I. Introduction
Long range 3D perception, and semantic understanding of the environment, are key challenges for mobile robots and autonomous vehicles. Although it has been proposed to fully address, at the same time, both of these issues via a multitask convolutional neural network [1], such approaches are not sufficient on their own. Convolutional neural networks seem sensible to minimal distortions within an image, as shown in [2], which justifies to only use them when paired with highly reliable sensors, such as LIDAR scanners [3]. A fusion framework generating 2D evidential grids from LIDAR scans, and segmentation results from a state-of-the-art convolutional neural network, is thus proposed in this paper. Evidential theory was adopted since this formalism can efficiently cope with incoherences between multiple pieces of information, while being able to model unobserved areas, and to manage sensor uncertainties [4]. Furthermore, 2D evidential grids are pertinent inputs for path planning systems, as in [5]. As explained in Fig. 1, Dempster's conjunctive rule is used to fuse, over time, evidential Cartesian grids from LIDAR scans, and image segmentation results projected on the ground plane. As this system presents a modular design, its components that process raw data, such as the neural network, can easily be replaced. One of the specificities of the proposed approach is its asynchronous behaviour: the LIDAR scanner and RGB camera are not assumed to acquire data at the same time. Instead, each sensor operates freely: each sensor input is processed individually, and fused with the grid representing the global state, or EgoGrid. The resulting system is flexible enough to withstand sensor failure. In the end, the main contributions of this paper are:
A novel Cartesian grid mapping framework for LIDAR sensors;
A novel evidential, asynchronous, fusion scheme for semantic segmentation results generated by a convolutional neural network, and LIDAR scans;
Asynchronous fusion based algorithm. Images and LIDAR scans are individually processed once acquired. An evidential grid is then built from the sensor input, and fused with the current global grid, or egogrid, based on the current vehicle pose. Thus, no LIDAR/camera synchronization is needed.