Introduction
Accurate, fast, and reliable 3D object detection (OD) is a core requirement of autonomous driving technologies. This is because key tasks such as object tracking, adjacent agent behaviour prediction and ego vehicle trajectory planning heavily depend on the ability to perceive objects in the surrounding area. Moreover, perception errors have a significant effect on all related downstream tasks and can lead to faulty autonomous vehicle behavior. To prevent this potential error propagation and mitigate this issue, the perception module that has 3D-OD as a sub-task needs to be carefully designed.
It was recently suggested that sensor fusion (SF) can increase the reliability of the perception module. In SF, many bands in the electromagnetic spectrum are used; RGB cameras are a suitable choice for the visible light band, radars for the radio frequency (RF) band, and lidars for the infra-red (IR) band. Therefore, integrating data from diverse modalities can enhance the robustness of 3D OD by leveraging their complementary physical characteristics to address difficult situations that challenge individual modalities.
Cameras and lidars suffer from strong received signal attenuation in extreme weather conditions, e.g., fog or rainy weather, while radars stay relatively robust. Yet, due to the sparsity of the resultant point cloud (PC), radars suffer from sparse or imprecise depth estimation. In recent years, major progress has been made in the task of 3D OD with modalities of lidars [6], [19], [34], [35], cameras [36], [37], [38], [39], [40] and radars [3], [41], [42], with public multi-modal datasets like nuScenes [16] and Waymo Open dataset [31] greatly contributing to the research community. As lidar is an expensive sensor for commercial usage, camera-radar SF is commonly used, with lidars typically utilized for data collection, self-supervision, semantic heat-map generation, occupancy grid prediction, and high-end use-cases. A common approach is to manually annotate some portion of the data and to use a pre-trained lidar or lidar-camera model to automatically generate more training examples. This process is often referred to as self-supervision or auto-annotation.
A major challenge when fusing multi-modal data is time synchronization and real-time performance. Commonly available models for SF consider the data to be synchronized, whereas in real-life, each sensor provides the data at a different time point, and in some cases, data could also be missing. SF models could be computationally exhaustive as they require more computing resources for all the sensors from the different modalities. In addition, most current models assume a certain sensor configuration, meaning that adopting this configuration to a specific use-case requires tremendous effort. In this work, we propose a modular asynchronous SF approach to fuse cameras and radars for real-time accurate, robust 3D-OD. This approach maintains the detection performance even in sensor failure cases, preventing any degradation.
Our model initially extracts features from the sensors and transforms them into a common Bird’s-eye-view (BEV) representation. These features predict the 3D locations of objects in the environment, are then ego-motion-compensated (propagated with time) to match the timestamp of the subsequent sensor reading, and act like temporal priors to the next detection interval, where they are correlated with the new data (updated). These features can also be utilized by other downstream tasks (e.g., prediction or planning) to enable end-to-end training. Alternatively, a transformer decoder head could be applied at any time to these predicted, propagated, or updated latent features, to output bounding box predictions, as in [24]. One could imagine this latent space representation of 3D objects in the environment as a modern revisit of the model-based Kalman Filter (KF) [46], Extended KF (EKF) [55], and variations of Bayesian inference [47] where we keep a state, predict this state using some motion model and update it using some external measurement. Furthermore, while the primary emphasis of this study lies in exploring sensor fusion within the context of AD, it is imperative to acknowledge the wider applicability of this methodology across diverse domains. The scope of this approach extends beyond the confines of AD, delving into the fundamental intricacies of sensor fusion. Consequently, its principles hold considerable promise for effective utilization in various fields, including robotics, medical applications, and the Internet of Things (IoT), where cameras and radars are extensively employed.
Our main contributions are outlined below:
We propose a modular, asynchronous, real-time architecture to fuse cameras and radars for 3D-OD in a straightforward, effective manner, one that involves incorporating learned temporal priors.
Our model is flexible and not optimized for a specific sensor setup. Therefore, it can be easily trained with any number of radars and cameras. As such, our model is also capable of mitigating sensor failures by employing a training pipeline that simulates degraded sensor availability.
We introduce a merged dataset approach, leveraging both self-supervised 3D annotations and manual annotations to enhance dataset diversity. This combined approach can significantly enhance the performance of perception models by offering a larger and richer dataset for training. This feature should particularly benefit underrepresented classes in autonomous driving, such as pedestrians.
When trained on the combined dataset, our model outperformed recent approaches involving average precision (AP) by 10% for the vehicle class, and it does not degrade translation, scale, and orientation errors.
The rest of the paper is organized as follows: Section II describes related work. Section III depicts our fusion module. Section IV describes the dataset used in this paper, as well as the training setup. Sections V and VI describe the results of our evaluations and provide a conclusion and future work, respectively.
Related Work
In this section, we briefly present the properties of cameras and radars in 3D perception, and summarize common deep learning methods for multi-modal OD. For a more comprehensive review, please refer to [14] and [15].
A. Radar Object Detection
Radar PC is sparse by nature, a feature investigated in [2] as a means to further improve radar OD. Another method to improve radar OD, suggested by [3], is to combine point-based feature extractors with a grid-based OD network to achieve high detection performance and good orientation estimates simultaneously. Grid rendering approaches [4] that leverage KPConvs [5] can achieve better radar OD by exchanging information between input points and mitigate grid discretionary effects. The Authors of [6] suggested to treat objects as points in 3D space. They first detect the centers of objects using a key-point detector and regress to other attributes, including 3D size, 3D orientation, and velocity, at a later stage. They did it for lidar PC, but this model could be used also with radar points, as suggested by [43]. Another commonly used approach for OD from PC is [7], who proposed an encoder that utilizes PointNets [48] to learn a representation of PC organized in vertical columns (pillars).
B. Camera Object Detection
The topic of 2D OD has become a vibrant field of research in the past years [26], [44]. The main component of these object detectors is the image back bone and 2D feature extractor [12], [29], [30]. Well-studied 2D-CNN architectures [19], [20] are used for this purpose, as their dense 2D outputs are rich in detail. Nevertheless, other popular approaches like 3D-sparse-convolutions [21] are, too, a conceivable option. The object detectors consist of a feature-pyramid-network (FPN) [22] for extracting multi-scale feature maps, and a detection head for classification and box regression outputs. The FPN serves as a feature extractor for fusion, whereas the detection head is used for pre-training and as an auxiliary loss in an end-to-end training setup. In addition, especially in the context of Autonomous driving (AD), after obtaining 2D image features from this 2D image backbone, a projection to BEV feature maps is used as well to represent the objects in a common 3D space [13], [18], [40]. Another approach for projecting a column of pixels in a 2D image onto the BEV was presented in [13].
C. Sensor Fusion for Object Detection
RGB cameras and lidars are the most commonly used sensors for fusion [6], [23], [49], [51], [52], [54]. Recently developed SF neural networks used either the two-stage [6], [23], [49], [50] or the one-stage OD pipeline [51], [52], [53]. Recent attempts were made to increase the robustness of 3D-OD using some variant of sensor dropout, as suggested in [8]. In addition, the authors of [24] suggest a flexible method to fuse lidars, cameras, and radars, where they avoid the additional late stage processing. Sensor dropout was used in [11] as well, this time on radar images (radar-spectra), where they proposed a cross-attention mechanism that fuses range measurements from radar to improve camera depth estimates, but this approach comes with an additional computational cost. Others, like [9], have proposed spatial-temporal methods as a means to implement temporal information directly to the radar feature encoder to avoid information loss and deal with the sparsity and clutter of radar PC. Alternatively, as reported in [10], a center-point detection network can be employed to detect objects by identifying their center points on the image and then associating the radar detections to their corresponding object’s center point. Grid-based approaches have also been used to leverage the information from several sensor modalities to increase both the accuracy and detection ranges [1], [18], [40], [45]. Reference [14] provides a review of SF methods, focusing on practical questions such as why, what, where, when and how to fuse data in the context of 3D-OD.
Proposed Approach: Deep Fusion With Temporal Priors
Our modular, flexible architectural design for radar/camera fusion using temporal priors is shown in Fig. 1. It heavily relies on pre-trained feature-extractors for the rich embedding of single-modality input data. In this section, we depict the individual sensor modality processing streams as well as our proposed deep-fusion stream with temporal priors.
A. Camera Stream
The camera stream consists of three parts: camera feature extractor, BEV projection, and camera detector head, as detailed bellow.
1) Camera Feature Extractor
A camera sensor provides RGB images I of shape \begin{equation*} \mathbf {F}_{cam}(kH, kW, L_{cam}) = H^{cam}(\mathbf {I}(H, W, 3)) \tag {1}\end{equation*}
2) BEV Projection
In the camera processing stream, an additional BEV transformation module is needed to map the 2D camera features onto a common latent 3D representation space. Our 2D to 3D projection module relies on the column transformer [13], where we improved the run time performance to be able to successfully deploy the model to the test vehicle. This module defines a regular grid G, to which 2D image features representing 3D locations of objects in the vicinity of the ego vehicle are projected. In addition, this ego vehicle is assumed to be located at the origin of this regular grid, i.e.\begin{equation*} \mathbf {G}:= \mathbf {G}(m_{1}X, n_{1}Y), \tag {2}\end{equation*}
The BEV projection module, B, then outputs BEV features of dimension \begin{equation*} \mathbf {F}_{BEV}(m_{1}X, n_{1}Y, K_{BEV}) = B(\mathbf {F}_{cam}(kH, kW, L_{cam})). \tag {3}\end{equation*}
3) Camera Detector Head
The camera feature maps could be applied with our attention-based detection head to output explainable representations of the objects in the scene, e.g., classes and bounding boxes (BBXs) that include the tuple:\begin{equation*} BBX: (x, y, z, w, l, h, \psi), \tag {4}\end{equation*}
B. Radar Stream
In contrast to the camera stream, the radar stream inherently offers 3D information in BEV. It is comprised of two components, the radar feature extractor and the radar detector head, outlined below.
1) Radar Feature Extractor
In the radar processing stream, the radar feature extractor, \begin{equation*} \mathbf {F}_{radar}(m_{2}X, n_{2}Y, K_{radar}) = H^{radar}(R^{d}), \tag {5}\end{equation*}
Our radar feature extractor was inspired by [6], but here it is fed with radar data rather than lidar PC data. We examined three grid resolution versions of the same radar feature extractor, ones with a \begin{equation*} H_{i}^{radar}, i = (1, 2, 4). \tag {6}\end{equation*}
2) Radar Detector Head
Our radar detector head has the same architecture at that described in III-A3, but here we apply it to the latent representation of the object predicted by the radar feature extractor III-B1.
C. Sensor Fusion With Temporal Priors
Our sensor fusion model is attention-based and was designed to process data asynchronously, fuse modalities in all grid resolutions, cope with sensor outage, and more, as will be depicted herein.
1) Fusing All Grids
An important property of our fusion model is that it can take the individual modalities that were trained on different grid resolutions and sizes and fuse them to form a common feature space. Taking the general notation of a grid (2) and applying it to the individual modalities results in:\begin{align*} \mathbf {G_{cam}}& := \mathbf {G}(m_{1}X, n_{1}Y) \tag {7a}\\ \mathbf {G_{radar}}& := \mathbf {G}(m_{2}X, n_{2}Y), \tag {7b}\end{align*}
\begin{equation*} m_{1} \neq n_{1} \neq m_{2} \neq n_{2}. \tag {8}\end{equation*}
2) Asynchronous Data Processing
Our fusion module is able to process the data asynchronously. In real world scenarios, the measurements could arrive at different times and frequencies, and even be unavailable due to malfunctions, occlusion or any other reason. Therefore, the fusion module needs to be agnostic to the time synchronization of the arriving data from the sensors. Moreover, even in severe cases, where the whole sensor stream does not send any data, the fusion model should continue to provide object predictions. To that end, we introduce our sensor dropout training regime and our asynchronous data processing module. As described in Section III-C1, we keep our state representation of the objects in the environment as predicted queries from the previous time stamp, or\begin{equation*} \mathbf {q_{pred}^{t_{k}}}, t_{k} \in T, \tag {9}\end{equation*}
\begin{equation*} \mathbf {q_{upd}^{t_{k}}} = \mathbf {A} (\mathbf {q_{pred}^{t_{k}}},\mathbf {d^{t_{k}}}), t_{k} \in T. \tag {10}\end{equation*}
\begin{equation*} \mathbf {q_{pred}^{t_{k+1}}} = EMC(\mathbf {q_{upd}^{t_{k}}}, t_{k+1}), (t_{k},t_{k+1}) \in T. \tag {11}\end{equation*}
\begin{equation*} t_{k+1} \neq t_{k} + t_{\Delta }, \tag {12}\end{equation*}
3) Fusion Detection Head
The Fusion detection head could be applied at any time on the latent representation of the objects to produce a BBX representation as input for the purposes of path planning, visualization or any other need:\begin{equation*} [(x, y, z, w, l, h, \psi), Cls] = \mathbf {H_{detection}} (\mathbf {q}), \tag {13}\end{equation*}
Implementation Details
Here, we depict our training strategy, datasets and evaluation method.
A. Bosch Dense Dataset
The Bosch Dense dataset is our internal large-scale dataset for autonomous driving that was collected in Germany during 2022-2023. The purpose of this dataset is to train models for Bosch use-cases and mitigate real-world challenges in automated driving. This collection of over 100 driving scenes, each one more than 30 seconds in length, feature a diverse and interesting set of driving scenarios. While widely used datasets for automated driving, such as [16], [31], [32], and [33], often provide annotations for all sensors at 2 Hz, our dataset offers the option to incorporate a blend of manually and auto-annotated labels. In our dataset, we manually annotate only a subset, while utilizing our lidar-camera model [1] to automatically generate additional training examples. This approach yields a more diverse and robust training set, one that benefits both from manual expertise and from the capabilities of our pre-trained model. In doing so, this paper also provides a qualitative and quantitative evaluation of the gap in 3D OD quality between using auto labels and manual labels in this dataset.
The structure of this dataset was inspired by the nuScenes [16] dataset, i.e., its data was obtained from a single vehicle equipped with 5 lidars, 10 cameras, 11 radars, a global navigation satellite system (GNSS) and an inertial measurement unit (IMU). This dataset is suitable not only for computer vision (CV) use-cases but also for all sensor suites intended for the deployment of sensor fusion models and for self-supervision of models and training foundation models.
B. Radar and Camera Model Training
1) Radar Model
For the radar branch, we trained three models that differed in their grid resolution. The motivation was to strike a balance between resolution and performance in 3D-OD. Namely, a 0.1m grid resolution, while high, introduces increased computational complexity, whereas a 0.4m grid resolution reduces computational cost but might yield degraded detection performance. Thus, we examined the following models:\begin{align*} H_{1}^{radar}, Grid^{1}_{res} & = [0.1m, 0.1m, 0.2m]; \tag {14}\\ H_{2}^{radar}, Grid^{2}_{res} & = [0.2m, 0.2m, 0.4m]; \tag {15}\\ H_{4}^{radar}, Grid^{4}_{res} & = [0.4m, 0.4m, 0.8m]. \tag {16}\end{align*}
C. Fusion Model Training
For the fusion, we trained three models that, as in the radar case, differed in their grid resolution, since the fusion model fuses the single camera model with three pre-trained radar models (14)-(16). To allow a fair comparison, all three fusion models assumed the same grid resolution \begin{equation*} H_{i}^{fusion}:= H^{cam} \otimes H_{i}^{radar}, i \in {1,2,4}, \tag {17}\end{equation*}
D. Dataset
We trained two distinct fusion models — one exclusively using manual labels and the other utilizing only auto labels. Subsequently, we trained three additional models by merging the datasets. This action was motivated by the observation (detailed in the IV-A subsection) that not all frames in the dataset had manual annotations, which resulted in a less-populated manual dataset. This amalgamation enabled the inclusion of numerous additional samples, especially ones with underrepresented classes, within the original manual dataset. We hypothesized that this augmentation would be particularly beneficial, anticipating an enhancement in detection performance, which we aimed to quantify. Furthermore, since the model has already been introduced to most of the knowledge, this augmentation only necessitates some fine-tuning. Consequently, we chose not to re-train the model from scratch. Instead, we appended three additional epochs with the following compositions for each of the cases mentioned below:
75% manual labels +25% auto labels;
50% manual labels +50% auto labels;
25% manual labels +75% auto labels.
E. Sensor Drop-Out for Increased Robustness
Here, we purposefully degraded the quality of the data generated by the camera and radar with the intention to overcome challenging environmental conditions. To degrade the quality of the radar data, we gradually reduced the number of points in each radar PC sample from 0% to 99.9% (see Fig. 3). To degrade the quality of the camera data, we shifted the image intensity by a factor of \begin{align*} q_{step} & = \frac {1}{2^{q_{val}}-1}, \\ I_{quant} & = \left ({{\frac {I_{in}}{q_{step}} \mod {255} }}\right) * q_{step},\end{align*}
Improved detection visualization w/ and w/o our radar degradation during training pipeline. In these three images, the number of radar points increases from the top one to the bottom one while the image quality is constant. The white rectangles are the GT BBXs from the manually annotated dataset, and the pink ones are the predicted BBXs from the model. When there is a low number of radar points (camera only) 3a, 3b the depth estimation has limitations. When the number of points increases, radar supports the reduction in the localization error of objects, as shown in 3c.
Improved detection visualization w/ and w/o our camera degradation during training pipeline. In these three images, the image quality improves from the top one to the bottom one while the number of radar points is constant. The white rectangles are the GT BBXs from the manually annotated dataset and the pink ones are the predicted BBXs from the model. When there is a low image quality (radar only) 4a, the objects in the oncoming lane (yellow) are not detectable by radar (only) due to sensor occlusion. The camera supports the detection of oncoming objects (yellow, as shown in 4b and 4c.
F. Training Setup
For the classification task, we used the focal loss, \begin{equation*} loss = \lambda _{cls}L_{cls} + \lambda _{reg}L_{reg}, \tag {18}\end{equation*}
Experiments and Results
We conducted rigorous experiments to verify our claims and quantify the performance gain achieved by implementing our proposed approach for training radar-camera fusion model. In Section V-A, we present our results for training and optimizing our model to get the best detection results, and in Section V-B, we compare our approach to previous works. As common in 3D-OD [1], [16], [31], we employ the average precision (AP) metric to evaluate detection performance. In addition, we assess the localization error of objects with respect to the ground truth (GT) labels. The translation error is measured as the Euclidean planar center distance in [m], the orientation error is computed as the smallest yaw angle difference in [rad], and the scale error is calculated as
A. Training Our Model
We here present the steps we took to realize an optimal training strategy for our model. In subsection V-A1, the performance under different radar grid resolutions is compared, while in subsection V-A2, the fusion performance of these different radar models is evaluated. Subsection V-A3 presents the results of mixing auto labels and manual labels, and subsection V-A4 re-validates our conclusion from previous work [1] regarding detection performance for distance-dependant bins. The outcome of this section is an optimized model, RCF-TP, which is compared to previous works in Section V-B.
1) Radar Grid Resolution
Table 1 quantifies the performance gain when increasing the grid resolution for the radar modality. As explained in subsection III-B, the radar BEV grid is a regular grid and, as expected, by increasing the resolution of this regular grid, we can improve the detection performance. Training and testing
As real-world scenarios are reflected in the manual labels’ version of the dataset, we also quantified the gap in the expected performance when (i) training and testing only with the auto-labels, (ii) training only with the auto-labels and testing only with the manual labels, and (iii) training and testing only with the manual labels. Our auto-label results, presented in Table 1, are for impression and expected performance quantification when comparing dataset versions.
2) Fusion With Different Radar Grids
Table 2 quantifies the performance of our trained fusion model when fused with three versions of radar models, each with a different grid resolution. As detailed in subsection IV-C, the fusion models share the same grid resolution. In accordance with subsection V-A1, we compare training with auto- and manual labels. Each entry in this table is presented as a tuple denoting (fusion, camera, radar) results.
As expected, the fusion model
Of note, as explained in subsection V-A1, and same as in Table 1, the auto-label results presented in Table 2 are for impression and expected performance quantification when comparing dataset versions.
In addition, the trained fusion model w/ our sensor drop-out regime, referred to as
Please refer to Fig. 3 and Fig. 4 for detection performance improvement visualization w/ and w/o our sensor degradation training pipeline for the camera and radar degradation, respectively.
3) Datasets Mixing
Tables 3 and 4 present the detection performance for the simultaneous combining of the two dataset versions for the Vehicle and Pedestrian classes, respectively. For the Vehicle class (Table 3), the results are not decisive, implying that adding more samples from the auto-labels version of the dataset does not improve the performance, as the manual version of the dataset already includes the necessary amount of data for training the model. For the Pedestrian class (Table 4), conversely, adding more samples from the auto-labels version of the dataset improves performance by
4) Distance Detection Performance
To evaluate the properties of each of the modalities, we assessed the detection performance for distance-dependant range bins (Table 5). Here, we compared radar-camera and fusion detection AP at a 4.0m localization threshold, for ranges 10m-30m, 30-50m, and 50m-70m. As can be gleaned from Table 5, camera-only detection is preferential at closer ranges (10-50m), while radar-only detection is better at larger distances (50-70m). In addition, fusion detection performance is better than the individual modalities for all tested detection bins. This suggests that our RCF-TP module is indeed asynchronous and accurate, as it can successfully process asynchronous data, with improved performance.
5) Training Summary
To summarize, our optimized model,
B. Comparison to Previous Works
We implemented the algorithm of [24] and compared its performance with our approach. The comparison indicates that increasing the radar resolution leads to improved fusion performance, as shown in Table 2.
As such, and to allow fair comparison, we did not use our optimal model, discussed in subsection V-A5, but our RCF-TP model that fuses radar with a grid resolution of
We also compared our optimal model, presented in Table 4, with our implementation of [24] in the case of the Pedestrian Class. The results are provided in Table 7. Our model outperforms [24] by a factor of 3 for the AP @4.0m threshold, demonstrating a 2-fold improvement in the translation error and a marginal enhancement in scale errors.
Conclusion and Future Work
We here present RCF-TP, a modular, asynchronous, real-time architecture for fusing multiple cameras and radars for 3D OD, using learnt temporal priors. RCF-TP first extracts modality-specific features using dedicated backbones. These features are then represented as queries for objects in the environment as the model’s state, followed by their projection to a common BEV and represented as queries. After some Ego motion correction, these queries are correlated with data received from other modalities, or from different sensors in the same modality. By utilizing the Attention mechanism, a detection head processes the fused features for bounding box representation as output. The model could easily be adapted to any sensor configuration, with numerous cameras and radars, and can therefore mitigate sensor failures, and can fuse individual modalities that were trained with non-constant grid size or resolution, and works as fast as 10 Hz. In our training regime, we employed a dataset-merger approach, leveraging both self-supervised and manual 3D annotations to enhance dataset diversity. This approach significantly improved the detection performance, particularly for pedestrians, demonstrating a 2-fold enhancement compared to using manual labels alone, while not degrading the scale, translation and orientation errors. Extensive experiments highlight the importance of our asynchronous and modular sensor fusion for improving robustness under challenging weather conditions or in cases of sensor faults. This approach proved valuable in real-time, real-world scenarios where accurate and robust 3D OD perception is essential, especially in datasets where certain classes are underrepresented. Furthermore, the applicability of this approach extends beyond AD to robotics, medical, and Internet of Things (IoT) applications. It addresses challenges related to sensor resolution, dataset combination, and the handling of minority classes, which are common issues in these diverse fields as well. While our sensor fusion model is adept at mitigating sensor occlusion through compensatory measures with data from other sensors and acquiring a shared latent representation of the scene, it may face difficulties in highly dynamic environments. Consequently, our future endeavors will focus on expanding the detection range to encompass highway scenarios and integrating a tracking module. This enhancement is anticipated to improve performance in highly dynamic environments and contribute to the optimization of downstream planning tasks.