Journals & Magazines >IEEE Access >Volume: 12

RCF-TP: Radar-Camera Fusion With Temporal Priors for 3D Object Detection

RCF-TP is an asynchronous, modular, real-time multi-modal architecture, to fuse cameras and radars for 3D object detection, with sensor fault mitigation and extreme weath...

Abstract:

Sensor fusion is an important method for achieving robust perception systems in autonomous driving, Internet of things, and robotics. Most multi-modal 3D detection models...Show More

Metadata

Abstract:

Sensor fusion is an important method for achieving robust perception systems in autonomous driving, Internet of things, and robotics. Most multi-modal 3D detection models assume the data is synchronized between the sensors and do not necessarily have real-time capabilities. We propose RCF-TP, an asynchronous, modular, real-time multi-modal architecture, to fuse cameras and radars for 3D object detection, with sensor fault mitigation and extreme weather conditions handling. Our dedicated feature extractors can be trained assuming either a regular or an irregular bird’s-eye-view grid or with different grid resolutions, such that the fusion module is agnostic to both. These extracted features are correlated to the other modality features or to another sensor of the same modality, and eventually a detection head that exploits rich multi-modal features could be applied at any time to produce bounding box predictions. Experimental results show the effectiveness of our fusion module. It improves detection performance for higher radar grid resolution, can operate under sensor faults without performance degradation, and improves pedestrian detection when our dataset combination strategy is implemented during training.

RCF-TP is an asynchronous, modular, real-time multi-modal architecture, to fuse cameras and radars for 3D object detection, with sensor fault mitigation and extreme weath...

Published in: IEEE Access ( Volume: 12)

Page(s): 127212 - 127223

Date of Publication: 31 May 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3408066

Contents

SECTION I.

Introduction

Accurate, fast, and reliable 3D object detection (OD) is a core requirement of autonomous driving technologies. This is because key tasks such as object tracking, adjacent agent behaviour prediction and ego vehicle trajectory planning heavily depend on the ability to perceive objects in the surrounding area. Moreover, perception errors have a significant effect on all related downstream tasks and can lead to faulty autonomous vehicle behavior. To prevent this potential error propagation and mitigate this issue, the perception module that has 3D-OD as a sub-task needs to be carefully designed.

It was recently suggested that sensor fusion (SF) can increase the reliability of the perception module. In SF, many bands in the electromagnetic spectrum are used; RGB cameras are a suitable choice for the visible light band, radars for the radio frequency (RF) band, and lidars for the infra-red (IR) band. Therefore, integrating data from diverse modalities can enhance the robustness of 3D OD by leveraging their complementary physical characteristics to address difficult situations that challenge individual modalities.

Cameras and lidars suffer from strong received signal attenuation in extreme weather conditions, e.g., fog or rainy weather, while radars stay relatively robust. Yet, due to the sparsity of the resultant point cloud (PC), radars suffer from sparse or imprecise depth estimation. In recent years, major progress has been made in the task of 3D OD with modalities of lidars [6], [19], [34], [35], cameras [36], [37], [38], [39], [40] and radars [3], [41], [42], with public multi-modal datasets like nuScenes [16] and Waymo Open dataset [31] greatly contributing to the research community. As lidar is an expensive sensor for commercial usage, camera-radar SF is commonly used, with lidars typically utilized for data collection, self-supervision, semantic heat-map generation, occupancy grid prediction, and high-end use-cases. A common approach is to manually annotate some portion of the data and to use a pre-trained lidar or lidar-camera model to automatically generate more training examples. This process is often referred to as self-supervision or auto-annotation.

A major challenge when fusing multi-modal data is time synchronization and real-time performance. Commonly available models for SF consider the data to be synchronized, whereas in real-life, each sensor provides the data at a different time point, and in some cases, data could also be missing. SF models could be computationally exhaustive as they require more computing resources for all the sensors from the different modalities. In addition, most current models assume a certain sensor configuration, meaning that adopting this configuration to a specific use-case requires tremendous effort. In this work, we propose a modular asynchronous SF approach to fuse cameras and radars for real-time accurate, robust 3D-OD. This approach maintains the detection performance even in sensor failure cases, preventing any degradation.

Our model initially extracts features from the sensors and transforms them into a common Bird’s-eye-view (BEV) representation. These features predict the 3D locations of objects in the environment, are then ego-motion-compensated (propagated with time) to match the timestamp of the subsequent sensor reading, and act like temporal priors to the next detection interval, where they are correlated with the new data (updated). These features can also be utilized by other downstream tasks (e.g., prediction or planning) to enable end-to-end training. Alternatively, a transformer decoder head could be applied at any time to these predicted, propagated, or updated latent features, to output bounding box predictions, as in [24]. One could imagine this latent space representation of 3D objects in the environment as a modern revisit of the model-based Kalman Filter (KF) [46], Extended KF (EKF) [55], and variations of Bayesian inference [47] where we keep a state, predict this state using some motion model and update it using some external measurement. Furthermore, while the primary emphasis of this study lies in exploring sensor fusion within the context of AD, it is imperative to acknowledge the wider applicability of this methodology across diverse domains. The scope of this approach extends beyond the confines of AD, delving into the fundamental intricacies of sensor fusion. Consequently, its principles hold considerable promise for effective utilization in various fields, including robotics, medical applications, and the Internet of Things (IoT), where cameras and radars are extensively employed.

Our main contributions are outlined below:

We propose a modular, asynchronous, real-time architecture to fuse cameras and radars for 3D-OD in a straightforward, effective manner, one that involves incorporating learned temporal priors.
Our model is flexible and not optimized for a specific sensor setup. Therefore, it can be easily trained with any number of radars and cameras. As such, our model is also capable of mitigating sensor failures by employing a training pipeline that simulates degraded sensor availability.
We introduce a merged dataset approach, leveraging both self-supervised 3D annotations and manual annotations to enhance dataset diversity. This combined approach can significantly enhance the performance of perception models by offering a larger and richer dataset for training. This feature should particularly benefit underrepresented classes in autonomous driving, such as pedestrians.

When trained on the combined dataset, our model outperformed recent approaches involving average precision (AP) by 10% for the vehicle class, and it does not degrade translation, scale, and orientation errors.

The rest of the paper is organized as follows: Section II describes related work. Section III depicts our fusion module. Section IV describes the dataset used in this paper, as well as the training setup. Sections V and VI describe the results of our evaluations and provide a conclusion and future work, respectively.

SECTION II.

Related Work

In this section, we briefly present the properties of cameras and radars in 3D perception, and summarize common deep learning methods for multi-modal OD. For a more comprehensive review, please refer to [14] and [15].

A. Radar Object Detection

Radar PC is sparse by nature, a feature investigated in [2] as a means to further improve radar OD. Another method to improve radar OD, suggested by [3], is to combine point-based feature extractors with a grid-based OD network to achieve high detection performance and good orientation estimates simultaneously. Grid rendering approaches [4] that leverage KPConvs [5] can achieve better radar OD by exchanging information between input points and mitigate grid discretionary effects. The Authors of [6] suggested to treat objects as points in 3D space. They first detect the centers of objects using a key-point detector and regress to other attributes, including 3D size, 3D orientation, and velocity, at a later stage. They did it for lidar PC, but this model could be used also with radar points, as suggested by [43]. Another commonly used approach for OD from PC is [7], who proposed an encoder that utilizes PointNets [48] to learn a representation of PC organized in vertical columns (pillars).

B. Camera Object Detection

The topic of 2D OD has become a vibrant field of research in the past years [26], [44]. The main component of these object detectors is the image back bone and 2D feature extractor [12], [29], [30]. Well-studied 2D-CNN architectures [19], [20] are used for this purpose, as their dense 2D outputs are rich in detail. Nevertheless, other popular approaches like 3D-sparse-convolutions [21] are, too, a conceivable option. The object detectors consist of a feature-pyramid-network (FPN) [22] for extracting multi-scale feature maps, and a detection head for classification and box regression outputs. The FPN serves as a feature extractor for fusion, whereas the detection head is used for pre-training and as an auxiliary loss in an end-to-end training setup. In addition, especially in the context of Autonomous driving (AD), after obtaining 2D image features from this 2D image backbone, a projection to BEV feature maps is used as well to represent the objects in a common 3D space [13], [18], [40]. Another approach for projecting a column of pixels in a 2D image onto the BEV was presented in [13].

C. Sensor Fusion for Object Detection

RGB cameras and lidars are the most commonly used sensors for fusion [6], [23], [49], [51], [52], [54]. Recently developed SF neural networks used either the two-stage [6], [23], [49], [50] or the one-stage OD pipeline [51], [52], [53]. Recent attempts were made to increase the robustness of 3D-OD using some variant of sensor dropout, as suggested in [8]. In addition, the authors of [24] suggest a flexible method to fuse lidars, cameras, and radars, where they avoid the additional late stage processing. Sensor dropout was used in [11] as well, this time on radar images (radar-spectra), where they proposed a cross-attention mechanism that fuses range measurements from radar to improve camera depth estimates, but this approach comes with an additional computational cost. Others, like [9], have proposed spatial-temporal methods as a means to implement temporal information directly to the radar feature encoder to avoid information loss and deal with the sparsity and clutter of radar PC. Alternatively, as reported in [10], a center-point detection network can be employed to detect objects by identifying their center points on the image and then associating the radar detections to their corresponding object’s center point. Grid-based approaches have also been used to leverage the information from several sensor modalities to increase both the accuracy and detection ranges [1], [18], [40], [45]. Reference [14] provides a review of SF methods, focusing on practical questions such as why, what, where, when and how to fuse data in the context of 3D-OD.

SECTION III.

Proposed Approach: Deep Fusion With Temporal Priors

Our modular, flexible architectural design for radar/camera fusion using temporal priors is shown in Fig. 1. It heavily relies on pre-trained feature-extractors for the rich embedding of single-modality input data. In this section, we depict the individual sensor modality processing streams as well as our proposed deep-fusion stream with temporal priors.

FIGURE 1.

Flowchart of the radar-camera fusion model with a temporal priors architecture.

Show All

A. Camera Stream

The camera stream consists of three parts: camera feature extractor, BEV projection, and camera detector head, as detailed bellow.

1) Camera Feature Extractor

A camera sensor provides RGB images I of shape $(H, W, 3)$ with height H and width W as input to the camera backbone. Our camera backbone is based on [17] and inspired by [12]. The image features are concatenated only once in the last feature map, as illustrated in Fig. 2. Utilizing such an approach ensures a constant input size and enables the system to enlarge new output channels. This is referred to as one-shot aggregation, where we access the memory only once at the aggregation stage. This approach is time efficient and scales well with image size. Our camera feature extractor $H^{cam}$ outputs image features of dimension $(kH, kW, L_{cam}), 0 \le k \le 1$ , i.e.

$\begin{equation*} \mathbf {F}_{cam}(kH, kW, L_{cam}) = H^{cam}(\mathbf {I}(H, W, 3)) \tag {1}\end{equation*}$ View Source

FIGURE 2.

Illustration of image backbone feature aggregation.

Show All

2) BEV Projection

In the camera processing stream, an additional BEV transformation module is needed to map the 2D camera features onto a common latent 3D representation space. Our 2D to 3D projection module relies on the column transformer [13], where we improved the run time performance to be able to successfully deploy the model to the test vehicle. This module defines a regular grid G, to which 2D image features representing 3D locations of objects in the vicinity of the ego vehicle are projected. In addition, this ego vehicle is assumed to be located at the origin of this regular grid, i.e.

$\begin{equation*} \mathbf {G}:= \mathbf {G}(m_{1}X, n_{1}Y), \tag {2}\end{equation*}$ View Source

where

$m_{1}X$

and

$n_{1}Y$

define the grid resolution and spacial dimension for the camera features, projected to the camera BEV grid,

$X \in [-X_{min}^{cam}, X_{max}^{cam}]$

$Y \in [-Y_{min}^{cam}, Y_{max}^{cam}]$

, and

$(m_{1}, n_{1}) \in \mathbb {N}$

The BEV projection module, B, then outputs BEV features of dimension $K_{BEV} \ge 1$ , i.e.:

$\begin{equation*} \mathbf {F}_{BEV}(m_{1}X, n_{1}Y, K_{BEV}) = B(\mathbf {F}_{cam}(kH, kW, L_{cam})). \tag {3}\end{equation*}$ View Source

3) Camera Detector Head

The camera feature maps could be applied with our attention-based detection head to output explainable representations of the objects in the scene, e.g., classes and bounding boxes (BBXs) that include the tuple:

$\begin{equation*} BBX: (x, y, z, w, l, h, \psi), \tag {4}\end{equation*}$ View Source

where

$(x, y, z)$

are the corresponding distances from the ego vehicle,

$(w, l, h)$

are the dimensions of the object and

$\psi$

is that yaw angle of the object. The yaw angle is usually applicable to vehicles rather than to pedestrians.

B. Radar Stream

In contrast to the camera stream, the radar stream inherently offers 3D information in BEV. It is comprised of two components, the radar feature extractor and the radar detector head, outlined below.

1) Radar Feature Extractor

In the radar processing stream, the radar feature extractor, $H^{radar}$ , is fed with a d dimension radar data, $R^{d}$ , to output radar features $\mathbf {F}_{radar}(m_{2}X, n_{2}Y, K_{radar})$ , i.e.:

$\begin{equation*} \mathbf {F}_{radar}(m_{2}X, n_{2}Y, K_{radar}) = H^{radar}(R^{d}), \tag {5}\end{equation*}$ View Source

where, as in (2),

$m_{2}X$

and

$n_{2}Y$

define the grid resolution and spacial dimension, this time for radar,

$X \in [-X_{min}^{radar}, X_{max}^{radar}]$

$Y \in [-Y_{min}^{radar}, Y_{max}^{radar}]$

, and

$(m_{2}, n_{2}) \in \mathbb {N}$

Our radar feature extractor was inspired by [6], but here it is fed with radar data rather than lidar PC data. We examined three grid resolution versions of the same radar feature extractor, ones with a $(0.1, 0.2, 0.4) [m, m, m]$ spatial bin voxel size, referred to as:

$\begin{equation*} H_{i}^{radar}, i = (1, 2, 4). \tag {6}\end{equation*}$ View Source

Thus, for example,

$H_{1}^{radar}$

corresponds to the

$0.1 [m]$

surface voxel size. The

$H^{radar}$

feature extractor first represents objects as points, and then, at a later stage, regresses all the other BBX properties mentioned in (4).

2) Radar Detector Head

Our radar detector head has the same architecture at that described in III-A3, but here we apply it to the latent representation of the object predicted by the radar feature extractor III-B1.

C. Sensor Fusion With Temporal Priors

Our sensor fusion model is attention-based and was designed to process data asynchronously, fuse modalities in all grid resolutions, cope with sensor outage, and more, as will be depicted herein.

1) Fusing All Grids

An important property of our fusion model is that it can take the individual modalities that were trained on different grid resolutions and sizes and fuse them to form a common feature space. Taking the general notation of a grid (2) and applying it to the individual modalities results in:

$\begin{align*} \mathbf {G_{cam}}& := \mathbf {G}(m_{1}X, n_{1}Y) \tag {7a}\\ \mathbf {G_{radar}}& := \mathbf {G}(m_{2}X, n_{2}Y), \tag {7b}\end{align*}$ View Source

where,

$\begin{equation*} m_{1} \neq n_{1} \neq m_{2} \neq n_{2}. \tag {8}\end{equation*}$

View Source

Our fusion module is able to fuse the individual modalities that were not trained on the same grid size or resolution. For that, we represent the state of the environment as latent queries. These queries contain all the properties of a state, as they can be predicted, propagated with time and these queries can be updated per our belief using the received data.

2) Asynchronous Data Processing

Our fusion module is able to process the data asynchronously. In real world scenarios, the measurements could arrive at different times and frequencies, and even be unavailable due to malfunctions, occlusion or any other reason. Therefore, the fusion module needs to be agnostic to the time synchronization of the arriving data from the sensors. Moreover, even in severe cases, where the whole sensor stream does not send any data, the fusion model should continue to provide object predictions. To that end, we introduce our sensor dropout training regime and our asynchronous data processing module. As described in Section III-C1, we keep our state representation of the objects in the environment as predicted queries from the previous time stamp, or

$\begin{equation*} \mathbf {q_{pred}^{t_{k}}}, t_{k} \in T, \tag {9}\end{equation*}$ View Source

where T is the time horizon and

$t_{k}$

is the previous time stamp. Having these predicted queries

$\mathbf {q_{pred}^{t_{k}}}$

, following [44], we apply the attention operator, A, to update the queries, using the received data after feature extraction,

$\mathbf {d^{t_{k}}}$

$\begin{equation*} \mathbf {q_{upd}^{t_{k}}} = \mathbf {A} (\mathbf {q_{pred}^{t_{k}}},\mathbf {d^{t_{k}}}), t_{k} \in T. \tag {10}\end{equation*}$

View Source

These updated queries

$\mathbf {q_{upd}^{t_{k}}}$

are then ego motion compensated (EMC) to the time stamp of the next arriving data (asynchronously). The EMC module outputs the predicted queries for the next time stamp,

$t_{k+1}$

, which are notated as

$\mathbf {q_{pred}^{t_{k+1}}}$

$\begin{equation*} \mathbf {q_{pred}^{t_{k+1}}} = EMC(\mathbf {q_{upd}^{t_{k}}}, t_{k+1}), (t_{k},t_{k+1}) \in T. \tag {11}\end{equation*}$

View Source

All the received data is assumed to arrive asynchronously:

$\begin{equation*} t_{k+1} \neq t_{k} + t_{\Delta }, \tag {12}\end{equation*}$

View Source

where

$t_{\Delta }$

is some constant time delta. Moreover, it is important to mention that this fusion module is not modality specific, i.e., this module can process camera data only, radar data only, or both, when it serves as a sensor fusion module. Real-life scenarios can include instances of sensor degradation, occlusion, or even malfunction. In such cases, where the whole sensor stream does not provide any data, the fusion model should continue to provide object predictions. To cope with such situations, we introduce our sensor dropout training regime below.

3) Fusion Detection Head

The Fusion detection head could be applied at any time on the latent representation of the objects to produce a BBX representation as input for the purposes of path planning, visualization or any other need:

$\begin{equation*} [(x, y, z, w, l, h, \psi), Cls] = \mathbf {H_{detection}} (\mathbf {q}), \tag {13}\end{equation*}$ View Source

where q could be

$\mathbf {q_{pred}^{t_{k}}}$

$\mathbf {q_{upd}^{t_{k}}}$

$\mathbf {q_{pred}^{t_{k+1}}}$

$\mathbf {q_{upd}^{t_{k+1}}}$

, and

$\mathbf {H_{detection}}$

is the detection head operator.

SECTION IV.

Implementation Details

Here, we depict our training strategy, datasets and evaluation method.

A. Bosch Dense Dataset

The Bosch Dense dataset is our internal large-scale dataset for autonomous driving that was collected in Germany during 2022-2023. The purpose of this dataset is to train models for Bosch use-cases and mitigate real-world challenges in automated driving. This collection of over 100 driving scenes, each one more than 30 seconds in length, feature a diverse and interesting set of driving scenarios. While widely used datasets for automated driving, such as [16], [31], [32], and [33], often provide annotations for all sensors at 2 Hz, our dataset offers the option to incorporate a blend of manually and auto-annotated labels. In our dataset, we manually annotate only a subset, while utilizing our lidar-camera model [1] to automatically generate additional training examples. This approach yields a more diverse and robust training set, one that benefits both from manual expertise and from the capabilities of our pre-trained model. In doing so, this paper also provides a qualitative and quantitative evaluation of the gap in 3D OD quality between using auto labels and manual labels in this dataset.

The structure of this dataset was inspired by the nuScenes [16] dataset, i.e., its data was obtained from a single vehicle equipped with 5 lidars, 10 cameras, 11 radars, a global navigation satellite system (GNSS) and an inertial measurement unit (IMU). This dataset is suitable not only for computer vision (CV) use-cases but also for all sensor suites intended for the deployment of sensor fusion models and for self-supervision of models and training foundation models.

B. Radar and Camera Model Training

1) Radar Model

For the radar branch, we trained three models that differed in their grid resolution. The motivation was to strike a balance between resolution and performance in 3D-OD. Namely, a 0.1m grid resolution, while high, introduces increased computational complexity, whereas a 0.4m grid resolution reduces computational cost but might yield degraded detection performance. Thus, we examined the following models:

$\begin{align*} H_{1}^{radar}, Grid^{1}_{res} & = [0.1m, 0.1m, 0.2m]; \tag {14}\\ H_{2}^{radar}, Grid^{2}_{res} & = [0.2m, 0.2m, 0.4m]; \tag {15}\\ H_{4}^{radar}, Grid^{4}_{res} & = [0.4m, 0.4m, 0.8m]. \tag {16}\end{align*}$ View Source

2) Camera Model

For the camera backbone, we used our implementation of [17] with pre-trained weights from [12]. For the BEV transformation, we adopted [13] and trained with our dataset using a manually labeled dataset.

C. Fusion Model Training

For the fusion, we trained three models that, as in the radar case, differed in their grid resolution, since the fusion model fuses the single camera model with three pre-trained radar models (14)-(16). To allow a fair comparison, all three fusion models assumed the same grid resolution $Grid^{fusion}_{res} = [0.4m, 0.4m, 4.0m]$ while differing in the radar feature extractor that was being fused. From computational complexity point of view, the resolution for the Z axis was reduced to $4.0m$ , without degrading detection quality:

$\begin{equation*} H_{i}^{fusion}:= H^{cam} \otimes H_{i}^{radar}, i \in {1,2,4}, \tag {17}\end{equation*}$ View Source

where the

$a \otimes b$

operation means the fusion of a and b. In our case, the fusion model will contain the updated queries extracted both from radar

$H_{i}^{radar}$

and camera

$H^{cam}$

D. Dataset

We trained two distinct fusion models — one exclusively using manual labels and the other utilizing only auto labels. Subsequently, we trained three additional models by merging the datasets. This action was motivated by the observation (detailed in the IV-A subsection) that not all frames in the dataset had manual annotations, which resulted in a less-populated manual dataset. This amalgamation enabled the inclusion of numerous additional samples, especially ones with underrepresented classes, within the original manual dataset. We hypothesized that this augmentation would be particularly beneficial, anticipating an enhancement in detection performance, which we aimed to quantify. Furthermore, since the model has already been introduced to most of the knowledge, this augmentation only necessitates some fine-tuning. Consequently, we chose not to re-train the model from scratch. Instead, we appended three additional epochs with the following compositions for each of the cases mentioned below:

75% manual labels +25% auto labels;
50% manual labels +50% auto labels;
25% manual labels +75% auto labels.

E. Sensor Drop-Out for Increased Robustness

Here, we purposefully degraded the quality of the data generated by the camera and radar with the intention to overcome challenging environmental conditions. To degrade the quality of the radar data, we gradually reduced the number of points in each radar PC sample from 0% to 99.9% (see Fig. 3). To degrade the quality of the camera data, we shifted the image intensity by a factor of $[-0.1, 0.1]$ , utilized an image intensity scale of $[{0.0, 0.01, 0.04, 0.2, 0.5, 1.0}]$ , and added gamma in the range of $[{1.0, 0.0}]$ , at increments of 0.1, and a intensity quantization of $[{0, 8}]$ . For the image quantization,

$\begin{align*} q_{step} & = \frac {1}{2^{q_{val}}-1}, \\ I_{quant} & = \left ({{\frac {I_{in}}{q_{step}} \mod {255} }}\right) * q_{step},\end{align*}$ View Source

where

$I_{in}$

is the input image,

$I_{quant}$

is the resultant quantized image, and

$q_{step}$

is the quantization factor (see Fig. 4).

FIGURE 3.

Improved detection visualization w/ and w/o our radar degradation during training pipeline. In these three images, the number of radar points increases from the top one to the bottom one while the image quality is constant. The white rectangles are the GT BBXs from the manually annotated dataset, and the pink ones are the predicted BBXs from the model. When there is a low number of radar points (camera only) 3a, 3b the depth estimation has limitations. When the number of points increases, radar supports the reduction in the localization error of objects, as shown in 3c.

Show All

FIGURE 4.

Improved detection visualization w/ and w/o our camera degradation during training pipeline. In these three images, the image quality improves from the top one to the bottom one while the number of radar points is constant. The white rectangles are the GT BBXs from the manually annotated dataset and the pink ones are the predicted BBXs from the model. When there is a low image quality (radar only) 4a, the objects in the oncoming lane (yellow) are not detectable by radar (only) due to sensor occlusion. The camera supports the detection of oncoming objects (yellow, as shown in 4b and 4c.

Show All

F. Training Setup

For the classification task, we used the focal loss, $L_{cls}$ [27], and for the 3D BBX regression, we used the L1 loss, $L_{reg}$ . We first trained the individual radar and camera modalities separately and then used the weights from individual detectors pre-trained for 3D OD for fusion. The pre-trained weights were not freezed to allow for feature refinement while training fusion. The fusion detection head was initialized by the pre-trained weights from the radar detector. For fusion training, an Adam optimizer [28] was used, and the training curve converged after 13 epochs. We used the same training loss for sensor-specific detectors and the fusion network, but for the sensor specific, the training session was longer, i.e., 48 epochs. To train both the individual modalities and the fusion, we superimposed the two loss terms:

$\begin{equation*} loss = \lambda _{cls}L_{cls} + \lambda _{reg}L_{reg}, \tag {18}\end{equation*}$ View Source

where

$\lambda _{cls}, \lambda _{reg}$

are training hyper-parameters that ensure the right balancing between the classification and regression tasks. Our Prior prediction module was trained to predict an initial guess for the 3D locations of objects in the scene. During training, we fed the module with sequences that had some temporal meaning; the gradients propagated with the sequences and enabled predictions of these locations. During inference, this module was kept constant and provided priors as keys that will be correlated with the queries extracted from the received data. This idea was previously implemented in [29], [30], and [25] to learn to predict the most probable locations for an object in a 2D image.

SECTION V.

Experiments and Results

We conducted rigorous experiments to verify our claims and quantify the performance gain achieved by implementing our proposed approach for training radar-camera fusion model. In Section V-A, we present our results for training and optimizing our model to get the best detection results, and in Section V-B, we compare our approach to previous works. As common in 3D-OD [1], [16], [31], we employ the average precision (AP) metric to evaluate detection performance. In addition, we assess the localization error of objects with respect to the ground truth (GT) labels. The translation error is measured as the Euclidean planar center distance in [m], the orientation error is computed as the smallest yaw angle difference in [rad], and the scale error is calculated as $1 - IOU$ (intersection over union) after aligning centers and orientation. The scale error is unitless, since IOU is calculated as the ratio of the intersection area to the union area of objects, often expressed as a percentage.

A. Training Our Model

We here present the steps we took to realize an optimal training strategy for our model. In subsection V-A1, the performance under different radar grid resolutions is compared, while in subsection V-A2, the fusion performance of these different radar models is evaluated. Subsection V-A3 presents the results of mixing auto labels and manual labels, and subsection V-A4 re-validates our conclusion from previous work [1] regarding detection performance for distance-dependant bins. The outcome of this section is an optimized model, RCF-TP, which is compared to previous works in Section V-B.

1) Radar Grid Resolution

Table 1 quantifies the performance gain when increasing the grid resolution for the radar modality. As explained in subsection III-B, the radar BEV grid is a regular grid and, as expected, by increasing the resolution of this regular grid, we can improve the detection performance. Training and testing $H_{1}^{radar}$ , defined in (14), on the manual labels’ version (named $H_{1}^{radar, MM}$ in Table 1) of the dataset yielded the highest detection performance and lowest translation, scale, and orientation errors.

TABLE 1 Comparison of the Performance of Our Radar Models With Different Grid Resolutions When Training on Auto- and Manual Labels

As real-world scenarios are reflected in the manual labels’ version of the dataset, we also quantified the gap in the expected performance when (i) training and testing only with the auto-labels, (ii) training only with the auto-labels and testing only with the manual labels, and (iii) training and testing only with the manual labels. Our auto-label results, presented in Table 1, are for impression and expected performance quantification when comparing dataset versions.

2) Fusion With Different Radar Grids

Table 2 quantifies the performance of our trained fusion model when fused with three versions of radar models, each with a different grid resolution. As detailed in subsection IV-C, the fusion models share the same grid resolution. In accordance with subsection V-A1, we compare training with auto- and manual labels. Each entry in this table is presented as a tuple denoting (fusion, camera, radar) results.

TABLE 2 Comparison of the Performance of Our Fusion Model When Fused With Three Different Radar Models Having Different Grid Resolutions. All Fusion Models Have the Same Grid Resolution. Here, We Compared Training With Auto and Manual Labels. In Addition, We Trained Fusion With a Sensor Dropout That Achieved the Best Results. All Entries in This Table are Tuples Presenting the (Fusion, Camera, Radar) Modalities

As expected, the fusion model $H_{1}^{fusion}$ , defined in (17), with the pre-trained radar model that had the best detection performance ( $H_{1}^{radar}$ is defined in (14)), since it was trained with increased resolution, outperformed all other fusion models. This fusion model, referred to as $H_{1}^{fusion, MM}$ in Table 2, also outperformed the individual modalities. Furthermore, as in the radar case in subsection V-A1), training and testing with the manual labels’ version of the dataset outperformed the other versions.

Of note, as explained in subsection V-A1, and same as in Table 1, the auto-label results presented in Table 2 are for impression and expected performance quantification when comparing dataset versions.

In addition, the trained fusion model w/ our sensor drop-out regime, referred to as $H_{1}^{fusion, M_{dr}M}$ in Table 2, yielded better results than w/o this training regime, supporting our hypothesis that our sensor dropout pipeline, presented in subsection IV-E, improves robustness and detection. Moreover, same as in the w/o the sensor dropout case, for the w/ sensor dropout case, the fusion with the improved radar resolution, $H_{1}^{fusion, M_{dr}M}$ , outperformed the one trained on a lower resolution, $H_{4}^{fusion, M_{dr}M}$ .

Please refer to Fig. 3 and Fig. 4 for detection performance improvement visualization w/ and w/o our sensor degradation training pipeline for the camera and radar degradation, respectively.

3) Datasets Mixing

Tables 3 and 4 present the detection performance for the simultaneous combining of the two dataset versions for the Vehicle and Pedestrian classes, respectively. For the Vehicle class (Table 3), the results are not decisive, implying that adding more samples from the auto-labels version of the dataset does not improve the performance, as the manual version of the dataset already includes the necessary amount of data for training the model. For the Pedestrian class (Table 4), conversely, adding more samples from the auto-labels version of the dataset improves performance by $x1.5$ - $x2$ compared to the manual-only case. This result emphasizes that adding more annotated samples for minority classes, such as Pedestrians, improves performance, as can be seen in Table 4. Moreover, unlike Tables 1 to 3, Table 4 does not include an orientation error. This omission is due to the absence of orientation information in the ground truth (GT) labels, and the lower likelihood of an orientation error calculation for pedestrians.

TABLE 3 Results of Our Fusion Model for the Vehicle Class Using the Manual Test Dataset. Adding the Auto-Labels Dataset in the Training Phase Did Not Have a Major Contribution to the Detection Performance. All Entries in This Table are Tuples Presenting (Fusion, Camera, Radar) Modalities

TABLE 4 Results of Our Fusion Model for the Pedestrian Class Using the Manual Test Dataset. Adding the Auto-Labels Dataset in the Training Phase Improved the Detection Performance by

$\sim {2x}$ , Since It Introduced Many More Data Samples. All Entries in This Table are Tuples Presenting (Fusion, Camera, Radar) Modalities

$Table 4- Results of Our Fusion Model for the Pedestrian Class Using the Manual Test Dataset. Adding the Auto-Labels Dataset in the Training Phase Improved the Detection Performance by $\sim {2x}$ , Since It Introduced Many More Data Samples. All Entries in This Table are Tuples Presenting (Fusion, Camera, Radar) Modalities$

4) Distance Detection Performance

To evaluate the properties of each of the modalities, we assessed the detection performance for distance-dependant range bins (Table 5). Here, we compared radar-camera and fusion detection AP at a 4.0m localization threshold, for ranges 10m-30m, 30-50m, and 50m-70m. As can be gleaned from Table 5, camera-only detection is preferential at closer ranges (10-50m), while radar-only detection is better at larger distances (50-70m). In addition, fusion detection performance is better than the individual modalities for all tested detection bins. This suggests that our RCF-TP module is indeed asynchronous and accurate, as it can successfully process asynchronous data, with improved performance.

TABLE 5 Detection Performance for Distance-Dependant Range Bins. Fusion Improves the Individual Modalities. For Closer Ranges, the Camera is Better, While for More Distant Ranges, Radar is Preferable

5) Training Summary

To summarize, our optimized model, $\mathbf {H_{1}^{fusion, M_{dr}M}}$ —namely, training with a radar grid resolution of $0.1 [m]$ , fusion with our camera model $H^{cam}$ , and training under our sensor drop-out regime—achieved the best AP score for the Vehicle Class - 0.63% (see Table 2). In addition, our dataset-merging regime improved the AP score for the Pedestrian class (see Table 4).

B. Comparison to Previous Works

We implemented the algorithm of [24] and compared its performance with our approach. The comparison indicates that increasing the radar resolution leads to improved fusion performance, as shown in Table 2.

As such, and to allow fair comparison, we did not use our optimal model, discussed in subsection V-A5, but our RCF-TP model that fuses radar with a grid resolution of $0.2 [m]$ ( $H_{2}^{fusion, MM}$ ), as the authors of [24] did. The results of this comparison are presented in Table 6 for the Vehicle Class. When trained and tested on our collected dataset, our RCF-TP model outperformed [24], on average, by 10% for AP for the vehicle class. In addition, our approach improved 2-fold, on average, the translation, scale, and orientation errors.

TABLE 6 Comparison of the Proposed Model to Our Implementation of [24] for the Vehicle Class Using the Manual Training and Testing Datasets. All Entries in This Table are Tuples Presenting the (Fusion, Camera, Radar) Modalities

We also compared our optimal model, presented in Table 4, with our implementation of [24] in the case of the Pedestrian Class. The results are provided in Table 7. Our model outperforms [24] by a factor of 3 for the AP @4.0m threshold, demonstrating a 2-fold improvement in the translation error and a marginal enhancement in scale errors.

TABLE 7 Comparison of the Proposed Model to Our Implementation of [24] for the Pedestrian Class Using the Manual Testing Dataset. All Entries in This Table are Tuples Presenting the (Fusion, Camera, Radar) Modalities

SECTION VI.

Conclusion and Future Work

We here present RCF-TP, a modular, asynchronous, real-time architecture for fusing multiple cameras and radars for 3D OD, using learnt temporal priors. RCF-TP first extracts modality-specific features using dedicated backbones. These features are then represented as queries for objects in the environment as the model’s state, followed by their projection to a common BEV and represented as queries. After some Ego motion correction, these queries are correlated with data received from other modalities, or from different sensors in the same modality. By utilizing the Attention mechanism, a detection head processes the fused features for bounding box representation as output. The model could easily be adapted to any sensor configuration, with numerous cameras and radars, and can therefore mitigate sensor failures, and can fuse individual modalities that were trained with non-constant grid size or resolution, and works as fast as 10 Hz. In our training regime, we employed a dataset-merger approach, leveraging both self-supervised and manual 3D annotations to enhance dataset diversity. This approach significantly improved the detection performance, particularly for pedestrians, demonstrating a 2-fold enhancement compared to using manual labels alone, while not degrading the scale, translation and orientation errors. Extensive experiments highlight the importance of our asynchronous and modular sensor fusion for improving robustness under challenging weather conditions or in cases of sensor faults. This approach proved valuable in real-time, real-world scenarios where accurate and robust 3D OD perception is essential, especially in datasets where certain classes are underrepresented. Furthermore, the applicability of this approach extends beyond AD to robotics, medical, and Internet of Things (IoT) applications. It addresses challenges related to sensor resolution, dataset combination, and the handling of minority classes, which are common issues in these diverse fields as well. While our sensor fusion model is adept at mitigating sensor occlusion through compensatory measures with data from other sensors and acquiring a shared latent representation of the scene, it may face difficulties in highly dynamic environments. Consequently, our future endeavors will focus on expanding the detection range to encompass highway scenarios and integrating a tracking module. This enhancement is anticipated to improve performance in highly dynamic environments and contribute to the optimization of downstream planning tasks.

References is not available for this document.

RCF-TP: Radar-Camera Fusion With Temporal Priors for 3D Object Detection

Alerts

Abstract:

Metadata

Abstract:

Introduction

Related Work

A. Radar Object Detection

B. Camera Object Detection

C. Sensor Fusion for Object Detection

Proposed Approach: Deep Fusion With Temporal Priors

A. Camera Stream

1) Camera Feature Extractor

2) BEV Projection

3) Camera Detector Head

B. Radar Stream

1) Radar Feature Extractor

2) Radar Detector Head

C. Sensor Fusion With Temporal Priors

1) Fusing All Grids

2) Asynchronous Data Processing

3) Fusion Detection Head

Implementation Details

A. Bosch Dense Dataset

B. Radar and Camera Model Training

1) Radar Model

2) Camera Model

C. Fusion Model Training

D. Dataset

E. Sensor Drop-Out for Increased Robustness

F. Training Setup

Experiments and Results

A. Training Our Model

1) Radar Grid Resolution

2) Fusion With Different Radar Grids

3) Datasets Mixing

4) Distance Detection Performance

5) Training Summary

B. Comparison to Previous Works

Conclusion and Future Work

Authors

Figures

References

Keywords

Metrics

References