Introduction
Automated vehicles predominantly rely on onboard sensors like cameras and LiDAR for object detection, but each sensor type has inherent limitations that can hinder overall performance. Camera-based systems are susceptible to lighting conditions, often performing poorly in extreme brightness or darkness. LiDAR, while unaffected by lighting, faces challenges in adverse weather conditions such as rain, snow, and dust. Additionally, LiDAR systems are costly, making them less accessible. Even when using sensor fusion, which combines data from multiple sensors to enhance detection capabilities, performance remains constrained by environmental factors and the intrinsic limitations of the sensors. For instance, severe occlusion can significantly reduce detection accuracy, and sensors are typically unable to detect objects beyond their designed range, limiting the vehicle’s ability to respond to distant hazards.
The U.S. Department of Transportation (USDOT) has recently placed significant emphasis on the deployment of infrastructure sensors and the promotion of Vehicle-to-Everything (V2X) technology. In parallel, advancements in related technologies, such as 5G communication, are gradually being developed to enhance V2X capabilities and facilitate its real-world application. The advent of 5G, with its low latency, high-speed data transmission, and large bandwidth, is crucial for enabling real-time communication between infrastructure sensors and vehicles. This high-speed connectivity ensures that data collected from various infrastructure sensors can be transmitted rapidly to autonomous vehicles (AVs), which is essential for the real-time functionality required by V2X systems in practical applications.
Since infrastructure sensor networks mature and related technology developments, integrating these sensors into detection systems becomes increasingly appealing to achieve more robust and accurate detection. Infrastructure sensors can provide a broader view than vehicle-mounted sensors, capturing additional information and resolving ambiguities caused by occlusions in crowded environments. In congested urban environments with buildings, parked cars, or pedestrians, vehicle-based sensors like LiDAR or cameras may have occlusion zone spots, leading to incomplete or delayed detection of objects. Infrastructure sensors, placed at higher vantage points such as traffic lights, road signs, or elevated structures, can offer a more comprehensive view of the surroundings. This ensures that critical objects or obstacles are detected even when a vehicle’s onboard sensors are obstructed and guarantees more safety. Positioned in stable locations, infrastructure sensors are less affected by dynamic conditions such as vibrations and tilting, providing more reliable data. Moreover, they can extend the detection range of vehicles, enabling the detection of objects outside the vehicle’s immediate range. Incorporating infrastructure sensors into object detection systems significantly enhances the detection capabilities of autonomous vehicles, contributing to safer, more efficient, and more reliable operations. This broader detection allows AVs to make more informed decisions in advance, such as adjusting speed, choosing alternate routes, or making lane changes, thereby reducing the likelihood of accidents.
Literature Review
Multimodal fusion for 3D object detection can be categorized into three approaches based on the stage at which fusion occurs: early fusion, intermediate fusion, and late fusion. Early fusion involves combining raw data from infrastructure and onboard sensors at the initial stage, with subsequent processing performed on the fused data. Intermediate fusion occurs at the feature level, integrating meaningful features extracted from raw data, offering a balance between flexibility and scalability. Late fusion allows independent sensor operation, with their outputs combined only at the final stage.
A. Vehicle/Infrastructure Sensor Fusion for 3D Object Detection
On the vehicle side, the multi-model fusion detection methods can be categorized into three types: point-level, proposal-level, and BEV-level fusion. For point-level fusion [1], [2], [3], [4], [5], PointPainting [1] is an early-stage point-level fusion method that integrates segmentation information from images with LiDAR points to enhance detection. For proposal-level fusion [6], [7], [8], [9], Ku et al. [6] implement fusion at the late stage using proposal generation techniques. BEVFusion [10] is the first approach to perform multimodal intermediate fusion in the Bird’s Eye View (BEV) domain, simplifying the fusion process while preserving both semantic and geometric information and mitigating perspective challenges like occlusion. BEVFusion’s structure is straightforward, and subsequent works [11], [12], [13], [14] have built upon it. For example, IS-Fusion [11] integrates scene-level and instance-level features during BEV fusion. UniTR [12] presents a weight-sharing backbone that can speed up the inference process.
In contrast, research on multimodal fusion methods for infrastructure is less developed than on the vehicle side. On the infrastructure side, Yao et al. [15] introduces VBRFusion, which performs an early-stage fusion of LiDAR points and images. Zimmer et al. [16] applies both early and late fusion to infrastructure point clouds and pictures, while Arnold et al. [17] compares early, late, and combined fusion strategies in their work on infrastructure LiDAR. CIP [18] fuses multiple infrastructure LiDAR points by points alignment.
B. Vehicle-Infrastructure Early Fusion and Late Fusion
Early fusion offers a richer data representation of the environment, potentially leading to better detection results. However, it incurs high transmission and computation costs, posing challenges for real-time applications. VRF [19] proposes an HD map-assisted fusion method. It projects the infrastructure LiDAR points and vehicle LiDAR points to another 3D points domain and then gets fusion on that domain. VI-Eye [20] fuse infrastructure and vehicle LiDAR points in the early stage by saliency point registration. In this way, they can achieve real-time point cloud fusion.
In contrast, late fusion reduces computational costs and is robust to sensor failures, as the failure of one sensor minimally impacts the entire system. Additionally, late fusion is transparent and interpretable.
Recent research has made significant progress in this area. For instance, QUEST [21] treats each detected object as an element in a query and fuses objects detected from infrastructure and vehicle images using transformer techniques. IFTR [22] also employs transformer-based instance-level fusion by aggregating instance features from images and completing object detection with a BEV map. VICOD [23] developed a fusion network that uses thresholds to differentiate and fuse bounding boxes from infrastructure and vehicle sensors. VIPS [24] introduces graph matching to fuse detected instances from infrastructure and vehicle LiDAR, achieving real-time performance.
C. Vehicle-Infrastructure Intermediate Fusion
The advantages of intermediate fusion compared to early and late fusion are evident. Intermediate fusion reduces computational and storage costs by avoiding the need to store and process the entire dataset, as required in early fusion. It also preserves more contextual information from raw data than late fusion, which loses the ability to exploit the complementary strengths of different sensors.
Researchers have devoted significant effort to this domain, with many leveraging vision transformer techniques due to their recent advancements in computer vision. For example, VIMI [25] uses transformers to augment infrastructure-side images with vehicle-side information, converting these images to BEV representations for object detection. TransIFF [26] fuses infrastructure and vehicle point clouds using transformers, concatenating keys, and value pairs to reduce domain gaps. CenterCoop [27] introduces cross-attention blocks in a transformer architecture to enable interaction between infrastructure and vehicle point clouds. ViT-FuseNet [28] directly fuses infrastructure point clouds with vehicle images, performing 3D detection on the fused features. The FETR [29] method utilizes the transformer to predict the features in the future frame and combine them with current features. Other methods, such as DI-V2X [30], propose a teacher-student model to generate domain-invariant feature maps, reducing the domain gap between infrastructure and onboard point clouds. PillarGrid [31] suggests a grid-wise fusion method for point clouds. Wang et al. [32] extract infrastructure-side features from the previous frame and current frame and fuse them with vehicle-side features using weighted addition. However, image-based methods generally focus only on the front view, neglecting the surrounding information of the vehicle, and most existing methods concentrate solely on areas where infrastructure and vehicle views largely overlap.
In our work, we extend the input from one image to six images, covering the entire surrounding area of the ego vehicle for more accurate and comprehensive object detection. Furthermore, we investigate scenarios both with and without overlapping coverage between infrastructure and vehicle detection.
D. Contribution
The contributions of our work can be summarized into three significant points. First, we introduce a novel framework capable of seamlessly integrating infrastructure-side sensors with multiple vehicle onboard sensors. This framework is designed to be scalable, allowing for the incorporation of a wide variety of sensors and the execution of different tasks, thereby ensuring adaptability to diverse applications. Second, we propose an innovative cross-attention block that effectively harnesses and maximizes the interaction of information between sensors, leading to enhanced data fusion and more accurate object detection. Finally, we rigorously validate our model using the V2X- Sim [33] online dataset, specifically under two distinct scenarios that reflect realistic conditions.
Methodology
Inspired by recent advancements in BEVFusion, an innovative pipeline has been developed to integrate infrastructure sensor data with vehicle onboard data, enabling a comprehensive data fusion system. A vital feature of this pipeline is the introduction of a novel Cross-Attention Block designed to reduce feature misalignment between heterogeneous sensors and incorporate auxiliary information from additional sources. This block enhances raw image data by fusing supplemental contextual information, leading to improved accuracy and robustness in the data processing pipeline. The proposed approach marks a significant advancement in multi-sensor fusion, providing a more comprehensive understanding of the environment and improving performance in tasks such as 3D object detection and map segmentation.
A. Vehicle Sensor Fusion for Object Detection
The entire framework is illustrated in Fig. 1. Our network architecture comprises three distinct pipelines: one for processing vehicle onboard images, one for handling LiDAR points from both the vehicle and infrastructure sides and another for managing infrastructure images. The first two pipelines, dedicated to vehicle onboard camera and LiDAR data, are adapted from the BEVFusion framework due to their well-structured network design and proven efficacy in object detection tasks. The third pipeline focuses on infrastructure cameras and incorporates a series of advanced processing steps: the Cross-Attention Augmentation Block, Feature Extraction Block, and View Transformation Block. The cross-attention Augmentation Block is explained in more detail in the next section. The Feature Extraction Block and View Transformation Block consist of the BEV generation network in Fig. 1. For the Feature Extraction Block, we utilize an online pre-trained model and finetune it with the V2X-Sim dataset. This block is crucial for extracting high-level features from raw images to generate detailed feature maps. The SwinTransformer [34] encoder is for feature extraction. The View Transformation Block employs an LSS-based method [35] to convert images from various sensor viewpoints into BEV images. This transformation facilitates a unified perspective for subsequent processing stages. More network structure information details of the Feature Extraction Block and View Transformation Block are depicted in Fig. 2.
The system takes six images from vehicle cameras; each image size is
B. Cross-Attention Block
As depicted in Fig. 1, the Cross-Attention Augmentation Module, inspired by the Vision Transformer (ViT) [36], is utilized in three distinct areas of our framework: two segments handle input images, and one segment processes BEV images. The integration of this module for input images aims to capture and leverage the relationships between different image inputs. This is crucial because image-to-image information is invaluable for accurate object reconstruction. A single camera may not always capture an entire object, necessitating the use of multiple cameras to reconstruct complete views accurately. Additionally, individual cameras might miss essential areas due to occlusion or misrecognition. The feature information might not be extracted for these reasons. By incorporating auxiliary information from neighboring cameras, the primary camera can more precisely extract features. For example, if a camera fails to detect an object due to adverse lighting conditions, adjacent cameras—less affected by these conditions—can provide the missing information, allowing the network to address the overlooked areas effectively.
The Cross-Attention Augmentation Module is also applied to process BEV images to address the spatial misalignment of features and the domain gap between BEV images. Based on our observations from the BEV feature images, there exists a spatial misalignment between the BEV images because of the depth estimation. Even if the convolutional-based fusion method can take some effects to decrease such misalignment, the capability is limited. And because of the sharp discrepancy between infrastructure and vehicle perspectives, objects might exhibit different feature representations in the infrastructure BEV images compared to the vehicle BEV images, making it challenging for conventional-based methods to fuse these disparate views effectively. The Cross-Attention Module is designed to bridge this gap by enhancing the alignment of features across different views, thereby improving the accuracy and completeness of the fused information.
For illustrative purposes, here, focus on the Cross-Attention Block within the infrastructure pipeline. The overall process is depicted in Fig. 3. Each black rectangular represents one infrastructure image. The arrows depict the interactions between the images. The rectangular region highlighted in red represents the image undergoing processing, while the remaining photos from neighboring homogenous sensors supply their respective key and value components for concatenation. These concatenated keys and values are then utilized in the cross-attention mechanism. Detailed calculations and operations are outlined in Fig. 4. The mathematical formula of the cross-attention layer is shown in (1).\begin{align*}& Cross\_ attention\left ({{ Q_{i\ }^{infra},\ K_{j\ }^{concat},\ V_{j\ }^{concat} }}\right) \\& \; = softmax\left ({{\frac {Q_{i\ }^{infra}\ K_{j\ }^{concat\_ T}}{\sqrt {d_{k}}}}}\right)\ V_{j\ }^{concat} \tag {1}\end{align*}
\begin{align*}& {image}_{i}^{infra} \\& \; = f_{2}\left ({{f_{1}\left ({{Cross\_ attention\left ({{ Q_{i\ }^{infra},\ K_{j\ }^{concat},\ V_{j\ }^{concat} }}\right)}}\right)}}\right)\quad \tag {2}\end{align*}
Experiment and Results
A. Experiment Hardware Settings
For our experiments, compared with other infrastructure data involving online datasets [18], [37], [38], V2X-sim fits our needs most. V2X-Sim is generated in a CARLA-SUMO co-simulation. It encompasses a comprehensive and well-synchronized collection of both infrastructure and vehicle sensor data. It also contains well-annotated ground truth data, which can support multiple perception tasks such as detection, segmentation, and tracking. More specifically, it includes six views of vehicle-side images and four views of infrastructure-side images, which are not provided by other online datasets. It collects 100 scenes in total, and every scene includes 100 records. To ensure compatibility with the nuScenes tool library, some adjustments to the dataset are necessary, such as scene name modification and folder name modification. The training and evaluation processes are conducted using ASU’s top-performing supercomputers, which are renowned for their high computational capabilities. These supercomputers feature multiple machines with hundreds of compute nodes, thousands of Central Processing Unit (CPU) cores, and numerous discrete Graphics Processing Units (GPUs). For our specific tasks, we select two NVIDIA A100 GPUs and four CPU cores, each equipped with 64 GB of memory.
B. 3D Object Detection
The default evaluation metrics from the nuScenes dataset [39] are utilized to compare results across the Car class. Two distinct scenarios, short-range and long-range, are appropriately designed for the 3D object detection task. By evaluating the VI-BEV framework in both short-range and long-range scenarios, its effectiveness in improving object detection is comprehensively assessed.
In the short-range scenario, the evaluation focuses on the area within the detection range of the ego-vehicle’s sensors. This evaluation area is a [−50m, 50m]
In the long-range scenario, we expand the evaluation space to a larger area that includes regions beyond the vehicle’s detection range. Considering the distance between ego-vehicle and infrastructure, and the onboard sensor capability in V2X-Sim, the evaluation area is designed as a [−100m, 100m]
C. Short-Range Case
We set the evaluation range threshold to [−50m, 50m]
As shown in Fig. 9, the use of infrastructure assistance significantly improves object detection, particularly at the intersection, where a higher number of objects are detected with increased accuracy. And the red rectangles highlight the more accurate parts. From the comparison between camera-view images, it’s evident that more objects are detected with the help of the infrastructure. For example, in Fig. 6, in the left front picture, the white vehicle can not be detected because of interference from the front bicycle. That area is expanded in Fig. 6, zoomed-in left front picture for a more straightforward display. Infrastructure sensors might remove this interference, and the VI-BEV model can detect that white one. The detection results can be found in Fig. 7 left front picture. Red rectangles highlight this part in Fig. 6, Fig. 7, and Fig. 8. Table 2 indicates that the detection accuracy of our model exceeds the baseline model by over 11.3%. This improvement can be attributed to two key factors: the inclusion of additional feature points from infrastructure sensors and the different view-angle detections provided by infrastructure sensors. The incorporation of more features enhances the detection accuracy by providing richer data from the infrastructure sensors for the model to process. Additionally, the elevated placement of infrastructure sensors allows for top-down detections, which can mitigate the effects of occlusion.
D. Long-Range Case
In this extension, we broaden the evaluation range threshold to [−100m, 100m]
Based on the experimental results, we can conclude that infrastructure sensors effectively extend the vehicle detection range. As illustrated in Fig. 13(b), the inclusion of infrastructure sensors allows for the detection of additional objects outside the vehicle’s range, as highlighted by the red rectangle. Table 2 further demonstrates a significant improvement in detection accuracy, with our model outperforming the baseline method by 15.1%. Moreover, from the camera-view images and Table 2, we can see the orientation predictions from our model are better than the baseline model. We take one object as an example, highlighted by red rectangles in Fig. 11 back image and Fig. 12 back image. This improvement can be attributed to the integration of infrastructure sensors, which provide additional feature points that enhance detection accuracy and detection range. In comparison to the short-range scenario, detection accuracy decreases for both methods in the long-range case. This reduction can be explained by the increased evaluation range, which introduces more objects, some of which may be outside the detection capability of the sensors or challenging to detect due to distance—these are marked by the blue rectangle in Fig. 13 (c). Additionally, the resolution of BEV images decreases as a larger area is included for detection. However, the performance degradation of our model is notably less than that of the baseline model, indicating that our approach is more robust for long-range detection scenarios.
Conclusion
In this paper, we propose a novel framework for integrating infrastructure sensor data with vehicle sensor data through an intermediate fusion approach. Our model is designed to be well-organized and scalable, allowing for the incorporation of additional sensors as needed. It effectively leverages information from both homogeneous (e.g., vehicle sensors) and heterogeneous (e.g., infrastructure sensors) sources to enhance 3D object detection performance. We validate our framework using the V2X-Sim dataset across two scenarios: short-range and long-range. The results demonstrate that our model consistently outperforms the baseline model in both scenarios. Specifically, our approach shows that incorporating infrastructure data significantly improves vehicle detection capabilities and expands the detection range. This integration allows for more comprehensive situational awareness and better object detection performance.
For future work, we acknowledge that the reliance on infrastructure LiDAR points may limit the general applicability of our model in real-world settings due to the current sparse installation of infrastructure LiDAR at intersections. To address this limitation, a promising direction is to adapt our method to a camera-only model, which could offer broader applicability and flexibility in various environments. Additionally, our model is well-suited for map segmentation tasks, which are crucial for high-definition (HD) map construction. Another potential avenue for future research is to enhance the model for online HD map creation, further advancing the capabilities of real-time mapping and navigation systems.