Journals & Magazines >IEEE Open Journal of Intellig... >Volume: 6

VI-BEV: Vehicle-Infrastructure Collaborative Perception for 3-D Object Detection on Bird’s-Eye View

Abstract:

As infrastructure equipment development matures, leveraging these assets to enhance automated vehicle perception becomes increasingly valuable for more accurate and broad...Show More

Metadata

Abstract:

As infrastructure equipment development matures, leveraging these assets to enhance automated vehicle perception becomes increasingly valuable for more accurate and broader 3D object detection. This paper proposes a straightforward and scalable framework to incorporate infrastructure and vehicle onboard sensors to perform 3D object detection on Bird’s Eye View(BEV) images. And a cross-attention based block is involved in utilizing the interacted information among the sensors for sensor information fusion. Our model gets validated on the online V2X-Sim dataset under two scenarios: the short-range case and the long-range case. Our model demonstrates superior accuracy and broader detection capabilities compared to the baseline model from the experiment results.

Published in: IEEE Open Journal of Intelligent Transportation Systems ( Volume: 6)

Page(s): 256 - 265

Date of Publication: 20 February 2025

Electronic ISSN: 2687-7813

DOI: 10.1109/OJITS.2025.3543831

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Automated vehicles predominantly rely on onboard sensors like cameras and LiDAR for object detection, but each sensor type has inherent limitations that can hinder overall performance. Camera-based systems are susceptible to lighting conditions, often performing poorly in extreme brightness or darkness. LiDAR, while unaffected by lighting, faces challenges in adverse weather conditions such as rain, snow, and dust. Additionally, LiDAR systems are costly, making them less accessible. Even when using sensor fusion, which combines data from multiple sensors to enhance detection capabilities, performance remains constrained by environmental factors and the intrinsic limitations of the sensors. For instance, severe occlusion can significantly reduce detection accuracy, and sensors are typically unable to detect objects beyond their designed range, limiting the vehicle’s ability to respond to distant hazards.

The U.S. Department of Transportation (USDOT) has recently placed significant emphasis on the deployment of infrastructure sensors and the promotion of Vehicle-to-Everything (V2X) technology. In parallel, advancements in related technologies, such as 5G communication, are gradually being developed to enhance V2X capabilities and facilitate its real-world application. The advent of 5G, with its low latency, high-speed data transmission, and large bandwidth, is crucial for enabling real-time communication between infrastructure sensors and vehicles. This high-speed connectivity ensures that data collected from various infrastructure sensors can be transmitted rapidly to autonomous vehicles (AVs), which is essential for the real-time functionality required by V2X systems in practical applications.

Since infrastructure sensor networks mature and related technology developments, integrating these sensors into detection systems becomes increasingly appealing to achieve more robust and accurate detection. Infrastructure sensors can provide a broader view than vehicle-mounted sensors, capturing additional information and resolving ambiguities caused by occlusions in crowded environments. In congested urban environments with buildings, parked cars, or pedestrians, vehicle-based sensors like LiDAR or cameras may have occlusion zone spots, leading to incomplete or delayed detection of objects. Infrastructure sensors, placed at higher vantage points such as traffic lights, road signs, or elevated structures, can offer a more comprehensive view of the surroundings. This ensures that critical objects or obstacles are detected even when a vehicle’s onboard sensors are obstructed and guarantees more safety. Positioned in stable locations, infrastructure sensors are less affected by dynamic conditions such as vibrations and tilting, providing more reliable data. Moreover, they can extend the detection range of vehicles, enabling the detection of objects outside the vehicle’s immediate range. Incorporating infrastructure sensors into object detection systems significantly enhances the detection capabilities of autonomous vehicles, contributing to safer, more efficient, and more reliable operations. This broader detection allows AVs to make more informed decisions in advance, such as adjusting speed, choosing alternate routes, or making lane changes, thereby reducing the likelihood of accidents.

SECTION II.

Literature Review

Multimodal fusion for 3D object detection can be categorized into three approaches based on the stage at which fusion occurs: early fusion, intermediate fusion, and late fusion. Early fusion involves combining raw data from infrastructure and onboard sensors at the initial stage, with subsequent processing performed on the fused data. Intermediate fusion occurs at the feature level, integrating meaningful features extracted from raw data, offering a balance between flexibility and scalability. Late fusion allows independent sensor operation, with their outputs combined only at the final stage.

A. Vehicle/Infrastructure Sensor Fusion for 3D Object Detection

On the vehicle side, the multi-model fusion detection methods can be categorized into three types: point-level, proposal-level, and BEV-level fusion. For point-level fusion [1], [2], [3], [4], [5], PointPainting [1] is an early-stage point-level fusion method that integrates segmentation information from images with LiDAR points to enhance detection. For proposal-level fusion [6], [7], [8], [9], Ku et al. [6] implement fusion at the late stage using proposal generation techniques. BEVFusion [10] is the first approach to perform multimodal intermediate fusion in the Bird’s Eye View (BEV) domain, simplifying the fusion process while preserving both semantic and geometric information and mitigating perspective challenges like occlusion. BEVFusion’s structure is straightforward, and subsequent works [11], [12], [13], [14] have built upon it. For example, IS-Fusion [11] integrates scene-level and instance-level features during BEV fusion. UniTR [12] presents a weight-sharing backbone that can speed up the inference process.

In contrast, research on multimodal fusion methods for infrastructure is less developed than on the vehicle side. On the infrastructure side, Yao et al. [15] introduces VBRFusion, which performs an early-stage fusion of LiDAR points and images. Zimmer et al. [16] applies both early and late fusion to infrastructure point clouds and pictures, while Arnold et al. [17] compares early, late, and combined fusion strategies in their work on infrastructure LiDAR. CIP [18] fuses multiple infrastructure LiDAR points by points alignment.

B. Vehicle-Infrastructure Early Fusion and Late Fusion

Early fusion offers a richer data representation of the environment, potentially leading to better detection results. However, it incurs high transmission and computation costs, posing challenges for real-time applications. VRF [19] proposes an HD map-assisted fusion method. It projects the infrastructure LiDAR points and vehicle LiDAR points to another 3D points domain and then gets fusion on that domain. VI-Eye [20] fuse infrastructure and vehicle LiDAR points in the early stage by saliency point registration. In this way, they can achieve real-time point cloud fusion.

In contrast, late fusion reduces computational costs and is robust to sensor failures, as the failure of one sensor minimally impacts the entire system. Additionally, late fusion is transparent and interpretable.

Recent research has made significant progress in this area. For instance, QUEST [21] treats each detected object as an element in a query and fuses objects detected from infrastructure and vehicle images using transformer techniques. IFTR [22] also employs transformer-based instance-level fusion by aggregating instance features from images and completing object detection with a BEV map. VICOD [23] developed a fusion network that uses thresholds to differentiate and fuse bounding boxes from infrastructure and vehicle sensors. VIPS [24] introduces graph matching to fuse detected instances from infrastructure and vehicle LiDAR, achieving real-time performance.

C. Vehicle-Infrastructure Intermediate Fusion

The advantages of intermediate fusion compared to early and late fusion are evident. Intermediate fusion reduces computational and storage costs by avoiding the need to store and process the entire dataset, as required in early fusion. It also preserves more contextual information from raw data than late fusion, which loses the ability to exploit the complementary strengths of different sensors.

Researchers have devoted significant effort to this domain, with many leveraging vision transformer techniques due to their recent advancements in computer vision. For example, VIMI [25] uses transformers to augment infrastructure-side images with vehicle-side information, converting these images to BEV representations for object detection. TransIFF [26] fuses infrastructure and vehicle point clouds using transformers, concatenating keys, and value pairs to reduce domain gaps. CenterCoop [27] introduces cross-attention blocks in a transformer architecture to enable interaction between infrastructure and vehicle point clouds. ViT-FuseNet [28] directly fuses infrastructure point clouds with vehicle images, performing 3D detection on the fused features. The FETR [29] method utilizes the transformer to predict the features in the future frame and combine them with current features. Other methods, such as DI-V2X [30], propose a teacher-student model to generate domain-invariant feature maps, reducing the domain gap between infrastructure and onboard point clouds. PillarGrid [31] suggests a grid-wise fusion method for point clouds. Wang et al. [32] extract infrastructure-side features from the previous frame and current frame and fuse them with vehicle-side features using weighted addition. However, image-based methods generally focus only on the front view, neglecting the surrounding information of the vehicle, and most existing methods concentrate solely on areas where infrastructure and vehicle views largely overlap.

In our work, we extend the input from one image to six images, covering the entire surrounding area of the ego vehicle for more accurate and comprehensive object detection. Furthermore, we investigate scenarios both with and without overlapping coverage between infrastructure and vehicle detection.

D. Contribution

The contributions of our work can be summarized into three significant points. First, we introduce a novel framework capable of seamlessly integrating infrastructure-side sensors with multiple vehicle onboard sensors. This framework is designed to be scalable, allowing for the incorporation of a wide variety of sensors and the execution of different tasks, thereby ensuring adaptability to diverse applications. Second, we propose an innovative cross-attention block that effectively harnesses and maximizes the interaction of information between sensors, leading to enhanced data fusion and more accurate object detection. Finally, we rigorously validate our model using the V2X- Sim [33] online dataset, specifically under two distinct scenarios that reflect realistic conditions.

SECTION III.

Methodology

Inspired by recent advancements in BEVFusion, an innovative pipeline has been developed to integrate infrastructure sensor data with vehicle onboard data, enabling a comprehensive data fusion system. A vital feature of this pipeline is the introduction of a novel Cross-Attention Block designed to reduce feature misalignment between heterogeneous sensors and incorporate auxiliary information from additional sources. This block enhances raw image data by fusing supplemental contextual information, leading to improved accuracy and robustness in the data processing pipeline. The proposed approach marks a significant advancement in multi-sensor fusion, providing a more comprehensive understanding of the environment and improving performance in tasks such as 3D object detection and map segmentation.

A. Vehicle Sensor Fusion for Object Detection

The entire framework is illustrated in Fig. 1. Our network architecture comprises three distinct pipelines: one for processing vehicle onboard images, one for handling LiDAR points from both the vehicle and infrastructure sides and another for managing infrastructure images. The first two pipelines, dedicated to vehicle onboard camera and LiDAR data, are adapted from the BEVFusion framework due to their well-structured network design and proven efficacy in object detection tasks. The third pipeline focuses on infrastructure cameras and incorporates a series of advanced processing steps: the Cross-Attention Augmentation Block, Feature Extraction Block, and View Transformation Block. The cross-attention Augmentation Block is explained in more detail in the next section. The Feature Extraction Block and View Transformation Block consist of the BEV generation network in Fig. 1. For the Feature Extraction Block, we utilize an online pre-trained model and finetune it with the V2X-Sim dataset. This block is crucial for extracting high-level features from raw images to generate detailed feature maps. The SwinTransformer [34] encoder is for feature extraction. The View Transformation Block employs an LSS-based method [35] to convert images from various sensor viewpoints into BEV images. This transformation facilitates a unified perspective for subsequent processing stages. More network structure information details of the Feature Extraction Block and View Transformation Block are depicted in Fig. 2.

FIGURE 1.

VI-BEV Algorithm Overview.

Show All

FIGURE 2.

BEV Generation Network.

Show All

The system takes six images from vehicle cameras; each image size is $704\times 256$ pixels, providing comprehensive coverage of the surrounding area. Simultaneously, four images from infrastructure cameras are processed through the infrastructure image pipeline to cover the entire intersection. Each image size is also $704\times 256$ pixels. LiDAR data from the infrastructure is transformed to align with the vehicle’s base link frame. This transformed data is then concatenated with the vehicle’s LiDAR points and fed into the LiDAR pipeline. For the final fusion stage, a convolution-based fusion method integrates all BEV images, ensuring a cohesive and accurate representation of the environment.

B. Cross-Attention Block

As depicted in Fig. 1, the Cross-Attention Augmentation Module, inspired by the Vision Transformer (ViT) [36], is utilized in three distinct areas of our framework: two segments handle input images, and one segment processes BEV images. The integration of this module for input images aims to capture and leverage the relationships between different image inputs. This is crucial because image-to-image information is invaluable for accurate object reconstruction. A single camera may not always capture an entire object, necessitating the use of multiple cameras to reconstruct complete views accurately. Additionally, individual cameras might miss essential areas due to occlusion or misrecognition. The feature information might not be extracted for these reasons. By incorporating auxiliary information from neighboring cameras, the primary camera can more precisely extract features. For example, if a camera fails to detect an object due to adverse lighting conditions, adjacent cameras—less affected by these conditions—can provide the missing information, allowing the network to address the overlooked areas effectively.

The Cross-Attention Augmentation Module is also applied to process BEV images to address the spatial misalignment of features and the domain gap between BEV images. Based on our observations from the BEV feature images, there exists a spatial misalignment between the BEV images because of the depth estimation. Even if the convolutional-based fusion method can take some effects to decrease such misalignment, the capability is limited. And because of the sharp discrepancy between infrastructure and vehicle perspectives, objects might exhibit different feature representations in the infrastructure BEV images compared to the vehicle BEV images, making it challenging for conventional-based methods to fuse these disparate views effectively. The Cross-Attention Module is designed to bridge this gap by enhancing the alignment of features across different views, thereby improving the accuracy and completeness of the fused information.

For illustrative purposes, here, focus on the Cross-Attention Block within the infrastructure pipeline. The overall process is depicted in Fig. 3. Each black rectangular represents one infrastructure image. The arrows depict the interactions between the images. The rectangular region highlighted in red represents the image undergoing processing, while the remaining photos from neighboring homogenous sensors supply their respective key and value components for concatenation. These concatenated keys and values are then utilized in the cross-attention mechanism. Detailed calculations and operations are outlined in Fig. 4. The mathematical formula of the cross-attention layer is shown in (1).

$\begin{align*}& Cross\_ attention\left ({{ Q_{i\ }^{infra},\ K_{j\ }^{concat},\ V_{j\ }^{concat} }}\right) \\& \; = softmax\left ({{\frac {Q_{i\ }^{infra}\ K_{j\ }^{concat\_ T}}{\sqrt {d_{k}}}}}\right)\ V_{j\ }^{concat} \tag {1}\end{align*}$ View Source

$Q_{i\ }^{infra}$

is the query from i-th infrastructure camera,

$K_{j\ }^{concat}$

is the concatenated key from the rest infrastructure cameras,

$V_{j\ }^{concat}$

is the concatenated value from the reset infrastructure cameras.

$d_{k}$

is the length of the query. Then, go through a normalization block and MLP block to generate an augmented image. And the formula is written as (2):

$\begin{align*}& {image}_{i}^{infra} \\& \; = f_{2}\left ({{f_{1}\left ({{Cross\_ attention\left ({{ Q_{i\ }^{infra},\ K_{j\ }^{concat},\ V_{j\ }^{concat} }}\right)}}\right)}}\right)\quad \tag {2}\end{align*}$

View Source

${image}_{i}^{infra}$

is the augmented image of i-th infrastructure camera.

$f_{1}$

is the normalization function, and

$f_{2}$

is the MLP block.

FIGURE 3.

Cross-attention Augmentation Module.

Show All

FIGURE 4.

Cross-attention Whole Process.

Show All

SECTION IV.

Experiment and Results

A. Experiment Hardware Settings

For our experiments, compared with other infrastructure data involving online datasets [18], [37], [38], V2X-sim fits our needs most. V2X-Sim is generated in a CARLA-SUMO co-simulation. It encompasses a comprehensive and well-synchronized collection of both infrastructure and vehicle sensor data. It also contains well-annotated ground truth data, which can support multiple perception tasks such as detection, segmentation, and tracking. More specifically, it includes six views of vehicle-side images and four views of infrastructure-side images, which are not provided by other online datasets. It collects 100 scenes in total, and every scene includes 100 records. To ensure compatibility with the nuScenes tool library, some adjustments to the dataset are necessary, such as scene name modification and folder name modification. The training and evaluation processes are conducted using ASU’s top-performing supercomputers, which are renowned for their high computational capabilities. These supercomputers feature multiple machines with hundreds of compute nodes, thousands of Central Processing Unit (CPU) cores, and numerous discrete Graphics Processing Units (GPUs). For our specific tasks, we select two NVIDIA A100 GPUs and four CPU cores, each equipped with 64 GB of memory.

B. 3D Object Detection

The default evaluation metrics from the nuScenes dataset [39] are utilized to compare results across the Car class. Two distinct scenarios, short-range and long-range, are appropriately designed for the 3D object detection task. By evaluating the VI-BEV framework in both short-range and long-range scenarios, its effectiveness in improving object detection is comprehensively assessed.

In the short-range scenario, the evaluation focuses on the area within the detection range of the ego-vehicle’s sensors. This evaluation area is a [−50m, 50m] $\times$ [−50m, 50m] rectangle following BEVFusion and work [35]. Vehicle is at the center of the area. This scenario simulates real-world conditions where the vehicle approaches an intersection. Here, infrastructure sensors are positioned close to the vehicle. Both infrastructure sensors and vehicle sensors cover a large common area, allowing the infrastructure sensors to augment the vehicle’s detection capabilities. This setup demonstrates how infrastructure sensors can enhance object detection when the car is in proximity to the intersection.

In the long-range scenario, we expand the evaluation space to a larger area that includes regions beyond the vehicle’s detection range. Considering the distance between ego-vehicle and infrastructure, and the onboard sensor capability in V2X-Sim, the evaluation area is designed as a [−100m, 100m] $\times$ [−100m, 100m] rectangle. It contains some space where onboard sensors are hard to detect. In this case, the infrastructure sensors play a crucial role by detecting objects that are outside the vehicle’s sensor coverage. This extended range of detection provided by the infrastructure sensors helps to broaden the vehicle’s overall perception. Such an approach is particularly beneficial for real-world applications, where even if the car is at a considerable distance from an intersection, traffic information can still be acquired through the infrastructure sensors. This capability allows the vehicle to make informed decisions in advance, thereby enhancing safety and enabling proactive driving strategies.

C. Short-Range Case

We set the evaluation range threshold to [−50m, 50m] $\times$ [−50m, 50m] in accordance with the methodology outlined in the BEVFusion work and to align with the valid sensor detection range of the V2X-Sim dataset. Given that the maximum LiDAR detection range in the dataset is 70 meters, this threshold ensures that most LiDAR points are included within the detection range. The evaluation range means we only consider the points that are in this range as valid. The points that are outside of the range are taken as invalid. A schematic representation of this configuration is illustrated in Fig. 5. In this diagram, yellow dots denote valid sensor points from infrastructure sensors, blue dots represent valid sensor points from vehicle sensors, and gray dots indicate points that are excluded from detection. Fig. 6 and Fig. 7 show the detection results from the vehicle camera-view side of the BEVFusion method and our VI-BEV method. Fig. 8 displays the object detection results from the infrastructure side of our process and ground truth. Fig. 9 shows the detection comparisons between the BEVFusion model and our proposed VI-BEV model. The green circle represents the position of the vehicle, and the red circle represents the infrastructure position. Both the BEVFusion method and our VI-BEV method are evaluated on the full V2X-Sim dataset. The numerical results of these evaluations are presented in Table 1.

TABLE 1 Cars Detection Comparison of Short-Range Case

FIGURE 5.

Short-range Detection Case.

Show All

FIGURE 6.

Vehicle camera-view detections from BEVFusion in short-range case.

Show All

FIGURE 7.

Vehicle camera-view detections from VI-BEV model in long-range case.

Show All

FIGURE 8.

Infrastructure camera-view detections from VI-BEV model in short-range case.

Show All

FIGURE 9.

Detections comparison from LiDAR-view in short-range case.

Show All

As shown in Fig. 9, the use of infrastructure assistance significantly improves object detection, particularly at the intersection, where a higher number of objects are detected with increased accuracy. And the red rectangles highlight the more accurate parts. From the comparison between camera-view images, it’s evident that more objects are detected with the help of the infrastructure. For example, in Fig. 6, in the left front picture, the white vehicle can not be detected because of interference from the front bicycle. That area is expanded in Fig. 6, zoomed-in left front picture for a more straightforward display. Infrastructure sensors might remove this interference, and the VI-BEV model can detect that white one. The detection results can be found in Fig. 7 left front picture. Red rectangles highlight this part in Fig. 6, Fig. 7, and Fig. 8. Table 2 indicates that the detection accuracy of our model exceeds the baseline model by over 11.3%. This improvement can be attributed to two key factors: the inclusion of additional feature points from infrastructure sensors and the different view-angle detections provided by infrastructure sensors. The incorporation of more features enhances the detection accuracy by providing richer data from the infrastructure sensors for the model to process. Additionally, the elevated placement of infrastructure sensors allows for top-down detections, which can mitigate the effects of occlusion.

TABLE 2 Cars Detection Comparison of Long-Range Case

D. Long-Range Case

In this extension, we broaden the evaluation range threshold to [−100m, 100m] $\times$ [−100m, 100m]. Given that the sensors’ detection range is approximately 70 meters, objects located beyond this range are challenging to detect with the vehicle’s sensors alone. To address this issue, infrastructure sensors are incorporated to enhance detection capabilities for objects outside the vehicle’s immediate detection range. Fig. 10 illustrates the long-range detection scenario as being more straightforward. This figure demonstrates how extending the threshold allows the infrastructure sensors to capture objects that are beyond the reach of the vehicle’s sensors. Objects in vehicle unsensed areas (outside the blue circle but still inside the gray rectangle) might be detected by infrastructure sensors. Fig. 11 and Fig. 12 show the detection results from the vehicle camera-view side of the baseline method and our proposed method. Fig. 13 shows a detection comparison of the baseline method, our method, and the ground truth from the LiDAR view. The green circle is the center position of the vehicle, and the red circle means the center position of the infrastructure. Table 2 lists the numerical results of a comparison between the baseline method and our model.

FIGURE 10.

Long-range Detection Case.

Show All

FIGURE 11.

Vehicle camera-view detections from BEVFusion in long-range case.

Show All

FIGURE 12.

Vehicle camera-view detections from the VI-BEV model in long-range case.

Show All

FIGURE 13.

LiDAR-view detections comparison in long-range case.

Show All

Based on the experimental results, we can conclude that infrastructure sensors effectively extend the vehicle detection range. As illustrated in Fig. 13(b), the inclusion of infrastructure sensors allows for the detection of additional objects outside the vehicle’s range, as highlighted by the red rectangle. Table 2 further demonstrates a significant improvement in detection accuracy, with our model outperforming the baseline method by 15.1%. Moreover, from the camera-view images and Table 2, we can see the orientation predictions from our model are better than the baseline model. We take one object as an example, highlighted by red rectangles in Fig. 11 back image and Fig. 12 back image. This improvement can be attributed to the integration of infrastructure sensors, which provide additional feature points that enhance detection accuracy and detection range. In comparison to the short-range scenario, detection accuracy decreases for both methods in the long-range case. This reduction can be explained by the increased evaluation range, which introduces more objects, some of which may be outside the detection capability of the sensors or challenging to detect due to distance—these are marked by the blue rectangle in Fig. 13 (c). Additionally, the resolution of BEV images decreases as a larger area is included for detection. However, the performance degradation of our model is notably less than that of the baseline model, indicating that our approach is more robust for long-range detection scenarios.

SECTION V.

Conclusion

In this paper, we propose a novel framework for integrating infrastructure sensor data with vehicle sensor data through an intermediate fusion approach. Our model is designed to be well-organized and scalable, allowing for the incorporation of additional sensors as needed. It effectively leverages information from both homogeneous (e.g., vehicle sensors) and heterogeneous (e.g., infrastructure sensors) sources to enhance 3D object detection performance. We validate our framework using the V2X-Sim dataset across two scenarios: short-range and long-range. The results demonstrate that our model consistently outperforms the baseline model in both scenarios. Specifically, our approach shows that incorporating infrastructure data significantly improves vehicle detection capabilities and expands the detection range. This integration allows for more comprehensive situational awareness and better object detection performance.

For future work, we acknowledge that the reliance on infrastructure LiDAR points may limit the general applicability of our model in real-world settings due to the current sparse installation of infrastructure LiDAR at intersections. To address this limitation, a promising direction is to adapt our method to a camera-only model, which could offer broader applicability and flexibility in various environments. Additionally, our model is well-suited for map segmentation tasks, which are crucial for high-definition (HD) map construction. Another potential avenue for future research is to enhance the model for online HD map creation, further advancing the capabilities of real-time mapping and navigation systems.

(AP means average preeision, ATE means average translation error, ASE means average scale error, AGE means average orientation error, and AAE means average attributes error.)

References is not available for this document.

MIT Libraries

MIT Libraries

VI-BEV: Vehicle-Infrastructure Collaborative Perception for 3-D Object Detection on Bird’s-Eye View

Abstract:

Metadata

Abstract:

Introduction

Literature Review

A. Vehicle/Infrastructure Sensor Fusion for 3D Object Detection

B. Vehicle-Infrastructure Early Fusion and Late Fusion