Loading [MathJax]/extensions/MathZoom.js
Channelwise and Spatially Guided Multimodal Feature Fusion Network for 3-D Object Detection in Autonomous Vehicles | IEEE Journals & Magazine | IEEE Xplore

Channelwise and Spatially Guided Multimodal Feature Fusion Network for 3-D Object Detection in Autonomous Vehicles


Abstract:

Accurate 3-D object detection is vital in autonomous driving. Traditional LiDAR models struggle with sparse point clouds. We propose a novel approach integrating LiDAR an...Show More

Abstract:

Accurate 3-D object detection is vital in autonomous driving. Traditional LiDAR models struggle with sparse point clouds. We propose a novel approach integrating LiDAR and camera data to maximize sensor strengths while overcoming individual limitations for enhanced 3-D object detection. Our research introduces the channelwise and spatially guided multimodal feature fusion network (CSMNET) for 3-D object detection. First, our method enhances LiDAR data by projecting it onto a 2-D plane, enabling the extraction of class-specific features from a probability map. Second, we design class-based farthest point sampling (C-FPS), which boosts the selection of foreground points by utilizing point weights based on geometric or probability features while ensuring diversity among the selected points. Third, we developed a parallel attention (PAT)-based multimodal fusion mechanism achieving higher resolution compared to raw LiDAR points. This fusion mechanism integrates two attention mechanisms: channel attention for LiDAR data and spatial attention for camera data. These mechanisms enhance the utilization of semantic features in a region of interest (ROI) to obtain more representative point features, leading to a more effective fusion of information from both LiDAR and camera sources. Specifically, CSMNET achieves an average precision (AP) in bird’s eye view (BEV) detection of 90.16% (easy), 85.18% (moderate), and 80.51% (hard), with a mean AP (mAP) of 85.12%. In 3-D detection, CSMNET attains 82.05% (easy), 72.64% (moderate), and 67.10% (hard) with an mAP of 73.75%. For 2-D detection, the scores are 95.47% (easy), 93.25% (moderate), and 86.68% (hard), yielding an mAP of 91.72% for the KITTI dataset.
Article Sequence Number: 5707515
Date of Publication: 08 October 2024

ISSN Information:

Funding Agency:

References is not available for this document.

I. Introduction

Accurate 3-D object detection holds utmost importance in the domain of autonomous vehicles (AVs), as well as for understanding object dimensions and positions in real-world scenarios [1], [2], [3]. Recent research focuses on harnessing LiDAR and camera data for this purpose, capitalizing on LiDAR’s point cloud-based 3-D data and cameras’ high-resolution RGB images [4]. Despite their importance, efficiently extracting and fusing features from these sources poses challenges. While deep learning-based feature extraction, especially for RGB images, is prevalent, dealing with point clouds’ irregular distribution and sparsity is complex [5]. Existing methods have involved transforming point clouds into either voxel grids or 2-D dense images for 2-D neural network application [6], [7], [8], [9], [10], [11]. Recent advancements include the direct utilization of multilayer perceptrons (MLPs) for feature aggregation from point clouds and the exploration of graph-based representations for feature extraction, treating points as vertices [12], [13].

Select All
1.
J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang and H. Li, "Voxel R-CNN: Towards high performance voxel-based 3D object detection", Proc. 35th AAAI Conf. Artif. Intell., vol. 35, no. 2, pp. 1201-1209, 2021.
2.
C. He, H. Zeng, J. Huang, X. Hua and L. Zhang, "Structure aware single-stage 3D object detection from point cloud", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 11870-11879, Jun. 2020.
3.
C. Vishnu, J. Khandelwal, C. K. Mohan and C. L. Reddy, "EVAA—Exchange vanishing adversarial attack on LiDAR point clouds in autonomous vehicles", IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1-10, 2023.
4.
S. Gu, Y. Zhang, J. Tang, J. Yang, J. M. Alvarez and H. Kong, "Integrating dense LiDAR-camera road detection maps by a multi-modal CRF model", IEEE Trans. Veh. Technol., vol. 68, no. 12, pp. 11635-11645, Dec. 2019.
5.
Y. Zhang, Q. Hu, G. Xu, Y. Ma, J. Wan and Y. Guo, "Not all points are equal: Learning highly efficient point-based detectors for 3D LiDAR point clouds", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 18931-18940, Jun. 2022.
6.
B. Yang, W. Luo and R. Urtasun, "PIXOR: Real-time 3D object detection from point clouds", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 7652-7660, Jun. 2018.
7.
Y. Zhou and O. Tuzel, "VoxelNet: End-to-end learning for point cloud based 3D object detection", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 4490-4499, Jun. 2018.
8.
Y. Yan, Y. Mao and B. Li, "SECOND: Sparsely embedded convolutional detection", Sensors, vol. 18, no. 10, pp. 3337, Oct. 2018.
9.
A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang and O. Beijbom, "PointPillars: Fast encoders for object detection from point clouds", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 12689-12697, Jun. 2019.
10.
S. Shi, X. Wang and H. Li, "PointRCNN: 3D object proposal generation and detection from point cloud", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 770-779, Jun. 2019.
11.
D. Lu, K. Gao, Q. Xie, L. Xu and J. Li, "3DGTN: 3-D dual-attention glocal transformer network for point cloud classification and segmentation", IEEE Trans. Geosci. Remote Sens., vol. 62, pp. 1-13, 2024.
12.
P. Sun et al., "Scalability in perception for autonomous driving: Waymo open dataset", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 2446-2454, Jun. 2020.
13.
J. Chen, B. Kakillioglu and S. Velipasalar, "Background-aware 3-D point cloud segmentation with dynamic point feature aggregation", IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1-12, 2022.
14.
D. Park, R. Ambrus, V. Guizilini, J. Li and A. Gaidon, "Is pseudo-LiDAR needed for monocular 3D object detection?", Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pp. 3122-3132, Oct. 2021.
15.
Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou and X. Bai, "TANET: Robust 3D object detection from point clouds with triple attention", Proc. Conf. Artif. Intell. (AAAI), pp. 11677-11684, Feb. 2020.
16.
J. Yang et al., "Modeling point clouds with self-attention and Gumbel subset sampling", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3318-3327, Jun. 2019.
17.
W. Zheng, W. Tang, S. Chen, L. Jiang and C. W. Fu, "CIA-SSD: Confident IoU-aware single-stage object detector from point cloud", Proc. 35th AAAI Conf. Artif. Intell., vol. 35, no. 4, pp. 3555-3562, 2021.
18.
S. Shi et al., "PV-RCNN++: Point-Voxel feature set abstraction with local vector representation for 3D object detection", Int. J. Comput. Vis., vol. 131, no. 2, pp. 531-551, Feb. 2023.
19.
H. Mushtaq, X. Deng, M. Ali, B. Hayat and H. H. Raza Sherazi, "DFA-SAT: Dynamic feature abstraction with self-attention-based 3D object detection for autonomous driving", Sustainability, vol. 15, no. 18, pp. 13667, Sep. 2023, [online] Available: https://www.mdpi.com/2071-1050/15/18/13667.
20.
Z. Yang, Y. Sun, S. Liu and J. Jia, "3DSSD: Point-based 3D single stage object detector", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 11040-11048, Jun. 2020.
21.
L. Xie et al., "Pi-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module", Proc. AAAI Conf. Artif. Intell., pp. 12460-12467, 2020.
22.
J. H. Yoo, Y. Kim, J. Kim and J. W. Choi, "3D-CVF: Generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection", Proc. 16th Eur. Conf. Comput. Vis. (ECCV), pp. 720-736, 2020.
23.
C.-H. Hsia, "Improved depth image-based rendering using an adaptive compensation method on an autostereoscopic 3-D display for a Kinect sensor", IEEE Sensors J., vol. 15, no. 2, pp. 994-1002, Feb. 2015.
24.
S. Pang, D. Morris and H. Radha, "CLOCs: Camera-LiDAR object candidates fusion for 3D object detection", Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 10386-10393, Oct. 2020.
25.
J. Ku, M. Mozifian, J. Lee, A. Harakeh and S. L. Waslander, "Joint 3D proposal generation and object detection from view aggregation", Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 1-8, Oct. 2018.
26.
C. R. Qi, W. Liu, C. Wu, H. Su and L. J. Guibas, "Frustum PointNets for 3D object detection from RGB-D data", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 918-927, Jun. 2018.
27.
C. Lin, D. Tian, X. Duan, J. Zhou, D. Zhao and D. Cao, "CL3D: Camera-LiDAR 3D object detection with point feature enhancement and point-guided fusion", IEEE Trans. Intell. Transp. Syst., vol. 23, no. 10, pp. 18040-18050, Oct. 2022.
28.
Z. Wang and K. Jia, "Frustum ConvNet: Sliding frustums to aggregate local point-wise features for amodal 3D object detection", Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 1742-1749, Nov. 2019.
29.
Z. Liu, T. Huang, B. Li, X. Chen, X. Wang and X. Bai, "EPNet++: Cascade bi-directional fusion for multi-modal 3D object detection", IEEE Trans. Pattern Anal. Mach. Intell., 2022.
30.
Y. Sun, Z. Fu, C. Sun, Y. Hu and S. Zhang, "Deep multimodal fusion network for semantic segmentation using remote sensing image and LiDAR data", IEEE Trans. Geosci. Remote Sens., vol. 60, 2022.

Contact IEEE to Subscribe

References

References is not available for this document.