Loading [MathJax]/extensions/MathZoom.js
Channelwise and Spatially Guided Multimodal Feature Fusion Network for 3-D Object Detection in Autonomous Vehicles | IEEE Journals & Magazine | IEEE Xplore

Channelwise and Spatially Guided Multimodal Feature Fusion Network for 3-D Object Detection in Autonomous Vehicles


Abstract:

Accurate 3-D object detection is vital in autonomous driving. Traditional LiDAR models struggle with sparse point clouds. We propose a novel approach integrating LiDAR an...Show More

Abstract:

Accurate 3-D object detection is vital in autonomous driving. Traditional LiDAR models struggle with sparse point clouds. We propose a novel approach integrating LiDAR and camera data to maximize sensor strengths while overcoming individual limitations for enhanced 3-D object detection. Our research introduces the channelwise and spatially guided multimodal feature fusion network (CSMNET) for 3-D object detection. First, our method enhances LiDAR data by projecting it onto a 2-D plane, enabling the extraction of class-specific features from a probability map. Second, we design class-based farthest point sampling (C-FPS), which boosts the selection of foreground points by utilizing point weights based on geometric or probability features while ensuring diversity among the selected points. Third, we developed a parallel attention (PAT)-based multimodal fusion mechanism achieving higher resolution compared to raw LiDAR points. This fusion mechanism integrates two attention mechanisms: channel attention for LiDAR data and spatial attention for camera data. These mechanisms enhance the utilization of semantic features in a region of interest (ROI) to obtain more representative point features, leading to a more effective fusion of information from both LiDAR and camera sources. Specifically, CSMNET achieves an average precision (AP) in bird’s eye view (BEV) detection of 90.16% (easy), 85.18% (moderate), and 80.51% (hard), with a mean AP (mAP) of 85.12%. In 3-D detection, CSMNET attains 82.05% (easy), 72.64% (moderate), and 67.10% (hard) with an mAP of 73.75%. For 2-D detection, the scores are 95.47% (easy), 93.25% (moderate), and 86.68% (hard), yielding an mAP of 91.72% for the KITTI dataset.
Article Sequence Number: 5707515
Date of Publication: 08 October 2024

ISSN Information:

Funding Agency:


I. Introduction

Accurate 3-D object detection holds utmost importance in the domain of autonomous vehicles (AVs), as well as for understanding object dimensions and positions in real-world scenarios [1], [2], [3]. Recent research focuses on harnessing LiDAR and camera data for this purpose, capitalizing on LiDAR’s point cloud-based 3-D data and cameras’ high-resolution RGB images [4]. Despite their importance, efficiently extracting and fusing features from these sources poses challenges. While deep learning-based feature extraction, especially for RGB images, is prevalent, dealing with point clouds’ irregular distribution and sparsity is complex [5]. Existing methods have involved transforming point clouds into either voxel grids or 2-D dense images for 2-D neural network application [6], [7], [8], [9], [10], [11]. Recent advancements include the direct utilization of multilayer perceptrons (MLPs) for feature aggregation from point clouds and the exploration of graph-based representations for feature extraction, treating points as vertices [12], [13].

Contact IEEE to Subscribe

References

References is not available for this document.