I. Introduction
Accurate 3-D object detection holds utmost importance in the domain of autonomous vehicles (AVs), as well as for understanding object dimensions and positions in real-world scenarios [1], [2], [3]. Recent research focuses on harnessing LiDAR and camera data for this purpose, capitalizing on LiDAR’s point cloud-based 3-D data and cameras’ high-resolution RGB images [4]. Despite their importance, efficiently extracting and fusing features from these sources poses challenges. While deep learning-based feature extraction, especially for RGB images, is prevalent, dealing with point clouds’ irregular distribution and sparsity is complex [5]. Existing methods have involved transforming point clouds into either voxel grids or 2-D dense images for 2-D neural network application [6], [7], [8], [9], [10], [11]. Recent advancements include the direct utilization of multilayer perceptrons (MLPs) for feature aggregation from point clouds and the exploration of graph-based representations for feature extraction, treating points as vertices [12], [13].