I. INTRODUCTION
Accurate 3D perception of the surrounding environment is fundamental for achieving autonomous driving. Since the rise of deep learning, research on object detection algorithms has become one of the important research focuses in the field of computer vision. Following remarkable progress in image-based 2D object detection, researchers have gradually shifted their focus to more complex 3D object detection tasks. Compared to 2D object detection, which only requires determining the pixel positions of objects in images, 3D object detection requires predicting accurate 3D spatial coordinate information of objects. Currently, mainstream solutions for 3D object detection tasks mainly fall into three categories. First, at the hardware level, sensors such as LiDAR that can directly acquire 3D spatial information of objects are utilized. Second, at the algorithm level, methods using multi-view geometry or deep neural networks are employed to extract depth information from 2D images, further inferring the 3D positional and geometrical information of objects. Third, leveraging the advantages of both hardware and algorithms, the fusion of 3D point clouds from LiDAR and 2D images from RGB cameras is achieved. This approach utilizes mature 2D image feature extraction algorithms and reliable 3D coordinate information from point clouds to achieve more accurate object detection and classification.