I. Introduction
Monocular 3D object detection is the task of estimating three-dimensional information solely from a single 2D image, offering extensive applications in real-world scenarios, including autonomous driving and robotics. Due to its low cost and closer proximity to mass production, it has attracted increasing attention from researchers in academia and industry. However, existing research has mainly focused on ego-vehicle applications [5], [32], [34], [37], [38], [54], [55], [56], [57], where the camera’s position is close to the ground and obstacles can be easily occluded by other vehicles. This greatly limits the ego-vehicle perception capabilities and further leads to potential safety hazards in autonomous driving. Therefore, researchers have begun studying roadside perception systems using higher-mounted intelligent sensors, such as cameras, to solve this occlusion problem, expand the perception range, increase the reaction time for autonomous driving in dangerous situations through cooperative techniques [11], [16], [36], [48], [50], and thereby improve safety. In order to promote future research, some large-scale roadside datasets [9], [13], [46], [47], [49] containing images collected from roadside view and corresponding 3D annotations, have been released to provide an important basis for training and evaluating roadside monocular 3D object detection methods.