1. Introduction
In recent years, LiDAR-based 3D detection [57], [28], [46], [45], [61], [60] has achieved increasing performance and stability in autonomous driving and robotics. However, the high cost of LiDAR sensors has limited its applications in low-cost products. Stereo matching [3], [18], [26], [35] is the most common depth sensing technique using only cameras. Compared with LiDAR sensors, stereo cameras are at a much lower cost and higher resolutions, which makes it a suitable alternative solution for 3D perception. 3D detection from stereo images aims at detecting objects using estimated depth maps [39], [52], [64] or implicit 3D geometry representations [30], [10], [53]. However, the performance of existing stereo-based 3D detection algorithms is still inferior compared with LiDAR-based algorithms.