1. Introduction
Perceiving the physical world in 3D space is critical for reliable autonomous driving systems [2], [54]. As self-driving sensors become more advanced, integrating the complementary signals captured from different sensors (e.g., Cameras, LiDAR, and Radar) in a unified manner is essential. To achieve this goal, we propose UniTR, a unified yet efficient multi-modal transformer backbone that can process both 3D sparse point clouds and 2D multi-view dense images in parallel to learn the unified bird’s-eye-view (BEV) representations for boosting 3D outdoor perception.