I. Introduction
Machine perception based on deep neural networks (DNNs) is widely deployed in autonomous systems such as autonomous vehicles (AV) and unmanned aerial vehicles (UAV). They are typically equipped with multiple types of sensory modalities (e.g., RGB cameras and LiDAR), and rely on a 3D object detector integrating information from all modalities and different views to localize and classify surrounding objects [1] –[3]. It is imperative to achieve accurate environment perception while keeping up with the real-time requirement, where the trade-off becomes more complex within dynamically evolving physical environments.