I. INTRODUCTION
3D computer vision models (e.g., object detectors) help robotic and control systems perceive and understand the environment from 3D data (e.g., point cloud), which provides more accurate geometric and spatial information and is robust to illumination and domain shifts. Since point clouds do not have a grid-like structure as images, previous works have proposed various neural network architectures for point cloud understanding [1] – [13]. With the success of attention-based architectures (i.e., transformers) in other learning regime [14] – [16], it has recently been applied to point clouds [17] – [23]. Some properties of transformers make them ideal for modeling point clouds. For example, their permutation-invariant property is necessary for modeling unordered sets like point clouds, and their attention mechanism helps learn long-range relationships and capture global context.