I. Introduction
Single object tracking (SOT) is a key task in the field of computer vision, which has wide downstream applications in outdoor and indoor scenarios, ranging from autonomous driving [1], [2], robot vision [3], [4], [5], [6], and intelligent transportation systems [7]. For example, an autonomous pedestrian-following robot should accurately track its master for efficient crowd-following control. Another example is autonomous landing by unmanned aerial vehicles, in which the drone must track the target and know the exact distance and pose of the target in order to land safely [8]. In indoor environments, tracking methods [5], [6], [9] can provide the six-degrees-of-freedom (6DoF) pose of an object for robust robotics manipulation. Given an initial bounding box of a template object in the first frame from images or LiDAR scans, the aim of SOT is to estimate its location by identifying the trajectory across all frames. In the past decade, a variety of image-based trackers (e.g., Siamese neural networks [10]) have shown promising performance in the 2D tracking community. However, the performance of image-based methods often suffers in degraded situations, e.g., when facing drastic lighting changes [11], [12]. As a possible remedy, 3D point clouds collected from LiDAR provide detailed depth and geometric information, which is inherently invariant to lighting changes [13], making it more robust when tracking across frames taken from different illumination environments.