I. Introduction
Real-time visual object tracking is a key module in many robotic perception systems [1]–[6]. Recently, deep regression trackers [7]–[9] (DRTs) have been proposed in the robotics community [7] because of their efficiency and generality. Thanks to their simple architecture, DRTs achieve processing speeds that surpass 100 FPS, making them suitable even for low-resource robots. Moreover, with the availability of large-scale computer vision datasets [10], these trackers can learn to track a large variety of targets without relying on particular assumptions, thus simplifying the development of tracking pipelines. However, acquiring thousands of videos for training these systems is not realistic in many real-world robotic application domains. Additionally, many domains offer particular scenarios that differ much from the examples which DRTs are trained on. For example, drone [11] and driving [3], [12] applications require tracking objects from particular camera views. Underwater robots offer uncommon targets and settings [4], [13]. Other robotics systems can use different imaging modalities [2]. Robotic manipulation configurations need the tracking of atypical objects [14]. As shown in Fig. 1, these situations cause DRTs’ accuracy to be very low. This is due to their deep learning architecture that is subject to overfitting if trained directly on small application datasets, and suffers from the shift between training and test data distributions when trained for large-scale generic object tracking.