I. Introduction
Visual object tracking aims to localize the specified object in a sequence given its initial state in the first frame. With the continuous innovation of the feature extraction backbone [1], [2], [3] and the manner of relation modeling [4], [5], [6], [7], [8], [9] between template and search regions, and the exploration on the saving of annotation costs [10], [11], the performance of tracking algorithms has achieved new heights. Meanwhile, the long-term tracking has gradually attracted more attention. The difference from short-term tracking, which has always been paid the most attention, is that in long-term tracking, the video duration is longer, and target disappearances and reappearances may occur more frequently. The long sequences can bring more complicated challenges and more obvious tracking error accumulation. Target disappearances and reappearances lead to the necessity of the re-detection capability of trackers. For the short-term tracking, sequences with targets always visible and a short duration are dominant. Therefore, searching in a local region based on the target location in the previous frame can handle most scenarios.