I. Introduction
Multiple object tracking (MOT) is a fundamental task of consistently assigning a unique ID to each observed object within a video sequence, which holds significant importance across various domains, including motion planning, safe robot navigation, and autonomous driving [1]. The primary challenge inherent to MOT lies in establishing precise associations between tracklets from preceding frames and the object detections within the current frame. To tackle the complexities of multi-object tracking, two main-stream paradigms have emerged: tracking-by-detection [2], [3] and joint-tracking-and-detection [4], [5], [6]. The tracking-by-detection paradigm follows a two-stage process. Initially, a pre-trained detector is employed to procure object detections, after which a tracker undertakes the data association task, assigning a distinct ID to each detected object across successive frames. On the other hand, the joint-tracking-and-detection paradigm endeavors to achieve detection and tracking concurrently, leveraging the benefits of joint optimization strategies. In this paper, our focus lies specifically on the tracking aspect, so we adopt the tracking-by-detection approach due to its inherent efficiency and proven effectiveness in addressing the complexities of object tracking tasks.