I. Introduction
Obstacle detection and tracking represents a safety-critical challenge across various domains, including robot autonomous navigation [1]–[5] and self-driving vehicles [6]–[9]. For instance, a service robot needs to detect people and pillars surrounding it, track their motions (if any), or even predict their future trajectories to avoid collision. Accurate obstacle detection and tracking are crucial components of autonomous navigation systems, particularly in state-based frameworks, to ensure collision-free navigation [10]–[13]. Recent research efforts focus on using low-cost visual sensors for obstacle perception to improve affordability [10], [14]–[17] compared with costly ones (e.g., LiDAR). In this paper, we concentrate on a specific line of research employing stereo cameras, which offer higher 3D perception accuracy, extended sensing range, and enhanced agility for robots compared to monocular-based systems [18], [19].