I. Introduction
Video object segmentation (VOS) is a fundamental task in video processing and understanding, which aims to identify and segment specific objects within a video. This task has wide-ranging applications, including video editing [1], [2], [3], [4], [5], [6], [7], [8], robotic mapping and navigation [9], [10], [11], [12], [13], and autonomous driving [14], [15], [16], [17], [18], [19]. Video object segmentation significantly impacts autonomous driving systems by improving object recognition accuracy and efficiency, enhancing perception, aiding semantic understanding, and supporting communication. It plays a crucial role in enabling autonomous vehicles to navigate safely and efficiently in intelligent transportation systems. A survey [20] offers a comprehensive overview of the substantial advancements in video segmentation by introducing multiple task settings, background concepts, perceived need, development history, and main challenges. We focus on the semi-supervised setting where the target masks are provided in the first frame of a video, and the goal is to segment targets in the subsequent frames. Semi-supervised VOS remains challenging due to target deformation, occlusion, and appearance variation.