I. Introduction
With the rapid advancements of CNN-based [1] and attention-based [2] models, neural networks have achieved significant performance boosts in various vision tasks, such as object detection [3], [4], [5], captioning [6], [7], [8] and semantic segmentation [9], [10], [11], [12], [13]. Among all vision tasks, video object segmentation (VOS) is a task that aims to segment objects in video frames. It has attracted increasing attention from researchers in recent years. VOS consists of the following two subtasks: (1) unsupervised VOS (UVOS) segments salient objects without clues and (2) semi-supervised VOS (SVOS) uses the first frame mask to segment specific entities. However, existing VOS models rely on large amounts of training data. Moreover, these models can hardly recognize unseen classes. Consequently, few-shot video object segmentation (FSVOS), which uses support images to help discover objects of unseen classes in query videos, is proposed to address these issues.