1. Introduction
Video object segmentation aims to segment the objects of interest in a video sequence. It has been widely applied to many downstream applications such as video editing and object tracking. Recently, DAVIS dataset [24], [25] and YouTube-VOS dataset [36] are introduced and significantly drive forward this task. However, collecting such densely-annotated datasets is expensive and time-consuming. For example, labeling a single object in one frame of DAVIS dataset requires more than 100 seconds [2], finally resulting in either limited sizes [24], [25] or coarse annotations [36] in the existing VOS datasets.