1. Introduction
Video Object Segmentation (VOS) aims at automatically generating accurate pixel masks for objects in each frame of a video, then associating those proposed object pixel masks in the successive frames to obtain temporally consistent tracks. VOS has mostly been tackled in a semi-supervised fashion [19], [32], [30], where the object masks of the objects to be tracked in the first-frame are given, and only those objects need to be tracked and segmented throughout the rest of the video.