1 Introduction
Motion plays a key role in many state-of-the-art methods of Video Object Segmentation (VOS) [1], [2], [3], [4], as motion estimations like optical flow [5] or pixel trajectory [6] reveals the pixel-wise correspondence between frames and enables the propagation of instance labels. Moreover, the rich spatio-temporal structure in motion also provides information that is beneficial for segmenting moving objects. However, motion estimation itself is still a difficult task as it suffers from challenges like noise, blurring, deformation, and occlusions. Different from previous methods that mainly rely on motion, recent attempts based on deep CNNs [7], [8], [9] tackle the problem of VOS by appearance learning. Building upon the powerful learning ability and the large amounts of training data, deep CNNs have achieved very good performance in still image segmentation [10]. For the task of VOS, however the annotated training data is lacking and treating frames as still images loses the information hidden in motion. It has been shown in [7], [8] that after fine-tuning on the first frame, deep CNNs can “recognize” the target object with similar appearance from subsequent frames. However, relying only on “memorizing” the appearance of the target object in the first frame may suffer from several limitations. For example, the object's appearance may change over time, and objects in the background may show similar appearance to the target object. Although online adaptation [7] shows robustness for video frames’ temporal variations, repeatedly finetuning the model at each time step is very time-consuming and negatively affects the efficiency.