I. Introduction
Video object segmentation plays an important role in computer vision, which is a fundament work for many applications, such as multiple targets tracking [1], pose estimation [2]–[4], action recognition [5]–[8]. Different with semantic image segmentation [9], [10] , video object segmentation extracts consistent object region through the whole video sequence. Obtaining the accurate object prior plays a critical role in video object segmentation. In order to obtain accurate object prior, many methods have been proposed in recent years such as object ranking based video segmentation [11]–[15], tracking based video segmentation [16]– [20], and iterative correction based interactive video segmentation [21], [22]. Object ranking based video segmentation focuses on designing the accuracy method for the initial object detection by exploring the appearance and motion consistence. However, the performance will suffer the similarity of foreground and background, abrupt motion, motion blur and non-rigid object deformation. Tracking based video segmentation tracks the object regions through the whole sequence based on the user annotation in the first frame. This method only requires few manual segmentation in the start of the sequence. However, its segmentation error cannot be corrected in the unlabeled frames and will be cumulated along with the sequence. In iterative correction based interactive video segmentation, the input for the user is just a few disjoint scribbles, such as foreground strokes and background strokes [22], which is similar with interactive Graph Cuts [23], or need user to repair the landmark to fit the object boundary [21]. Although this method can repair the errors in segmentation process, many trivial operations arise in the segmentation [22] due to the absence of the guidance that when and where to be annotated will benefit the segmentation mostly.