1. Introduction
Recent advances in salient object segmentation (SOS) in videos using CNN [22], [27], [29], [50] have demonstrated impressive performance in accuracy. Such SOS meth- ods [22], [27], [29], [50] focus on only localizing the region of interest by labeling “salient” or “non-salient” to each pixel in the video frame. The localized salient region, however, may involve multiple (interacting) objects (Fig. 1 a), which is a more reasonable scenario in the real-world scenes. Therefore, localized salient regions should be decomposed into conceptually meaningful components (Fig. 1 b), called salient instances [26], for better understanding of videos. Furthermore, attaching a semantic label to each salient instance (Fig. 1 c) will widen the range of applications of SOS even to autonomous driving [54] and robotic interaction [52]. Nevertheless, segmenting semantic salient instances is not yet addressed in the literature.
Segmentation levels of salient objects. The input video frame is followed by different levels of label annotation. Our work focuses on segmenting semantic salient instances (most right).
Examples obtained by our method on the SESIV dataset. From left to right, the original video frame is followed by instance label and semantic label. The first and second rows show ground-truth labels, and segmented results, respectively.