I. Introduction
Localizing primary objects [1], [2] in videos is an important task in computer vision, since it facilitates many other vision tasks such as object recognition, retrieval and action recognition. Following the success of joint processing research in images [3]–[5], recent research interests have been shifted from single-video object localization to video object co-localization [6], [7], which aims at jointly localizing common objects across videos by exploiting shared attributes among videos as a type of weak supervision.