I. Introduction
With the prevalence of online social video sharing, considerable amounts of videos are being created and processed every day. In many of those videos, there exists a primary object that we want to focus our attention on, e.g., a child or a pet in a homemade personal video. We define the primary object in a video sequence as the object that presents saliently in most of the frames, and some examples are shown in Fig. 1. In this paper, we address the problem of automatically discovering the primary objects in videos, which is an essential step for many applications such as advertisement design [36] and video summarization [20], [28], [44]. Traditional video object detection and localization methods, however, are either too category specific (e.g., face [47] and pedestrian detection [13]) or heavily rely on manual initialization (e.g., object tracking [19] and interactive object segmentation [18]). They are suitable for targeted object detection that is tailored to users’ interests, but are too limited for many multimedia applications that require automatically processing large volumes of video data with diverse content. Throughout this paper, we will also use the term foreground object or simply foreground interchangeably with the term primary object.
Examples of primary object discovery. Each row corresponds to one video, and the red rectangle highlights the primary object.