I. Introduction
Temporal action localization (TAL), aiming at localizing the action instances in an untrimmed video [1], has been applied in many applications, that is, video summarization [2], video retrieval [3], [4], and video surveillance [5]. Unlike fully supervised TAL, which needs time-consuming frame-level annotations, weakly supervised TAL only needs video-level annotations, which is much easier for building large-scale datasets. To tackle this problem, many methods employ the multiple instance learning framework [1] that uses a labeled bag consisting of multiple unlabeled instances (e.g., video snippets). They learn the action classification scores of video snippets to form temporal class activation sequences (CAS) [6], [7]. The CAS is aggregated with a top-k mean strategy to predict the video-level labels. The action classifier in CAS still relies on the unlabeled snippets, which may fail to distinguish the highly similar action/background snippets resulting in incomplete action localization. To distinguish these unlabeled snippets, some methods use unsupervised clustering to learn class-wise prototypes [8], [9]. Each prototype uses a central feature and a variation to cover a cluster of action/background snippets. However, the action snippets contain multiple subactions and their features have large intraclass variations. The background snippets contain many unconstrained frames and their features have complex variations. As shown in Fig. 1(b), the complex variations of snippets hinder the accurate classification of action and background categories. The above methods focus on learning the snippet representation with unlabeled snippets but neglect to alleviate the misclassified prediction caused by the complex variations. It is still challenging to learn the snippet representation using complex variation snippets in weakly supervised TAL.
Methodological pipeline of our ensemble prototype learning: (a) action snippets and background snippets with large variations, (b) clustered prototype learning, (c) CPL, and (d) ESWL. In this work, we ensemble consensus prototypes of multiple stages for weakly supervised TAL.