1. Introduction
Temporal action localization (TAL) is an important yet challenging task for video understanding. Its goal is to localize temporal boundaries of actions with specific categories in untrimmed videos [13], [7]. Because of its broad applications in high-level tasks such as video surveillance [40], video summarization [17], and event detection [15], TAL has recently drawn increasing attentions from the community. Up to now, deep learning based methods have made impressive progresses in this area. However, most of them handle this task in a fully supervised way, requiring massive temporal boundary annotations for actions [24], [51], [5], [42], [36]. Such manual annotations are expensive to obtain, which limits the development potential of fully-supervised methods in real-world scenarios.