1 Introduction
Detecting or localizing activities in videos [1], [2], [3], [4], [5], [6], [7], [8], [9] is a prominent while fundamental problem for video understanding. As videos often contain intricate activities that cannot be indicated by a predefined list of action classes, a new task, namely temporal sentence grounding in videos (TSG) [10], [11], has recently attracted much research attention [12], [13], [14], [15], [16], [17], [18]. Formally, given an untrimmed video and a natural sentence query, the TSG task aims to identify the start and end timestamps of one specific video segment, which contains activities of interest semantically corresponding to the given sentence query.