I. Introduction
Video understanding is a challenging task in computer vision. Many related tasks have been extensively studied due to its wide applications, such as video captioning [1], action localization [46] and video question answering [47]. Temporal language grounding (TLG) [2], [3], which aims at automatically locating the temporal boundaries of the segments indicated by natural language descriptions in an untrimmed video, has recently attracted increasing attention from both the computer vision and the natural language processing communities because of its potential applications in video search engines and automated video editing.