I. Introduction
Video content analysis has received increasing attention from both academia and industry, which has stimulated the research and application of novel video understanding tasks, such as video retrieval [1], [2] and video question answering [3], [4]. As a classic example of cross-modal information retrieval, video retrieval retrieves the semantically most relevant videos in the trimmed video dataset based on textual sentence queries. However, videos often contain redundant and irrelevant content, that is, only a small fraction of the video clips are semantically relevant to the query [5], [6]. For example, for a long untrimmed surveillance video, only a few short key clips are of interest. To localize these clips, we have to spend several hours manually browsing through the entire video. This process is inefficient and labor-intensive [7].