1 Introduction
Natural language video localization (NLVL) is a prominent yet challenging problem in vision-language understanding. Given an untrimmed video, NLVL is to retrieve a temporal moment that semantically corresponds to a given language query. As illustrated in Fig. 1, NLVL involves both computer vision and natural language processing techniques [1], [2], [3], [4], [5], [6]. Cross-modal reasoning is essential for NLVL to correctly locate the target moment in a video. Prior studies primarily treat NLVL as a ranking task, which apply multimodal matching architecture to find the best matching video segment for a query [7], [8], [9], [10], [11]. Some works [11], [12], [13], [14] assign multi-scale temporal anchors to frames and select the anchor with the highest confidence as the result. Recently, several methods explore to model cross-interactions between video and query, and to regress temporal locations of target moment directly [15], [16], [17]. There are also studies that formulate NLVL as a sequential decision-making problem and solve it with reinforcement learning [18], [19], [20].