Journals & Magazines >IEEE Transactions on Pattern ... >Volume: 44 Issue: 8

Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query. Existing approaches mainl...Show More

Metadata

Abstract:

Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query. Existing approaches mainly solve the NLVL problem from the perspective of computer vision by formulating it as ranking, anchor, or regression tasks. These methods suffer from large performance degradation when localizing on long videos. In this work, we address the NLVL from a new perspective, i.e., span-based question answering (QA), by treating the input video as a text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework (named VSLBase), to address NLVL. VSLNet tackles the differences between NLVL and span-based QA through a simple yet effective query-guided highlighting (QGH) strategy. QGH guides VSLNet to search for the matching video span within a highlighted region. To address the performance degradation on long videos, we further extend VSLNet to VSLNet-L by applying a multi-scale split-and-concatenation strategy. VSLNet-L first splits the untrimmed video into short clip segments; then, it predicts which clip segment contains the target moment and suppresses the importance of other segments. Finally, the clip segments are concatenated, with different confidences, to locate the target moment accurately. Extensive experiments on three benchmark datasets show that the proposed VSLNet and VSLNet-L outperform the state-of-the-art methods; VSLNet-L addresses the issue of performance degradation on long videos. Our study suggests that the span-based QA framework is an effective strategy to solve the NLVL problem.

Published in: IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: 44, Issue: 8, 01 August 2022)

Page(s): 4252 - 4266

Date of Publication: 23 February 2021

ISSN Information:

PubMed ID: 33621165

DOI: 10.1109/TPAMI.2021.3060449

Funding Agency:

Citations are not available for this document.

Contents

1 Introduction

Natural language video localization (NLVL) is a prominent yet challenging problem in vision-language understanding. Given an untrimmed video, NLVL is to retrieve a temporal moment that semantically corresponds to a given language query. As illustrated in Fig. 1, NLVL involves both computer vision and natural language processing techniques [1], [2], [3], [4], [5], [6]. Cross-modal reasoning is essential for NLVL to correctly locate the target moment in a video. Prior studies primarily treat NLVL as a ranking task, which apply multimodal matching architecture to find the best matching video segment for a query [7], [8], [9], [10], [11]. Some works [11], [12], [13], [14] assign multi-scale temporal anchors to frames and select the anchor with the highest confidence as the result. Recently, several methods explore to model cross-interactions between video and query, and to regress temporal locations of target moment directly [15], [16], [17]. There are also studies that formulate NLVL as a sequential decision-making problem and solve it with reinforcement learning [18], [19], [20].

References is not available for this document.

MIT Libraries

MIT Libraries

Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1 Introduction

Cites in Papers - |

Cites in Papers - IEEE (28)

Cites in Papers - Other Publishers (10)

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1 Introduction

Cites in Papers - IEEE (28) | Other Publishers (10)

Cites in Papers - IEEE (28)

Cites in Papers - Other Publishers (10)

References

Cites in Papers - |