Loading [MathJax]/extensions/MathMenu.js
DPHANet: Discriminative Parallel and Hierarchical Attention Network for Natural Language Video Localization | IEEE Journals & Magazine | IEEE Xplore

DPHANet: Discriminative Parallel and Hierarchical Attention Network for Natural Language Video Localization


Abstract:

Natural Language Video Localization (NLVL) has recently attracted much attention because of its practical significance. However, the existing methods still face the follo...Show More

Abstract:

Natural Language Video Localization (NLVL) has recently attracted much attention because of its practical significance. However, the existing methods still face the following challenges: 1) When the models learn intra-modal semantic association, the temporal causal interaction information and contextual semantic discriminative information are ignored, resulting in the lack of intra-modal semantic context connection; 2) When learning fusion representations, existing cross-modal interaction modules lack hierarchical attention function to extract inter-modal similarity information and intra-modal self-correlation information, resulting in insufficient cross-modal information interaction; and 3) When the loss function is optimized, the existing models ignore the correlation of causal inference between the start and end boundaries, resulting in inaccurate start and end boundary calibrations. To conquer the above challenges, we proposed a novel NLVL model, called Discriminative Parallel and Hierarchical Attention Network (DPHANet). Specifically, we emphasized the importance of temporal causal interaction information and contextual semantic discriminative information and correspondingly proposed a Discriminative Parallel Attention Encoder (DPAE) module to infer and encode the above critical information. Besides, to overcome the shortcomings of the existing cross-modal interaction modules, we designed a Video-Query Hierarchical Attention (VQHA) module, which can perform cross-modal interaction and intra-modal self-correlation modeling in a hierarchical manner. Furthermore, a novel deviation loss function was proposed to capture the correlation of causal inference between the start and end boundaries and force the model to focus on the continuity and temporal causality in the video. Finally, extensive experiments on three benchmark datasets demonstrated the superiority of our proposed DPHANet model, which has achieved about 1.5% and 3.5% average performance improvement and a...
Published in: IEEE Transactions on Multimedia ( Volume: 26)
Page(s): 9575 - 9590
Date of Publication: 02 May 2024

ISSN Information:

Funding Agency:


I. Introduction

Video content analysis has received increasing attention from both academia and industry, which has stimulated the research and application of novel video understanding tasks, such as video retrieval [1], [2] and video question answering [3], [4]. As a classic example of cross-modal information retrieval, video retrieval retrieves the semantically most relevant videos in the trimmed video dataset based on textual sentence queries. However, videos often contain redundant and irrelevant content, that is, only a small fraction of the video clips are semantically relevant to the query [5], [6]. For example, for a long untrimmed surveillance video, only a few short key clips are of interest. To localize these clips, we have to spend several hours manually browsing through the entire video. This process is inefficient and labor-intensive [7].

Contact IEEE to Subscribe

References

References is not available for this document.