Journals & Magazines >IEEE Transactions on Multimedia >Volume: 26

Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions....Show More

Metadata

Abstract:

The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions. Existing graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features without considering relative relations between objects, which may lead to inferior performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA. First, to make question features aware of keywords, we employ an attention mechanism to assign high weights to keywords during question encoding. The keyword-aware question features are then used to guide video graph construction. Second, because relations are relative, we integrate the relative relation modeling to better capture the spatio-temporal dynamics among object nodes. Moreover, we disentangle the spatio-temporal reasoning into an object-level spatial graph and a frame-level temporal graph, which reduces the impact of spatial and temporal relation reasoning on each other. Extensive experiments on the TGIF-QA, MSVD-QA and MSRVTT-QA datasets demonstrate the superiority of our KRST over multiple state-of-the-art methods.

Published in: IEEE Transactions on Multimedia ( Volume: 26)

Page(s): 6131 - 6141

Date of Publication: 20 December 2023

ISSN Information:

DOI: 10.1109/TMM.2023.3345172

Funding Agency:

Contents

I. Introduction

Video question answering (VideoQA) is a challenging task in Multimedia Intelligence [1], [2], [3], and it aims to answer the question based on a thorough understanding of the given video. The task requires the powerful cognitive capability of spatio-temporal visual representations guided by the compositional semantics of the given question. In recent years, VideoQA has drawn increasing attention due to its wide applications in various domains, e.g., human-robot interaction and autonomous driving. Despite its recent achievements, VideoQA still remains challenging as it requires effective reasoning about complex spatio-temporal relations based on the vision and language modalities [4], [5].

References is not available for this document.

MIT Libraries

MIT Libraries

Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References