Journals & Magazines >IEEE Transactions on Circuits... >Volume: 34 Issue: 2

See and Learn More: Dense Caption-Aware Representation for Visual Question Answering

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

With the rapid development of deep learning models, great improvements have been achieved in the Visual Question Answering (VQA) field. However, modern VQA models are eas...Show More

Metadata

Abstract:

With the rapid development of deep learning models, great improvements have been achieved in the Visual Question Answering (VQA) field. However, modern VQA models are easily affected by language priors, which ignore image information and learn the superficial relationship between questions and answers, even in the optimal pre-training model. The main reason is that visual information is not fully extracted and utilized, which results in a domain gap between vision and language modalities to a certain extent. In order to mitigate the circumstances, we propose to extract dense captions (auxiliary semantic information) from images to enhance the visual information for reasoning and utilize them to release the gap between vision and language since the dense captions and the questions are from the same language modality (i.e., phrase or sentence). In this paper, we propose a novel dense caption-aware visual question answering model called DenseCapBert to enhance visual reasoning. Specifically, we generate dense captions for the images and propose a multimodal interaction mechanism to fuse dense captions, images, and questions in a unified framework, which makes the VQA models more robust. The experimental results on GQA, GQA-OOD, VQA v2, and VQA-CP v2 datasets show that dense captions are beneficial to improving the model generalization and our model effectively mitigates the language bias problem.

Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 34, Issue: 2, February 2024)

Page(s): 1135 - 1146

Date of Publication: 03 July 2023

ISSN Information:

DOI: 10.1109/TCSVT.2023.3291379

Funding Agency:

Contents

I. Introduction

Visual question answering task is a hot topic in the artificial intelligence field, which aims to answer questions related to a specific image. It has been widely used in our daily life such as blind assistance, advanced image semantic retrieval, and intelligent medical diagnosis systems [1]. Given an image and a corresponding question, the goal of VQA models is to output the right answer through cross modal reasoning, where the key challenge lies in effectively fusing the multi-modal information and designing elaborate reasoning approaches. For multi-modal information fusion, previous approaches use global image and question features for cross-modal information fusion [2], [3], which are difficult to answer questions about details, resulting in inefficient models. In order to tackle such problems, fine-grained methods [4], [5], [6] are proposed, which take local details into consideration. These methods usually use the object detector to extract multiple salient regions and apply the word tokenizer to process questions into multiple word tokens, and then construct fine-grained interactions between two modalities to make robust reasoning. Later on, some researchers find that multiple objects in an image exist diverse relationships in terms of their position (e.g., the right of) or mutual relations (e.g. ride). For better information reasoning, graph-based methods [5], [7], [8] are proposed to construct the diverse relationships among objects for graph reasoning. Similarly, [8] parses the question into a syntactic dependency tree, where relations among words are built according to syntactic rules, which facilitates question understanding and promotes more effective visual reasoning. With the significant advantages of the BERT model [9] in pre-training large-scale text corpus, researchers have imitated it to build cross-modal pre-training models to obtain better fusion features, which usually perform best in VQA challenges. The above VQA methods mostly follow a general process (Fig. 1(a)): First, the feature extractor extracts features from images and questions respectively, and then the Single-Modality Module constructs relationships among the extracted features to capture crucial intra-modal information. Next, the Cross-Modality Module fuses multi-modal features into a contextualized vector, which is finally put into the classifier to output the expected answer. Fig. 1.

(a) Traditional models often concentrate on image and question information, which apply a cross-modal fusion method (Cross-Modality Module) to generate the answers. (b) Our dense caption-aware model (DenseCapBert) first generates dense captions from images to enhance the visual information and then applies a novel triple information fusion mechanism to release the domain gap between vision and language modalities for robust visual reasoning.

References is not available for this document.

MIT Libraries

MIT Libraries

See and Learn More: Dense Caption-Aware Representation for Visual Question Answering

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

See and Learn More: Dense Caption-Aware Representation for Visual Question Answering

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References