Conferences >2022 IEEE International Confe...

A New Design of VQA System based on Weighted Contextual Features

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Visual question answering (VQA) is a challenging task that requires a deep understanding of language and images. Currently, most VQA algorithms focus on finding the corre...Show More

Metadata

Abstract:

Visual question answering (VQA) is a challenging task that requires a deep understanding of language and images. Currently, most VQA algorithms focus on finding the correlations between basic question embeddings and image features by using an element-wise product or bilinear pooling between these two vectors. Some algorithms also use attention models to extract features. In this paper, deeper analyses of these attention features are enabled by capturing their importance by weighting their contextual information. A novel interpretable VQA system leveraging weighted attention contextual features (WACF) is proposed for VQA tasks. This is a multimodal system which can assign adaptive weights to the features of questions and images themselves and to their contextual features based on their importance. Our new model yields state-of-the-art results on the MS COCO VQA datasets for open-ended question tasks.

Published in: 2022 IEEE International Conference on Knowledge Graph (ICKG)

Date of Conference: 30 November 2022 - 01 December 2022

Date Added to IEEE Xplore: 06 February 2023

ISBN Information:

DOI: 10.1109/ICKG55886.2022.00045

Conference Location: Orlando, FL, USA

Contents

I. Introduction

Visual question answering (VQA) has become a popular research problem that is being studied from the perspectives of multiple disciplines, such as natural language processing (NLP) and computer vision. The objective of a VQA task is to generate an answer based on an image and a given related question. The answer can be a number, a response of yes or no, or a word phrase. Such a task is not trivial because it is necessary to first understand each image and its corresponding question and then find the correlations between question-image pairs based on their own features as well as certain auxiliary external features [1]–[9]. Currently, most VQA approaches take advantage of the concept of multiple modalities by representing images and queries separately using two embedding or feature vectors [10]. In the vision mode, image features are extracted using a convolutional neural network (CNN) [11], whereas in the question understanding mode, a question embedding vector is generated to represent the semantic meaning of the question by using either a recurrent neural network (RNN) or the bag-of-words approach [12]. To identify the important information with regard to a question-image pair, most current algorithms use the concept of attention/co-attention. By assigning different weights to different image features, a good attention algorithm is able to select the features in an image that are most important in relation to the question being asked. There are many ways to generate attention weights, such as an elementwise sum or product or multimodal compact bilinear pooling (MCB) [10], and most of them demonstrate reasonable performance on VQA tasks [13]. Some of the recent works also resolve the question-answering part by leveraging semantic frame parsing with the help of recurrent neural network (RNN) models [14]–[29]. Fig. 1:

An example in which an incorrect answer is obtained using a VQA model

References is not available for this document.

A New Design of VQA System based on Weighted Contextual Features

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

A New Design of VQA System based on Weighted Contextual Features

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

References