I. Introduction
Visual question answering (VQA) has become a popular research problem that is being studied from the perspectives of multiple disciplines, such as natural language processing (NLP) and computer vision. The objective of a VQA task is to generate an answer based on an image and a given related question. The answer can be a number, a response of yes or no, or a word phrase. Such a task is not trivial because it is necessary to first understand each image and its corresponding question and then find the correlations between question-image pairs based on their own features as well as certain auxiliary external features [1]–[9]. Currently, most VQA approaches take advantage of the concept of multiple modalities by representing images and queries separately using two embedding or feature vectors [10]. In the vision mode, image features are extracted using a convolutional neural network (CNN) [11], whereas in the question understanding mode, a question embedding vector is generated to represent the semantic meaning of the question by using either a recurrent neural network (RNN) or the bag-of-words approach [12]. To identify the important information with regard to a question-image pair, most current algorithms use the concept of attention/co-attention. By assigning different weights to different image features, a good attention algorithm is able to select the features in an image that are most important in relation to the question being asked. There are many ways to generate attention weights, such as an elementwise sum or product or multimodal compact bilinear pooling (MCB) [10], and most of them demonstrate reasonable performance on VQA tasks [13]. Some of the recent works also resolve the question-answering part by leveraging semantic frame parsing with the help of recurrent neural network (RNN) models [14]–[29].
An example in which an incorrect answer is obtained using a VQA model