I. Introduction
Visual question answering task is a hot topic in the artificial intelligence field, which aims to answer questions related to a specific image. It has been widely used in our daily life such as blind assistance, advanced image semantic retrieval, and intelligent medical diagnosis systems [1]. Given an image and a corresponding question, the goal of VQA models is to output the right answer through cross modal reasoning, where the key challenge lies in effectively fusing the multi-modal information and designing elaborate reasoning approaches. For multi-modal information fusion, previous approaches use global image and question features for cross-modal information fusion [2], [3], which are difficult to answer questions about details, resulting in inefficient models. In order to tackle such problems, fine-grained methods [4], [5], [6] are proposed, which take local details into consideration. These methods usually use the object detector to extract multiple salient regions and apply the word tokenizer to process questions into multiple word tokens, and then construct fine-grained interactions between two modalities to make robust reasoning. Later on, some researchers find that multiple objects in an image exist diverse relationships in terms of their position (e.g., the right of) or mutual relations (e.g. ride). For better information reasoning, graph-based methods [5], [7], [8] are proposed to construct the diverse relationships among objects for graph reasoning. Similarly, [8] parses the question into a syntactic dependency tree, where relations among words are built according to syntactic rules, which facilitates question understanding and promotes more effective visual reasoning. With the significant advantages of the BERT model [9] in pre-training large-scale text corpus, researchers have imitated it to build cross-modal pre-training models to obtain better fusion features, which usually perform best in VQA challenges. The above VQA methods mostly follow a general process (Fig. 1(a)): First, the feature extractor extracts features from images and questions respectively, and then the Single-Modality Module constructs relationships among the extracted features to capture crucial intra-modal information. Next, the Cross-Modality Module fuses multi-modal features into a contextualized vector, which is finally put into the classifier to output the expected answer.
(a) Traditional models often concentrate on image and question information, which apply a cross-modal fusion method (Cross-Modality Module) to generate the answers. (b) Our dense caption-aware model (DenseCapBert) first generates dense captions from images to enhance the visual information and then applies a novel triple information fusion mechanism to release the domain gap between vision and language modalities for robust visual reasoning.