Loading [MathJax]/extensions/MathMenu.js
See and Learn More: Dense Caption-Aware Representation for Visual Question Answering | IEEE Journals & Magazine | IEEE Xplore

See and Learn More: Dense Caption-Aware Representation for Visual Question Answering


Abstract:

With the rapid development of deep learning models, great improvements have been achieved in the Visual Question Answering (VQA) field. However, modern VQA models are eas...Show More

Abstract:

With the rapid development of deep learning models, great improvements have been achieved in the Visual Question Answering (VQA) field. However, modern VQA models are easily affected by language priors, which ignore image information and learn the superficial relationship between questions and answers, even in the optimal pre-training model. The main reason is that visual information is not fully extracted and utilized, which results in a domain gap between vision and language modalities to a certain extent. In order to mitigate the circumstances, we propose to extract dense captions (auxiliary semantic information) from images to enhance the visual information for reasoning and utilize them to release the gap between vision and language since the dense captions and the questions are from the same language modality (i.e., phrase or sentence). In this paper, we propose a novel dense caption-aware visual question answering model called DenseCapBert to enhance visual reasoning. Specifically, we generate dense captions for the images and propose a multimodal interaction mechanism to fuse dense captions, images, and questions in a unified framework, which makes the VQA models more robust. The experimental results on GQA, GQA-OOD, VQA v2, and VQA-CP v2 datasets show that dense captions are beneficial to improving the model generalization and our model effectively mitigates the language bias problem.
Page(s): 1135 - 1146
Date of Publication: 03 July 2023

ISSN Information:

Funding Agency:

References is not available for this document.

I. Introduction

Visual question answering task is a hot topic in the artificial intelligence field, which aims to answer questions related to a specific image. It has been widely used in our daily life such as blind assistance, advanced image semantic retrieval, and intelligent medical diagnosis systems [1]. Given an image and a corresponding question, the goal of VQA models is to output the right answer through cross modal reasoning, where the key challenge lies in effectively fusing the multi-modal information and designing elaborate reasoning approaches. For multi-modal information fusion, previous approaches use global image and question features for cross-modal information fusion [2], [3], which are difficult to answer questions about details, resulting in inefficient models. In order to tackle such problems, fine-grained methods [4], [5], [6] are proposed, which take local details into consideration. These methods usually use the object detector to extract multiple salient regions and apply the word tokenizer to process questions into multiple word tokens, and then construct fine-grained interactions between two modalities to make robust reasoning. Later on, some researchers find that multiple objects in an image exist diverse relationships in terms of their position (e.g., the right of) or mutual relations (e.g. ride). For better information reasoning, graph-based methods [5], [7], [8] are proposed to construct the diverse relationships among objects for graph reasoning. Similarly, [8] parses the question into a syntactic dependency tree, where relations among words are built according to syntactic rules, which facilitates question understanding and promotes more effective visual reasoning. With the significant advantages of the BERT model [9] in pre-training large-scale text corpus, researchers have imitated it to build cross-modal pre-training models to obtain better fusion features, which usually perform best in VQA challenges. The above VQA methods mostly follow a general process (Fig. 1(a)): First, the feature extractor extracts features from images and questions respectively, and then the Single-Modality Module constructs relationships among the extracted features to capture crucial intra-modal information. Next, the Cross-Modality Module fuses multi-modal features into a contextualized vector, which is finally put into the classifier to output the expected answer.

(a) Traditional models often concentrate on image and question information, which apply a cross-modal fusion method (Cross-Modality Module) to generate the answers. (b) Our dense caption-aware model (DenseCapBert) first generates dense captions from images to enhance the visual information and then applies a novel triple information fusion mechanism to release the domain gap between vision and language modalities for robust visual reasoning.

Select All
1.
S. Barra, C. Bisogni, M. De Marsico and S. Ricciardi, "Visual question answering: Which investigated applications?", Pattern Recognit. Lett., vol. 151, pp. 325-331, Nov. 2021.
2.
J. H. Kim, J. Jun and B. T. Zhang, "Bilinear attention networks", Proc. NeurIPS, pp. 1571-1581, 2018.
3.
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell and M. Rohrbach, "Multimodal compact bilinear pooling for visual question answering and visual grounding", Proc. Conf. Empirical Methods Natural Lang. Process., pp. 457-468, 2016.
4.
P. Anderson et al., "Bottom-up and top-down attention for image captioning and VQA", Proc. CVPR, pp. 6077-6086, 2018.
5.
L. Li, Z. Gan, Y. Cheng and J. Liu, "Relation-aware graph attention network for visual question answering", Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pp. 10312-10321, Oct. 2019.
6.
W. Zhu, X. Wang and H. Li, "Multi-modal deep analysis for multimedia", IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 10, pp. 3740-3764, Oct. 2020.
7.
E. Kim, W. Y. Kang, K. On, Y. Heo and B. Zhang, "Hypergraph attention networks for multimodal learning", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 14569-14578, Jun. 2020.
8.
Q. Huang et al., "Aligned dual channel graph convolutional network for visual question answering", Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, pp. 7166-7176, 2020.
9.
J. Devlin, M. W. Chang, K. Lee and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding", Proc. NAACL, pp. 4171-4186, 2019.
10.
C. Kervadec, T. Jaunet, G. Antipov, M. Baccouche, R. Vuillemot and C. Wolf, "How transferable are reasoning patterns in VQA?", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 4205-4214, Jun. 2021.
11.
H. Jiang, I. Misra, M. Rohrbach, E. Learned-Miller and X. Chen, "In defense of grid features for visual question answering", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 10264-10273, Jun. 2020.
12.
T. Yu, J. Yu, Z. Yu, Q. Huang and Q. Tian, "Long-term video question answering via multimodal hierarchical memory attentive networks", IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 3, pp. 931-944, Mar. 2021.
13.
D. A. Hudson and C. D. Manning, "Learning by abstraction: The neural state machine", Proc. NeurIPS, pp. 5901-5914, 2019.
14.
L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh and K.-W. Chang, "VisualBERT: A simple and performant baseline for vision and language", arXiv:1908.03557, 2019.
15.
H. Xu et al., "E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning", Proc. 59th Annu. Meeting Assoc. Comput. Linguistics 11th Int. Joint Conf. Natural Lang. Process., pp. 503-513, 2021.
16.
X. Li et al., "OSCAR: Object-semantics aligned pre-training for visionlanguage tasks", Proc. ECCV, pp. 121-137, 2020.
17.
J. Lu, D. Batra, D. Parikh and S. Lee, "VilBERT: Pretraining taskagnostic visiolinguistic representations for vision-and-language tasks", Proc. NeurIPS, pp. 13-23, 2019.
18.
H. Tan and M. Bansal, "LXMERT: Learning cross-modality encoder representations from transformers", Proc. Conf. Empirical Methods Natural Lang. Process. 9th Int. Joint Conf. Natural Lang. Process. (EMNLP-IJCNLP), pp. 5099-5110, 2019.
19.
W. Guo, Y. Zhang, J. Yang and X. Yuan, "Re-attention for visual question answering", IEEE Trans. Image Process., vol. 30, pp. 6730-6743, 2021.
20.
R. R. Selvaraju et al., "Taking a HINT: Leveraging explanations to make vision and language models more grounded", Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pp. 2591-2600, Oct. 2019.
21.
J. Wu and R. Mooney, "Self-critical reasoning for robust visual question answering", Proc. NeurIPS, vol. 32, pp. 8601-8611, 2019.
22.
R. Cadene, C. Dancette, H. Ben-Younes, M. Cord and D. Parikh, "RUBi: Reducing unimodal biases in visual question answering", Proc. NeurIPS, pp. 839-850, 2019.
23.
Y. Niu, K. Tang, H. Zhang, Z. Lu, X. Hua and J. Wen, "Counterfactual VQA: A cause-effect look at language bias", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 12695-12705, Jun. 2021.
24.
S. Ramakrishnan, A. Agrawal and S. Lee, "Overcoming language priors in visual question answering with adversarial regularization", Proc. NeurIPS, vol. 31, pp. 1548-1558, 2018.
25.
N. Ouyang et al., "Suppressing biased samples for robust VQA", IEEE Trans. Multimedia, vol. 24, pp. 3405-3415, 2022.
26.
L. Chen, X. Yan, J. Xiao, H. Zhang, S. Pu and Y. Zhuang, "Counterfactual samples synthesizing for robust visual question answering", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 10797-10806, Jun. 2020.
27.
T. Gokhale, P. Banerjee, C. Baral and Y. Yang, "MUTANT: A training paradigm for out-of-distribution generalization in visual question answering", Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), pp. 878-892, 2020.
28.
Z. Liang, W. Jiang, H. Hu and J. Zhu, "Learning to contrast the counterfactual samples for robust visual question answering", Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), pp. 3285-3292, 2020.
29.
A. Vaswani et al., "Attention is all you need", Proc. NeurIPS, pp. 5998-6008, 2017.
30.
L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso and J. Gao, "Unified vision-language pre-training for image captioning and VQA", Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7, pp. 13041-13049, 2020.

Contact IEEE to Subscribe

References

References is not available for this document.