Loading [MathJax]/extensions/MathMenu.js
Counterfactual VQA: A Cause-Effect Look at Language Bias | IEEE Conference Publication | IEEE Xplore

Counterfactual VQA: A Cause-Effect Look at Language Bias


Abstract:

VQA models may tend to rely on language bias as a shortcut and thus fail to sufficiently learn the multi-modal knowledge from both vision and language. Recent debiasing m...Show More

Abstract:

VQA models may tend to rely on language bias as a shortcut and thus fail to sufficiently learn the multi-modal knowledge from both vision and language. Recent debiasing methods proposed to exclude the language prior during inference. However, they fail to disentangle the "good" language context and "bad" language bias from the whole. In this paper, we investigate how to mitigate language bias in VQA. Motivated by causal effects, we proposed a novel counterfactual inference framework, which enables us to capture the language bias as the direct causal effect of questions on answers and reduce the language bias by subtracting the direct language effect from the total causal effect. Experiments demonstrate that our proposed counterfactual inference framework 1) is general to various VQA backbones and fusion strategies, 2) achieves competitive performance on the language-bias sensitive VQA-CP dataset while performs robustly on the balanced VQA v2 dataset without any augmented data. The code is available at https://github.com/yuleiniu/cfvqa.
Date of Conference: 20-25 June 2021
Date Added to IEEE Xplore: 02 November 2021
ISBN Information:

ISSN Information:

Conference Location: Nashville, TN, USA

Funding Agency:


1. Introduction

Visual Question Answering (VQA) [8], [4] has become the fundamental building block that underpins many frontier interactive AI systems, such as visual dialog [15], vision-and-language navigation [6], and visual commonsense reasoning [51]. VQA systems are required to perform visual analysis, language understanding, and multi-modal reasoning. Recent studies [19], [3], [8], [19], [25] found that VQA models may rely on spurious linguistic correlations rather than multi-modal reasoning. For instance, simply answering "tennis" to the sport-related questions and "yes" to the questions "Do you see a ..." can achieve approximately 40% and 90% accuracy on the VQA v1.0 dataset. As a result, VQA models will fail to generalize well if they simply memorize the strong language priors in the training data [2], [19], especially on the recently proposed VQA-CP [3] dataset where the priors are quite different in the training and test sets.

Contact IEEE to Subscribe

References

References is not available for this document.