1. Introduction
Visual Question Answering (VQA) [8], [4] has become the fundamental building block that underpins many frontier interactive AI systems, such as visual dialog [15], vision-and-language navigation [6], and visual commonsense reasoning [51]. VQA systems are required to perform visual analysis, language understanding, and multi-modal reasoning. Recent studies [19], [3], [8], [19], [25] found that VQA models may rely on spurious linguistic correlations rather than multi-modal reasoning. For instance, simply answering "tennis" to the sport-related questions and "yes" to the questions "Do you see a ..." can achieve approximately 40% and 90% accuracy on the VQA v1.0 dataset. As a result, VQA models will fail to generalize well if they simply memorize the strong language priors in the training data [2], [19], especially on the recently proposed VQA-CP [3] dataset where the priors are quite different in the training and test sets.