I. Introduction
Visual question answering (VQA) is a prevalent and challenging multi-modal task that demands a strong grasp of visual context understanding [27] and linguistically-aware reasoning [5], [21], [43]. With the advancement of deep neural networks, VQA models have made significant strides in applications like human-robot interaction [42] and visual dialogs [44]. Typically, existing VQA models excel when they have access to large-scale training datasets and share similar data distributions between training and testing sets. However, these well-trained VQA models often struggle when faced with out-of-distribution scenarios in real-world applications [40]. In these cases, answer distributions diverge between the training and testing datasets. For instance, as depicted in Figure 1 (a), when we query a well-trained VQA model with “What color is her shirt?”, it might provide a biased answer like “black”, reflecting the answer distribution in the training dataset, while neglecting the true visual context. Therefore, it is crucial to appropriately adapt the deployed VQA model in one distinct scene to ensure optimal performance when facing distribution shifts in test samples.
Test-time Adaptation for Visual Question Answering with Biased Dataset Distributions. (a) Illustration of biased training and testing subsets related to the question type “what color is” in the VQA-CP v2 [1] dataset. In the training subset, “black” constitutes a significant portion of the answers, whereas in the testing subset, “black” represents a small proportion. Current methods tend to predict the biased answer by capturing language biases within the training dataset. (b) Given the biased testing dataset and a pre-trained biased VQA model (e.g., UpDn [2]), test-time adaptation aims to enhance the VQA model’s out-of-distribution performance by leveraging unlabeled sequential testing samples.