1. Introduction
Visual Question Answering (VQA) refers to the task of automatically answering free-form natural language questions about an image. For VQA systems to work reliably when deployed in the wild, for applications such as assisting visually impaired users, they need to be robust to different ways a user might ask the same question. For example, VQA models should produce the same answer for two paraphrased questions – "What is in the basket?" and "What is contained in the basket?" since their semantic meaning is the same. While significant progress has been made towards building more accurate VQA systems, these models remain brittle to minor linguistic variations in the input question.