1. Introduction
Visual Question Answering (VQA) is a challenging task that requires both language-aware reasoning and image understanding. With advances in deep learning, neural networks [37], [34], [6], [13], [18], [17], [19], [29] that model the correlations between vision and language have shown remarkable results on large-scale benchmark datasets [3], [15], [23], [20].