I. Introduction
Visual Question Answering(VQA) is an advanced task that combines natural language understanding and computer vision, which aims to predict answers based on images and corresponding questions. Therefore, compared to traditional computer vision tasks such as object detection [1], VQA models need the ability to handle more complex image-text inference. The VQA task needs to jointly analyze multimodal features from vision and text, which has also gained increasing attention.