I. Introduction
The Visual Question Answering (VQA) task is an image-text multi-modal question answering task proposed by Agrawal in 2015 [1]. The task requires multimodal models with the ability to read, understand, fuse, and reason. In natural scenes, textual information appears in many places, such as car licenses, store names, and clothing logos. These OCR texts are indispensable supplementary information for visual question answering. This kind of task using OCR text is called Text-based Visual Question Answering [2]. An example is shown in Fig. 1.