I. Introduction
With the development of remote sensing technology, remote sensing images (RSIs) have become widely accessible [1], including panchromatic images, multispectral images [2], hyperspectral images [3], and infrared images. These RSIs contain rich vision properties of land surface, which can be used in RSI scene classification [4], object detection [5], [6], and image caption [7]. Multispectral images and hyperspectral images [8] are able to record spectral characteristics of each material [9], which can be applied to discriminating materials, such as hyperspectral classification [10], change detection, and anomaly detection. However, all these tasks only extract task-specific information (such as scene categories, object location, and labels) from RSIs. In contrast, the remote sensing visual question answering (RSVQA) task generates questions about RSIs by combining image processing and natural language processing (NLP), which provides the user with high-level semantic information.