Journals & Magazines >IEEE Transactions on Geoscien... >Volume: 60

Mutual Attention Inception Network for Remote Sensing Visual Question Answering

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Remote sensing images (RSIs) containing various ground objects have been applied in many fields. To make semantic understanding of RSIs objective and interactive, the tas...Show More

Metadata

Abstract:

Remote sensing images (RSIs) containing various ground objects have been applied in many fields. To make semantic understanding of RSIs objective and interactive, the task remote sensing visual question answering (VQA) has appeared. Given an RSI, the goal of remote sensing VQA is to make an intelligent agent answer a question about the remote sensing scene. Existing remote sensing VQA methods utilized a nonspatial fusion strategy to fuse the image features and question features, which ignores the spatial information of images and word-level information of questions. A novel method is proposed to complete the task considering these two aspects. First, convolutional features of the image are included to represent spatial information, and the word vectors of questions are adopted to present semantic word information. Second, attention mechanism and bilinear technique are introduced to enhance the feature considering the alignments between spatial positions and words. Finally, a fully connected layer with softmax is utilized to output an answer from the perspective of the multiclass classification task. To benchmark this task, a RSIVQA dataset is introduced in this article. For each of more than 37 000 RSIs, the proposed dataset contains at least one or more questions, plus corresponding answers. Experimental results demonstrate that the proposed method can capture the alignments between images and questions. The code and dataset are available at https://github.com/spectralpublic/RSIVQA.

Published in: IEEE Transactions on Geoscience and Remote Sensing ( Volume: 60)

Article Sequence Number: 5606514

Date of Publication: 31 May 2021

ISSN Information:

DOI: 10.1109/TGRS.2021.3079918

Funding Agency:

Contents

I. Introduction

With the development of remote sensing technology, remote sensing images (RSIs) have become widely accessible [1], including panchromatic images, multispectral images [2], hyperspectral images [3], and infrared images. These RSIs contain rich vision properties of land surface, which can be used in RSI scene classification [4], object detection [5], [6], and image caption [7]. Multispectral images and hyperspectral images [8] are able to record spectral characteristics of each material [9], which can be applied to discriminating materials, such as hyperspectral classification [10], change detection, and anomaly detection. However, all these tasks only extract task-specific information (such as scene categories, object location, and labels) from RSIs. In contrast, the remote sensing visual question answering (RSVQA) task generates questions about RSIs by combining image processing and natural language processing (NLP), which provides the user with high-level semantic information.

References is not available for this document.

Mutual Attention Inception Network for Remote Sensing Visual Question Answering

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Mutual Attention Inception Network for Remote Sensing Visual Question Answering

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References