I. Introduction
With the development of geoinformatics technology, the Earth observation fields have witnessed significant progress, where various remote sensing (RS) sensors and devices have been widely used. Among them, with the advantages of real-time, abundant amount, and easy access, the aerial image has become one of the most important data sources in Earth vision to serve the requirements of a series of practical tasks, such as precision agriculture [1], [2] and environmental monitoring [3]. In these applications, aerial scene recognition is a fundamental and active research topic over the past years. However, because of the own characteristics of aerial images, it is still challenging to efficiently understand the aerial scene.