I. Introduction
Remote sensing images possess tremendous potential in various applications, such as land management, hazard assessment, national defense, and intelligent agriculture, and extracting crucial information from images has become an important topic [1]– [3]. Scene classification aims to assign a semantic label for each queried images, and it can provide scene-level information for downstream tasks [4]– [6]. Scene images always exhibit complex spatial patterns, which leads to within-class diversity and between-class similarity [7]– [9]. To improve classification performance, various methods based on scene representations have been proposed, including hand-crafted feature-based methods and deep learning methods. Although hand-crafted features (e.g., texture, color, structure, and shape) possess the ability to describe simple scenes for classification, it is a huge challenge to effectively extract high-level semantic information for complex scenes because of its limited description ability, which restricts the improvement of performance [10], [11].