I. Introduction
Remote sensing scene understanding is a vital yet difficult task in the field of intelligent interpretation of remote sensing data. It aims to capture high-level semantic information from images and precisely assign corresponding class labels to them. It has applications in various military and civilian domains, including natural disaster detection, weapon guidance, traffic supervision, and land cover monitoring [1], [2], [3], [4], [5]. In recent years, the advancement in remote sensing technology has increased the level of data abstraction from pixels to objects and ultimately to scenes [6], [7], [8]. To keep pace with these advancements, numerous researchers have dedicated their efforts over the past few decades to address the challenges and achieve scene-level image understanding [9], [10], [11]. In this task, effective feature extraction plays a crucial role, and based on the means of feature extraction, existing scene understanding works can be roughly divided into three directions: methods using low-level visual features, methods relying on mid-level visual representations, and methods based on high-level visual information [12], [13].