1. INTRODUCTION
Remote sensing (RS) image scene classification has captured growing interest owing to its invaluable data support across various domains, encompassing environmental monitoring, resource management, disaster response, etc [1], [2]. At the same time, remarkable advancements have been made in the application of deep learning algorithms for RS image scene classification, progressing from initial usage of CNN [3], vision transformer [4], to vision-language Models [1] and multi-modal learning [5]. These algorithms continuously enhance the accuracy and efficiency of classifying RS images and can even surpass human capabilities. Nonetheless, their high accuracy predominantly pertains to familiar scene categories. Additionally, the application of deep learning in the field of remote sensing [6], [7], [8] is widespread, reflecting its significant impact and utility in various aspects of this area.