1. INTRODUCTION
The rapid development of spectral and optical sensors for satellites has increasingly provided large volumes of earth observation data that meets the data-rich prerequisite for developing automatic analyzing algorithms, especially deep learning (DL)-based methods. Among these, Vision Transformer (ViT) DL models have gained considerable traction in the Remote Sensing (RS) domain [1] and achieved superior performance in various RS applications, such as RS semantic segmentation on Very High Resolution (VHR) [2] and multispectral images [3]. Apart from achieving more uniform global representations compared to Convolutional Neural Networks (CNNs), ViTs can also scale effectively to accommodate large-scale data and model sizes, so they are regarded as the foundation vision models for large-scale RS models [4]. Despite their advantages, the substantial computational complexity introduced by attention mechanisms in Transformer architectures undermines the performance benefits of the powerful models on resource-constrained devices, particularly with very high-resolution inputs [5].