I. Introduction
Object detection is a fundamental task in the field of remote sensing scene understanding, with a wide range of real-world applications, including environmental monitoring, intelligent transportation, and military deployment [1], [2], [3], [4], [5], [6]. Thanks to the easy availability of large-scale remote sensing data, deep learning has recently dominated the field [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20]. Specifically, convolutional neural networks (CNNs) and vision Transformers (ViTs) [21], [22], [23] have successively demonstrated their exceptional representational abilities, while the latter, as a more promising alternative to CNNs, has received increasing popularity, which reflects their significant potential for future applications in remote sensing object detection [24], [25], [26].