I. Introduction
Object counting in aerial scenes aims to estimate the number of objects with a specific category in given images, which can be applied to urban planning [1], [2], [3], environmental monitoring and mapping [4], [5], [6], disaster detection [7], [8], [9], and other practical applications [10], [11], [12]. Over the past years, benefiting from the development of deep learning and neural networks, some object (pedestrians, cars, boats, etc.) counting algorithms designed for remote sensing scenes have been proposed. Specifically, Bahmanyar et al. [13] introduce a crowd-counting dataset captured from the UAV view. They also propose a multiresolution network to estimate the pedestrian count. Gao et al. [14] construct an RSOC dataset for remote sensing object counting, containing four subsets: buildings, small vehicles, large vehicles, and boats. Based on the dataset, Gao et al. [15] propose the PSGCNet, which addresses challenges such as scale variations and complex backgrounds in remote sensing scenes by extracting and fusing multiscale and global feature information.