I. Introduction
In recent years, rapid progress in sensor technology has led to a substantial enhancement in the resolution of urban remote sensing imagery. The intelligent interpretation of the spatial nuances and latent semantic content within this imagery has unlocked a wide array of applications across various domains. Notably, these include the realms of high-resolution Earth observation, strategic urban construction planning, and the meticulous management of land resources [1]–[3]. However, the pursuit of high-precision semantic segmentation in urban remote sensing imagery is is challenging due to the inherent diversity of terrestrial objects, fluctuations in lighting conditions, the presence of shadows, and the impact of occlusions. Moreover, the high-precision nature of urban remote sensing imagery can be distinguished based on a substantial volume of data., To this end, efficient computational algorithms capable of facilitating real-time processing and analysis have gained significant interest [4].