I. Introduction
Semantic segmentation is one of the traditional tasks in computer vision. The general purpose of semantic segmentation is to assign pixel-level semantic labels by generalizing a large number of densely labeled images [1]–[3]. Along with the development of the field of remote sensing, remote sensing satellites can acquire a large amount of remote sensing image data. Effective semantic segmentation of remote sensing images can classify ground objects at pixel level, which is widely used in road network extraction [4], [5] and land cover [6]–[8] , etc. It is of great significance in updating basic geographic data, autonomous agriculture, intelligent transportation, urban planning and sustainable development, and has a wide range of practical value. There are two challenges in semantic segmentation of remote sensing images: high resolution and large scale variance, which requires huge human resources and time to label; Moreover, there are great differences in topography and architectural style in different regions, and the segmentation effect of trained models is often unsatisfactory when applied to different geographical space regions. For example, in urban and rural areas, land cover is completely different in class distribution, object scale and pixel spectrum.