1 Introduction
Semantic segmentation is an important processing step in natural or medical image analysis for the detection of distinct types of objects in images [1]. In this process, a semantic label is assigned to each pixel of a given image. The breakthrough of semantic segmentation methods came when fully convolutional neural networks (FCN) were first used by [2] to perform end-to-end segmentation of images. While semantic segmentation has achieved significant improvement based on the conception of fully convolutional networks, small and thin items in the scene remain difficult to segment because the information of small objects is lost throughout the convolutional and pooling processes [3], [4], [5], [6]. For example, Fig. 1a is an image of size 800 by 1200 pixels, which contains two cars: the larger car is 160 by 220 pixels (Fig. 1b), and the smaller one is 30 by 40 (Fig. 1c). After a convolution operation with a convolution kernel of 10×10, the length and width of the image are compressed to one-tenth of the original size (as shown in Fig. 1d). Accordingly, the dimensions of the large and small cars become 16 by 22 and 3 by 4 pixels, respectively. As seen from the example, we can still see the car's features from Fig. 1e (feature map of the large car), but we can hardly see the features of the small car from the 12-pixel size Fig. 1c (feature map of the small car). This is because the high-level representation from convolutional and pooling operations generated along lowers the resolution, which often leads to the loss of the detailed information of small/thin objects [3] — as a result, recovering the car information from the coarse feature maps is difficult for segmentation models [7]. However, accurately segmenting small objects is critical in many applications, such as autonomous driving, where the segmentation and recognition of small-sized cars and pedestrians in the distance is critical [8], [9], [10], [11].