1. Introduction
Because of the powerful representation ability of deep convolutional neural networks [26], it has deeply boosted the performance of the computer vision tasks including image recognition [45], [21], object detection [16], [32], semantic segmentation [33], [63], [3], etc. They all require plentiful images and accurate annotations to train high-performance models. Compared with image recognition, semantic segmentation is more complex and aims at classifying each pixel in an image. Therefore, collecting the annotations for segmentation is an extremely expensive and laborious process (e.g., 90 minutes per image for Cityscapes [7]).