1. Introduction
Semantic segmentation aims to assign each pixel of a photograph to a semantic class label. Currently, the achievement is at the price of large amount of dense pixel-level annotations obtained by expensive human labor [4], [23], [27]. An alternative would be resorting to simulated data, such as computer generated scenes [31], [32], so that unlimited amount of labels are made available. However, models trained with the simulated images do not generalize well to realistic domains. The reason lies in the different data distributions of the two domains, typically known as domain shift [37]. To address this issue, domain adaptation approaches [35], [41], [14], [46], [17], [16], [13], [48] are proposed to bridge the gap between the source and target domains. A majority of recent methods [26], [24], [40], [43], [42] aim to align the feature distributions of different domains. Works along this line are based on the theoretical insights in [1] that minimizing the divergence between domains lowers the upper bound of error on the target domain. Among this cohort of domain adaptation methods, a common and pivotal step is minimizing some distance metric between the source and target feature distributions [24], [40]. Another popular choice, which borrows the idea from adversarial learning [10], is to minimize the accuracy of domain prediction. Through a minimax game between two adversarial networks, the generator is trained to produce features that confuse the discriminator while the latter is required to correctly classify which domain the features are generated from.