1. Introduction
Image-to-image translation refers to the image generation conditioned on the certain inputs from other domains [25], [29], [32], and has delivered impressive advances with the emergence of Generative Adversarial Networks (GAN s) [5] in recent years. As a typical ill-posed problem, diverse solutions are naturally allowed in image translation tasks as one conditional input could corresponds to multiple image instances. Faithful control of the generation style not only enables diverse generation given certain conditions but also flexible user control of the desired generation. However, yielding high-fidelity images with controllable styles still remains a grand challenge.
Learning domain-invariant features for building correspondence across domains: we exploit contrastive learning between the conditional input and the ground truth by pulling features at the same position closer while pushing those at different positions apart. With the learned condition encoder and image encoder, explicit feature correspondences can be built up between the conditional input and the exemplar image.