1. Introduction
Generative adversarial network (GAN) has been the workhorse in deep generative models since its birth for image generation [16], and its popularity arises from the capability of generating clear and realistic images from merely small dimensions. Despite success, the original architecture of GAN only allows for randomly generating images from Gaussian noise, and an important variant of GANs aims to control the generation by pre-defined auxiliary information (e.g., the class labels or texts), constituting the conditional GAN (cGAN). Taking advantages of the auxiliary information, cGANs have been proved to be capable of enhancing the realistic image generation that is conditioned on extra semantic cues [42], [32], [33]. Therefore, the past few years have witnessed the extensive applications of cGANs, including class-conditioned generation [31], [37], style transfer [55], text-to-image translation [42], [51], to name but a few.