1. Introduction
In recent years, multimodal image generation has achieved remarkable success, driven by the advancements in Generative Adversarial Networks (GANs) [15] and diffusion models (DMs) [11], [18], [48]. Facial image processing has become a popular application for a variety of tasks, including face image generation [21], [39], face editing [6], [12], [30], 3[6], [37], 4[6], and style transfer [7], [64]. Many tasks typically utilize the pretrained StyleGAN [21], [22], which can generate realistic facial images and edit facial attributes by manipulating the latent space using GAN inversion [39], [42], [58]. In these tasks, using multiple modalities as conditions is becoming a popular approach, which improves the user's controllability in generating realistic face images. However, existing GAN
We present a method to map the diffusion features to the latent space of a pretrained GAN, which enables diverse tasks in multimodal face image generation and style transfer. Our method can be applied to 2D and 3D-aware face image generation.
inversion methods [51], [58] have poor alignment with inputs as they neglect the correlation between multimodal inputs. They struggle to map the different modalities into the latent space of the pretrained GAN, such as by mixing the latent codes or optimizing the latent code converted from a given image according to the input text.