Loading [MathJax]/extensions/MathZoom.js
DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model | IEEE Conference Publication | IEEE Xplore

DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model


Abstract:

Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but t...Show More

Abstract:

Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information. Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles by leveraging the CLIP (Contrastive Language-Image Pre-training), rather than collecting massive datasets for those domains. However, one drawback of them is that the sample diversity in the original generative model is not well-preserved in the domain-adapted generative models due to the deterministic nature of the CLIP text encoder. Text-guided domain adaptation will be even more challenging for 3D generative models not only because of catastrophic diversity loss, but also because of inferior text-image correspondence and poor image quality. Here we propose DATID-3D, a domain adaptation method tailored for 3D generative models using text-to-image diffusion models that can synthesize diverse images per text prompt without collecting additional images and camera information for the target domain. Unlike 3D extensions of prior text-guided domain adaptation methods, our novel pipeline was able to fine-tune the state-of-the-art 3D generator of the source domain to synthesize high resolution, multi-view consistent images in text-guided targeted domains without additional data, outperforming the existing text-guided domain adaptation methods in diversity and text-image correspondence. Furthermore, we propose and demonstrate diverse 3D image manipulations such as one-shot instance-selected adaptation and single-view manipulated 3D reconstruction to fully enjoy diversity in text.
Date of Conference: 17-24 June 2023
Date Added to IEEE Xplore: 22 August 2023
ISBN Information:

ISSN Information:

Conference Location: Vancouver, BC, Canada

1. Introduction

Recently, 3D generative models [5],[6],[13],[18],[19],[22],[31], [40]–[42],[59],[60],[65],[69],[74],[75] have been developed to extend 2D generative models for multi-view consistent and explic-itly pose-controlled image synthesis. Especially, some of them [5], [18], [74] combined 2D CNN generators like Style-GAN2 [28] with 3D inductive bias from the neural rendering [38], enabling efficient synthesis of high-resolution photorealistic images with remarkable view consistency and detailed 3D shapes. These 3D generative models can be trained with single-view images and then can sample infinite 3D images in real-time, while 3D scene representation as neural implicit fields using NeRF [38] and its variants [3,4, 8,10,14,17,20,32-34,36,45,47,50,53,54,64,66,70-73] require multi-view images and training for each scene.

Our DATID-3D succeeded in domain adaptation of 3D-aware generative models without additional data for the target domain while preserving diversity that is inherent in the text prompt as well as enabling high-quality pose-controlled image synthesis with excellent text-image correspondence. However, StyleGAN-NADA *, a 3d extension of the state-of-the-art StyleGAN-NADA for 2d generative models [16], yielded alike images in style with poor text-image correspondence. See the supplementary videos at gwang-kim.github.io/datid_3d.

Contact IEEE to Subscribe

References

References is not available for this document.