Loading [MathJax]/extensions/MathMenu.js
Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples | IEEE Conference Publication | IEEE Xplore

Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples


Abstract:

In this paper, we delve into a novel aspect of learning novel diffusion conditions with datasets an order of magnitude smaller. The rationale behind our approach is the e...Show More

Abstract:

In this paper, we delve into a novel aspect of learning novel diffusion conditions with datasets an order of magnitude smaller. The rationale behind our approach is the elimination of textual constraints during the few-shot learning process. To that end, we implement two optimization strategies. The first, prompt-free conditional learning, utilizes a prompt-free encoder derived from a pre-trained Stable Diffusion model. This strategy is designed to adapt new conditions to the diffusion process by minimizing the textual-visual cor-relation, thereby ensuring a more precise alignment between the generated content and the specified conditions. The second strategy entails condition-specific negative rectification, which addresses the inconsistencies typically brought about by Classifier-free guidance in few-shot training con-texts. Our extensive experiments across a variety of condition modalities demonstrate the effectiveness and efficiency of our framework, yielding results comparable to those obtained with datasets a thousand times larger. Our codes are available at https://github.com/Yuyan9Yu/BeyondTextConstraint.
Date of Conference: 16-22 June 2024
Date Added to IEEE Xplore: 16 September 2024
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA

Funding Agency:

References is not available for this document.

1. Introduction

Large-scale generative models [9], [18], [19], [33], notably the Stable Diffusion (SD) series [33], have significantly ad-vanced image synthesis. These models produce high-fidelity images with visually striking content from concise textual prompts. However, despite the versatility of textual descriptions in directing the visual elements of generated images, they frequently lack precision in conveying intricate details such as spatial layouts, poses, shapes, and forms, when relying solely on text prompts.

We explore a novel aspect of learning diffusion conditions, requiring only a magnitude of a thousand times fewer examples (only 100 vs. 100k) compared to existing methods like ControlNet [44] and T2IAdapter [26]. The “-100” suffix in our model names indicates training with just 100 text-image-condition pairs. Our method achieves both structural consistency and high-quality generation with these limited samples, delivering performance comparable to the fully trained models of our competitors.

Segmentation-conditioned text-to-image generation of ControlNet-100 w. and w/o. text condition. Incorporating text constraints would lead to structurally inconsistent regions (bounded by red boxes), when only limited training exemplars are available.

Select All
1.
Paul Bao, Lei Zhang and Xiaolin Wu, "Canny edge detection enhancement by scale multiplication", IEEE TPAMI, vol. 27, no. 9, pp. 1485-1490, 2005.
2.
Omer Bar-Tal, Lior Yariv, Yaron Lipman and Tali Dekel, Multidiffusion: Fusing diffusion paths for controlled image generation, 2023.
3.
Andrew Brock, Jeff Donahue and Karen Simonyan, "Large scale gan training for high fidelity natural image synthesis", arXiv preprint, 2018.
4.
Tim Brooks, Aleksander Holynski and A Efros Alexei, "In-structpix2pix: Learning to follow image editing instructions", CVPR, pp. 18392-18402, 2023.
5.
Holger Caesar, Jasper Uijlings and Vittorio Ferrari, "Coco-stuff: Thing and stuff classes in context", CVPR, pp. 1209-1218, 2018.
6.
Guillaume Couairon, Marlene Careil, Matthieu Cord, Stephane Lathuiliere and Jakob Verbeek, "Zero-shot spatiallayout conditioning for text-to-image diffusion models", ICCV, pp. 2174-2183, 2023.
7.
Prafulla Dhariwal and Alexander Nichol, "Diffusion models beat gans on image synthesis", NeurIPS, vol. 34, pp. 8780-8794, 2021.
8.
Ziyi Dong, Pengxu Wei and Liang Lin, "Dreamartist: Towards controllable one-shot text-to-image generation via positive-negative prompt-tuning", arXiv preprint, 2022.
9.
Patrick Esser, Robin Rombach and Bjorn Ommer, "Taming transformers for high-resolution image synthesis", CVPR, pp. 12873-12883, 2021.
10.
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, et al., "Vector quan-tized diffusion model for text-to-image synthesis", CVPR, pp. 10696-10706, 2022.
11.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-hard Nessler and Sepp Hochreiter, "Gans trained by a two time-scale update rule converge to a local nash equilibrium", NeurIPS, vol. 30, 2017.
12.
Jonathan Ho, Ajay Jain and Pieter Abbeel, "Denoising diffusion probabilistic models", NeurIPS, vol. 33, pp. 6840-6851, 2020.
13.
Jonathan Ho and Tim Salimans, "Classifier-free diffusion guidance", arXiv preprint, 2022.
14.
Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao and Jingren Zhou, "Composer: Creative and controllable image synthesis with composable conditions", arXiv preprint, 2023.
15.
Yutao Jiang, Yang Zhou, Yuan Liang, Wenxi Liu, Jianbo Jiao, Yuhui Quan, et al., "Diffuse3d: Wide-angle 3d photography via bilateral diffusion", CVP R, pp. 8998-9008, 2023.
16.
Xuan Ju, Ailing Zeng, Jianan Wang, Qiang Xu and Lei Zhang, "Human-art: A versatile human-centric dataset bridging natu-ral and artificial scenes", ICCV, 2023.
17.
Xuan Ju, Ailing Zeng, Jianan Wang, Qiang Xu and Lei Zhang, "Human-art: A versatile human-centric dataset bridging natural and artificial scenes", CVPR, pp. 618-629, 2023.
18.
Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, et al., "Scaling up gans for text-to-image synthesis", CVPR, pp. 10124-10134, 2023.
19.
Tero Karras, Samuli Laine and Timo Aila, "A style-based generator architecture for generative adversarial networks", CVPR, pp. 4401-4410, 2019.
20.
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova, "Bert: Pre-training of deep bidirectional trans-formers for language understanding", NAACL-HLT, pp. 4171-4186, 2019.
21.
Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim and Namhyuk Ahn, "Diffblender: Scalable and composable multimodal text-to-image diffusion models", arXiv preprint, 2023.
22.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, et al., "Microsoft coco: Common objects in context" in ECCV, Springer, pp. 740-755, 2014.
23.
Haofeng Liu, Chen shu Xu, Yang Yifei, Zeng LiHua and He Shengfeng, "Drag your noise: Interactive point-based editing via diffusion semantic propagation", CVPR, 2024.
24.
Ilya Loshchilov and Frank Hutter, "Decoupled weight decay regularization", ICLR, 2018.
25.
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch and Daniel Cohen-Or, "Null-text inversion for editing real images using guided diffusion models", CVPR, pp. 6038-6047, 2023.
26.
Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon-gang Qi, Ying Shan, et al., "T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models", arXiv preprint, 2023.
27.
Alexander Quinn Nichol and Prafulla Dhariwal, "Improved denoising diffusion probabilistic models", ICML, pp. 8162-8171, 2021.
28.
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, et al., "Glide: Towards photorealistic image generation and editing with text-guided diffusion models", ICML, pp. 16784-16804, 2022.
29.
Xingang Pan, Ayush Tewari, Thomas Leimkuhler, Lingjie Liu, Abhimitra Meka and Christian Theobalt, "Drag your gan: Interactive point-based manipulation on the generative image manifold", ACM SIGGRAPH, pp. 1-11, 2023.
30.
Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese et al., "Unicontrol: A unified diffusion model for controllable visual generation in the wild", NeurIPS, 2023.
Contact IEEE to Subscribe

References

References is not available for this document.