Conferences >2024 IEEE/CVF Conference on C...

Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In this paper, we delve into a novel aspect of learning novel diffusion conditions with datasets an order of magnitude smaller. The rationale behind our approach is the e...Show More

Metadata

Abstract:

In this paper, we delve into a novel aspect of learning novel diffusion conditions with datasets an order of magnitude smaller. The rationale behind our approach is the elimination of textual constraints during the few-shot learning process. To that end, we implement two optimization strategies. The first, prompt-free conditional learning, utilizes a prompt-free encoder derived from a pre-trained Stable Diffusion model. This strategy is designed to adapt new conditions to the diffusion process by minimizing the textual-visual cor-relation, thereby ensuring a more precise alignment between the generated content and the specified conditions. The second strategy entails condition-specific negative rectification, which addresses the inconsistencies typically brought about by Classifier-free guidance in few-shot training con-texts. Our extensive experiments across a variety of condition modalities demonstrate the effectiveness and efficiency of our framework, yielding results comparable to those obtained with datasets a thousand times larger. Our codes are available at https://github.com/Yuyan9Yu/BeyondTextConstraint.

Published in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Date of Conference: 16-22 June 2024

Date Added to IEEE Xplore: 16 September 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/CVPR52733.2024.00679

Conference Location: Seattle, WA, USA

Funding Agency:

References is not available for this document.

Contents

1. Introduction

Large-scale generative models [9], [18], [19], [33], notably the Stable Diffusion (SD) series [33], have significantly ad-vanced image synthesis. These models produce high-fidelity images with visually striking content from concise textual prompts. However, despite the versatility of textual descriptions in directing the visual elements of generated images, they frequently lack precision in conveying intricate details such as spatial layouts, poses, shapes, and forms, when relying solely on text prompts. Figure 1.

We explore a novel aspect of learning diffusion conditions, requiring only a magnitude of a thousand times fewer examples (only 100 vs. 100k) compared to existing methods like ControlNet [44] and T2IAdapter [26]. The “-100” suffix in our model names indicates training with just 100 text-image-condition pairs. Our method achieves both structural consistency and high-quality generation with these limited samples, delivering performance comparable to the fully trained models of our competitors.

Figure 2.

Segmentation-conditioned text-to-image generation of ControlNet-100 w. and w/o. text condition. Incorporating text constraints would lead to structurally inconsistent regions (bounded by red boxes), when only limited training exemplars are available.

References is not available for this document.

Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?