1. Introduction
Large-scale generative models [9], [18], [19], [33], notably the Stable Diffusion (SD) series [33], have significantly ad-vanced image synthesis. These models produce high-fidelity images with visually striking content from concise textual prompts. However, despite the versatility of textual descriptions in directing the visual elements of generated images, they frequently lack precision in conveying intricate details such as spatial layouts, poses, shapes, and forms, when relying solely on text prompts.
We explore a novel aspect of learning diffusion conditions, requiring only a magnitude of a thousand times fewer examples (only 100 vs. 100k) compared to existing methods like ControlNet [44] and T2IAdapter [26]. The “-100” suffix in our model names indicates training with just 100 text-image-condition pairs. Our method achieves both structural consistency and high-quality generation with these limited samples, delivering performance comparable to the fully trained models of our competitors.
Segmentation-conditioned text-to-image generation of ControlNet-100 w. and w/o. text condition. Incorporating text constraints would lead to structurally inconsistent regions (bounded by red boxes), when only limited training exemplars are available.