1. Introduction
Text-to-image (T2I) diffusion models [4], [42] have achieved tremendous success in high-quality image synthesis, yet a text description alone is far from enough for users to con-vey their preferences and intents for content creation. Re-cent advances such as ControlNet [59] enable spatial control of pretrained T2I diffusion models, allowing users to specify the desired image composition by providing a guid-ance image (e.g., depth map, human pose) alongside the text description. Despite their superior generation results, these methods [6], [30], [33], [55], [59], [6]2 require training an additional module specific to each spatial condition type. Considering the large space of control signals, constantly evolving model architectures, and a growing number of customized model checkpoints (e.g., Stable Diffusion [44] fine-tuned for Disney characters or user-specified objects [24], [46]), this repetitive training on every new model and condition type is costly and uneconomical.
Training-free conditional control of Stable Diffusion [44]. (a) FreeControl enables zero-shot control of pretrained text-to-image diffusion models given various input control conditions. (b) Compared to ControlNet [59], FreeControl achieves a good balance between spatial and image-text alignment, especially when facing a conflict between the guidance image and text description. Additionally, FreeControl supports several condition types (e.g., 2D projections of point clouds and meshes in the bottom row), where it is difficult to construct training pairs.