Loading web-font TeX/Main/Bold
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition | IEEE Conference Publication | IEEE Xplore

FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition


Abstract:

Recent approaches such as ControlNet [59] offer users fine-grained spatial control over text-to-image (T2I) diffusion models. However, auxiliary modules have to be traine...Show More

Abstract:

Recent approaches such as ControlNet [59] offer users fine-grained spatial control over text-to-image (T2I) diffusion models. However, auxiliary modules have to be trained for each spatial condition type, model architecture, and checkpoint, putting them at odds with the diverse intents and preferences a human designer would like to convey to the AI models during the content creation process. In this work, we present FreeControl, a training-free approach for controllable T2I generation that supports multiple conditions, architectures, and checkpoints simultaneously. Free Control enforces structure guidance to facilitate the global alignment with a guidance image, and appearance guidance to collect visual details from images generated without control. Extensive qualitative and quantitative experiments demonstrate the superior performance of Free Control across a variety of pre-trained T2I models. In particular, FreeControl enables convenient training-free control over many different architectures and checkpoints, allows the challenging input conditions on which most of the existing training-free methods fail, and achieves competitive synthesis quality compared to training-based approaches. Project page: https://genforce.github.io/freecontrol/.
Date of Conference: 16-22 June 2024
Date Added to IEEE Xplore: 16 September 2024
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA

Funding Agency:


1. Introduction

Text-to-image (T2I) diffusion models [4], [42] have achieved tremendous success in high-quality image synthesis, yet a text description alone is far from enough for users to con-vey their preferences and intents for content creation. Re-cent advances such as ControlNet [59] enable spatial control of pretrained T2I diffusion models, allowing users to specify the desired image composition by providing a guid-ance image (e.g., depth map, human pose) alongside the text description. Despite their superior generation results, these methods [6], [30], [33], [55], [59], [6]2 require training an additional module specific to each spatial condition type. Considering the large space of control signals, constantly evolving model architectures, and a growing number of customized model checkpoints (e.g., Stable Diffusion [44] fine-tuned for Disney characters or user-specified objects [24], [46]), this repetitive training on every new model and condition type is costly and uneconomical.

Training-free conditional control of Stable Diffusion [44]. (a) FreeControl enables zero-shot control of pretrained text-to-image diffusion models given various input control conditions. (b) Compared to ControlNet [59], FreeControl achieves a good balance between spatial and image-text alignment, especially when facing a conflict between the guidance image and text description. Additionally, FreeControl supports several condition types (e.g., 2D projections of point clouds and meshes in the bottom row), where it is difficult to construct training pairs.

Contact IEEE to Subscribe

References

References is not available for this document.