Conferences >2024 IEEE/CVF Conference on C...

CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Images produced by text-to-image diffusion models might not always faithfully represent the semantic intent of the provided text prompt, where the model might overlook or...Show More

Metadata

Abstract:

Images produced by text-to-image diffusion models might not always faithfully represent the semantic intent of the provided text prompt, where the model might overlook or entirely fail to produce certain objects. Existing solutions often require customly tailored functions for each of these problems, leading to sub-optimal results, especially for complex prompts. Our work introduces a novel perspective by tackling this challenge in a contrastive context. Our approach intuitively promotes the segregation of objects in attention maps while also maintaining that pairs of related attributes are kept close to each other. We conduct extensive experiments across a wide variety of scenarios, each involving unique combinations of objects, attributes, and scenes. These experiments effectively showcase the versatil-ity, efficiency, and flexibility of our method in working with both latent and pixel-based diffusion models, including Sta-ble Diffusion and Imagen. Moreover, we publicly share our source code to facilitate further research.

Published in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Date of Conference: 16-22 June 2024

Date Added to IEEE Xplore: 16 September 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/CVPR52733.2024.00860

Conference Location: Seattle, WA, USA

Contents

1. Introduction

Diffusion text-to-image models [15] have showcased re-markable progress in generating images using textual cues [33], [34], [37]. These models offer a wide set of capabilities, ranging from image editing [2], [3], [9], [13], [29], [42], personal-ized content creation [36], and inpainting [25]. However, images produced by these models might not always faith-fully represent the semantic intent of the given text prompt [6], [39]. Notable semantic discrepancies in models like Sta-ble Diffusion [34] and Imagen [37] include a) missing objects where the model might overlook or entirely fail to produce certain objects; b) attribute binding where the model might mistakenly link attributes to the wrong subjects [6]; and c) miscounting where the model fails to accurately produce the right quantity of objects [22], [48]. Figure 2 illustrates these shortcomings in popular diffusion models, Sta-ble Diffusion [34] and Imagen [37]. For example, the output might neglect certain subjects, as in the ‘a bear and an elephant’ prompt, where the bear is ignored as depicted in Fig. 2(a). Additionally, the model might mix up attributes, such as mixing the colors in the ‘a purple crown and a yel-low suitcase’ prompt as seen in Fig. 2(b). Another behavior that is often attributed to the imprecise language compre-hension of the CLIP text encoder [28], [30] is the failure to produce the correct quantity of subjects as in Fig. 2(c) where the model either produces an excessive number of cats (SD or failed to include a cat (Imagen) for ‘one dog and two cats’ prompt. Figure 1.

Our training-free method combines a contrastive objective with test-time optimization, significantly improving how models such as Imagen and Stable Diffusion generate images with text prompts consisting of multiple concepts or subjects such as ‘a bear and a horse’.

Figure 2.

Failure cases of Stable Diffusion [34] and Imagen [37]. Text-to-image diffusion models may not faithfully adhere to the subjects specified in the text prompt: a) missing objects (e.g., bear), b) misaligned attributes (e.g., the color yellow blends into the crown), and c) inaccurate object count (e.g., only one cat is generated instead of two). Our method steers the diffusion process towards more faithful images in both SD and Imagen.

References is not available for this document.

CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References