Loading [MathJax]/extensions/MathZoom.js
CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models | IEEE Conference Publication | IEEE Xplore

CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models


Abstract:

Images produced by text-to-image diffusion models might not always faithfully represent the semantic intent of the provided text prompt, where the model might overlook or...Show More

Abstract:

Images produced by text-to-image diffusion models might not always faithfully represent the semantic intent of the provided text prompt, where the model might overlook or entirely fail to produce certain objects. Existing solutions often require customly tailored functions for each of these problems, leading to sub-optimal results, especially for complex prompts. Our work introduces a novel perspective by tackling this challenge in a contrastive context. Our approach intuitively promotes the segregation of objects in attention maps while also maintaining that pairs of related attributes are kept close to each other. We conduct extensive experiments across a wide variety of scenarios, each involving unique combinations of objects, attributes, and scenes. These experiments effectively showcase the versatil-ity, efficiency, and flexibility of our method in working with both latent and pixel-based diffusion models, including Sta-ble Diffusion and Imagen. Moreover, we publicly share our source code to facilitate further research.
Date of Conference: 16-22 June 2024
Date Added to IEEE Xplore: 16 September 2024
ISBN Information:

ISSN Information:

Conference Location: Seattle, WA, USA
No metrics found for this document.

1. Introduction

Diffusion text-to-image models [15] have showcased re-markable progress in generating images using textual cues [33], [34], [37]. These models offer a wide set of capabilities, ranging from image editing [2], [3], [9], [13], [29], [42], personal-ized content creation [36], and inpainting [25]. However, images produced by these models might not always faith-fully represent the semantic intent of the given text prompt [6], [39]. Notable semantic discrepancies in models like Sta-ble Diffusion [34] and Imagen [37] include a) missing objects where the model might overlook or entirely fail to produce certain objects; b) attribute binding where the model might mistakenly link attributes to the wrong subjects [6]; and c) miscounting where the model fails to accurately produce the right quantity of objects [22], [48]. Figure 2 illustrates these shortcomings in popular diffusion models, Sta-ble Diffusion [34] and Imagen [37]. For example, the output might neglect certain subjects, as in the ‘a bear and an elephant’ prompt, where the bear is ignored as depicted in Fig. 2(a). Additionally, the model might mix up attributes, such as mixing the colors in the ‘a purple crown and a yel-low suitcase’ prompt as seen in Fig. 2(b). Another behavior that is often attributed to the imprecise language compre-hension of the CLIP text encoder [28], [30] is the failure to produce the correct quantity of subjects as in Fig. 2(c) where the model either produces an excessive number of cats (SD or failed to include a cat (Imagen) for ‘one dog and two cats’ prompt.

Our training-free method combines a contrastive objective with test-time optimization, significantly improving how models such as Imagen and Stable Diffusion generate images with text prompts consisting of multiple concepts or subjects such as ‘a bear and a horse’.

Failure cases of Stable Diffusion [34] and Imagen [37]. Text-to-image diffusion models may not faithfully adhere to the subjects specified in the text prompt: a) missing objects (e.g., bear), b) misaligned attributes (e.g., the color yellow blends into the crown), and c) inaccurate object count (e.g., only one cat is generated instead of two). Our method steers the diffusion process towards more faithful images in both SD and Imagen.

Usage
Select a Year
2025

View as

Total usage sinceSep 2024:48
01234567JanFebMarAprMayJunJulAugSepOctNovDec262000000000
Year Total:10
Data is updated monthly. Usage includes PDF downloads and HTML views.

Contact IEEE to Subscribe

References

References is not available for this document.