1. Introduction
Diffusion text-to-image models [15] have showcased re-markable progress in generating images using textual cues [33], [34], [37]. These models offer a wide set of capabilities, ranging from image editing [2], [3], [9], [13], [29], [42], personal-ized content creation [36], and inpainting [25]. However, images produced by these models might not always faith-fully represent the semantic intent of the given text prompt [6], [39]. Notable semantic discrepancies in models like Sta-ble Diffusion [34] and Imagen [37] include a) missing objects where the model might overlook or entirely fail to produce certain objects; b) attribute binding where the model might mistakenly link attributes to the wrong subjects [6]; and c) miscounting where the model fails to accurately produce the right quantity of objects [22], [48]. Figure 2 illustrates these shortcomings in popular diffusion models, Sta-ble Diffusion [34] and Imagen [37]. For example, the output might neglect certain subjects, as in the ‘a bear and an elephant’ prompt, where the bear is ignored as depicted in Fig. 2(a). Additionally, the model might mix up attributes, such as mixing the colors in the ‘a purple crown and a yel-low suitcase’ prompt as seen in Fig. 2(b). Another behavior that is often attributed to the imprecise language compre-hension of the CLIP text encoder [28], [30] is the failure to produce the correct quantity of subjects as in Fig. 2(c) where the model either produces an excessive number of cats (SD or failed to include a cat (Imagen) for ‘one dog and two cats’ prompt.
Our training-free method combines a contrastive objective with test-time optimization, significantly improving how models such as Imagen and Stable Diffusion generate images with text prompts consisting of multiple concepts or subjects such as ‘a bear and a horse’.
Failure cases of Stable Diffusion [34] and Imagen [37]. Text-to-image diffusion models may not faithfully adhere to the subjects specified in the text prompt: a) missing objects (e.g., bear), b) misaligned attributes (e.g., the color yellow blends into the crown), and c) inaccurate object count (e.g., only one cat is generated instead of two). Our method steers the diffusion process towards more faithful images in both SD and Imagen.