Loading [MathJax]/extensions/MathMenu.js
A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis | IEEE Conference Publication | IEEE Xplore

A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis


Abstract:

While recent developments in text-to-image generative models have led to a suite of high-performing methods capable of producing creative imagery from free-form text, the...Show More

Abstract:

While recent developments in text-to-image generative models have led to a suite of high-performing methods capable of producing creative imagery from free-form text, there are several limitations. By analyzing the cross-attention representations of these models, we notice two key issues. First, for text prompts that contain multiple concepts, there is a significant amount of pixel-space overlap (i.e., same spatial regions) among pairs of different concepts This eventually leads to the model being unable to distinguish between the two concepts and one of them being ignored in the final generation. Next, while these models attempt to capture all such concepts during the beginning of denoising (e.g., first few steps) as evidenced by cross-attention maps, this knowledge is not retained by the end of denoising (e.g., last few steps). Such loss of knowledge eventually leads to inaccurate generation outputs.To address these issues, our key innovations include two test-time attention-based loss functions that substantially improve the performance of pretrained baseline text-to-image diffusion models. First, our attention segregation loss reduces the cross-attention overlap between attention maps of different concepts in the text prompt, thereby reducing the confusion/conflict among various concepts and the eventual capture of all concepts in the generated output. Next, our attention retention loss explicitly forces text-to-image diffusion models to retain cross-attention information for all concepts across all denoising time steps, thereby leading to reduced information loss and the preservation of all concepts in the generated output. We conduct extensive experiments with the proposed loss functions on a variety of text prompts and demonstrate they lead to generated images that are significantly semantically closer to the input text when compared to baseline text-to-image diffusion models.
Date of Conference: 01-06 October 2023
Date Added to IEEE Xplore: 15 January 2024
ISBN Information:

ISSN Information:

Conference Location: Paris, France
References is not available for this document.

1. Introduction

The last few years has seen a dramatic rise in the capabilities of text-to-image generative models to produce creative image outputs conditioned on free-form text inputs. While the recent class of pixel [20], [23] and latent [21] diffusion models have shown unprecedented image generation results, they have some key limitations. First, as noted in prior work [3], [27], [2], these models do not always produce a semantically accurate image output, consistent with the text prompt. As a consequence, there are numerous cases where not all subjects of the input text prompt are reflected in the model’s generated output. For instance, see Figure 1 where Stable Diffusion [21] omits ship in the first column, crown in the second column, and salmon in the third column.

Getting results...

Contact IEEE to Subscribe

References

References is not available for this document.