1. Introduction
The last few years has seen a dramatic rise in the capabilities of text-to-image generative models to produce creative image outputs conditioned on free-form text inputs. While the recent class of pixel [20], [23] and latent [21] diffusion models have shown unprecedented image generation results, they have some key limitations. First, as noted in prior work [3], [27], [2], these models do not always produce a semantically accurate image output, consistent with the text prompt. As a consequence, there are numerous cases where not all subjects of the input text prompt are reflected in the model’s generated output. For instance, see Figure 1 where Stable Diffusion [21] omits ship in the first column, crown in the second column, and salmon in the third column.