I. Introduction
Recent developments in text-to-image models have significantly enhanced the ability to generate high-quality images from natural language prompts [1] –[4]. The writing of text prompts, as shown in the first method in Figure 1, is a reasonable approach to crafting text prompts. These models possess a strong semantic understanding, having been trained on extensive datasets of images and their corresponding captions. However, challenges arise when generating images with multiple objects, as the quality can often diminish. Crafting precise prompts is essential, yet current models struggle to provide the flexibility needed to control the positioning and combinations of specific objects without impacting other elements. For example, adjusting a single object in an image can inadvertently disrupt other well-formed aspects of the scene, leading to consistency issues [5, 6].