I. INTRODUCTION
Image generation makes use of deep learning approaches which have performed decent results in the past decade, including Generative Adversarial Networks (GANs) [1]–[3], Variational AutoEncoders (VAEs) [4]–[6] and Diffusion Models (DMs) [7], [8]. Without any conditioning, these generative models produce images from a latent space. The results depend highly on the training data and usually are limited to a specify type of scenes, such as human face or bedroom etc. Conditioning methods [8]–[10], no matter by texts or images, can guide the generation process to fit for the target result of the users. However, sometimes users may just want to choose specific objects and post them on preferred locations, just like drawing by their own hands.