1. Introduction
Recently, deep generative models such as StyleGAN [24], [25] and diffusion models [9], [19], [49] have made a significant breakthrough in generating high-quality images. Image generation and editing technologies enabled by these models have become highly appealing to artists and designers by helping their creative workflows. To make image generation more controllable, researchers have put a lot of effort into conditional image synthesis and introduced models using various types and levels of semantic input such as object categories, text prompts, and segmentation maps etc. [23], [35], [36], [43], [44], [67].
Examples of image synthesis from any-level semantic layouts. (a) The coarsest layout, i.e. At the 0-th precision level, is equivalent to a text input; (d) the finest layout, i.e. At the highest level, is close to an accurate segmentation map; (a)-(d) intermediate level layouts (from coarse to fine), the shape control becomes tighter with increasing levels. (e) We can specify different precision levels for different components, e.g. To include a 0-th level style indicator while the remaining regions are of higher levels.
Difference from related conditional image synthesis works. T2i: text to image, s2i: segmentation to image, st2i: scene-based text to image box2i: bounding box layout to imageSetting | Open-domain layout | Shape control | Sparse layout | Coarse shape | Level control |
---|---|---|---|---|---|
X | |||||
Box2I | X | ||||
Ours |