Loading [MathJax]/extensions/MathMenu.js
SceneComposer: Any-Level Semantic Image Synthesis | IEEE Conference Publication | IEEE Xplore

SceneComposer: Any-Level Semantic Image Synthesis


Abstract:

We propose a new framework for conditional image synthesis from semantic layouts of any precision levels, ranging from pure text to a 2D semantic canvas with precise shap...Show More

Abstract:

We propose a new framework for conditional image synthesis from semantic layouts of any precision levels, ranging from pure text to a 2D semantic canvas with precise shapes. More specifically, the input layout consists of one or more semantic regions with free-form text descriptions and adjustable precision levels, which can be set based on the desired controllability. The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level. By supporting the levels in-between, our framework is flexible in assisting users of different drawing expertise and at different stages of their creative workflow. We introduce several novel techniques to address the challenges coming with this new setup, including a pipeline for collecting training data; a precision-encoded mask pyramid and a text feature map representation to jointly encode precision level, semantics, and composition information; and a multi-scale guided diffusion model to synthesize images. To evaluate the proposed method, we collect a test dataset containing user-drawn layouts with diverse scenes and styles. Experimental results show that the proposed method can generate high-quality images following the layout at given precision, and compares favorably against existing methods. Project page https://zengxianyu.github.io/scenec/
Date of Conference: 17-24 June 2023
Date Added to IEEE Xplore: 22 August 2023
ISBN Information:

ISSN Information:

Conference Location: Vancouver, BC, Canada
Citations are not available for this document.

1. Introduction

Recently, deep generative models such as StyleGAN [24], [25] and diffusion models [9], [19], [49] have made a significant breakthrough in generating high-quality images. Image generation and editing technologies enabled by these models have become highly appealing to artists and designers by helping their creative workflows. To make image generation more controllable, researchers have put a lot of effort into conditional image synthesis and introduced models using various types and levels of semantic input such as object categories, text prompts, and segmentation maps etc. [23], [35], [36], [43], [44], [67].

Examples of image synthesis from any-level semantic layouts. (a) The coarsest layout, i.e. At the 0-th precision level, is equivalent to a text input; (d) the finest layout, i.e. At the highest level, is close to an accurate segmentation map; (a)-(d) intermediate level layouts (from coarse to fine), the shape control becomes tighter with increasing levels. (e) We can specify different precision levels for different components, e.g. To include a 0-th level style indicator while the remaining regions are of higher levels.

Difference from related conditional image synthesis works. T2i: text to image, s2i: segmentation to image, st2i: scene-based text to image box2i: bounding box layout to image
Setting Open-domain layout Shape control Sparse layout Coarse shape Level control
X
Box2I X
Ours

Getting results...

Contact IEEE to Subscribe

References

References is not available for this document.