I. Introduction
With the rapid development of computer vision and natural language processing, text-to-image synthesis has recently come to attract considerable attention, which refers to generating a visually realistic image that matches a given textual sentence. Since it is very difficult to parse sentences and bridge the semantic gap between sentence and image, text-to-image synthesis remains an open problem. There are two major challenges associated with the text-to-image synthesis task. One is visual realism, as generating rich yet detailed images using text with a limited number of words is difficult. The other is semantic consistency, as building the relationships between text semantics and visual features is problematic.