1. Introduction
Text-to-Image synthesis aims to automatically generate images conditioned on text descriptions, which is one of the most popular and challenging multi-modal task. The task re-quires the generator not only generates high-quality images, but also preserve the semantic consistency between the text and the generated image. Generative Adversarial Networks (GANs) [1] have shown promising results on text-to-image generation by using the sentence vector as a conditional in-formation. Zhang et al. [2] proposes Stack-GAN++, which employed a multi-stage structure to improve image resolution stage by stage, and an unconditional loss besides a conditional loss at each stage. Xu et al. [3] proposes Attn-GAN with a module DAMSM to strengthen the consistency constraint on the generator. These models have achieved great improve-ments on the task, but the performances are still not satisfied, especially on complex scenes.