1. Introduction
Synthesizing images from text descriptions (known as Text-to-Image synthesis) is an important machine learning task, which requires handling ambiguous and incomplete information in natural language descriptions and learning across vision and language modalities. Approaches based on Generative Adversarial Networks (GANs) [5] have recently achieved promising results on this task [23, 22, 32, 33, 29, 16, 9, 12, 34]. Most GAN based methods synthesize the image conditioned only on a global sentence vector, which may miss important fine-grained information at the word level, and prevents the generation of high-quality images. More recently, AttnGAN [29] is proposed which introduces the attention mechanism [28], [30], [2], [27] into the GAN framework, thus allows attention-driven, multi-stage refinement for fine-grained text-to-image generation.