1. Introduction
Text-to-image synthesis aims to generate images from natural language description. A generated image is expected to be photo and semantics realistic. Specifically, an image should have sufficient visual details that semantically align with the text description. Since the proposal of Generative Adversarial Network (GAN) [1], there have been numerous progresses that address the issues in photo-realistic quality [11], [18]–[20], and semantic consistency [17]. While both aspects emphasize image quality, an aspect being overlooked in the literature is the cause-and-effect visual scenario in image generation. For example, an image corresponding to the text “cut chicken into dice and stir with roasted peanuts” is hard to be generated with the current text-to-image synthesis paradigm. The reason is that the sentence is action-oriented. The expected image details are entities like “diced chicken” and “roasted peanuts”, and the visual consequence of stirring both entities. The current state-of-the-art techniques that rely on mapping between textual and visual entities cannot deal with this cause-and-effect realistic image generation.