I. Introduction
Realistic picture generation from textual descriptions is a fascinating and revolutionary area at the nexus of natural language processing and computer vision. Content creation, ecommerce, virtual reality, and accessibility for the blind and visually impaired are just a few of the fields that have seen a surge in interest in text-to-image synthesis, which creates pictures from natural language inputs [1]. The most successful models for this purpose are Generative Adversarial Networks (GANs), which can produce incredibly realistic pictures by learning from matched datasets of images and their textual descriptions [2], [3].