I. Introduction
Image generation also referred to as computer graphics (CG), is a group of methods used to create the most realistic image possible within the constraints of time, money, skill sets, and available computer hardware. These images can be static, like a digital painting or a photograph, or dynamic, like an animated movie. This project includes the integration of a generator of StyleGAN and a discriminator of AttnGAN. The generator creates a picture that corresponds to a text description given as input. The generator can then improve its output thanks to the discriminator’s evaluation and feedback of the created image. StyleGAN uses its technique called "style-based synthesis" to separate the generation of image content from the generation of image styles like color, texture, and lighting of a face. It has ability to produce high-quality, controllable outputs has made it a popular tool. Attention Encoded Generative Adversarial Networks" (AttnGAN) is a type of Generative Adversarial Network (GAN) that is used in image generation, particularly in the field of NLP. AttnGAN uses a multi-level attention mechanism that attends to different parts of the textual description at different scales. This allows the network to collect all the features of the text and generate images that are more faithful to the description. Another important feature of AttnGAN is its use of a hierarchical structure to generate images. The generator first generates a low-resolution image and then refines it in a step-by-step manner to produce a higher-resolution output. This enables the network to produce images that possess both exceptional quality and seamless coherence with the textual description. AttnGAN has been utilized in several applications, counting the generation of realistic images from textual descriptions of birds, flowers, and other natural objects. Its capacity to produce pictures from written descriptions has important applications in fields such as e-commerce, where it can be used to generate product images from product descriptions. This project uses the COCO database and the flower database. The "COCO" database refers to the Common Objects in Context (COCO) dataset, which contains over 3,30,000 images, it is one of the biggest and fullest datasets for object comprehension and verification, including 80 different item categories.. Each image in the COCO dataset is annotated with object instance masks, object bounding boxes, and object category labels. This project is to avoid copyrights occurring because of using the same images. Image generation with captions is also important because it can help us to better understand how language and vision are related. By inputting a specific caption or description, unique and personalized images can be generated.