1. Introduction
Over the last few years, we have witnessed the great success of generative models for various applications [4], [47]. Among them, text-to-image synthesis [3], [5], [16], [19]–[22], [26], [29], [30], [34], [43], [48], [50]–[53], [60] is one of the most appealing applications. It generates high-fidelity images according to given language guidance. Owing to the convenience of language for users, text-to-image synthesis has attracted many researchers and has become an active research area.
(a) Existing text-to-image gans conduct adversarial training from scratch. (b) Our proposed galip conducts adversarial training based on the integrated clip model.