Loading [MathJax]/extensions/MathZoom.js
GR-GAN: Gradual Refinement Text-To-Image Generation | IEEE Conference Publication | IEEE Xplore

GR-GAN: Gradual Refinement Text-To-Image Generation


Abstract:

A good Text-to-Image model should not only generate high quality images, but also ensure the consistency between the text and the generated image. Previous models failed ...Show More

Abstract:

A good Text-to-Image model should not only generate high quality images, but also ensure the consistency between the text and the generated image. Previous models failed to simultaneously fix both sides well. This paper proposes a Gradual Refinement Generative Adversarial Network (GR-GAN) to alleviates the problem efficiently. A GRG module is designed to generate images from low resolution to high resolution with the corresponding text constraints from coarse granularity (sentence) to fine granularity (word) stage by stage, a ITM module is designed to provide image-text matching losses at both sentence-image level and word-region level for corresponding stages. We also introduce a new metric Cross-Model Distance (CMD) for simultaneously evaluating image quality and image-text consistency. Experimental results show GR-GAN significant outperform previous models, and achieve new state-of-the-art on both FID and CMD. A detailed analysis demonstrates the efficiency of different generation stages in GR-GAN.
Date of Conference: 18-22 July 2022
Date Added to IEEE Xplore: 26 August 2022
ISBN Information:

ISSN Information:

Conference Location: Taipei, Taiwan

Funding Agency:

References is not available for this document.

1. Introduction

Text-to-Image synthesis aims to automatically generate images conditioned on text descriptions, which is one of the most popular and challenging multi-modal task. The task re-quires the generator not only generates high-quality images, but also preserve the semantic consistency between the text and the generated image. Generative Adversarial Networks (GANs) [1] have shown promising results on text-to-image generation by using the sentence vector as a conditional in-formation. Zhang et al. [2] proposes Stack-GAN++, which employed a multi-stage structure to improve image resolution stage by stage, and an unconditional loss besides a conditional loss at each stage. Xu et al. [3] proposes Attn-GAN with a module DAMSM to strengthen the consistency constraint on the generator. These models have achieved great improve-ments on the task, but the performances are still not satisfied, especially on complex scenes.

Select All
1.
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-geswaran, Bernt Schiele and Honglak Lee, "Gener-ative adversarial text to image synthesis", International Conference on Machine Learning. PMLR, pp. 1060-1069, 2016.
2.
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, et al., "Stackgan++: Realistic image synthesis with stacked generative adversarial networks", IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1947-1962, 2018.
3.
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, et al., "At-tngan: Fine-grained text to image generation with attentional generative adversarial networks", Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1316-1324, 2018.
4.
Tobias Hinz, Stefan Heinrich and Stefan Wermter, "Se-mantic object accuracy for generative text-to-image syn-thesis", arXiv preprint, 2019.
5.
Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, et al., "Object-driven text-to-image synthesis via adversarial training", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12174-12182, 2019.
6.
Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, et al., "Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis", arXiv preprint, 2020.
7.
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford and Xi Chen, "Improved techniques for training gans", Advances in neural in-formation processing systems, vol. 29, pp. 2234-2242, 2016.
8.
Minfeng Zhu, Pingbo Pan, Wei Chen and Yi Yang, "Dm-gan: Dynamic memory generative adversarial net-works for text-to-image synthesis", Proceedings of the IEEE/CVF Conference on Computer Vision and Pat-tern Recognition, pp. 5802-5810, 2019.
9.
Tobias Hinz, Stefan Heinrich and Stefan Wermter, "Generating multiple objects at spatially distinct locations", arXiv preprint, 2019.
10.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, et al., "Microsoft coco: Common objects in context", European conference on computer vision, pp. 740-755, 2014.
11.
Tingting Qiao, Jing Zhang, Duanqing Xu and Dacheng Tao, "Mirrorgan: Learning text-to-image generation by redescription", Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, pp. 1505-1514, 2019.
12.
Jiadong Liang, Wenjie Pei and Feng Lu, "Cp-gan: Content-parsing generative adversarial networks for text-to-image synthesis", European Conference on Computer Vision, pp. 491-508, 2020.
13.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler and Sepp Hochreiter, "Gans trained by a two time-scale update rule converge to a local nash equilibrium", Advances in neural information process-ine systems, vol. 30, 2017.
14.
Stanislav Frolov, Tobias Hinz, Federico Raue, Jorn Hees and Andreas Dengel, "Adversarial text-to-image synthesis: A review", arXiv preprint, 2021.
15.
Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee and Yinfei Yang, "Cross-modal contrastive learning for text-to-image generation", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 833-842, 2021.
16.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, et al., "Attention is all you need", Ad-vances in neural information processing systems, pp. 5998-6008, 2017.
17.
Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, "Deep residual learning for image recognition", Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
18.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-try, Amanda Askell, Pamela Mishkin, Jack Clark et al., "Learning transferable visual models from natural language supervision", arXiv preprint, 2021.
19.
Maurice Frechet, "Sur la distance de deux lois de proba-bilite", Comptes Rendus Hebdomadaires des Seances de L Academie des Sciences, vol. 244, no. 6, pp. 689-692, 1957.
20.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, et al., "Zero-shot text-to-image generation", arXiv preprint, 2021.
Contact IEEE to Subscribe

References

References is not available for this document.