Cross-Modal Contrastive Learning for Text-to-Image Generation | IEEE Conference Publication | IEEE Xplore

Cross-Modal Contrastive Learning for Text-to-Image Generation


Abstract:

The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cr...Show More

Abstract:

The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses this challenge by maximizing the mutual information between image and text. It does this via multiple contrastive losses which capture inter-modality and intra-modality correspondences. XMC-GAN uses an attentional self-modulation generator, which enforces strong text-image correspondence, and a contrastive discriminator, which acts as a critic as well as a feature encoder for contrastive learning. The quality of XMC-GAN’s output is a major step up from previous models, as we show on three challenging datasets. On MS-COCO, not only does XMC-GAN improve state-of-the-art FID from 24.70 to 9.33, but– more importantly–people prefer XMC-GAN by 77.3% for image quality and 74.1% for image-text alignment, compared to three other recent models. XMC-GAN also generalizes to the challenging Localized Narratives dataset (which has longer, more detailed descriptions), improving state-of-the-art FID from 48.70 to 14.12. Lastly, we train and evaluate XMC-GAN on the challenging Open Images data, establishing a strong benchmark FID score of 26.91.
Date of Conference: 20-25 June 2021
Date Added to IEEE Xplore: 02 November 2021
ISBN Information:

ISSN Information:

Conference Location: Nashville, TN, USA
References is not available for this document.

1. Introduction

Compared to other kinds of inputs (e.g., sketches and object masks), descriptive sentences are an intuitive and flexible way to express visual concepts for generating images. The main challenge for text-to-image synthesis lies in learning from unstructured description and handling the different statistical properties between vision and language inputs.

Select All
1.
Martin Arjovsky, Soumith Chintala and Léon Bottou, "Wasserstein generative adversarial networks", ICML, 2017.
2.
Shane Barratt and Rishi Sharma, "A note on the inception score", 2018.
3.
Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, et al., "Mutual information neural estimation", ICML, 2018.
4.
Andrew Brock, Jeff Donahue and Karen Simonyan, "Large scale gan training for high fidelity natural image synthesis", ICLR, 2019.
5.
Ting Chen, Simon Kornblith, Mohammad Norouzi and Geoffrey Hinton, "A simple framework for contrastive learning of visual representations", ICML, 2020.
6.
Ting Chen, Mario Lucic, Neil Houlsby and Sylvain Gelly, "On self modulation for generative adversarial networks", ICLR, 2019.
7.
Xinlei Chen, Haoqi Fan, Ross Girshick and Kaiming He, "Improved baselines with momentum contrastive learning", 2020.
8.
Soo-Whan Chung, Joon Son Chung and Hong-Goo Kang, "Perfect match: Improved cross-modal embeddings for audiovisual synchronisation", ICASSP, 2019.
9.
Yu Deng, Jiaolong Yang, Dong Chen, Fang Wen and Xin Tong, "Disentangled and controllable face image generation via 3d imitative-contrastive learning", CVPR, 2020.
10.
Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding", NAACL, 2019.
11.
Jon Gauthier, "Conditional generative adversarial networks for convolutional face generation", Technical report, 2015.
12.
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, et al., "Generative adversarial nets", NeurIPS, 2014.
13.
Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende and Daan Wierstra, "DRAW: A recurrent neural network for image generation", ICML, 2015.
14.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie and Ross Girshick, "Momentum contrast for unsupervised visual representation learning", CVPR, 2020.
15.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler and Sepp Hochreiter, "GANs trained by a two time-scale update rule converge to a local nash equilibrium", NeurIPS, 2017.
16.
Tobias Hinz, Stefan Heinrich and Stefan Wermter, "Generating multiple objects at spatially distinct locations", ICLR, 2019.
17.
Tobias Hinz, Stefan Heinrich and Stefan Wermter, "Semantic object accuracy for generative text-to-image synthesis", TPAMI, 2020.
18.
Seunghoon Hong, Dingdong Yang, Jongwook Choi and Honglak Lee, "Inferring semantic layout for hierarchical text-to-image synthesis", CVPR, 2018.
19.
Justin Johnson, Alexandre Alahi and Li Fei-Fei, "Perceptual losses for real-time style transfer and super-resolution", ECCV, 2016.
20.
Minguk Kang and Jaesik Park, "ContraGAN: Contrastive Learning for Conditional Image Generation", NeurIPS, 2020.
21.
Diederik P. Kingma and Max Welling, "Auto-encoding variational bayes", ICLR, 2014.
22.
Jing Yu Koh, Jason Baldridge, Honglak Lee and Yinfei Yang, "Text-to-image generation grounded by fine-grained user attention", WACV, 2021.
23.
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig et al., "The Open Images dataset v4: Unified image classification object detection and visual relationship detection at scale", IJCV, 2020.
24.
Qicheng Lao, Mohammad Havaei, Ahmad Pesaranghader, Francis Dutil, Lisa Di Jorio and Thomas Fevens, "Dual adversarial inference for text-to-image synthesis", ICCV, 2019.
25.
Kwot Sin Lee, Ngoc-Trung Tran and Ngai-Man Cheung, "Infomaxgan: Improved adversarial image generation via information maximization and contrastive learning", WACV, 2021.
26.
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz and Philip H. S. Torr, "Controllable text-to-image generation", NeurIPS, 2019.
27.
Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, et al., "Object-driven text-to-image synthesis via adversarial training", CVPR, 2019.
28.
Jiadong Liang, Wenjie Pei and Feng Lu, "CPGAN: Full-spectrum content-parsing generative adversarial networks for text-to-image synthesis", ECCV, 2020.
29.
Jae Hyun Lim and Jong Chul Ye, "Geometric GAN", 2017.
30.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, et al., "Microsoft COCO: Common objects in context", ECCV, 2014.

Contact IEEE to Subscribe

References

References is not available for this document.