Conferences >2021 IEEE/CVF Conference on C...

Cross-Modal Contrastive Learning for Text-to-Image Generation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cr...Show More

Metadata

Abstract:

The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses this challenge by maximizing the mutual information between image and text. It does this via multiple contrastive losses which capture inter-modality and intra-modality correspondences. XMC-GAN uses an attentional self-modulation generator, which enforces strong text-image correspondence, and a contrastive discriminator, which acts as a critic as well as a feature encoder for contrastive learning. The quality of XMC-GAN’s output is a major step up from previous models, as we show on three challenging datasets. On MS-COCO, not only does XMC-GAN improve state-of-the-art FID from 24.70 to 9.33, but– more importantly–people prefer XMC-GAN by 77.3% for image quality and 74.1% for image-text alignment, compared to three other recent models. XMC-GAN also generalizes to the challenging Localized Narratives dataset (which has longer, more detailed descriptions), improving state-of-the-art FID from 48.70 to 14.12. Lastly, we train and evaluate XMC-GAN on the challenging Open Images data, establishing a strong benchmark FID score of 26.91.

Published in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Date of Conference: 20-25 June 2021

Date Added to IEEE Xplore: 02 November 2021

ISBN Information:

ISSN Information:

DOI: 10.1109/CVPR46437.2021.00089

Conference Location: Nashville, TN, USA

References is not available for this document.

Contents

1. Introduction

Compared to other kinds of inputs (e.g., sketches and object masks), descriptive sentences are an intuitive and flexible way to express visual concepts for generating images. The main challenge for text-to-image synthesis lies in learning from unstructured description and handling the different statistical properties between vision and language inputs.

References is not available for this document.

Cross-Modal Contrastive Learning for Text-to-Image Generation

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Cross-Modal Contrastive Learning for Text-to-Image Generation

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Supplemental Items

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?