1. Introduction
Pre-training vision-language models on massive imagetext pairs to learn transferable representations for imagetext retrieval has attracted a lot of attention in recent years. Previous dominant methods [11], [29], [38] adopt “dualencoder” architecture to enable efficient retrieval, where two separate encoders are used to extract image and text representations. They learn a joint image-text embedding space via constraining the coarse-grained alignment between global image and text features. However, the coarse grained alignment constraint ignores the capture of detailed image and text semantics, and associations between them, impeding the performance improvement of image-text retrieval.
Illustration of image-text contrastive learning (ITC) and visual-language error modeling (ViLEM). ITC learns image-text global alignment by distinguishing paired data from unpaired data. ViLEM establishes detailed image-text association via discriminating and correcting wrong words in plausible negative texts.