1. Introduction
Learning to recognize visual concepts in an image has been a fundamental and long-standing research problem. Typically, this can be tackled via either supervised learning on human-annotated image-label pairs [10] or contrastive learning on webly-crawed image-text pairs [29], [47]. When fueled with clean and large-scale human-annotated image-label data, e.g., ImageNet [10], supervised learning can attain decent visual recognition capacities over the given categories [23], [34], [53] and also powerful transfer learning abilities [14], [32]. Nevertheless, collecting precise image-label data can be a laborious and expensive process, not to say its difficulty to scale up to numerous visual concepts
The largest scale but private JFT-300M covers 18,291 concepts.
. On the other hand, language-image contrastive learning has recently emerged as a promising approach by leveraging huge amounts of webly-crawled image-text pairs. These pairs are usually noisy, free-form but cover lots of visual concepts. As demonstrated in CLIP [47] and ALIGN [29], models learned from hundreds of millions of image-text pairs can attain impressive low-shot recognition performance for a wide range of visual understanding scenarios. Though these image-text models show a broad coverage of visual concepts, we find in our experiments that they usually lack the strong discriminative ability required by transfer learning. A natural question is: can we have one model for both discriminative representations and broad visual concept coverage?Unified contrastive learning paradigm in the image-text-label space, which recovers the supervised learning (e.g., cross-entropy (ce) [46] or supervised contrastive learning (sup-con) [30]) on image-label data, and language-image contrastive learning (e.g., clip [47] or align [29]) on image-text data.