1. Introduction
There have been impressive advances in the performance of zero-shot recognition through the use of large-scale pre-trained Vision & Language (VL) models [45, 20, 50, 28, 15, 57, 29, 13]. However, these VL models still face some important challenges in understanding Visual Language Concepts (VLC) beyond object nouns (e.g., recognizing attributes, relations, states) and in terms of compositional reasoning capabilities (i.e.., understanding subtle changes in meaning due to small changes in word order). Recently, several benchmark tests have been devised to demonstrate the extent to which these models lack these capabilities [51], [62], [59] . As noted in several recent works [59], [62], [9], this behavior of VL models is likely due to the contrastive pre-training prevalent for all of them and likely inducing ‘bag-of-objects’ kind of representations (for both images and text alike). Indeed, for (even large) random batches of paired image-text samples, the collection of objects (nouns) in the image (or text) is likely to uniquely determine the image (or text) in the batch, making contrastive batch losses focus on the objects (nouns) while regarding other details (attributes, relations, states, word order, etc.) as unnecessary. Intuitively, this impairs VLC understanding and compositional reasoning of the resulting model.