1. Introduction
In visual-language models there has been a recent move to explicitly build object awareness into the vision module by adding specialized and bespoke components, or using entirely object-centric architectures. The motivation for this partly comes from the attractive compositional nature of objects and their inter-relationships in language, which enables inexhaustible novel combinations [10], [45], and partly from infant cognitive studies that stress the importance of objects in early visual development [29], [56], [60]. Examples in the video domain include explicit internal object representations [2], e.g., through RoI-align [17] pooled features either from a pre-trained region-proposal network (RPN) [2], [52], [57], [62], or from bounding-box coordinates taken as input [19], [42], [48], [71]. This contrasts with the large body of work where standard representations are learnt end-to-end without any explicit factorization into objects/entities, such as dual-encoder vision-language models in the image [21], [49] and video domains [4], [64].