1. Introduction
In recent years, there has been significant progress in Language Models (LMs) [7], [11], [47], [52] and Vision Language Models (VLMs) [1,2,6,8, 10, 15-18,21,23,25-30,39- 41,44,45,49,50,56,57,60-63,65-75,77-80] exhibiting strong zero-shot generalization and adaptability to a wide range of tasks. Though they may differ in architecture, data and task formulation, such foundational models predominantly rely on large-scale pre-training on giant corpora of web scraped data which serves as the source of generalization capability - C4 [51], The Pile [20], Laion 5B [54].