1. Introduction
The advent of contrastive visual-language pre-training has provided a new paradigm for multi-modal learning [16], [17], [22], [42]. Its popularity has been observed across diverse downstream vision tasks, including 2D or 3D classification [14], [39], [41], [9], segmentation [27], [48], [36], [44], and detection [38], [45], [29]. CLIP [26] is one of the most acknowledged contrastive visual-language models and has attained widespread attention for its simplicity and superiority. Pretrained by massive image-text pairs sourced from the Internet, CLIP exhibits remarkable aptitude in aligning visionlanguage representations with favorable zero-shot performance on downstream tasks. To further enhance CLIP in low-data regimes, many efforts propose few-shot learning techniques with additional learnable modules upon the frozen CLIP for new semantic domains.