I. Introduction
Vision-language Models (VLMs), leveraging large-scale contrastive-based image-text pre-training like CLIP [1] and Florence [2], have shown remarkable zero-shot generalization abilities in video understanding. Central to this field, egocentric action recognition (EAR) focuses on identifying human actions in first-person videos, gaining increasing attention for its broad applications in areas such as human-object interaction [3], sports analysis [4], and video summarization [5]. The strategy of pre-training these versatile VLMs on egocentric videos and adapting them for Zero-Shot Egocentric Action Recognition (ZS-EAR) is emerging as an effective and promising approach.