I. Introduction
The Action Recognition (AR) task has attracted lots of attention in recent years for its wide applications in video surveillance, video understanding [1], [2], human-computer interactions and robotics [3]. However, due to the explosive increase of video data and various action categories, supervised action recognition is under the burden of laboriously collecting and annotating a huge amount of new action video data [4]. In addition, the supervised action recognition model needs to be fine-tuned or retrained when meeting unseen categories [5]. Different from traditional supervised learning, Zero-Shot Action Recognition (ZSAR) could utilize a shared semantic/attribute space to recognize unseen action categories without collecting corresponding labeled data. Therefore, in the community, ZSAR is getting ever-increasing attention as it achieves recognition of unseen categories in a weakly supervised way. However, as far as we know, compared to traditional fully supervised learning, the framework design of ZSAR still lacks structural guidance.