I. Introduction
Human action recognition (HAR) is a central task in video understanding [1], [2], [3]. Existing studies have explored various modalities for feature extraction, such as optical flows [4], [5], [6], RGB images [7], [8], [9], and human skeletons [10], [11], [12]. Optical flows extract the temporal features immune to redundant environmental influence but leading to a relatively high computational cost. RGB images contain rich information but are sensitive to environmental noises. Human skeletons are action-focusing and compact but contain less information than optical flows and RGB images. Despite their sparsity, human skeletons have received increasing attention in recent research on action recognition due to their light weights and robustness to environmental variations.