I. Introduction
With the significant increase of digital cameras installed for public visual surveillance purpose, human action recognition has received increasing attentions from researchers during the past decade. The aim of human action recognition is to recognize human actions from videos so that the system can understand the scene, and make further classification or semantic description of the scene [1]–[4]. The results can be applied to many applications such as visual surveillance, human–computer interfaces, video summarization, content-based video retrieval, and others. Human action recognition is a challenging research area because the dynamic human body motions have almost unlimited underlying representations. There are also difficulties from perspective distortions, different viewpoints, and illumination variations. To recognize human actions, an action model [5], [6] is often required. Most of the existing human action recognition algorithms construct the model based on one single discriminative feature. However, one single human action feature is hard to model complicated action well. in turn, multiple human action features should be employed. On the contrary, in order to train a good action model, a large amount of labeled data is needed so that there are sufficient training samples to achieve generalization ability. However, labeling videos are costly because they require a lot of human effort, while the unlabeled videos can be easily obtained from public cameras. in this case, how to fully use labeled data, and more importantly, how to employ the large amount of unlabeled data to boost the performance of the overall system is a crucial problem. Semi-supervised learning is a promising approach to solve this problem.