I. Introduction
Recognizing human actions in videos is a challenging task. It has received a significant amount of attention from the research community due to its wide range of applications in visual surveillance and human computer interaction, among others [1]–[13]. Although more than a decade of active research has been conducted [1]–[5], there still exist many unsolved problems due to the following reasons. First, this task typically has large interclass difference due to the variations in viewpoints, background clutter, object speed, and motion patterns. Second, if complex contextual information is present, such as unexpected interaction between objects, people, and scene, it might have negative influence on action recognition. Third, the diverse and the dynamic nature within an action category makes it difficult to model the salient action units. Therefore, designing a robust human action recognition model is an urgent requirement.