1. Introduction
Human action recognition has been widely explored, bringing applications to many fields, such as content-based action re-trieval [1], intelligent surveillance [2], gaming [3] and so on. The first attempt of this task uses RGB data, since RGB sensor is cheap and has been used in various scenarios. Since RGB sensor cannot capture depth information, it is rather difficult for algorithms to detect human bodies from cluttered background. Moreover, the lost of depth data brings ambiguities for distinguishing similar actions. With the progress of depth sensor, i.e., Microsoft Kinect, researchers begin using depth data for human action recognition. Compared with RGB data, human bodies can be segmented from backgrounds more easily, since the complex and confusing textures or illuminations are ignored by depth sensor. More importantly, additional information from depth data provides a new view to distinguish actions whose appearances are similar from the view of X-Y plane but different in the depth (Z axis) direction. The drawbacks of depth data are mainly two folds. First, the depth data contains jumping noises. Second, depth data is usually redundant for mapping a complex depth sequence to a simple action label. Recently, robust skeleton estimation algorithms can extract skeleton joints from depth data in realtime, which opens a new way for understanding human actions using 3D skeleton data. Compared with depth data, skeleton joints estimated by any robust algorithm [4] is more compact and suffers less from jumping noise.