1. Introduction
Automatically recognizing human actions is receiving increasing attention due to its wide range of applications such as video indexing and retrieval, human-computer interaction, and activity monitoring. Although a large amount of research has been reported on action categorization, recognizing actions from realistic video still remains a quite challenging problem due to the significant intra-class variations, occlusion, and background clutter. In order to obtain reliable features, most early work made a number of strong assumptions about the videos, such as the availability of reliable human body tracking, slight or no camera motion, and limited number of viewpoints [3], [5]. The commonly used KTH dataset contains relatively complicated scenarios, and many methods employing this dataset have been reported [8], [9], [10]. However, very few attempts have been made to recognize actions from videos “in the wild,” as shown by the examples in Fig. 1. Here, a video “in the wild” refers to a video captured under uncontrolled conditions, such as videos recoded by an amateur using a hand-held camera. Owing to the diverse video sources such as YouTube, TV broadcast and personal video collections, this type of video generally contains significant camera motion, background clutter, and changes in object appearance, scale, illumination conditions, and viewpoint. In this paper, our goal is to offer a generic framework for recognizing this type of realistic actions. Since we collected most of these videos from YouTube, hereafter, YouTube videos refer to videos “in the wild.” Examples of our YouTube action dataset consist of 11 categories with about 1160 videos. Detailed description is in section