1. Introduction
Recognizing actions in videos is an active area of research in computer vision. Because of the many fine-grained spatio-temporal variations in action appearance, the current performance is far from that achieved in other recognition tasks such as image search. The goal of action classification is to determine which action appears in the video. Temporal action detection estimates, additionally, when it occurs. This paper specifically considers the problem of action localization: the objective is to detect when and where an action of interest occurs. The expected output of such an action localization system is typically a subvolume encompassing the action of interest. Since a localized action only covers a fraction of the spatio-temporal volume in a video, the task is considerably more challenging than action classification and temporal detection. This task can be seen as the video counterpart of object detection in still images.