Loading [MathJax]/extensions/MathMenu.js
Spatio-temporal Shape and Flow Correlation for Action Recognition | IEEE Conference Publication | IEEE Xplore

Spatio-temporal Shape and Flow Correlation for Action Recognition


Abstract:

This paper explores the use of volumetric features for action recognition. First, we propose a novel method to correlate spatio-temporal shapes to video clips that have b...Show More

Abstract:

This paper explores the use of volumetric features for action recognition. First, we propose a novel method to correlate spatio-temporal shapes to video clips that have been automatically segmented. Our method works on over-segmented videos, which means that we do not require background subtraction for reliable object segmentation. Next, we discuss and demonstrate the complementary nature of shape- and flow-based features for action recognition. Our method, when combined with a recent flow-based correlation technique, can detect a wide range of actions in video, as demonstrated by results on a long tennis video. Although not specifically designed for whole-video classification, we also show that our method's performance is competitive with current action classification techniques on a standard video classification dataset.
Date of Conference: 17-22 June 2007
Date Added to IEEE Xplore: 16 July 2007
ISBN Information:
Print ISSN: 1063-6919
Conference Location: Minneapolis, MN, USA
References is not available for this document.

1. Introduction

The goal of action recognition is to localize a particular event of interest in video, such as a tennis serve, both in space and in time. Just as object recognition is a key problem in image understanding, action recognition is a fundamental challenge for interpreting video. A recent trend in action recognition has been the emergence of techniques based on the volumetric analysis of video, where a sequence of images is treated as a three-dimensional space-time volume. Eschewing the building of explicit models of the actor or environment (e.g., kinematic models of humans), these approaches attempt to perform recognition directly on the raw video. An obvious benefit is that recognition need not be limited to a specific set of actors or actions but can, in principle, extend to a variety of events - given appropriate training data. The drawback is that volumetric representations do not easily generalize across appearance changes due to different actors, varying environmental conditions and camera viewpoint. This observation has motivated the employment of video features that are robust to appearance; these can be broadly categorized as shape-based (e.g., background subtracted human silhouettes) and flow-based (e.g., motion fields generated using optical flow). However, as discussed below, both of these types of methods have significant limitations. Our goal is to detect specific actions in realisitic videos with cluttered environments. First, we segment input video into space-time volumes. Then, we correlate action templates with the volumes using shape and flow features. We are able to localize events in space-time without the need for background-subtracted videos.

Select All
1.
Wimbledon 2000 Semi-Final - Agassi vs. Rafter. SRO Sports Entertainment. ISBN: 0-7697-7886-0.
2.
M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In Proc. ICCV, 2005.
3.
A. F. Bobick and J. W. Davis. The recognition of human movement using temporal templates. PAMI, 2001.
4.
E. Borenstein and J. Malik. Shape guided object segmentation. In Proc. CVPR, 2006.
5.
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at www.csie.ntu.edu.tw/~cjlin/libsvm.
6.
Y. Cheng. Mean shift, mode seeking, and clustering. PAMI, 1995.
7.
D. Comaniciu. An algorithm for data-driven bandwidth selection. RAMI, 2003.
8.
D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. PAMI, 2002.
9.
D. DeMenthon and D. Doermann. Video retrieval of near-duplicates using k-nearest neighbor retrieval of spatiotemporal descriptors. MTAP, 2005.
10.
P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal, features. In In IEEE VS-PETS Workshop, 2005.
11.
H. Jiang, M. S. Drew, and Z.-N. Li. Successive convex matching for action detection. In Proc. CVPR, 2006.
12.
Y. Ke, R. Sukthankar, and M. Hebert. Efficient visual event detection using volumetric features. In Proc. ICCV, 2005.
13.
Y. Leung, J.-S. Zhang, and Z.-B. Xu. Clustering by scale-space filtering. PAMI, 2000.
14.
D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. ICCV, 2001.
15.
G. Mori. Guiding model search using segmentation. In Proc. ICCV, 2005.
16.
J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. In Proc. BMVC, 2006.
17.
X. Ren and J. Malik. Learning a classification model for segmentation. In Proc. ICCV, 2003.
18.
C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local SVM approach. In Proc. ICPR, 2004.
19.
E. Shechtman and M. Irani. Space-time behavior based correlation. In Proc. CVPR, 2005.
20.
J. Wang, B. Thiesson, Y. Xu, and M. Cohen. Image and video segmentation by anisotropic kernel mean shift. In Proc. ECCV, 2004.
21.
A. Yilmaz and M. Shah. Actions as objects: A novel action representation. In Proc. CVPR, 2005.
Contact IEEE to Subscribe

References

References is not available for this document.