1 Introduction
Accurately reconstructing the 3D human poses of people from real images, in a variety of indoor and outdoor scenarios, has a broad spectrum of applications in entertainment, environmental awareness, or human-computer interaction [1]–[3]. Over the past 15 years the field has made significant progress fueled by new optimization and modeling methodology, discriminative methods, feature design and standardized datasets for model training. It is now widely agreed that any successful human sensing system, be it generative, discriminative or combined, would need a significant training component, together with strong constraints from image measurements, in order to be successful, particularly under monocular viewing and (self-) occlusion. Such situations are not infrequent but rather commonplace in the analysis of images acquired in real world situations. Yet these images cannot be handled well with the human models and training tools currently available in computer vision. Part of the problem is that humans are highly flexible, move in complex ways against natural backgrounds, and their clothing and muscles deform. Other confounding factors like occlusion may also require comprehensive scene modeling, beyond just the humans in the scene. Such image understanding scenarios stretch the ability of the pose sensing system to exploit prior knowledge and structural correlations, by using the incomplete visible information in order to constrain estimates of unobserved body parts. One of the key challenges for trainable systems is insufficient data coverage. Existing state of the art datasets like HumanEva [4], contain about 40,000 different poses and the class of motions covered is somewhat small, reflecting its design purpose geared primarily towards algorithm evaluation. In contrast, while we want to continue to be able to offer difficult benchmarks, we also wish to collect datasets that can be used to build operational systems for realistic environments. People in the real world move less regularly than assumed in many existing datasets. Consider the case of a pedestrian, for instance. It is not that frequent, particularly in busy urban environments, to encounter 'perfect' walkers. Driven by their daily tasks, people carry bags, walk with hands in their pockets and gesticulate when talking to other people or on the phone. Since the human kinematic space is too large to be sampled regularly and densely, we chose to collect data by focusing on a set of poses which are likely to be of interest because they are common in urban and office scenes. The poses are derived from 15 chosen scenarios for which our actors were given general instructions, but were also left ample freedom to improvise. This choice helps us cover more densely some of the common pose variations and at the same time control the difference between training and testing data (or covariate shift [5]) without placing unrealistic restrictions on their similarity. However that variability within daily tasks like “talking on the phone” or “Eating” is subtle as functionally, similar programs are being performed, irrespective of the exact execution. In contrast, the distributions of any two such different scenarios are likely to contain wider separated poses, although the manifolds from which this data is sampled may intersect.