Loading [MathJax]/extensions/MathZoom.js
Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments | IEEE Journals & Magazine | IEEE Xplore

Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments


Abstract:

We introduce a new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different vi...Show More

Abstract:

We introduce a new dataset, Human3.6M, of 3.6 Million accurate 3D Human poses, acquired by recording the performance of 5 female and 6 male subjects, under 4 different viewpoints, for training realistic human sensing systems and for evaluating the next generation of human pose estimation models and algorithms. Besides increasing the size of the datasets in the current state-of-the-art by several orders of magnitude, we also aim to complement such datasets with a diverse set of motions and poses encountered as part of typical human activities (taking photos, talking on the phone, posing, greeting, eating, etc.), with additional synchronized image, human motion capture, and time of flight (depth) data, and with accurate 3D body scans of all the subject actors involved. We also provide controlled mixed reality evaluation scenarios where 3D human models are animated using motion capture and inserted using correct 3D geometry, in complex real environments, viewed with moving cameras, and under occlusion. Finally, we provide a set of large-scale statistical models and detailed evaluation baselines for the dataset illustrating its diversity and the scope for improvement by future work in the research community. Our experiments show that our best large-scale model can leverage our full training set to obtain a 20% improvement in performance compared to a training set of the scale of the largest existing public dataset for this problem. Yet the potential for improvement by leveraging higher capacity, more complex models with our large dataset, is substantially vaster and should stimulate future research. The dataset together with code for the associated large-scale learning models, features, visualization tools, as well as the evaluation server, is available online at http://vision.imar.ro/human3.6m.
Page(s): 1325 - 1339
Date of Publication: 12 December 2013

ISSN Information:

PubMed ID: 26353306

1 Introduction

Accurately reconstructing the 3D human poses of people from real images, in a variety of indoor and outdoor scenarios, has a broad spectrum of applications in entertainment, environmental awareness, or human-computer interaction [1]–[3]. Over the past 15 years the field has made significant progress fueled by new optimization and modeling methodology, discriminative methods, feature design and standardized datasets for model training. It is now widely agreed that any successful human sensing system, be it generative, discriminative or combined, would need a significant training component, together with strong constraints from image measurements, in order to be successful, particularly under monocular viewing and (self-) occlusion. Such situations are not infrequent but rather commonplace in the analysis of images acquired in real world situations. Yet these images cannot be handled well with the human models and training tools currently available in computer vision. Part of the problem is that humans are highly flexible, move in complex ways against natural backgrounds, and their clothing and muscles deform. Other confounding factors like occlusion may also require comprehensive scene modeling, beyond just the humans in the scene. Such image understanding scenarios stretch the ability of the pose sensing system to exploit prior knowledge and structural correlations, by using the incomplete visible information in order to constrain estimates of unobserved body parts. One of the key challenges for trainable systems is insufficient data coverage. Existing state of the art datasets like HumanEva [4], contain about 40,000 different poses and the class of motions covered is somewhat small, reflecting its design purpose geared primarily towards algorithm evaluation. In contrast, while we want to continue to be able to offer difficult benchmarks, we also wish to collect datasets that can be used to build operational systems for realistic environments. People in the real world move less regularly than assumed in many existing datasets. Consider the case of a pedestrian, for instance. It is not that frequent, particularly in busy urban environments, to encounter 'perfect' walkers. Driven by their daily tasks, people carry bags, walk with hands in their pockets and gesticulate when talking to other people or on the phone. Since the human kinematic space is too large to be sampled regularly and densely, we chose to collect data by focusing on a set of poses which are likely to be of interest because they are common in urban and office scenes. The poses are derived from 15 chosen scenarios for which our actors were given general instructions, but were also left ample freedom to improvise. This choice helps us cover more densely some of the common pose variations and at the same time control the difference between training and testing data (or covariate shift [5]) without placing unrealistic restrictions on their similarity. However that variability within daily tasks like “talking on the phone” or “Eating” is subtle as functionally, similar programs are being performed, irrespective of the exact execution. In contrast, the distributions of any two such different scenarios are likely to contain wider separated poses, although the manifolds from which this data is sampled may intersect.

Contact IEEE to Subscribe

References

References is not available for this document.