1. Introduction
High-fidelity human digitization is the key to enabling a myriad of applications from medical imaging to virtual reality. While metrically accurate and precise reconstructions of humans is now possible with multi-view systems [12], [26], it has remained largely inaccessible to the general community due to its reliance on professional capture systems with strict environmental constraints (e.g., high number of cameras, controlled illuminations) that are prohibitively expensive and cumbersome to deploy. Increasingly, the community has turned to using high capacity deep learning models that have shown great promise in acquiring reconstructions from even a single image [19], [42], [30], [1]. However, the performance of these methods currently remains significantly lower than what is achievable with professional capture systems.
Given a high-resolution single image of a person, we recover highly detailed 3D reconstructions of clothed humans at 1k resolution.