1. Introduction
In the realm of computer graphics and computer vision, the synthesis and manipulation of human images have evolved into a captivating and transformative field. This field holds invaluable applications covering a range of domains: reposing strives to generate a new pose of a person given a target pose [2], [43], [45], [47], virtual try-on aims to seamlessly fit a new garment onto a person [23], [26], [48], and text-to-image editing manipulate a person's clothing styles based on text prompts [5], [11], [12], [40]. However, most approaches address these tasks in isolation, neglecting the benefits of learning them jointly to mutually reinforce one another via the uti-lization of auxiliary information provided by related tasks [9], [16], [42]. In addition, few studies have explored effective ways to adapt to unseen human-in-the-wild cases.
The results of UniHuman on diverse real-world images. UniHuman learns informative representations by leveraging multiple data sources and connections between related tasks, achieving high-quality results across various human image editing objectives.