1. Introduction
3D human pose estimation (HPE) is the process of pre-dicting the 3D coordinates of human joints from images or videos. It serves as the foundation for various applications including person re-identification [29], action recognition [10], [20], [34], human mesh recovery [44], [45], virtual reality [9], [35]. However, the annotated 3D data are often collected in controlled laboratory environments for convenience, featuring indoor settings and limited actions performed by few individuals. As a result, pose estimators trained on these labeled datasets face challenges in generalizing to varied in-the-wild scenarios. Hence, the notion of domain generalization (DG) is pivotal in incorporating knowledge from labeled (source) data into a pose estimator that could generalize well on unseen (target) data. Unlike domain adaptation (DA) which involves the training with target data, DG relies solely on the source data as a reference, without any prior information about the target data.
Comparisons between existing single-augmentor frame-works and our proposed dual-augmentor framework on a toy example. Current single-augmentor methods excel at simulating Target Domain 2 but exhibit limitations in simulating Target Domain 1, closely resembling the source, and Target Domain 3, deviating significantly from the source. In our framework, the weak aug-mentor excels in simulating Target Domain 1, while the strong augmentor effectively imitates both Target Domain 2 and 3.