1. Introduction
3D human pose estimation is an essential computer vision task which aims to estimate the coordinates of 3D joints from single-frame images or videos. This task can be further used for several downstream tasks in multiple object tracking [1], [15], [13], [21], person re-identification [45], action recognition [44], robot [47], human body reconstruction [14], sports application [64], etc. However, large-scale 3D-annotated datasets are hard to obtain. Existing methods are usually built on an off-the-shelf 2D pose estimators [9], [46] following two-stage schemes.