I. Introduction
3D human pose estimation is pervasively utilized in human-machine interaction, autonomous driving, auxiliary medical care, etc. Traditional monocular 3D human pose estimation methods normally utilize convolutional and fully connected layers to predict 3D human joints. To better utilize 2D human pose estimation and to improve the accuracy, a typical process of 3D human pose estimation generally has two stages. First, obtaining 2D human joint positions with a 2D pose estimation. Second, mapping them to the corresponding 3D joint positions, e.g. SimpieBaseline3D [1] and VideoPose3D [2]. With the introduction of PoseFormer [3], Transformer [4] becomes a promising foundational architecture with ascendant performance [1], [5], [6], [7]. However, the persistent challenges of location uncertainty and depth ambiguity remain unsolved due to the absence of depth information.