I. Introduction
3D human pose estimation is one of the most important problems in computer vision, which is closely related to human motion analysis, action recognition, human-computer interaction and so on [1], [9], [28], [36], [55]. The common solution of this problem is to predict the 3D keypoint coordinates of a predefined human skeleton from single-view [34] or multi-view RGB images [27], [41], [47]. In this solution, it is widely known that occlusions or self-occlusion in images introduce stochastic estimation errors in extracting image features. As shown in Fig. 1, occlusion is inevitable in vision data. It may significantly reduce the performance and result in unreasonable human pose [22], [58].