I. Introduction
Human pose estimation is a useful computer vision technique in empowering a wide range of real-world intelligent applications, such as smart sports [8], healthcare [25], and human behavior cognition [24]. Typical current human pose estimation techniques often solve this problem by transforming the discrete 2D coordinates of human keypoints into heatmaps and training a deep neural network to predict the heatmaps [4], [38], [7], [33], [16], [17]. This process can be seen as a constrained point distribution optimization problem in the image plane [27]. However, this paradigm solely focuses on the prediction of each individual body joint from an input image and fails to consider the inference of body parts of the same person as a cohesive unit, which should be accurately discriminated from those of other people in the image. This can lead to inaccurate assignment of body joints to individuals in challenging scenarios.