1. Introduction
Multi-person pose estimation is a well-explored area in computer vision, which involves locating the keypoints that cor-respond to body parts of each person within an image. It has been adopted in various applications, including human action recognition [39], [40], human body reconstruction [35], [47], and human image generation [15], [21]. The multi-person pose estimation can be broadly classified into three categories: top-down [26], [34], [36], [42], [44], bottom-up [5], [8], [22], [38], and one-stage methods [20], [29], [37], [41]. The top-down method typically relies on an off-the-shelf object detector, which is firstly applied to identify persons within an image and then followed by the single-person pose estimation. In contrast’ the bottom-up approach initially detects keypoints in an instance-agnostic manner and subsequently groups them to form individual human instances. Compared to those two-stage approaches above, the one-stage method is capable of directly outputting a sequence of potential human poses, yielding improved computing efficiency and thus drawing increased research attention.
Comparison of one-stage end-to-end human pose estimation frameworks. The naive method is represented in (a), while our proposed diffusion-based approach is illustrated in (b).