Introduction
Human pose estimation [1] –[5] is a challenging task in computer vision. When there is a complicate scenario (such as occlusion or limb entanglement) occurred in the input image, most of the state-of-the-art estimation methods [6] –[10] could incorrectly identify the key points of non-target individuals as those of the target individual. Such incorrect identifications can propagate and affect the predictions of other key points. For instance, the ankle of the target individual might be misidentified as that of another individual, consequently influencing the perception of the target individual's knee.
Existing methods (e.g., [11], [12]) have been largely focused on the development of the attention module and neglect the complex interactions among different attention mechanisms. Furthermore, these mechanisms are often combined combined through a cascade or parallel connection. Unfortunately, the limited exchange of information between these modules can hinder the perception of key points due to the influence of preceding modules and negatively impact the accuracy of the final predictions.
Demonstration of the effect of our proposed perception-enhanced network (PEN), achieving sharper focus on the target key point than that of the baseline model, HRNet [13].
A single attention mechanism usually only captures information in a specific dimension, making it difficult to effectively utilize all extracted features. Most of the existing methods aggregate spatial and channel attention by combining them through a simple cascade or parallel connection. Such simple fusion approach tends to yield poor accuracy, especially in dealing with occlusion and limb entanglement problems. Furthermore, different attentions may also yield a mutual inhibitory phenomenon as discovered in our experiments.
To address the above-described issues, we propose a perception-enhanced module that enhances the ability to discern the difference between salient features and average features. This is achieved by exploiting lightweight spatial and channel attentions, which effectively correct errors in feature extraction and fuse with the corresponding attention via a dynamic re-weighting mechanism. The spatial information complements the channel modelling capability, while the channel features interact with spatial ones, enabling adaptive perception of features across both spatial and channel dimensions.
This work was supported in part by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant 21KJA520007, in part by the Priority Academic Program Development of Jiangsu Higher Education Institutions, China, and in part by the Collaborative Innovation Center of Novel Software Technology and Industrialization, China. (Corresponding authors: Baojiang Zhong, Kai-Kuang Ma)
The structure of our proposed perception-enhanced network, which contains three stages: (1) the backbone to extract shallow features; (2) our proposed perception-enhanced network to extract deep features from channel and spatial dimensions; (3) generation of key points through a multilayer perceptron.
Details of our proposed perception-enhanced module, which contains three blocks: (a) spatial attention; (b) channel attention; (c) pixel attention.
In addition, a novel pixel attention mechanism is developed, which is guided by a shallow feature map to further enhance key point sensitivity and mitigate the impact of occlusion on the key point prediction.
To demonstrate the effectiveness of our proposed perception-enhanced network, Fig. 1 shows the heat maps resulted from our method for comparing with that from the HRNet [13], which is served as the baseline for comparison. HRNet produces incorrect prediction of key points. On the other hand, the heat maps resulted from our network are almost the same as that of the ground truth. Our approach effectively suppresses the values related to non-target individuals while amplifying those associated with the target individual.
Proposed Method
A. Outline
Fig. 2 depicts the outline of the proposed network. First, HRNet [13] is chosen as the backbone to extract shallow features from the input image, due to the fact that it can effectively capture complex pose details at different scales. Subsequently, our developed perception-enhanced module, which is used to extract deep features from shallow features. Finally, the prediction of key points is produced through a multilayer perceptron, which realizes of the lightweight classification of coordinates. In the following, a detailed introduction of our network architecture will be provided.
B. Perception-enhanced Network (PEN)
For the feature map X ∈ ℝH×W×C extracted from the backbone (where H denotes the height of the feature map, W denotes the width, and C denotes the number of channels), convolution operations are used as preprocessing. The spatial attention module, as shown in Fig. 3, is designed to perceive the spatial positions and relative relationships of key posture joints or body parts in the image. formulated as:
\begin{gather*} {\operatorname{GAP} ^s}{({\mathbf{X}})_{mn}} = \frac{1}{C}\sum\limits_{k = 1}^C {{{\mathbf{X}}_{mnk}}} \tag{1} \\ {\operatorname{GMP} ^s}{({\mathbf{X}})_{mn}} = \mathop {\max }\limits_k \left\{ {{{\mathbf{X}}_{mnk}}} \right\}\tag{2} \\ {{\mathbf{Y}}^s} = \left( {{{\operatorname{GAP} }^s}({\mathbf{X}}) \otimes {{\operatorname{GMP} }^s}({\mathbf{X}})} \right){{\mathbf{W}}_s}\tag{3}\end{gather*}
Each channel of the feature maps X represents information related to a key point. It is widely accepted that there exists a certain correlation between different key joints in the human body. Channel attention can effectively capture the dependencies among different key points and suppress noise information. This can be formulated as
\begin{gather*} {\operatorname{GAP} ^c}{({\mathbf{X}})_k} = \frac{1}{{HW}}\sum\limits_{m = 1}^H {\sum\limits_{n = 1}^W {{{\mathbf{X}}_{mnk}}} } \tag{4} \\ {\operatorname{GMP} ^c}{({\mathbf{X}})_k} = \mathop {\max }\limits_{m,n} \left\{ {{{\mathbf{X}}_{mnk}}} \right\}\tag{5} \\ {{\mathbf{Y}}^c} = \operatorname{ReLU} \left( {\left( {{{\operatorname{GAP} }^c}({\mathbf{X}}) \otimes {{\operatorname{GMP} }^c}({\mathbf{X}})} \right){\mathbf{W}}_c^1} \right){\mathbf{W}}_c^2\tag{6}\end{gather*}
Although attention mechanisms are effective in capturing information from a specific dimension, they typically do not consider information from other dimensions within a single attention mechanism. To address this limitation, the Perception-Enhanced Module (PEM) is proposed, depicted in the pink shadow in Fig. 2. The PEM adaptively fuses of information from both channel and spatial attentions. Specifically, in the output of channel attention, spatial attention features are dynamically integrate based on learned content. Similarly, within spatial attention dimension, channel attention dimension features are adaptively incorporate. In summary, our method can be defined as follows:
\begin{align*} & {{\mathbf{F}}^s} = \left( {\operatorname{Sigmoid} \left( {{{\mathbf{Y}}^s} \odot {\mathbf{X}}} \right)} \right){W^s}\tag{7} \\ & {{\mathbf{F}}^c} = \left( {\operatorname{Sigmoid} \left( {{{\mathbf{Y}}^c} \odot {\mathbf{X}}} \right)} \right){W^c}\tag{8}\end{align*}
Due to the different output dimensions of channel attention and spatial attention, simple addition using the broadcast principle is inadequate. Therefore, incorporating learnable parameters into adaptive interactions facilitates enhanced feature integration and dimensional normalization. It is worth noting that sigmoid was added to normalize the data to avoid the gradient exploding if the data is too large or too small.
Subsequently, through a straightforward fusion process, spatial dimension enriches channel information, while channel dimension bolster spatial information, ultimately enhancing feature expression. This process can be defined as
\begin{equation*}{{\mathbf{Y}}^f} = {{\mathbf{F}}^c} + {{\mathbf{F}}^s}\tag{9}\end{equation*}
Pixel attention, achieved by assigning a weight to each pixel, directs the model to focus more on specific pixels, aligning well with key points prediction tasks. Therefore, the shallow features are opted to utilize as a reference to conduct pixel attention operations on the fused attention features. This approach further refines the localization of accurate key points while mitigating error information based on the feedback results. This process can be defined as follows
\begin{equation*}{{\mathbf{Y}}^p} = \left( {\left( {{\mathbf{X}} \otimes {{\mathbf{Y}}^f}} \right){\mathbf{W}}_p^1} \right){\mathbf{W}}_p^2\tag{10}\end{equation*}
C. Coordinate classifier
After obtaining the feature maps X, the x and y coordinates of the key points within X are encoded into two separate one-dimensional vectors. The length of each vector equals twice the side length of the image, facilitating sub-pixel positioning. This process can be formulated as
\begin{gather*} \left\{ {{{\mathbf{K}}_i}} \right\}_{i = 1}^{\mathcal{C}} = \operatorname{Flatten} \left( {{{\mathbf{Y}}^p}} \right)\tag{11} \\ {{\mathbf{x}}_i} = \operatorname{Softmax} \left( {{{\mathbf{K}}_i}{{\mathbf{W}}_x} + {{\mathbf{b}}_x}} \right)\tag{12} \\ {{\mathbf{y}}_i} = \operatorname{Softmax} \left( {{{\mathbf{K}}_i}{{\mathbf{W}}_y} + {{\mathbf{b}}_y}} \right)\tag{13}\end{gather*}
Experiments
A. Experimental Setup
Extensive experiments are conducted on the datasets COCO [14] and MPII [15]. As widely practiced in the literature, four evaluation metrics are used: AP (Average Precision), AR (Average Recall), OKS (Object Key Point Similarity) and PCKh (Percentage of Correct Key points of Head).
Following the HRNet [13] baseline model, we adhere closely to the original paper's configurations. The learning rate starts at 0.001 and gradually decays by 0.1 at the 170-th and 200-th epoch, respectively, with training concluding within 210 epochs. In this paper, the two-stage [29, 38], [6], [25] top-down human pose estimation pipeline are used: the individual instances are firstly detected and then the key points are estimated. a popular person detector is adopted with 56.4% AP provided by [6] for COCO validation set. The experiments are conducted using 4 NVIDIA TITAN Xp GPUs.
B. Comparison with the State-of-the-Art Methods
Our method exhibits superior performance compared to previous methods [6], [11] –[13], [16] –[19] on the COCO validation set and test-dev set, as summarized in Table I and Table II. On the validation set, we outperform baseline models HRNet-W32 and HRNet-W48 [13] by 0.9 and 0.6 AP respectively. Additionally, even compared to the AFC model [11], which addresses occlusion issues, our network achieves improvements of 0.1 and 0.4 AP. Notably, our network achieves more significant improvements on the test-dev set, with improvements of 0.8 and 1.1 AP compared to baseline models. Specifically, when the input image size is 384 × 288, our network achieves a 0.4 AP improvement over the baseline model, and a 0.2 AP improvement over the AFC model designed to address occlusion problems.
On the MPII validation set, extensive experiments are conducted compared to previous methods [6], [13], [20] –[22], evaluating our results on the input size of 256 × 256. The results show our approach has significant advantages, as shown in Table III. In order to further prove the effectiveness of our method, a comparison experiment is conducted with HRNet in PCKh@0.1, which is shown in Table IV, and the results show that our method has a more significant advantage in the case where the input size is either 64 × 64 or 256 × 256.
To demonstrate the superiority of our model, we selected six test images from the COCO dataset, involving occlusion and multiple individuals. The results are shown in Fig. 4. Some of the key points predicted by HRNet are incorrectly associated with non-target individuals, marked as yellow dots and lines. In contrast, our method accurately localizes all key points on the target individual, marked as green dots and lines. Notably, even in the presence of occlusions, our model maintains high accuracy in key point localization. This comparison clearly shows that our proposed model outperforms existing methods in terms of precision.
A comparison between HRNet [13] (left) and our proposed method (right) using six test images. In each comparison, HRNet produces some incorrect predictions (highlighted in yellow), while our method generates the correct predictions (highlighted in green). Note that the blue lines indicate areas where both methods produce similar results.
C. Ablation Study
To investigate the performance of our proposed method, an ablation study is conducted on COCO validation set, and Table V documents the result. In this study, Case 1 serves as the baseline, constructed by HRNet [13]. Next, in Case 2, we incorporated both channel attention and spatial attention in a cascade configuration, leading to a slight improvement in AP by 0.2. Case 3, on the other hand, integrates the same attention mechanisms but in parallel, which yielded a more notable AP improvement of 0.4. These results suggest that parallel attention is more effective in capturing complex spatial dependencies than the cascade approach. Case 4 is finally created by adding our PEM and further boosted AP by 1.1.
Conclusion
A novel network that integrates spatial, channel and pixel attention is proposed to address challenges caused by occlusion and limb entanglement. Initially, spatial and channel attention mechanisms arranged in parallel in parallel are employed to extract relevant features. During this process, a novel approach is utilized to overcome the limitations of single-dimensional attention, enabling the consideration of information from other attention mechanisms within one attention process. Subsequently, guided by the original feature map, incorrect information is suppressed. Extensive experiments validate the effectiveness of our approach. We believe this study offers a novel solution to occlusion-related issues in human pose estimation.