Loading [MathJax]/extensions/MathMenu.js
Efficient Pose Estimation via a Lightweight Single-Branch Pose Distillation Network | IEEE Journals & Magazine | IEEE Xplore

Efficient Pose Estimation via a Lightweight Single-Branch Pose Distillation Network


Abstract:

Accurate lightweight (LW) pose estimation is still a challenging task influenced by different human poses and various complex backgrounds in 2-D human images. To address ...Show More

Abstract:

Accurate lightweight (LW) pose estimation is still a challenging task influenced by different human poses and various complex backgrounds in 2-D human images. To address the above problems, we propose a lightweight single-branch pose distillation network, termed LSPD, which is a lightweight powerful fully convolutional pose network that can be executed quickly with a low computational cost for accurate pose estimation. First, we introduced an efficient end-to-end pose distillation sequence framework, which utilizes a small number of lightweight and strong pose estimation stages to effectively transfer the pose knowledge of our teacher model. Second, we constructed a compact and strong pose estimation stage that uses a type of lightweight multiscale residual block to enhance the image features and the image-dependent spatial features representation ability of the model. At the same time, it reduces the computational cost. Finally, when training is complete, we used the backbone network and the first student stage as the simple architecture to deploy. Extensive experiments demonstrated that the proposed method obtains excellent performance with high accuracy and low model parameters.
Published in: IEEE Sensors Journal ( Volume: 23, Issue: 22, 15 November 2023)
Page(s): 27709 - 27719
Date of Publication: 13 October 2023

ISSN Information:

Funding Agency:

References is not available for this document.

I. Introduction

Single-person pose estimation, also known as human keypoints detection, which is to locate the coordinates of keypoints or joints of the human body using image sensor input data, has become a fundamental challenging problem in computer vision. It has many application scenarios, including human behavior recognition [1], human-computer interaction, distracted driving behavior detection [2], etc. With the development of deep convolutional neural networks (DCNNs) and their excellent performance, human pose estimation based on DCNNs has also made significant progress. Most existing state-of-the-art (SOTA) pose estimation methods [3], [4], [5], [6] can achieve good detection accuracy, however, they are usually accompanied by a complex network structure and high resource consumption, which limits their promotion in resource-limited devices, such as robots, cars, monitoring equipments, etc. To achieve good accuracy, low cost, and real-time performance, many efficient pose estimation methods have been proposed, which can be mainly divided into two categories: conventional lightweight (LW) networks [7], [8], [9] and efficient knowledge distillation networks [10], [11], [12], [13]. Although conventional lightweight networks are generally concise, pose estimation methods based on knowledge distillation have received more and more attention, and have had a good balance between detection accuracy and deployment cost. Traditional two-stage offline pose distillation schemes [10], [11] could distill pose knowledge from a heavy pre-trained pose estimator (teacher model) to a lightweight compact pose estimator (student model). It is usually time-consuming, and strong teacher models are not always available. So one-stage online multibranch pose distillation schemes [12] are proposed to reduce the complexity and the tediousness of model training in the traditional distillation process. There is also no need for a large pre-trained teacher model. Although these methods compress model parameters to reduce the training cost of the model by the means of knowledge distillation and maintain high accuracy, there are still several problems to be solved. First, current top-performing pose distillation methods rely on complex and heavy basic building blocks and neglect to design or use lightweight structures for reducing computational cost and model parameters. Second, the existing online pose distillation schemes rely on a teacher model composed of redundant student models and do not explore the impact of the number of student models on the performance of the final target model. Finally, it is more difficult to detect invisible keypoints due to blurry appearance, occlusion, etc.

Select All
1.
M. M. E. Yurtsever and S. Eken, "BabyPose: Real-time decoding of baby’s non-verbal communication using 2D video-based pose estimation", IEEE Sensors J., vol. 22, no. 14, pp. 13776-13784, Jul. 2022.
2.
L. Ye et al., "Using CNN and channel attention mechanism to identify driver’s distracted behavior", Trans. Edutainment., vol. 16, pp. 175-183, Apr. 2020.
3.
A. Toshev and C. Szegedy, "DeepPose: Human pose estimation via deep neural networks", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1653-1660, Jun. 2014.
4.
S.-E. Wei, V. Ramakrishna, T. Kanade and Y. Sheikh, "Convolutional pose machines", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 4724-4732, Jun. 2016.
5.
W. Yang, S. Li, W. Ouyang, H. Li and X. Wang, "Learning feature pyramids for human pose estimation", Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 1290-1299, Oct. 2017.
6.
Y. Zhou, H. Dong and A. E. Saddik, "Learning to estimate 3D human pose from point cloud", IEEE Sensors J., vol. 20, no. 20, pp. 12334-12342, Oct. 2020.
7.
Z. Cao, T. Simon, S.-E. Wei and Y. Sheikh, "Realtime multi-person 2D pose estimation using part affinity fields", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1302-1310, Jul. 2017.
8.
J. Zhou et al., "Resource management for improving soft-error and lifetime reliability of real-time MPSoCs", IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 12, pp. 2215-2228, Dec. 2019.
9.
W. Wang, K. Zhang, H. Ren, D. Wei, Y. Gao and J. Liu, "UULPN: An ultra-lightweight network for human pose estimation based on unbiased data processing", Neurocomputing, vol. 480, pp. 220-233, Apr. 2022.
10.
F. Zhang, X. Zhu and M. Ye, "Fast human pose estimation", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3512-3521, Jun. 2019.
11.
D.-H. Hwang, S. Kim, N. Monet, H. Koike and S. Bae, "Lightweight 3D human pose estimation network training using teacher–student learning", Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), pp. 468-477, Mar. 2020.
12.
Z. Li, J. Ye, M. Song, Y. Huang and Z. Pan, "Online knowledge distillation for efficient pose estimation", Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pp. 11720-11730, Oct. 2021.
13.
S. Zhang, B. Qiang, X. Yang, M. Zhou and R. Chen, "Knowledge distillation for lightweight 2D single-person pose estimation", J. Circuits Syst. Comput., vol. 32, no. 3, 2022.
14.
P. F. Felzenszwalb and D. P. Huttenlocher, "Pictorial structures for object recognition", Int. J. Comput. Vis., vol. 61, no. 1, pp. 55-79, Jan. 2005.
15.
M. Andriluka, S. Roth and B. Schiele, "Pictorial structures revisited: People detection and articulated pose estimation", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1014-1021, Jun. 2009.
16.
S. Johnson and M. Everingham, "Clustered pose and nonlinear appearance models for human pose estimation", Proc. BMVC, pp. 1-11, Sep. 2010.
17.
L. Pishchulin, M. Andriluka, P. Gehler and B. Schiele, "Poselet conditioned pictorial structures", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 588-595, Jun. 2013.
18.
L. Pishchulin, M. Andriluka, P. Gehler and B. Schiele, "Strong appearance and expressive spatial models for human pose estimation", Proc. IEEE Int. Conf. Comput. Vis., pp. 3487-3494, Dec. 2013.
19.
B. Sapp, A. Toshev and B. Taskar, "Cascaded models for articulated pose estimation", Proc. Eur. Conf. Comput. Vis. (ECCV), pp. 406-420, Sep. 2010.
20.
B. Sapp and B. Taskar, "MODEC: Multimodal decomposable models for human pose estimation", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3674-3681, Jun. 2013.
21.
X. Chen and A. Yuille, "Articulated pose estimation by a graphical model with image dependent pairwise relations", Proc. IEEE Conf. Neural Inf. Process. Syst. (NIPS), pp. 1736-1744, Dec. 2014.
22.
A. Cherian, J. Mairal, K. Alahari and C. Schmid, "Mixing body-part sequences for human pose estimation", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2361-2368, Jun. 2014.
23.
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition", arXiv:1409.1556, 2014.
24.
C. Szegedy et al., "Going deeper with convolutions", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1-9, Jun. 2015.
25.
J. Hu, L. Shen and G. Sun, "Squeeze-and-excitation networks", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 7132-7141, Jun. 2018.
26.
J. J. Tompson, A. Jain, Y. LeCun and C. Bregler, "Joint training of a convolutional network and a graphical model for human pose estimation", Proc. IEEE Conf. Neural Inf. Process. Syst. (NIPS), pp. 1799-1807, Dec. 2014.
27.
A. Newell, K. Yang and J. Deng, "Stacked hourglass networks for human pose estimation", Proc. Eur. Conf. Comput. Vis. (ECCV), pp. 483-499, Oct. 2016.
28.
X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille and X. Wang, "Multi-context attention for human pose estimation", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5669-5678, Jul. 2017.
29.
K. Sun, B. Xiao, D. Liu and J. Wang, "Deep high-resolution representation learning for human pose estimation", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5686-5696, Jun. 2019.
30.
B. Cheng, B. Xiao, J. Wang, H. Shi, T. S. Huang and L. Zhang, "HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5385-5394, Jun. 2020.

Contact IEEE to Subscribe

References

References is not available for this document.