I. Introduction
Single-person pose estimation, also known as human keypoints detection, which is to locate the coordinates of keypoints or joints of the human body using image sensor input data, has become a fundamental challenging problem in computer vision. It has many application scenarios, including human behavior recognition [1], human-computer interaction, distracted driving behavior detection [2], etc. With the development of deep convolutional neural networks (DCNNs) and their excellent performance, human pose estimation based on DCNNs has also made significant progress. Most existing state-of-the-art (SOTA) pose estimation methods [3], [4], [5], [6] can achieve good detection accuracy, however, they are usually accompanied by a complex network structure and high resource consumption, which limits their promotion in resource-limited devices, such as robots, cars, monitoring equipments, etc. To achieve good accuracy, low cost, and real-time performance, many efficient pose estimation methods have been proposed, which can be mainly divided into two categories: conventional lightweight (LW) networks [7], [8], [9] and efficient knowledge distillation networks [10], [11], [12], [13]. Although conventional lightweight networks are generally concise, pose estimation methods based on knowledge distillation have received more and more attention, and have had a good balance between detection accuracy and deployment cost. Traditional two-stage offline pose distillation schemes [10], [11] could distill pose knowledge from a heavy pre-trained pose estimator (teacher model) to a lightweight compact pose estimator (student model). It is usually time-consuming, and strong teacher models are not always available. So one-stage online multibranch pose distillation schemes [12] are proposed to reduce the complexity and the tediousness of model training in the traditional distillation process. There is also no need for a large pre-trained teacher model. Although these methods compress model parameters to reduce the training cost of the model by the means of knowledge distillation and maintain high accuracy, there are still several problems to be solved. First, current top-performing pose distillation methods rely on complex and heavy basic building blocks and neglect to design or use lightweight structures for reducing computational cost and model parameters. Second, the existing online pose distillation schemes rely on a teacher model composed of redundant student models and do not explore the impact of the number of student models on the performance of the final target model. Finally, it is more difficult to detect invisible keypoints due to blurry appearance, occlusion, etc.