I. Introduction
Human motion capture has been an essential and active research topic for years. For instance, motion capture widely used in human-computer interaction [1], virtual reality and extended reality environments [2], [3], creative industries and fashion [4], animation generation [5], medicine and psychology [6], robotics and autonomous driving [7], and video monitoring and security [8]. It enables the accurate recording and analysis of human movements, facilitating the creation of realistic animation, the performance evaluation and human-computer interaction. Traditional motion capture systems, such as marker-based and markerless optical systems, have shown remarkable results but suffer from certain limitations like high cost, complex setup and dependency on controlled environments. In recent years, deep learning has emerged as a promising solution to overcome the limitations of traditional methods, enabling more accessible, flexible, and efficient motion capture systems.