I. Introduction
In educational environments, human pose estimation is crucial for understanding student behavior and engagement, offering key insights for assessing teaching effectiveness. However, accurately capturing human poses in dynamic and complex classroom settings presents significant challenges. These challenges include frequent occlusions, overlapping figures, and the diversity of student postures, all of which complicate the accurate extraction of skeletal structures [1]. Despite advances in pose estimation technology that have improved the interpretation of human postures, consistently extracting precise skeletal details in such complex scenarios remains a significant challenge [2].