I. Introduction
For a long time, pedestrian detection and associating it with their own head have been very popular research topics in computer vision. These studies have been widely applied in many fields such as robotics and security monitoring. In the past decade, with the rapid development of computational power in computers, deep learning methods based on Convolutional Neural Networks (CNN) [1] have been extensively applied to this task. Currently, mainstream methods are divided into two categories: anchor-based [2], [3], [4], [5], [6] and anchor-free [7], [8], [9], [10]. Recently, Nicolas Carion introduced Transformer [11] into object detection tasks and proposed DETR [12]. Following this, numerous methods based on DETR [13], [14], [15], [16], [17], [18], [19], [20] have proliferated like mushrooms, becoming a third mainstream methods.