1. Introduction
As one of the fundamental computer vision tasks, object detection has attracted increasing attention in various real-world applications including autonomous driving and surveillance video analysis. Recent advances of deep learning introduce many convolutional neural network based solutions to object detection. The backbone of a detector is often composed of heavy convolution operations to produce intensive features that is critical to the detection accuracy. But doing so inevitably results in a sharp increase in the cost of computing resource and an apparent decrease in detection speed. Techniques such as quantization [19], [58], [31], [57], [62], pruning [2], [17], [20], network design [55], [49], [15], [18] and knowledge distillation [56], [6] have been developed to overcome this dilemma and achieve an efficient inference on detection task. We are particularly interested in knowledge distillation [24], as it provides an elegant way to learn a compact student network when a performance proven teacher network is available. Classical knowledge distillation methods are firstly developed for the classification task to decide which category the image belongs to. The information from soft label outputs [24], [28], [38], [13] or intermediate features [1], [23], [66] of a well-optimized teacher network have been well exploited to learn the student networks, but these methods cannot be directly extended to the detection task which needs to further figure out where the objects are.