I. Introduction
With the amazing performance achieved by Convolutional Neural Network (CNN) model on multiple computer vision tasks such as image classification [1]–[3], object detection [4]–[10], and image segmentation [11]–[13], more advanced neural networks with various strengths are proposed to improve further. However, in object detection, fully supervised methods with massive anchor boxes [4]–[6] or anchor points [7]–[9] are in the majority, which compel researchers to put much attention on precisely annotating the coordinates of ground-truth boxes for each object before training.