1. Introduction
Object detection has been intensively investigated by computer vision community for decades. Traditional detectors built by deep convolutional neural network (CNN) are either anchor-based [13], [32], [26] or anchor-free [30], [37], [46]. The former performs classification and regression based on pre-defined densely tiled bounding boxes, while the latter only assumes grid points in the 2D image plane. On the other hand, detection can be completed in a single stage, two stages or even multiple cascade stages. The single-stage method directly gives predictions without further modifications, which is usually simple and efficient. Two- or multi-stage methods repeatedly make corrections based on previous results, which offer better results but cost more model parameters and calculations. Except for the first stage, later stages are usually region-based, focusing on the local region within the bounding box, which is often realized by RoI Align [15].