I. Introduction
With the development of deep learning [1], [2], recent years have witnessed remarkable advancement in object detection [3]. Among them, representative successes include two-stage R-CNN detectors [4]–[15]: their first stage uses a region proposal network (RPN [4]) to generate some candidates from dense, predefined bounding-boxes (i.e. anchors), then the second stage uses a region-of-interest subnetwork (RoI-subnet) for object classification and localization. To pursue higher efficiency, one-stage approaches [16]–[23] directly recognize objects from dense anchors rather than generating candidate proposals. Both two-stage and one-stage detectors adopt the anchoring scheme, where massive anchors (~105) are uniformly sampled over an image.