1. Introduction
Recent years have witnessed the remarkable success of person search which is to match persons existed in real-world scene images. It is often taken as a joint task consisting of person detection [26], [29], [46] and re-identification (re-id) [32], [43], [36]. To achieve high performance, existing methods are commonly trained in a fully supervised setting [4], [1], [41], [45], [13], [20], [23], [5] where the bounding boxes and identity labels are required. However, it is time-consuming and labor-intensive to annotate both of them in a large-scale dataset, which encourages some researchers to embark on reducing the supervision.