I. Introduction
Crowd counting aims to estimate people number and crowd density distribution in an image, which is generally formulated as the estimation of crowd density map [1]. Its ground truth is obtained by performing Gaussian kernel convolution on head location. The supervised learning methods [2], [3], [4], [5], [6] are frequently used in crowd counting, but it is time-consuming to manually label the people locations in images, especially for the images with thousands of people. Besides, we observe that most of the existing models work well only on the test dataset similar to training dataset. However, a real scene of varying scale, occlusion, nonuniform distribution and background clutter, may be quite different from training dataset. That is to say, new images need to be added and labeled manually when a model is used for new scenes. However, the labeling cost limits its application. It is necessary to reduce the tedious annotation work and to improve the efficiency of learning model with limited data. To this end, the synthetic data were used to train a counting model in [7]. However, the distribution shift between the synthetic and real data degrades the model performance in the real crowd scene. Therefore, the semi-supervised crowd counting (SSCC) methods [8], [9], [10], [11], [12], [13] utilize a large number of unlabeled data to train counting model, and surrogate tasks are introduced to leverage the insightful information of unlabeled data.