I. Introduction
Crowd counting is usually treated as a pixel-level estimation problem, which predicts the density value for each pixel and sums the entire prediction map as a final counting result. A pixelwise density map produces more detailed information than a single number for a complex crowd scene. In addition, it also boosts other highly semantic crowd analysis (group detection [1]–[3], crowd segmentation [4], public management [5], and so on) or video surveillance tasks (video summarization [6]–[8] and abnormal detection [9]). Recently, benefiting from the powerful capacity of deep learning, there is a significant promotion in the field of counting. However, currently released datasets are too small to satisfy the mainstream deep learning-based methods [10]–[15]. The main reason is that constructing a large-scale crowd counting dataset is extremely demanding, which needs many human resources [16].