I. Introduction
Crowd counting is a fundamental task in computer vision that aims to estimate the number of people in images. Most existing methods [1]–[5] generate intermediate representations of learning targets, such as density maps, where the crowd count is computed via summing over the estimated density map. However, numbers counting alone can hardly support the downstream tasks based on the crowd distributions. In response to this problem, numerous methods have been proposed for more challenging fine-grained prediction of the exact locations of individuals. Specifically, some approaches [6], [7] bypass the error-prone steps and directly predict the center points of heads, yielding encouraging counting performance and impressive localization accuracy.