I. Introduction
The aim of crowd counting is to automatically estimate the crowd numbers in a given image or video. This task, regarded as a critical task of crowd analysis [1], has gained increasing attention in the deep learning community [2], [3], [4] because of its wide range of applications, including in smart cities for public safety [5], intelligence surveillance [6], [7], and traffic monitoring [8]. In particular, the crowd counting technique can be integrated into consumer electronic devices, such as smart glasses or drones, to provide security personnel with crowd distribution information, provide early warning of crowd stampedes, or detect crowd gatherings during the COVID-19 pandemic. The current state-of-the-art methods generally fall into two categories: density map-based methods [9], [10], [11], [12] and point-based methods [13], [14], [15]. Although those methods are well studied to handle various challenges in crowd counting (e.g., large-scale variation of people, occlusions and high clutter, or uneven crowd distributions), they all need point annotations in advance.