I. Introduction
With the increasing concentration of urbanized population worldwide, computer vision-based crowd recognition technology plays an important role in public safety, abnormal event detection, and urban traffic management. In the past few years, a large number of approaches have been proposed to solve the crowd-counting problems in images, mainly including traditional detection-based and regression-based methods, as well as convolutional neural network (CNN)-based crowd counting methods [1]. For sparse scenes containing a single or a few targets in an image, crowd counting can be easily and accurately performed by detection-based methods to detect human bodies in an image or regression-based methods to learn the mapping relationship between features and the number of people in an image. However, for crowded scenes containing a large number of targets, there are still great difficulties in recognition due to challenges such as scale variation, nonuniform distribution, occlusion, and complex background. Sample images of these challenges are illustrated in Fig. 1. These challenging scenes are usually not independent of each other, but have coupled relationships, which means that several challenges may exist simultaneously in one image. To overcome these challenges, a variety of CNN-based crowd counting methods have been designed that automatically learn the mapping relationships between images and density maps and achieve excellent crowd counting results. Although these methods have solved some problems, there is still much room for improvement in performance especially for crowded dense scenes. Therefore, in this article, a CNN-based generic and robust network is designed to further improve the crowd counting performance. Another hot topic of research in crowd counting is to alleviate the challenges of reliance on a large amount of labeled data and difficulty in pixel-level annotations by using cross-domain crowd counting methods [2], domain-adaptive crowd counting methods [3], semi-supervised crowd counting methods [4], etc. In this article, we focus on the performance improvement of CNN-based crowed counting, therefore a brief review of traditional crowd counting methods and a detailed review of CNN-based crowd counting methods are provided in this Introduction sections.
Image samples of challenges including (a) scale variation, (b) nonuniform distribution, (c) occlusion, and (d) complex background.