Loading [MathJax]/extensions/MathZoom.js
Crowd Counting Based on Multiscale Spatial Guided Perception Aggregation Network | IEEE Journals & Magazine | IEEE Xplore

Crowd Counting Based on Multiscale Spatial Guided Perception Aggregation Network


Abstract:

Crowd counting has received extensive attention in the field of computer vision, and methods based on deep convolutional neural networks (CNNs) have made great progress i...Show More

Abstract:

Crowd counting has received extensive attention in the field of computer vision, and methods based on deep convolutional neural networks (CNNs) have made great progress in this task. However, challenges such as scale variation, nonuniform distribution, complex background, and occlusion in crowded scenes hinder the performance of these networks in crowd counting. In order to overcome these challenges, this article proposes a multiscale spatial guidance perception aggregation network (MGANet) to achieve efficient and accurate crowd counting. MGANet consists of three parts: multiscale feature extraction network (MFEN), spatial guidance network (SGN), and attention fusion network (AFN). Specifically, to alleviate the scale variation problem in crowded scenes, MFEN is introduced to enhance the scale adaptability and effectively capture multiscale features in scenes with drastic scale variation. To address the challenges of nonuniform distribution and complex background in population, an SGN is proposed. The SGN includes two parts: the spatial context network (SCN) and the guidance perception network (GPN). SCN is used to capture the detailed semantic information between the multiscale feature positions extracted by MFEN, and improve the ability of deep structured information exploration. At the same time, the dependence relationship between the spatial remote context is established to enhance the receptive field. GPN is used to enhance the information exchange between channels and guide the network to select appropriate multiscale features and spatial context semantic features. AFN is used to adaptively measure the importance of the above different features, and obtain accurate and effective feature representations from them. In addition, this article proposes a novel region-adaptive loss function, which optimizes the regions with large recognition errors in the image, and alleviates the inconsistency between the training target and the evaluation metric. In order to eva...
Published in: IEEE Transactions on Neural Networks and Learning Systems ( Volume: 35, Issue: 12, December 2024)
Page(s): 17465 - 17478
Date of Publication: 23 August 2023

ISSN Information:

PubMed ID: 37610898

Funding Agency:


I. Introduction

With the increasing concentration of urbanized population worldwide, computer vision-based crowd recognition technology plays an important role in public safety, abnormal event detection, and urban traffic management. In the past few years, a large number of approaches have been proposed to solve the crowd-counting problems in images, mainly including traditional detection-based and regression-based methods, as well as convolutional neural network (CNN)-based crowd counting methods [1]. For sparse scenes containing a single or a few targets in an image, crowd counting can be easily and accurately performed by detection-based methods to detect human bodies in an image or regression-based methods to learn the mapping relationship between features and the number of people in an image. However, for crowded scenes containing a large number of targets, there are still great difficulties in recognition due to challenges such as scale variation, nonuniform distribution, occlusion, and complex background. Sample images of these challenges are illustrated in Fig. 1. These challenging scenes are usually not independent of each other, but have coupled relationships, which means that several challenges may exist simultaneously in one image. To overcome these challenges, a variety of CNN-based crowd counting methods have been designed that automatically learn the mapping relationships between images and density maps and achieve excellent crowd counting results. Although these methods have solved some problems, there is still much room for improvement in performance especially for crowded dense scenes. Therefore, in this article, a CNN-based generic and robust network is designed to further improve the crowd counting performance. Another hot topic of research in crowd counting is to alleviate the challenges of reliance on a large amount of labeled data and difficulty in pixel-level annotations by using cross-domain crowd counting methods [2], domain-adaptive crowd counting methods [3], semi-supervised crowd counting methods [4], etc. In this article, we focus on the performance improvement of CNN-based crowed counting, therefore a brief review of traditional crowd counting methods and a detailed review of CNN-based crowd counting methods are provided in this Introduction sections.

Image samples of challenges including (a) scale variation, (b) nonuniform distribution, (c) occlusion, and (d) complex background.

Contact IEEE to Subscribe

References

References is not available for this document.