Processing math: 0%
Global Representation Guided Adaptive Fusion Network for Stable Video Crowd Counting | IEEE Journals & Magazine | IEEE Xplore

Global Representation Guided Adaptive Fusion Network for Stable Video Crowd Counting


Abstract:

Modern crowd counting methods in natural scenes, even when video datasets are available, are mostly based on images. Because of background interference or occlusion in th...Show More

Abstract:

Modern crowd counting methods in natural scenes, even when video datasets are available, are mostly based on images. Because of background interference or occlusion in the scene, these methods can easily lead to mutations and instability in density prediction. There has been minimal research on how to exploit the inherent consistency among adjacent frames to achieve high estimation accuracy of video sequences. In this study, we explore the long-term global temporal consistency in the video sequence and propose a novel Global Representation Guided Adaptive Fusion Network (GRGAF) for video crowd counting. The primary aim is to establish a long-term temporal representation among consecutive frames to guide the density estimation of local frames, which can alleviate the prediction instability caused by background noise and occlusions in crowd scenes. Moreover, in order to further enforce the temporal consistency, we apply the generative adversarial learning scheme and design a global-local joint loss, which can make the estimated density maps more temporally coherent. Extensive experiments on four challenging video-based crowd counting datasets (FDST, DroneCrowd, MALL and UCSD) demonstrate that our method makes effective use of spatio-temporal information of video and outperforms the other state-of-the-art approach.
Published in: IEEE Transactions on Multimedia ( Volume: 25)
Page(s): 5222 - 5233
Date of Publication: 07 July 2022

ISSN Information:

Funding Agency:


I. Introduction

Crowd counting is an important computer vision task because it facilitates a variety of fundamental applications, such as public safety management [1], automatic driving technologies [2], video surveillance [3], [4], and traffic management [5], [6]. The primary aim is to count the accurate number of people in a crowd scene from a video or image. Counting in diverse real-world scenarios remains challenging due to severe occlusion, large-scale variation and light illumination.

Contact IEEE to Subscribe

References

References is not available for this document.