I. Introduction
Crowd counting problem aims to get the accurate number of people through images or videos, which is important for applications such as video surveillance, traffic monitoring, public safety, and urban planning. In all practical applications, the main source of data are videos captured by drones or surveillance cameras. Data in the form of video can be naturally decomposed into temporary part and spatial part. However, most of the crowd density estimation models only use the spatial information of a video and ignore the strong correlation between adjacent video frames. The methods for processing a single image can be roughly divided into two categories: detection-based methods[1]–[3] and regression-based methods[4]. The latter solves the occlusion and chaos problems of the former by using CNN-based models such as MCNN and CSRNet[5], [6].