1 Introduction
Crowd counting is a task of estimating the number of persons in an image or a surveillance video, and it has drawn a lot of attention in computer vision community due to its potential applications in security-related scenarios. Almost all previous work targets at RGB image based crowd counting [1], [2], [3], [4] and achieve satisfactory performance on this task. With the popularity of depth sensors, people also propose to study RGB-D crowd counting [5], [6], [7] in surveillance scenarios. Compared with a single RGB image, a depth map provides additional geometry information (e.g., the size of the object) [8], [9] to understand scenes, which drives us to simultaneously count and locate heads with RGB-D data.