I. Introduction
Crowd analysis is an emerging topic in computer vision, and a crucial task in smart city applications, e.g., video monitoring, urban planning, and public security [1], [2]. It has two essential subtasks, namely, counting and localization, that have drawn signification attention in recent years. The objectives are to infer pedestrian numbers and locations, respectively. The approach for crowd counting is constantly refreshed and increasingly more effective. Meanwhile, crowd localization, evolved from crowd counting, is gradually explored and developed. They can be served for high-level vision tasks, e.g., crowd tracking [3] and 3-D human pose estimation [4].