1. Introduction
Deep neural network-based multi-view (MV) crowd counting [56], [57] was recently proposed to count people in wide-area scenes that cannot be covered by a single camera. In these works, feature maps from multiple camera views are fused together and decoded to predict a scene-level crowd density map. However, one major disadvantage of the current MV paradigm is the models are trained and tested on the same single scene and a fixed camera layout, and thus the trained models do not generalize well to other scenes or other camera layouts.