1. Introduction
The problem of crowd counting [4] has recently received significant attention in computer vision. Given video of a crowded environment, the goal is to estimate the density of the crowd, by counting the number of people that it contains. Various methods have shown that the problem is solvable with fairly high accuracy [4], [5], [14], [21], [23]. In fact, state of the art results place the prediction error at around ±1 person per video frame, for crowds with dozens of pedestrians. While this is sufficient for most applications of practical interest, the scalability problem remains open. Most works assume a large annotated training set per camera view. This is not practical for large camera networks, where crowd counting systems are most useful.