I. Introduction
For the past decades, one has focused on human or crowd activity recognition based on videos, and significant progress has been made. However, most of these works are devoted to single-camera videos in simple background scenarios [1]–[6]. There exist some natural drawbacks for the single-camera videos-based methods, such as limited views, objects occlusions, and low recognition accuracy under complicated backgrounds. It is difficult to overcome such inherent shortages. In recent years, with the wide use of low-cost cameras in many public places for the purpose of safety, one site is usually covered by several cameras. Therefore, researchers begin to pay attention to human or crowd activity analysis in multicamera networks, which is meaningful for mitigating the drawbacks of using single camera mentioned above. Intuitively, the abundant and complementary information from multicamera systems will improve activity recognition. Toward this goal, many challenges should be overcome, such as how to effectively represent multicamera data, how to extract the union features from multicamera videos, how to deal with the discrepancies among different views of videos, and how to fuse information from multiple cameras for the analysis of human or crowd behaviors, and so on.