1. Introduction
Nowadays, activity detection has drawn a fast-growing attention in both industry and research fields. Activity detection in extended videos [4], [15] is widely applied for public safety in indoor and outdoor scenarios. Activity detection on streaming videos captured by in-vehicle cameras is applied for vision-based autonomous driving. The development of these applications brings several challenges. First, most of these systems take unconstrained videos as input, which are recorded in large field-of-views where multi-object and multi-activity occur simultaneously and continuously over time. Second, the unconstrained videos in real world are in multiple scenarios and under multiple conditions, e.g. in dynamically changed road environments from day to night in autonomous driving [21]. Third, efficient algorithms are demanded for real-time processing and responding of streaming video.