1. Introduction
Anomaly detection is an important task in computer vision on account of its prevalent applications in video surveillance, video summarization, and scene understanding, etc. However, this task remains extremely challenging because it is an ill-posed problem, i.e., the scenarios of abnormal event are unbounded because it is extremely difficult or infeasible to collect data corresponding to all abnormal events. In contrast, the acquisition of ordinary moments in videos is much easier. Thus, a common setting for anomaly detection is that there are only ordinary moments available in the training sets.
†: Equal Contribution
Anomaly detection can be casted as the following two subproblems: i)how to characterize the appearance and motion; ii)how to model the change in appearance or motion. For quite a long time, hand-crafted features [1] [2] are utilized to characterize the appearance and motion in videos, then sparse representation based approaches [3]–[5] can be used to measure the change of appearance or motion. However, such sparse representation strategy is very time-consuming for both training and testing. Recently, deep neural networks have shown their advantages over hand-crafted features for visual data representation in image classification [6] and activity recognition [7] [8]. Recently, Hasan et al. [9] propose to use a 3D Convolutional Neural Network (ConvNet or CNN) based Auto-Encoder framework to simultaneously learn the regularity among the appearance and motion for anomaly detection. However, many existing work for activity recognition have shown that 3D convolution is not good enough for motion characterization [10] [11].