1. Introduction
In recent years, research on abnormal event detection in video gained significant traction [1, 10, 17, 18, 26–28, 36, 38, 43, 44, 49, 52, 57, 58, 61, 62, 65, 69, 76, 78, 80, 83, 87, 90, 95, 97–100], due to its utter importance in video surveillance. Despite the growing interest, video anomaly detection remains a complex task, owing its complexity to the fact that abnormal situations are context-dependent and do not occur very often. This makes it very difficult to collect a representative set of abnormal events for training state-of-the-art deep learning models in a fully supervised manner. To showcase the rarity and reliance on context of anoma-lies, we refer to the vehicle ramming attacks carried out by terrorists against pedestrians. As soon as a car is steered on the sidewalk, it becomes an abnormal event. Hence, the place where the car is driven (street versus sidewalk) determines the normal or abnormal label of the action, i.e. the label depends on context. Furthermore, there are less than 200 vehicle ramming attacks registered to date
https://en.wikipedia.org/wiki/Vehicle-ramming_attack
, confirming the scarcity of such events (even less are caught on video).Our masked auto-encoder for abnormal event detection based on self-distillation. At training time, some video frames are augmented with synthetic anomalies. The teacher decoder learns to reconstruct original frames (without anomalies) and predict anomaly maps. The student decoder learns to reproduce the teacher's output. Motion gradients are aggregated at the token level and used as weights for the reconstruction loss. Red dashed lines represent steps executed only during training.
Performance versus speed trade-offs for our self-distilled masked AE and several state-of-the-art methods [26–28, 47, 49, 60, 61, 69, 84] (with open-sourced code), on the Avenue data set. The running times of all methods are measured on a computer with one Nvidia GeForce GTX 3090 GPU with 24 GB of VRAM. Best viewed in color.