Conferences >2024 IEEE/CVF Conference on C...

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. The novelty of the proposed model...Show More

Metadata

Abstract:

We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. The novelty of the proposed model is threefold. First, we introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects. Second, we integrate a teacher decoder and a student decoder into our architecture, leveraging the discrepancy between the outputs given by the two decoders to improve anomaly detection. Third, we generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames (without anomalies) and the corresponding pixel-level anomaly maps. Our design leads to an efficient and effective model, as demonstrated by the extensive experiments carried out on four benchmarks: Avenue, Shanghai Tech, UBnormal and UCSD Ped2. The empirical results show that our model achieves an excellent trade-off between speed and accuracy, obtaining competitive AUC scores, while processing 1655 FPS. Hence, our model is between 8 and 70 times faster than competing methods. We also conduct an ablation study to justify our design. Our code is freely available at: https://github.com/ristea/aed-mae.

Published in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Date of Conference: 16-22 June 2024

Date Added to IEEE Xplore: 16 September 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/CVPR52733.2024.01513

Conference Location: Seattle, WA, USA

No metrics found for this document.

Contents

1. Introduction

In recent years, research on abnormal event detection in video gained significant traction [1, 10, 17, 18, 26–28, 36, 38, 43, 44, 49, 52, 57, 58, 61, 62, 65, 69, 76, 78, 80, 83, 87, 90, 95, 97–100], due to its utter importance in video surveillance. Despite the growing interest, video anomaly detection remains a complex task, owing its complexity to the fact that abnormal situations are context-dependent and do not occur very often. This makes it very difficult to collect a representative set of abnormal events for training state-of-the-art deep learning models in a fully supervised manner. To showcase the rarity and reliance on context of anoma-lies, we refer to the vehicle ramming attacks carried out by terrorists against pedestrians. As soon as a car is steered on the sidewalk, it becomes an abnormal event. Hence, the place where the car is driven (street versus sidewalk) determines the normal or abnormal label of the action, i.e. the label depends on context. Furthermore, there are less than 200 vehicle ramming attacks registered to date ¹

https://en.wikipedia.org/wiki/Vehicle-ramming_attack

, confirming the scarcity of such events (even less are caught on video). Figure 1.

Our masked auto-encoder for abnormal event detection based on self-distillation. At training time, some video frames are augmented with synthetic anomalies. The teacher decoder learns to reconstruct original frames (without anomalies) and predict anomaly maps. The student decoder learns to reproduce the teacher's output. Motion gradients are aggregated at the token level and used as weights for the reconstruction loss. Red dashed lines represent steps executed only during training.

Figure 2.

Performance versus speed trade-offs for our self-distilled masked AE and several state-of-the-art methods [26–28, 47, 49, 60, 61, 69, 84] (with open-sourced code), on the Avenue data set. The running times of all methods are measured on a computer with one Nvidia GeForce GTX 3090 GPU with 24 GB of VRAM. Best viewed in color.

Usage

Select a Year

View as

Total usage sinceSep 2024:104

Year Total:47

Data is updated monthly. Usage includes PDF downloads and HTML views.

Citations

Crossref^®

Search for
Citations in
Google Scholar^®

References is not available for this document.

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

View as

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

View as

Supplemental Items

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?