Introduction
Despite the extensive efforts and significant achievements in intelligent monitoring systems and scene understanding, video anomaly detection is still one of the most sought-after research domains in academia and business firms [1], [2], [3], [4]. The real-time detection, localization, and tracking of anomalies are increasingly crucial for applications, such as security systems, crowd management, industrial transportation, and healthcare administration [5], [6], [7], [8], [9].
One of the most pressing challenges in video anomaly detection is the dynamic nature of anomaly definitions across different domains and time frames [10]. What constitutes an anomaly can vary significantly when transitioning between domains and even from one frame to the next [6], [11]. Another challenge is delving into the detected anomaly patterns to extract comprehensive insights into their origins. The analytical depth is instrumental in enhancing monitoring systems’ understanding of the causes behind these anomaly patterns, allowing for effective prioritization. Anomaly detection can be roughly classified into three primary categories: 1) point; 2) collective; and 3) contextual anomaly detection [12]. Point anomalies are data values that significantly diverge from most data points [12], [13]. Collective anomalies are a group of data that deviate from the rest, and contextual anomalies surface when data items manifest peculiar behavior within a specific context or setting [12].
This article presents a novel semi-supervised traffic video monitoring system, localization, and classification. Our method analyzes incoming frames using three distinct baselines. The first baseline involves learning normal patterns during training and labeling deviations from these patterns as anomalies. The second baseline employs a scenario-based assessment, evaluating anomalous patterns based on the context of the previous frames. The third baseline is a frame-based object detection analyzer within the current frame. This multifaceted approach provides a comprehensive perspective on video anomaly detection, allowing adaptability to diverse scenarios and types of anomalies. It is worth noting that a pattern considered an anomaly in the frame-based baseline may be deemed normal when considering the context of previous frames.
The proposed methodology extends beyond anomaly detection by conducting in-depth analyses of the detected anomalies. It distinguishes between anomalies caused by motion and appearance, providing valuable insights into the nature of the anomalies. For instance, a pattern may be classified as a point anomaly in the frame-based baseline while being categorized as a collective anomaly when considering previous frames. This approach offers a nuanced and versatile framework for video anomaly detection, contributing to improved accuracy and adaptability.
The core structure of the proposed method is a semi-supervised Siamese (SemiSiam) network designed to detect contextual anomaly patterns with only limited normal samples provided during the training phase, without any abnormal samples. Input videos undergo comprehensive analysis of their appearance and motion patterns. These features are then directed into distinct encoder engines: the appearance analyzer engine (AAE) and the motion analyzer engine (MAE). Within these encoder engines, the input patterns are mapped to unique latent representations, a critical step in the anomaly detection process. The extracted representations are strategically fused and then fed into the decoder, which plays a pivotal role in reconstructing the input patterns. The algorithm is designed such that similar input patterns yield similar latent representations, ensuring effective anomaly detection. The contributions of this article include the following.
Multiple Baseline Anomaly Detection: This article introduces a flexible anomaly detection using multiple baselines. It includes the conventional method that relies on patterns learned during training, a frame-based approach, and a scenario-based assessment that considers the context of the current frame, even if it contradicts past patterns. This adaptability enhances the system’s ability to detect and respond to anomalies effectively in diverse situations.
Anomaly Pattern Classification: This article presents a robust framework for classifying detected anomaly patterns into point, collective, and contextual anomalies. By utilizing these baselines, this classification provides a comprehensive understanding of the nature of the anomalies. It allows the system to prioritize responses based on the specific type of anomaly.
Semi-Supervised Few-Shot Learning (FSL): This article introduces a novel semi-supervised FSL approach for detecting contextual anomaly patterns. It can effectively address the challenge of sparse training data and enhance the algorithm’s practicality.
The remainder of this article is organized as follows. The related works are reviewed in Section II. The proposed problem formulation is described in Section III. Section IV presents SemiSiam as a semi-supervised FSL methodology for video anomaly detection. Section V analyzes the simulation results, and Section VI concludes this article.
Related Works
Reconstruction-based and distribution-based techniques are widely used in video anomaly detection, but they face limitations in adaptability and generalization. For instance, the work in [14] employs spatial-temporal autoencoders (AEs) to reconstruct a sequence of frames, but this method is constrained by its reliance on a fixed set of general features, limiting its adaptability to new, unseen scenarios. Real-world environments often involve situations where data is scarce, making FSL approaches more suitable. In contrast, our work addresses this limitation by utilizing a novel FSL methodology, which adapts more effectively to environments with limited data by leveraging motion and appearance information via two AEs.
The method in [15] employs a 3-D Conv-net AE to learn abstract latent features and predict future frames, but this approach struggles with complex scenarios that require extensive historical data for accurate prediction. Additionally, frame prediction becomes increasingly challenging as the complexity of the environment grows. Similarly, the work in [16] uses a Conv-LSTM-AE for appearance encoding and temporal memorization, which improves accuracy but lacks adaptability in dynamic environments. Our proposed method overcomes these challenges by introducing an architecture that learns both appearance and motion in a more adaptable way, enabling it to perform well even with FSL tasks, which are crucial in scenarios with limited training data.
Generative adversarial networks (GANs) have also been explored for video anomaly detection, as seen in [17], but they suffer from significant computational costs and slow convergence. The AnoGAN framework in [18] learns the distribution of normal data but requires an optimization step during testing, making it computationally expensive. In contrast, our proposed semi-siam technique strikes a balance between performance and computational efficiency by reducing the reliance on extensive optimization and training time, thus making it more practical for real-time anomaly detection applications.
Meta-learning, a popular FSL approach, has been explored in [19] and [20] for adapting models to specific scenarios. However, this method encounters challenges related to high computational costs, slow convergence, and sensitivity to hyperparameters, which limits its applicability in real-time systems. Our semi-siam approach directly addresses these issues by offering a lightweight, adaptable model that can achieve faster convergence and robustness to varying hyperparameters, making it more suitable for real-world applications where computational resources are limited.
Another notable limitation of existing works, such as the online anomaly detection method proposed in [21], is their reliance on pretrained models, which may fail to generalize to novel anomaly patterns in real-world scenarios. Our approach, in contrast, offers a solution that balances adaptability, performance, and complexity. By employing a semi-supervised learning framework, we can detect anomalies using only normal patterns during training, which is essential in scenarios where abnormal patterns are difficult to collect or define.
Finally, weakly supervised FSL frameworks, such as [22], show promise in improving sample efficiency but still require abnormal patterns during training. This assumption is not always practical, especially in real-world scenarios with limited abnormal data. Our proposed approach addresses this issue by using only normal patterns for training, making it highly effective in semi-supervised contexts, while still maintaining robust performance when confronted with novel anomaly patterns.
Problem Formulation
From the perspective of FSL, video anomaly detection can be considered as a semi-supervised problem in a sense that only some nominal samples are available in the training set. A video sequence X and its corresponding video-level annotation
Video anomaly detection aims to estimate the anomalous nature among a video to label each sequence as a normal or abnormal frame [1], formally\begin{equation*} S(x) = \arg \max _{\phi } \log p\left ({{\phi | D_{\text {train}}}}\right ) \tag {1}\end{equation*}
\begin{equation*} S(x) = \arg \max _{\phi } \log \left [{{p\left ({{D_{\text {train}} | \phi }}\right ) + \log p\left ({{\phi }}\right )}}\right ]. \tag {2}\end{equation*}
\begin{equation*} S(x) = \arg \max _{\phi } \sum _{i} \log \left [{{p\left ({{y_{i} | x_{i}, \phi }}\right ) + \log p\left ({{\phi }}\right )}}\right ] . \tag {3}\end{equation*}
The proposed algorithm is devised based on a semi-supervised paradigm, with no abnormal pattern in the training set. Three different datasets are defined. The first dataset is
The proposed video anomaly detection methodology presents a two-stage algorithm [23]. During the prior training step, \begin{equation*} S(x) = \arg \max _{\phi } \log p\left ({{\phi | D_{\text {adaptation}}, D_{\text {pretraining}}}}\right ). \tag {4}\end{equation*}
\begin{align*} S(x)=& \arg \max _{\phi } \log p\left ({{\phi | D_{\text {adaptation}}, D_{\text {pretraining}}}}\right ) \\=& \log \int _{\theta } \left [{{ p\left ({{\phi | D_{\text {adaptation}}, D_{\text {prior-train}}}}\right ) p\left ({{\theta | D_{\text {pretraining}}}}\right ) }}\right ] d\theta . \tag {5}\end{align*}
Equation (5) assumes that \begin{align*}& S(x) \approx \log \left [{{p\left ({{\phi | D_{\text {adaptation}}, \theta ^{*}}}\right )}}\right . \\& \;\qquad \left .{{+ \log p\left ({{\theta ^{*} | D_{\text {pretraining}}}}\right )}}\right ] \tag {6}\end{align*}
\begin{align*}& \arg \max _{\phi } \log p\left ({{\phi | D_{\text {adaptation}}, D_{\text {pretraining}}}}\right ) \\& \qquad \approx \arg \max _{\phi } \log p\left ({{\phi | D_{\text {adaptation}}, \theta ^{*}}}\right ). \tag {7}\end{align*}
In other words, a massive amount of training data containing all possible normal samples of various scenarios is not needed\begin{align*} \text {prior-learning:}~ \theta ^{*}=& \arg \max _{\theta } \log p\left ({{\theta | D_{\text {pretraining}}}}\right ) \tag {8}\\ \text {adaptation:}~ \phi ^{*}=& \arg \max _{\phi } \log p\left ({{\phi | D_{\text {adaptation}}, \theta ^{*}}}\right ). \tag {9}\end{align*}
\begin{equation*} \theta ^{*} = \arg \max _{\theta } \sum _{i=1}^{t} \log p\left ({{\phi _{i} | D_{\text {test}_{i}}}}\right ) \tag {10}\end{equation*}
Proposed Methodology
A. System Overview
Fig. 1 shows the training and testing phases of the proposed approach. The proposed system monitors the scene and reports the appearance and motion patterns in each frame. The conventional baseline analyzes incoming patterns compared to those modeled during the training phase. The scenario baseline considers the previous frames to detect anomalies (here we used ten previous frames), while the frame baseline focuses on objects in the incoming frames. In each case, frames are analyzed based on appearance and motion patterns, classifying them as point or collective anomalies.
High-level overview of the proposed SemiSiam network for video anomaly detection. It is structured by two encoders, one fusion network, and one decoder. (a) Learning procedure in the training phase. (b) Evaluating incoming frames in the testing phase.
To detect contextual anomalies, the proposed model aims to generate meaningful latent representations for similar input patterns. The AAE and MAE encode each input pattern to different latent representations. The fusion block then fuses the encoded features into the latent space. Since the training dataset contains only normal patterns, the proposed network aims to decrease the distance among the generated latent representations during the training phase. The decoder’s role is to reconstruct the input patterns at the output.
B. Multiple Baseline Anomaly Detection
Each frame undergoes two processes: it is initially input into the pretrained object detection block to identify the moving objects in the scene. The extracted information serves as both the appearance details of each object and the coordinates utilized in the optical flow extraction algorithm to trace the movement of each object, capturing information about their velocity and direction. Subsequently, multiple baseline anomaly detection is initiated.
In the conventional baseline, the information from each frame is compared with what was learned during the training phase, and any deviation is labeled as an anomaly. The second baseline, the scenario baseline, involves comparing the incoming frame with the previous ten frames in the scenario. Any inconsistencies with the scenario are considered anomalies. The third baseline is the frame-based baseline, where the objects within the frames are compared. The patterns are analyzed in each case based on their motion and appearance characteristics. Algorithm 1 outlines the proposed multiple baseline anomaly detection method.
Algorithm 1 Proposed Multiple Baseline Anomaly Detection Method
Video frames
Anomaly detection labels
for each frame
Input frame
Extract appearance details
Use optical flow extraction algorithm to capture velocity
Conventional Baseline:
Compare frame information
Calculate deviation
Label as anomaly if
Scenario Baseline:
Compare incoming frame
Calculate scenario deviation
Label as anomaly if
Frame-based Baseline:
Compare objects within the frame
Calculate frame-based deviation
Label as anomaly if
end for
C. Anomaly Pattern Classification
The proposed video anomaly detection goes beyond a simple binary classification, aiming for a thorough analysis within each identified baseline. Specifically, it categorizes the identified anomaly patterns into three distinctive types: point, collective, and contextual anomalies. The gathered information on velocity and direction undergoes a meticulous summarization process to actualize this approach. The summarized data becomes the foundation for comparing the learned patterns established during the training phase.
When a single object demonstrates a noticeable deviation from the recognized patterns, it is classified as a “point anomaly.” On the contrary, if multiple objects collectively show deviations, the frame is labeled a “collective anomaly.” The Semi-Siam structure is designed to detect the contextual anomalies described in the following section. Algorithm 2 outlines the proposed anomaly pattern classification.
Algorithm 2 Proposed Anomaly Pattern Classification
Video frames, velocity v, direction d
Anomaly classification labels
for each frame
Summarize velocity
Point Anomaly Detection:
for each object
Calculate deviation
Label as point anomaly if
end for
Collective Anomaly Detection:
Calculate collective deviation
Label as collective anomaly if
Contextual Anomaly Detection:
Refer to the Semi-Siam structure for detailed context-based anomaly detection
end for
D. Semi-Siam: Semi-Supervised Siamese Network
The proposed SemiSiam network analyzes each video sequence based on its appearance and motion. The algorithm employs two different encoder engines to inspect the appearance and movement of each frame. The input video sequence is divided into separate frames
AAE: The proposed AAE takes the video frames F as the input and encodes them to the corresponding latent representations
MAE: The MAE is used to identify motion anomalies by processing M within each video frame. Using the optical flow as a preprocessing step, it maps the input movement patterns to corresponding latent representations, denoted as
As illustrated in Fig. 2, the proposed encoders contain six convolutional layers to extract features and three connected layers to classify each pattern as normal or anomalous. The input shape of AAE and MAE is a
Fusion Block: A fusion block is introduced to combine the latent representations encoded separately by the AAE and MAE, denoted as
Decoder: The decoder is crucial in guiding the AAE and MAE to generate meaningful latent representations. Essentially, the proposed AE is constrained to create similar latent representations that enable the faithful reproduction of input patterns. The decoder prevents the model from producing identical latent representations.
The input shape of the decoder is a 256 latent representation, and its output shape is a (158, 258, 3) image. It comprises 2048 and 8064 connected layers, followed by a reshaping to a (64, 9, 14) feature plan. Subsequently, there is a (
E. Objective Function
The proposed SemiSiam network considers three underlying assumptions.
Considering appearance and motion, if two input patterns
are close in high-density regions, their corresponding latent spaces$x_{i}, x_{j}$ and$z_{i}$ should also be relative.$z_{j}$ Considering
and$x_{i}$ as two samples from the normal class,$x_{j}$ and$z_{i}$ as their corresponding latent space, and Euclidean distance as the distance measurement tool, the following condition should be satisfied:$z_{j}$ where\begin{equation*} \text {if}~ p\left ({{x_{i} | 0}}\right )~\approx p\left ({{x_{j} | 0}}\right ), \text {then}~ \|z_{i} - z_{j}\|_{2}^{2} \ll 1 \tag {11}\end{equation*} View Source\begin{equation*} \text {if}~ p\left ({{x_{i} | 0}}\right )~\approx p\left ({{x_{j} | 0}}\right ), \text {then}~ \|z_{i} - z_{j}\|_{2}^{2} \ll 1 \tag {11}\end{equation*}
is the probability of the ith frame being classified as the normal class.$p(x_{i} | 0)$
The pretraining phase is designed to learn the general information about the normal patterns. Following this, the adaptation phase is considered to transfer the knowledge learned during the previous step and improve the algorithm’s adaptivity capability using a few training samples in each scene. Two training datasets are considered: 1)
The proposed SemiSiam network follows a two-stage training procedure with no abnormal patterns in the seen data.
Latent Similarity Loss Function: The distance of two input patterns is computed based on the distance of their latent representations. That is, \begin{equation*} \text {loss}_{\text {Latent Similarity}} = \min \| z_{i} - z_{j} \|_{2}^{2}\ \forall z_{i}, z_{j} \in Z. \tag {12}\end{equation*}
Reconstruction Error Loss Function: The reconstruction loss function is considered to make the proposed network generate meaningful latent representations. Within each batch, the input patterns \begin{equation*} \text {loss}_{\text {Reconstruction Error}} = \min \sum \| x_{i} - \hat {x}_{i} \|_{2}^{2}\ \forall x_{i} \in M . \tag {13}\end{equation*}
As mentioned in (12) and (13), two constraints are considered for each batch. The first constraint is to make the latent representations similar, and the second is to reduce the reconstruction error of the input patterns. A max function is applied in this work to merge these two objective functions\begin{align*}& \text {loss} = \max \left ({{\alpha , \text {loss}_{\text {Latent Similarity}},}}\right . \\& \;\qquad \quad \left .{{\beta , \text {loss}_{\text {Reconstruction Error}}, 0}}\right )~ \tag {14}\end{align*}
Algorithm 3 Proposed Loss Function Computation Procedure
Divide the input patterns into different batches
Compute the corresponding latent representations
Compute the reconstructed patterns
Compute the Latent Similarity Loss Function:\begin{equation*} \text {loss}_{\text {Latent}} {=} \min \| z_{i} - z_{j} \|_{2}^{2}\ \forall z_{i}, z_{j} \in Z\end{equation*}
Compute the Reconstruction Error Loss:\begin{equation*} \text {loss}_{\text {Reconstruction}} {=} \min \sum \| x_{i} - \hat {x}_{i} \|_{2}^{2}\ \forall x_{i} \in M\end{equation*}
Combine the losses using the max function:\begin{equation*} \text {loss} {=} \max (\alpha \text {loss}_{\text {Latent}}, \beta \text {loss}_{\text {Reconstruction}}, 0)\end{equation*}
Simulation Results
The performance of the proposed algorithm is compared to the reconstruction-based methods and FSL methods for video anomaly detection. The cross-domain simulation performance is also presented. Cross-domain simulation evaluates the adaptivity capability performance of the proposed methodology considering different
A. Experimental Results and Discussion
Four datasets are utilized to assess the performance of the proposed methodology and compare it with existing works in the literature.
UCSD Pedestrian 1 [24]: This dataset was captured by a stationary camera overlooking pedestrian walkways. In the normal setting, the video only contains pedestrians, while in the testing set, many unseen patterns appear, which are considered as anomalies. UCSD Pedestrian 1 is composed of 34 training and 16 testing scenarios. It was acquired with frames of
A sanity check is performed to validate the proposed SemiSiam methodology. Table 3 provides a comprehensive comparison with a wide range of techniques, including reconstruction-based and prediction-based methods. It is important to note that the sanity check employs the standard training/testing setup provided by the datasets.
Fig. 5 illustrates some samples of the proposed method in detecting and classifying anomalies. In Frame 30 of Testing Scenario Number 26 in UCSD 1, the object detection algorithm identifies a bicycle. As it deviates from the patterns observed during the training phase, it is labeled as an anomaly in the conventional baseline. Additionally, since it is detected for the first time in this frame, it is considered an anomaly in the scenario baseline, indicating the recognition of a new scenario. According to the frame baseline, it is also classified as an anomaly. From frame number 40 onward (beyond ten frames), it continues to be categorized as an anomaly in the conventional and frame baselines but is now considered normal in the scenario baseline. The classification of these anomalies as “point anomalies” is based on the analysis of objects’ appearance and motion patterns, specifically considering their distinct velocities and appearances.
Detection and classification of a bicycle as a point anomaly based on distinct appearance and velocity patterns in the proposed video anomaly detection methodology. (a) Bicycle is identified as an anomaly across all baselines. (b) Bicycle is considered normal in scenario-based baselines but remains an anomaly for the rest.
Fig. 6 illustrates the significance of employing multiple baselines and classifying anomalies. In Fig. 6(a), we have a snapshot of the training set with only normal patterns, specifically pedestrians. Fig. 6(b) displays a testing frame where the proposed algorithm detects a car labeled as an anomaly in all three baselines. It is anomalous in the conventional baseline because there was no car in the training set. In the scenario baseline due to the absence of a car in the ten previous frames, it differs from the others (pedestrians in this case) within the frame. Since no other anomaly patterns are detected, it is classified as a point anomaly because of its distinct appearance and velocity. Fig. 6(c) highlights the effectiveness of the proposed methodology, where the car is labeled as an anomaly in the conventional and frame baselines but is considered normal in the scenario baseline, as previously detected. Fig. 6(d) introduces a scenario with the appearance of a bicycle. While literature might identify an anomaly in the frame, the proposed algorithm designates the car as an anomaly in the frame and conventional baselines but normal in the scenario baseline. The bicycle, however, is an anomaly in all baselines. Since there are multiple anomaly patterns, they are classified as collective anomalies. Fig. 7 shows the situation of the anomalies based on three baselines for this specific scenario.
Illustration of the proposed methodology showcasing detecting and classifying anomalies in various scenarios. (a) Training set with only normal patterns (pedestrians). (b) Testing frame detecting a car as an anomaly in all three baselines. (c) Highlight the proposed methodology’s power with a scenario where the car is labeled differently in various baselines. (d) Introducing a situation with the appearance of a bicycle demonstrates the algorithm’s ability to classify collective anomalies.
Situation of the anomalies based on three baselines for specific scenario where we have car and bicycles in the testing phase.
UCSD Pedestrian 2 [24]: This dataset reports scenes with pedestrians’ movement parallel to the camera plane. UCSD Pedestrian 2 contains 16 training video samples and 12 testing ones. It was also captured with frames of
Table 4 provides a detailed comparison of several state-of-the-art methodologies for anomaly detection on the Pedestrian 2 dataset. The table includes metrics, such as accuracy, hardware used, and processing time, highlighting both reconstruction-based and prediction-based methods.
Our proposed SemiSiam model achieves the highest accuracy at 98.5%, outperforming other reconstruction-based approaches like Conv-AE (90%) in [25] and MemAE (94.1%) in [30]. Additionally, the SemiSiam model demonstrates efficient processing capabilities, achieving this result with 200 epochs and only 54948 parameters. This relatively low number of parameters highlights the efficiency and compactness of our model, making it suitable for real-time anomaly detection applications without compromising accuracy.
When it comes to training time, the SemiSiam model requires only 1 h to complete training on an NVIDIA RTX 30360 GPU, which is comparable to other methods. In contrast, other methods like Stacked RNN in [31] require a similar 1 h for 10000 epochs but achieve lower accuracy (92.2%).
Regarding processing speed, our model strikes a balance between accuracy and efficiency. Although methods like FFP report a faster speed of 25 FPS using an NVIDIA GeForce TITAN GPU, FFP achieves a lower accuracy (95.4%) compared to our SemiSiam model. This demonstrates that our approach provides a robust tradeoff between accuracy and computational efficiency.
This comparison highlights the advantage of our SemiSiam model in terms of both performance and efficiency. The relatively low number of parameters combined with high accuracy and a fast training time makes it a strong candidate for real-time applications, where accuracy and efficiency are critical.
Fig. 8 illustrates a complex scenario, specifically Scenario 11 in the Pedestrian 2 dataset. In Frame 11, a bicycle is detected and classified as an anomaly across all baselines. Unlike the observations in Fig. 6(a), this bicycle remains anomalous in the conventional and frame baselines but is deemed normal in the scenario baseline. It is identified as a point anomaly at this stage due to its unique appearance and motion patterns. Subsequently, in Frame 125, another bicycle is detected. It is considered normal in the scenario baseline since bicycles were observed in the preceding ten frames. However, it qualifies as an anomaly in the frame baseline and conventional baseline. This situation is categorized as a collective anomaly with multiple anomalies identified.
Complex anomaly scenario: the proposed methodology detects and classifies anomalies in Scenario 11 of the Pedestrian 2 dataset (Fig. 7). A bicycle is initially considered a point anomaly due to its distinct appearance and motion patterns. (a) Bicycle is classified as an anomaly across all baselines. (b) It is considered normal in scenario-based baselines but classified as a point anomaly in the remaining baselines. (c) It is considered normal in scenario-based baselines but classified as a collective anomaly in the remaining baselines.
The adaptivity capability of video anomaly detection methods represents a nascent paradigm in the field, with few existing works exploring this simulation approach. Our proposed method follows a two-phase training and testing process: initially trained on one dataset and subsequently tested on another to assess its adaptability and performance.
To evaluate the performance of our method, we conducted experiments using the UCSD Pedestrian datasets. In the first experiment, we trained the model on the Peds1 dataset and tested it on the Pedestrian 2 dataset. This setup assesses how well the model, trained on one dataset, generalizes to a different dataset. The results of this experiment are summarized in Fig. 9(a), where the model’s performance is measured using the area under the curve (AUC) metric. In another scenario, Peds2 is considered the training dataset, and Peds1 is the target dataset. This setup further evaluates the method’s ability to adapt to different datasets. The results of this experiment are presented in Fig. 9(b), highlighting the AUC scores achieved with and without adaptation.
Adaptivity capability of the proposed video anomaly detection method. (a) Model is trained on the Pedestrian 1 dataset and tested it on the Pedestrian 2 dataset. (b) Model is trained on the Pedestrian 2 dataset and tested it on the Pedestrian 1 dataset.
Chunk Avenue [32]: This dataset contains 16 training video clips (15328 training frames) and 21 testing clips (15324 testing frames).
Table 5 presents a comparative analysis of anomaly detection methods for the Chunk dataset, including performance metrics and computational information, such as hardware details and processing times.
Our SemiSiam model achieves the highest accuracy at 88.5%, outperforming both reconstruction-based and prediction-based methods. In comparison, methods like Conv-AE [25] achieve 70% accuracy, and ConvLSTM AE [16] report 77% accuracy, showcasing the superiority of our proposed SemiSiam method.
In terms of computational efficiency, the SemiSiam model operates at 20 FPS on an NVIDIA RTX 3060 Ti GPU. This demonstrates a strong balance between processing speed and accuracy, making it suitable for real-time applications. Although the FFP method [27] reports a faster processing speed of 25 FPS using an NVIDIA GeForce TITAN, it achieves a lower accuracy of 84.9%. The Stacked RNN [31] reports similar training times (1 h for 10000 steps), but with lower accuracy (81.7%).
City of Calgary: The proposed methodology is evaluated for its practical using real-world datasets provided by the City of Calgary, Canada. Defining anomalies here is more complex compared to synthetic or simplified datasets. For example, while datasets like UCSD may adopt a simplistic definition where all nonpedestrian entities are considered anomalies, such criteria fail to capture the nuanced nature of anomalies in real-world scenarios. By leveraging this dataset, our methodology demonstrates its efficacy in handling the complexities of urban surveillance and aims to advance anomaly detection techniques relevant to today’s dynamic urban landscapes. The dataset comprises 257 clips captured by urban surveillance cameras, with durations ranging from 10 min to 1 h each. These clips cover ten different scenarios, reflecting real-world challenges, such as varying weather conditions, changes in lighting, blurry images, and fluctuations in crowd density.
Fig. 10 shows an interesting scenario in detecting traffic anomaly. A reversing car is detected as an anomaly here. It is important to acknowledge that certain anomaly patterns related to the direction of movement pose specific challenges. For example, while a U-turn might be considered normal in most cities, it is categorized as an anomaly in some cities including Calgary. Despite this discrepancy, our proposed algorithm effectively addresses such scenarios. In Fig. 10, the proposed model analyzes the scene of a reversing car and detects it as an anomaly in all three baselines.
Scenario where a car reversing is detected as an anomaly. (a) Analysis of the frame where a car reverses. (b) Detected anomaly based on the direction of the movement.
Fig. 11 showcases instances where a bus is detected within the frame. During the training phase, the model learned to recognize the bus as part of the normal pattern. However, when evaluated against the scenario and frame baselines, the bus is detected as an anomaly. This discrepancy highlights the nuanced nature of anomaly detection, where the context in which an object appears can significantly influence its classification.
To evaluate the performance of the proposed model, we considered four different clips, each with a duration of 20 s. The anomalies in these clips are specifically motion anomalies, as the proposed anomaly detection module is designed to monitor this type of anomaly. The specifications of the clips, including the nature and frequency of the anomalies, are illustrated in Table 6.
Conclusion
This article introduces an innovative and adaptable approach to video anomaly detection, with the capability of anomaly classification. The proposed methodology, grounded in multiple baselines—including a conventional method, a frame-based approach, and a scenario-based assessment—showcases remarkable adaptability to diverse scenarios, enhancing the system’s efficacy in detecting anomalies. A significant contribution of this article is the introduction of a semi-supervised FSL approach, which addresses challenges associated with limited training data. This approach not only enhances the algorithm’s adaptability but also makes it a practical and valuable tool for real-world video anomaly detection scenarios. Results across multiple datasets affirm the high-performance capabilities of the proposed methodology.
ACKNOWLEDGMENT
The authors thank the City of Calgary for providing the video data used in this work.