Journals & Magazines >IEEE Open Journal of Instrume... >Volume: 4

Semi-Siam: A Novel Intelligent Monitoring System With a Multibaseline Video Anomaly Detection

Abstract:

This article introduces a novel anomaly detector for intelligent monitoring systems, leveraging multiple assessment baselines, including conventional, frame-based, and sc...Show More

Metadata

Abstract:

This article introduces a novel anomaly detector for intelligent monitoring systems, leveraging multiple assessment baselines, including conventional, frame-based, and scenario-based approaches, to enhance anomaly detection. The integration of these baselines improves detection accuracy and contextual understanding of anomalies. A key feature of the proposed methodology is the incorporation of the Semi-Siam technique, a semi-supervised few-shot learning approach, which significantly boosts performance in scenarios with limited training data. Extensive simulations on multiple datasets demonstrate the proposed system’s effectiveness and substantial improvements over existing techniques. The results indicate that this methodology offers a robust and efficient solution for real-world video anomaly detection applications, such as the City of Calgary dataset, providing significant advancements in detection accuracy and adaptability.

Published in: IEEE Open Journal of Instrumentation and Measurement ( Volume: 4)

Article Sequence Number: 4500113

Date of Publication: 18 December 2024

Electronic ISSN: 2768-7236

DOI: 10.1109/OJIM.2024.3517614

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Despite the extensive efforts and significant achievements in intelligent monitoring systems and scene understanding, video anomaly detection is still one of the most sought-after research domains in academia and business firms [1], [2], [3], [4]. The real-time detection, localization, and tracking of anomalies are increasingly crucial for applications, such as security systems, crowd management, industrial transportation, and healthcare administration [5], [6], [7], [8], [9].

One of the most pressing challenges in video anomaly detection is the dynamic nature of anomaly definitions across different domains and time frames [10]. What constitutes an anomaly can vary significantly when transitioning between domains and even from one frame to the next [6], [11]. Another challenge is delving into the detected anomaly patterns to extract comprehensive insights into their origins. The analytical depth is instrumental in enhancing monitoring systems’ understanding of the causes behind these anomaly patterns, allowing for effective prioritization. Anomaly detection can be roughly classified into three primary categories: 1) point; 2) collective; and 3) contextual anomaly detection [12]. Point anomalies are data values that significantly diverge from most data points [12], [13]. Collective anomalies are a group of data that deviate from the rest, and contextual anomalies surface when data items manifest peculiar behavior within a specific context or setting [12].

This article presents a novel semi-supervised traffic video monitoring system, localization, and classification. Our method analyzes incoming frames using three distinct baselines. The first baseline involves learning normal patterns during training and labeling deviations from these patterns as anomalies. The second baseline employs a scenario-based assessment, evaluating anomalous patterns based on the context of the previous frames. The third baseline is a frame-based object detection analyzer within the current frame. This multifaceted approach provides a comprehensive perspective on video anomaly detection, allowing adaptability to diverse scenarios and types of anomalies. It is worth noting that a pattern considered an anomaly in the frame-based baseline may be deemed normal when considering the context of previous frames.

The proposed methodology extends beyond anomaly detection by conducting in-depth analyses of the detected anomalies. It distinguishes between anomalies caused by motion and appearance, providing valuable insights into the nature of the anomalies. For instance, a pattern may be classified as a point anomaly in the frame-based baseline while being categorized as a collective anomaly when considering previous frames. This approach offers a nuanced and versatile framework for video anomaly detection, contributing to improved accuracy and adaptability.

The core structure of the proposed method is a semi-supervised Siamese (SemiSiam) network designed to detect contextual anomaly patterns with only limited normal samples provided during the training phase, without any abnormal samples. Input videos undergo comprehensive analysis of their appearance and motion patterns. These features are then directed into distinct encoder engines: the appearance analyzer engine (AAE) and the motion analyzer engine (MAE). Within these encoder engines, the input patterns are mapped to unique latent representations, a critical step in the anomaly detection process. The extracted representations are strategically fused and then fed into the decoder, which plays a pivotal role in reconstructing the input patterns. The algorithm is designed such that similar input patterns yield similar latent representations, ensuring effective anomaly detection. The contributions of this article include the following.

Multiple Baseline Anomaly Detection: This article introduces a flexible anomaly detection using multiple baselines. It includes the conventional method that relies on patterns learned during training, a frame-based approach, and a scenario-based assessment that considers the context of the current frame, even if it contradicts past patterns. This adaptability enhances the system’s ability to detect and respond to anomalies effectively in diverse situations.
Anomaly Pattern Classification: This article presents a robust framework for classifying detected anomaly patterns into point, collective, and contextual anomalies. By utilizing these baselines, this classification provides a comprehensive understanding of the nature of the anomalies. It allows the system to prioritize responses based on the specific type of anomaly.
Semi-Supervised Few-Shot Learning (FSL): This article introduces a novel semi-supervised FSL approach for detecting contextual anomaly patterns. It can effectively address the challenge of sparse training data and enhance the algorithm’s practicality.

The remainder of this article is organized as follows. The related works are reviewed in Section II. The proposed problem formulation is described in Section III. Section IV presents SemiSiam as a semi-supervised FSL methodology for video anomaly detection. Section V analyzes the simulation results, and Section VI concludes this article.

SECTION II.

Related Works

Reconstruction-based and distribution-based techniques are widely used in video anomaly detection, but they face limitations in adaptability and generalization. For instance, the work in [14] employs spatial-temporal autoencoders (AEs) to reconstruct a sequence of frames, but this method is constrained by its reliance on a fixed set of general features, limiting its adaptability to new, unseen scenarios. Real-world environments often involve situations where data is scarce, making FSL approaches more suitable. In contrast, our work addresses this limitation by utilizing a novel FSL methodology, which adapts more effectively to environments with limited data by leveraging motion and appearance information via two AEs.

The method in [15] employs a 3-D Conv-net AE to learn abstract latent features and predict future frames, but this approach struggles with complex scenarios that require extensive historical data for accurate prediction. Additionally, frame prediction becomes increasingly challenging as the complexity of the environment grows. Similarly, the work in [16] uses a Conv-LSTM-AE for appearance encoding and temporal memorization, which improves accuracy but lacks adaptability in dynamic environments. Our proposed method overcomes these challenges by introducing an architecture that learns both appearance and motion in a more adaptable way, enabling it to perform well even with FSL tasks, which are crucial in scenarios with limited training data.

Generative adversarial networks (GANs) have also been explored for video anomaly detection, as seen in [17], but they suffer from significant computational costs and slow convergence. The AnoGAN framework in [18] learns the distribution of normal data but requires an optimization step during testing, making it computationally expensive. In contrast, our proposed semi-siam technique strikes a balance between performance and computational efficiency by reducing the reliance on extensive optimization and training time, thus making it more practical for real-time anomaly detection applications.

Meta-learning, a popular FSL approach, has been explored in [19] and [20] for adapting models to specific scenarios. However, this method encounters challenges related to high computational costs, slow convergence, and sensitivity to hyperparameters, which limits its applicability in real-time systems. Our semi-siam approach directly addresses these issues by offering a lightweight, adaptable model that can achieve faster convergence and robustness to varying hyperparameters, making it more suitable for real-world applications where computational resources are limited.

Another notable limitation of existing works, such as the online anomaly detection method proposed in [21], is their reliance on pretrained models, which may fail to generalize to novel anomaly patterns in real-world scenarios. Our approach, in contrast, offers a solution that balances adaptability, performance, and complexity. By employing a semi-supervised learning framework, we can detect anomalies using only normal patterns during training, which is essential in scenarios where abnormal patterns are difficult to collect or define.

Finally, weakly supervised FSL frameworks, such as [22], show promise in improving sample efficiency but still require abnormal patterns during training. This assumption is not always practical, especially in real-world scenarios with limited abnormal data. Our proposed approach addresses this issue by using only normal patterns for training, making it highly effective in semi-supervised contexts, while still maintaining robust performance when confronted with novel anomaly patterns.

SECTION III.

Problem Formulation

From the perspective of FSL, video anomaly detection can be considered as a semi-supervised problem in a sense that only some nominal samples are available in the training set. A video sequence X and its corresponding video-level annotation $Y \in \{0,1\}$ are given. $y=1$ means an anomaly exists in the frame, whereas $y=0$ indicates no anomaly pattern is detected. Hence, the training dataset is a set of nominal samples belonging to the available normal classes $D_{\text {train}} = \{(x_{1}, y_{1}), (x_{2}, y_{2}), \ldots , (x_{n}, y_{n})\} \sim C_{\text {train}}$ , where $C_{\text {train}} = \{C_{1}, C_{2}, \ldots , C_{\text {available classes}}\}$ . However, the possible normal classes $C_{\text {possible classes}} = \{C_{1}, C_{2}, \ldots , C_{\text {possible classes}}\}$ includes more classes that are not present in the training set. Accordingly, $C_{\text {train}} \subset C_{\text {possible classes}}$ , and $C_{\text {available class}} \leq C_{\text {possible class}}$ . Furthermore, since no abnormal pattern exists in the training set, it can be rewritten as $D_{\text {train}} = \{(x_{1}, 0), (x_{2}, 0), \ldots , (x_{n}, 0)\}$ .

Video anomaly detection aims to estimate the anomalous nature among a video to label each sequence as a normal or abnormal frame [1], formally\begin{equation*} S(x) = \arg \max _{\phi } \log p\left ({{\phi | D_{\text {train}}}}\right ) \tag {1}\end{equation*} View Sourcewhere $\phi $ is the parameter for the estimation problem. Applying Bayes’ rule, it is rewritten as\begin{equation*} S(x) = \arg \max _{\phi } \log \left [{{p\left ({{D_{\text {train}} | \phi }}\right ) + \log p\left ({{\phi }}\right )}}\right ]. \tag {2}\end{equation*} View SourceThe likelihood can be factorized, and data points are assumed to be identically distributed\begin{equation*} S(x) = \arg \max _{\phi } \sum _{i} \log \left [{{p\left ({{y_{i} | x_{i}, \phi }}\right ) + \log p\left ({{\phi }}\right )}}\right ] . \tag {3}\end{equation*} View SourceThe goal of a video anomaly detection algorithm is to estimate $\phi $ to detect abnormal patterns in each frame. In general, there is not enough data is available for all possible scenarios in the training set, and hence $\phi $ cannot be appropriately assessed, limiting the adaptation capability of the video anomaly detection algorithms.

The proposed algorithm is devised based on a semi-supervised paradigm, with no abnormal pattern in the training set. Three different datasets are defined. The first dataset is $D_{\text {prior-train}} = \{(x_{1}, 0), (x_{2}, 0), \ldots , (x_{m}, 0)\}$ , $m~\ll ~n$ . It is considered as the prior data which consists of various normal samples extracted from different scenarios. The second dataset is the adaptation dataset $D_{\text {adaptation}} = \{(x_{1}, 0), (x_{2}, 0), \ldots , (x_{k}, 0)\}$ which is a subset of the training set gathered from the target scenario. In other words, the adaptation and test datasets should have the same distribution. The test dataset is defined as the data for which the anomalous pattern will be detected. $D_{\text {test}}$ consists of unseen data of normal and abnormal patterns.

The proposed video anomaly detection methodology presents a two-stage algorithm [23]. During the prior training step, $D_{\text {pretraining}}$ is used to estimate the prior-parameter $\theta ~:~p(\theta | D_{\text {pretraining}})$ . Following this, the adaptation phase starts, and the appropriate value of $\phi $ for the target scenario will be estimated by feeding $D_{\text {adaptation}}$ . Consequently, (1) can be rewritten as\begin{equation*} S(x) = \arg \max _{\phi } \log p\left ({{\phi | D_{\text {adaptation}}, D_{\text {pretraining}}}}\right ). \tag {4}\end{equation*} View SourceThen\begin{align*} S(x)=& \arg \max _{\phi } \log p\left ({{\phi | D_{\text {adaptation}}, D_{\text {pretraining}}}}\right ) \\=& \log \int _{\theta } \left [{{ p\left ({{\phi | D_{\text {adaptation}}, D_{\text {prior-train}}}}\right ) p\left ({{\theta | D_{\text {pretraining}}}}\right ) }}\right ] d\theta . \tag {5}\end{align*} View Source

Equation (5) assumes that $\phi $ is conditionally independent of $D_{\text {pretraining}} | \theta $ . In practice, integrating out $\theta $ is computationally expensive, and hence, the maximum posterior estimate is used\begin{align*}& S(x) \approx \log \left [{{p\left ({{\phi | D_{\text {adaptation}}, \theta ^{*}}}\right )}}\right . \\& \;\qquad \left .{{+ \log p\left ({{\theta ^{*} | D_{\text {pretraining}}}}\right )}}\right ] \tag {6}\end{align*} View Sourcewhere $\theta ^{*} = \arg \max _{\theta } \log p(\theta | D_{\text {pretraining}})$ . Equation (6) indicates the problem addressed in this article. Given $D_{\text {pretraining}}$ , $\theta ^{*}$ is estimated and is used as a pretrained parameter to predict $\phi $ by feeding $D_{\text {adaptation}}$ \begin{align*}& \arg \max _{\phi } \log p\left ({{\phi | D_{\text {adaptation}}, D_{\text {pretraining}}}}\right ) \\& \qquad \approx \arg \max _{\phi } \log p\left ({{\phi | D_{\text {adaptation}}, \theta ^{*}}}\right ). \tag {7}\end{align*} View Source

In other words, a massive amount of training data containing all possible normal samples of various scenarios is not needed\begin{align*} \text {prior-learning:}~ \theta ^{*}=& \arg \max _{\theta } \log p\left ({{\theta | D_{\text {pretraining}}}}\right ) \tag {8}\\ \text {adaptation:}~ \phi ^{*}=& \arg \max _{\phi } \log p\left ({{\phi | D_{\text {adaptation}}, \theta ^{*}}}\right ). \tag {9}\end{align*} View SourceThe adaptation step can be described compactly as $\phi ^{*} = f_{\theta ^{*}}(D_{\text {adaptation}})$ , where f is a function parameterized by $\theta ^{*}$ . Accordingly, the video anomaly detection problem estimates $\theta $ such that $\phi = f_{\theta ^{*}}(D_{\text {adaptation-train}})$ is applicable for $D_{\text {test}}$ . In other words\begin{equation*} \theta ^{*} = \arg \max _{\theta } \sum _{i=1}^{t} \log p\left ({{\phi _{i} | D_{\text {test}_{i}}}}\right ) \tag {10}\end{equation*} View Sourcewhere $\phi _{i} = f_{\theta ^{*}}(D_{\text {adaptation}})$ .

SECTION IV.

Proposed Methodology

A. System Overview

Fig. 1 shows the training and testing phases of the proposed approach. The proposed system monitors the scene and reports the appearance and motion patterns in each frame. The conventional baseline analyzes incoming patterns compared to those modeled during the training phase. The scenario baseline considers the previous frames to detect anomalies (here we used ten previous frames), while the frame baseline focuses on objects in the incoming frames. In each case, frames are analyzed based on appearance and motion patterns, classifying them as point or collective anomalies.

FIGURE 1.

High-level overview of the proposed SemiSiam network for video anomaly detection. It is structured by two encoders, one fusion network, and one decoder. (a) Learning procedure in the training phase. (b) Evaluating incoming frames in the testing phase.

Show All

To detect contextual anomalies, the proposed model aims to generate meaningful latent representations for similar input patterns. The AAE and MAE encode each input pattern to different latent representations. The fusion block then fuses the encoded features into the latent space. Since the training dataset contains only normal patterns, the proposed network aims to decrease the distance among the generated latent representations during the training phase. The decoder’s role is to reconstruct the input patterns at the output.

B. Multiple Baseline Anomaly Detection

Each frame undergoes two processes: it is initially input into the pretrained object detection block to identify the moving objects in the scene. The extracted information serves as both the appearance details of each object and the coordinates utilized in the optical flow extraction algorithm to trace the movement of each object, capturing information about their velocity and direction. Subsequently, multiple baseline anomaly detection is initiated.

In the conventional baseline, the information from each frame is compared with what was learned during the training phase, and any deviation is labeled as an anomaly. The second baseline, the scenario baseline, involves comparing the incoming frame with the previous ten frames in the scenario. Any inconsistencies with the scenario are considered anomalies. The third baseline is the frame-based baseline, where the objects within the frames are compared. The patterns are analyzed in each case based on their motion and appearance characteristics. Algorithm 1 outlines the proposed multiple baseline anomaly detection method.

Algorithm 1 Proposed Multiple Baseline Anomaly Detection Method

Input:

Video frames

Output:

Anomaly detection labels

for each frame $f_{t}$ in the video sequence do

Input frame $f_{t}$ into the pre-trained object detection block

Extract appearance details $\mathbf {a}_{t}$ and coordinates $\mathbf {c}_{t}$ for each object

Use optical flow extraction algorithm to capture velocity $\mathbf {v}_{t}$ and direction $\mathbf {d}_{t}$ of each object

Conventional Baseline:

Compare frame information $\{\mathbf {a}_{t}, \mathbf {v}_{t}, \mathbf {d}_{t}\}$ with learned patterns from the training phase $\{\mathbf {a}_{train}, \mathbf {v}_{train}, \mathbf {d}_{train}\}$

Calculate deviation $\Delta _{conv} = \|\{\mathbf {a}_{t}, \mathbf {v}_{t}, \mathbf {d}_{t}\} - \{\mathbf {a}_{train}, \mathbf {v}_{train}, \mathbf {d}_{train}\}\|$

Label as anomaly if $\Delta _{conv} \gt \epsilon _{conv}$

Scenario Baseline:

Compare incoming frame $f_{t}$ with the previous ten frames $\{f_{t-1}, f_{t-2}, \ldots , f_{t-10}\}$

Calculate scenario deviation $\Delta _{scen} = \sum _{i=1}^{10} \|\{\mathbf {a}_{t}, \mathbf {v}_{t}, \mathbf {d}_{t}\} - \{\mathbf {a}_{t-i}, \mathbf {v}_{t-i}, \mathbf {d}_{t-i}\}\|$

Label as anomaly if $\Delta _{scen} \gt \epsilon _{scen}$

Frame-based Baseline:

Compare objects within the frame $f_{t}$

Calculate frame-based deviation $\Delta _{frame} = \|\{\mathbf {a}_{t}, \mathbf {v}_{t}, \mathbf {d}_{t}\} - \{\mathbf {a}_{obj}, \mathbf {v}_{obj}, \mathbf {d}_{obj}\}\|$ for each object in the frame

Label as anomaly if $\Delta _{frame} \gt \epsilon _{frame}$

end for

C. Anomaly Pattern Classification

The proposed video anomaly detection goes beyond a simple binary classification, aiming for a thorough analysis within each identified baseline. Specifically, it categorizes the identified anomaly patterns into three distinctive types: point, collective, and contextual anomalies. The gathered information on velocity and direction undergoes a meticulous summarization process to actualize this approach. The summarized data becomes the foundation for comparing the learned patterns established during the training phase.

When a single object demonstrates a noticeable deviation from the recognized patterns, it is classified as a “point anomaly.” On the contrary, if multiple objects collectively show deviations, the frame is labeled a “collective anomaly.” The Semi-Siam structure is designed to detect the contextual anomalies described in the following section. Algorithm 2 outlines the proposed anomaly pattern classification.

Algorithm 2 Proposed Anomaly Pattern Classification

Input:

Video frames, velocity v, direction d

Output:

Anomaly classification labels

for each frame $f_{t}$ in the video sequence do

Summarize velocity $\mathbf {v}_{t}$ and direction $\mathbf {d}_{t}$ for each object

Point Anomaly Detection:

for each object $o_{i}$ in frame $f_{t}$ do

Calculate deviation $\Delta _{point} = \|\{\mathbf {v}_{t,i}, \mathbf {d}_{t,i}\} - \{\mathbf {v}_{train,i}, \mathbf {d}_{train,i}\}\|$

Label as point anomaly if $\Delta _{point} \gt \epsilon _{point}$

end for

Collective Anomaly Detection:

Calculate collective deviation $\Delta _{coll} = \sum _{i=1}^{N} \|\{\mathbf {v}_{t,i}, \mathbf {d}_{t,i}\} - \{\mathbf {v}_{train,i}, \mathbf {d}_{train,i}\}\|$

Label as collective anomaly if $\Delta _{coll} \gt \epsilon _{coll}$

Contextual Anomaly Detection:

Refer to the Semi-Siam structure for detailed context-based anomaly detection

end for

D. Semi-Siam: Semi-Supervised Siamese Network

The proposed SemiSiam network analyzes each video sequence based on its appearance and motion. The algorithm employs two different encoder engines to inspect the appearance and movement of each frame. The input video sequence is divided into separate frames $\text {vid} = \{F, M\}$ . Considering n frames in the video sequence, $F = \{f_{1}, f_{2}, \ldots , f_{n}\}$ represents the frames, and $M = \{m_{1}, m_{2}, \ldots , m_{n}\}$ indicates the corresponding optical flow of each.

AAE: The proposed AAE takes the video frames F as the input and encodes them to the corresponding latent representations $h_{\text {app}}^{(i)} = \text {AAE}(f^{(i)})$ . The AAE applies convolutional, flattened, and dense layers to map the input frame into a latent vector space.

MAE: The MAE is used to identify motion anomalies by processing M within each video frame. Using the optical flow as a preprocessing step, it maps the input movement patterns to corresponding latent representations, denoted as $h_{\text {mot}}^{(i)} = \text {MAE}(m^{(i)})$ . The MAE encodes optical flow patterns into these latent representations. It is noted that the structure of the AAE and MAE remains consistent.

As illustrated in Fig. 2, the proposed encoders contain six convolutional layers to extract features and three connected layers to classify each pattern as normal or anomalous. The input shape of AAE and MAE is a $(158, 258, 3)$ image, followed by a 16-filter $(3 \times 3)$ convolutional layer. A max-pooling layer is applied to highlight features, and 32-filter $(3 \times 3)$ convolutional layers follow. Using the second max-pooling layer, a 64-filter $(3 \times 3)$ convolutional layer, and the third max-pooling layer generate a $(64, 9, 14)$ feature map. The 2-D map is then flattened, and two 2048 and 1024 connected layers are applied to encode each input pattern into a latent representation. Table 1 summarizes some important network hyperparameters. The generated latent representation is used to detect the appearance anomalies.

TABLE 1 Network Hyperparameters for the Proposed AAE and MAE

FIGURE 2.

Appearance and motion encoder structure in the proposed SemiSiam.

Show All

Fusion Block: A fusion block is introduced to combine the latent representations encoded separately by the AAE and MAE, denoted as $Z= \{z_{1}, z_{2}, \ldots , z_{n}\}$ , where $z_{i} = \text {fusion}(h_{\text {mot}}^{(i)} + h_{\text {app}}^{(i)})$ . The proposed semi-supervised algorithm is applied to the merged signal, and then utilized to identify anomaly patterns. This approach minimizes the distance between generated latent representations for normal patterns while maximizing the distance for abnormal ones. The two extracted representations are combined and input into the fusion block to create a comparable latent space for similar objects. Fig. 3 illustrates the structure of the proposed fusion block, which comprises connected layers of 1024, 512, and 256 units.

FIGURE 3.

Proposed fusion block’s structure in the proposed SemiSiam.

Show All

Decoder: The decoder is crucial in guiding the AAE and MAE to generate meaningful latent representations. Essentially, the proposed AE is constrained to create similar latent representations that enable the faithful reproduction of input patterns. The decoder prevents the model from producing identical latent representations.

The input shape of the decoder is a 256 latent representation, and its output shape is a (158, 258, 3) image. It comprises 2048 and 8064 connected layers, followed by a reshaping to a (64, 9, 14) feature plan. Subsequently, there is a ($4 {\times }4$ ) upsampling layer, a 64-filter deconvolutional layer, and a 32-filter convolutional layer. Two sets of ($2 {\times }2$ ) upsampling layers followed by a 32-filter deconvolutional layer are applied to achieve the desired output dimension. The final layer consists of a 3-filter ($3 {\times }3$ ) convolutional layer. Fig. 4 illustrates the decoder’s network structure, and some key hyperparameters are presented in Table 2.

TABLE 2 Hyperparameter of the Proposed Decoder in SemiSiam

FIGURE 4.

Proposed decoder’s structure in the proposed SemiSiam.

Show All

E. Objective Function

The proposed SemiSiam network considers three underlying assumptions.

Considering appearance and motion, if two input patterns $x_{i}, x_{j}$ are close in high-density regions, their corresponding latent spaces $z_{i}$ and $z_{j}$ should also be relative.
Considering $x_{i}$ and $x_{j}$ as two samples from the normal class, $z_{i}$ and $z_{j}$ as their corresponding latent space, and Euclidean distance as the distance measurement tool, the following condition should be satisfied:\begin{equation*} \text {if}~ p\left ({{x_{i} | 0}}\right )~\approx p\left ({{x_{j} | 0}}\right ), \text {then}~ \|z_{i} - z_{j}\|_{2}^{2} \ll 1 \tag {11}\end{equation*} View Sourcewhere $p(x_{i} | 0)$ is the probability of the ith frame being classified as the normal class.

The pretraining phase is designed to learn the general information about the normal patterns. Following this, the adaptation phase is considered to transfer the knowledge learned during the previous step and improve the algorithm’s adaptivity capability using a few training samples in each scene. Two training datasets are considered: 1) $D_{\text {pretraining}}$ which is fed to learn the primary parameters and 2) $D_{\text {adaptation}}$ that contains a few frames of the target scene to adapt the learned parameters.

The proposed SemiSiam network follows a two-stage training procedure with no abnormal patterns in the seen data.

Latent Similarity Loss Function: The distance of two input patterns is computed based on the distance of their latent representations. That is, $ d(x_{1}, x_{2}) = \|z_{1} - z_{2} \|_{2}^{2} $ , where $ x_{1} $ and $ x_{2} $ are the input patterns, and $ z_{1} $ and $ z_{2} $ are their corresponding latent representations. If $ x_{1} $ and $ x_{2} $ are two different samples from the same class, $ d(x_{1}, x_{2}) \ll 1 $ and if they belong to different classes, $ d(x_{1}, x_{2}) \gg 1 $ . The normal input patterns $ X = \{ x_{1}, x_{2}, \ldots , x_{i} \} $ are divided into different batches $ X = \{ M_{1}, M_{2}, \ldots , M_{k} \} $ , where $ k $ is the number of batches. Considering $ l $ samples in each batch $ M_{i} = \{ x_{1}, x_{2}, \ldots , x_{l} \} $ is fed into the proposed SemiSiam network, and the corresponding latent representation is generated $ Z = \{ z_{1}, z_{2}, \ldots , z_{l} \} $ . The proposed network aims to decrease the distance of the latent representations for all the samples seen during the training phase\begin{equation*} \text {loss}_{\text {Latent Similarity}} = \min \| z_{i} - z_{j} \|_{2}^{2}\ \forall z_{i}, z_{j} \in Z. \tag {12}\end{equation*} View Source

Reconstruction Error Loss Function: The reconstruction loss function is considered to make the proposed network generate meaningful latent representations. Within each batch, the input patterns $ \{ x_{1}, x_{2}, \ldots , x_{l} \} $ are fed into the proposed AE, and the reconstructed patterns $ \{ \hat {x}_{1}, \hat {x}_{2}, \ldots , \hat {x}_{l} \} $ are considered. The reconstruction loss function aims to reconstruct the input patterns at the output\begin{equation*} \text {loss}_{\text {Reconstruction Error}} = \min \sum \| x_{i} - \hat {x}_{i} \|_{2}^{2}\ \forall x_{i} \in M . \tag {13}\end{equation*} View Source

As mentioned in (12) and (13), two constraints are considered for each batch. The first constraint is to make the latent representations similar, and the second is to reduce the reconstruction error of the input patterns. A max function is applied in this work to merge these two objective functions\begin{align*}& \text {loss} = \max \left ({{\alpha , \text {loss}_{\text {Latent Similarity}},}}\right . \\& \;\qquad \quad \left .{{\beta , \text {loss}_{\text {Reconstruction Error}}, 0}}\right )~ \tag {14}\end{align*} View Sourcewhere $ \alpha $ and $ \beta $ are the controlling parameters for the proposed loss function. Algorithm 3 illustrates the proposed procedure to compute the loss function.

Algorithm 3 Proposed Loss Function Computation Procedure

Divide the input patterns into different batches $ X {=} \{ M_{1}, M_{2}, \ldots , M_{k} \} $ , where $ M_{i} {=} \{ x_{1}, x_{2}, \ldots , x_{l} \} $

Compute the corresponding latent representations $ Z {=} \{ z_{1}, z_{2}, \ldots , z_{l} \} $

Compute the reconstructed patterns $ \hat {X} {=} \{ \hat {x}_{1}, \hat {x}_{2}, \ldots , \hat {x}_{l} \} $

Compute the Latent Similarity Loss Function:\begin{equation*} \text {loss}_{\text {Latent}} {=} \min \| z_{i} - z_{j} \|_{2}^{2}\ \forall z_{i}, z_{j} \in Z\end{equation*} View Source

Compute the Reconstruction Error Loss:\begin{equation*} \text {loss}_{\text {Reconstruction}} {=} \min \sum \| x_{i} - \hat {x}_{i} \|_{2}^{2}\ \forall x_{i} \in M\end{equation*} View Source

Combine the losses using the max function:\begin{equation*} \text {loss} {=} \max (\alpha \text {loss}_{\text {Latent}}, \beta \text {loss}_{\text {Reconstruction}}, 0)\end{equation*} View Source

SECTION V.

Simulation Results

The performance of the proposed algorithm is compared to the reconstruction-based methods and FSL methods for video anomaly detection. The cross-domain simulation performance is also presented. Cross-domain simulation evaluates the adaptivity capability performance of the proposed methodology considering different $D_{\text {pretraining}}$ and $D_{\text {adaptation}}$ .

A. Experimental Results and Discussion

Four datasets are utilized to assess the performance of the proposed methodology and compare it with existing works in the literature.

UCSD Pedestrian 1 [24]: This dataset was captured by a stationary camera overlooking pedestrian walkways. In the normal setting, the video only contains pedestrians, while in the testing set, many unseen patterns appear, which are considered as anomalies. UCSD Pedestrian 1 is composed of 34 training and 16 testing scenarios. It was acquired with frames of $238 {\times }158$ pixels and at a frame rate of 10 frames/s.

A sanity check is performed to validate the proposed SemiSiam methodology. Table 3 provides a comprehensive comparison with a wide range of techniques, including reconstruction-based and prediction-based methods. It is important to note that the sanity check employs the standard training/testing setup provided by the datasets.

TABLE 3 Comparison of Methodologies in Anomaly Detection for Peds1 Dataset

Fig. 5 illustrates some samples of the proposed method in detecting and classifying anomalies. In Frame 30 of Testing Scenario Number 26 in UCSD 1, the object detection algorithm identifies a bicycle. As it deviates from the patterns observed during the training phase, it is labeled as an anomaly in the conventional baseline. Additionally, since it is detected for the first time in this frame, it is considered an anomaly in the scenario baseline, indicating the recognition of a new scenario. According to the frame baseline, it is also classified as an anomaly. From frame number 40 onward (beyond ten frames), it continues to be categorized as an anomaly in the conventional and frame baselines but is now considered normal in the scenario baseline. The classification of these anomalies as “point anomalies” is based on the analysis of objects’ appearance and motion patterns, specifically considering their distinct velocities and appearances.

FIGURE 5.

Detection and classification of a bicycle as a point anomaly based on distinct appearance and velocity patterns in the proposed video anomaly detection methodology. (a) Bicycle is identified as an anomaly across all baselines. (b) Bicycle is considered normal in scenario-based baselines but remains an anomaly for the rest.

Show All

Fig. 6 illustrates the significance of employing multiple baselines and classifying anomalies. In Fig. 6(a), we have a snapshot of the training set with only normal patterns, specifically pedestrians. Fig. 6(b) displays a testing frame where the proposed algorithm detects a car labeled as an anomaly in all three baselines. It is anomalous in the conventional baseline because there was no car in the training set. In the scenario baseline due to the absence of a car in the ten previous frames, it differs from the others (pedestrians in this case) within the frame. Since no other anomaly patterns are detected, it is classified as a point anomaly because of its distinct appearance and velocity. Fig. 6(c) highlights the effectiveness of the proposed methodology, where the car is labeled as an anomaly in the conventional and frame baselines but is considered normal in the scenario baseline, as previously detected. Fig. 6(d) introduces a scenario with the appearance of a bicycle. While literature might identify an anomaly in the frame, the proposed algorithm designates the car as an anomaly in the frame and conventional baselines but normal in the scenario baseline. The bicycle, however, is an anomaly in all baselines. Since there are multiple anomaly patterns, they are classified as collective anomalies. Fig. 7 shows the situation of the anomalies based on three baselines for this specific scenario.

FIGURE 6.

Illustration of the proposed methodology showcasing detecting and classifying anomalies in various scenarios. (a) Training set with only normal patterns (pedestrians). (b) Testing frame detecting a car as an anomaly in all three baselines. (c) Highlight the proposed methodology’s power with a scenario where the car is labeled differently in various baselines. (d) Introducing a situation with the appearance of a bicycle demonstrates the algorithm’s ability to classify collective anomalies.

Show All

FIGURE 7.

Situation of the anomalies based on three baselines for specific scenario where we have car and bicycles in the testing phase.

Show All

UCSD Pedestrian 2 [24]: This dataset reports scenes with pedestrians’ movement parallel to the camera plane. UCSD Pedestrian 2 contains 16 training video samples and 12 testing ones. It was also captured with frames of $360 {\times }240$ pixels and at a frame rate of 10 frames/s.

Table 4 provides a detailed comparison of several state-of-the-art methodologies for anomaly detection on the Pedestrian 2 dataset. The table includes metrics, such as accuracy, hardware used, and processing time, highlighting both reconstruction-based and prediction-based methods.

TABLE 4 Comparison of Methodologies in Anomaly Detection for Pedestrian 2 Dataset

Our proposed SemiSiam model achieves the highest accuracy at 98.5%, outperforming other reconstruction-based approaches like Conv-AE (90%) in [25] and MemAE (94.1%) in [30]. Additionally, the SemiSiam model demonstrates efficient processing capabilities, achieving this result with 200 epochs and only 54948 parameters. This relatively low number of parameters highlights the efficiency and compactness of our model, making it suitable for real-time anomaly detection applications without compromising accuracy.

When it comes to training time, the SemiSiam model requires only 1 h to complete training on an NVIDIA RTX 30360 GPU, which is comparable to other methods. In contrast, other methods like Stacked RNN in [31] require a similar 1 h for 10000 epochs but achieve lower accuracy (92.2%).

Regarding processing speed, our model strikes a balance between accuracy and efficiency. Although methods like FFP report a faster speed of 25 FPS using an NVIDIA GeForce TITAN GPU, FFP achieves a lower accuracy (95.4%) compared to our SemiSiam model. This demonstrates that our approach provides a robust tradeoff between accuracy and computational efficiency.

This comparison highlights the advantage of our SemiSiam model in terms of both performance and efficiency. The relatively low number of parameters combined with high accuracy and a fast training time makes it a strong candidate for real-time applications, where accuracy and efficiency are critical.

Fig. 8 illustrates a complex scenario, specifically Scenario 11 in the Pedestrian 2 dataset. In Frame 11, a bicycle is detected and classified as an anomaly across all baselines. Unlike the observations in Fig. 6(a), this bicycle remains anomalous in the conventional and frame baselines but is deemed normal in the scenario baseline. It is identified as a point anomaly at this stage due to its unique appearance and motion patterns. Subsequently, in Frame 125, another bicycle is detected. It is considered normal in the scenario baseline since bicycles were observed in the preceding ten frames. However, it qualifies as an anomaly in the frame baseline and conventional baseline. This situation is categorized as a collective anomaly with multiple anomalies identified.

FIGURE 8.

Complex anomaly scenario: the proposed methodology detects and classifies anomalies in Scenario 11 of the Pedestrian 2 dataset (Fig. 7). A bicycle is initially considered a point anomaly due to its distinct appearance and motion patterns. (a) Bicycle is classified as an anomaly across all baselines. (b) It is considered normal in scenario-based baselines but classified as a point anomaly in the remaining baselines. (c) It is considered normal in scenario-based baselines but classified as a collective anomaly in the remaining baselines.

Show All

The adaptivity capability of video anomaly detection methods represents a nascent paradigm in the field, with few existing works exploring this simulation approach. Our proposed method follows a two-phase training and testing process: initially trained on one dataset and subsequently tested on another to assess its adaptability and performance.

To evaluate the performance of our method, we conducted experiments using the UCSD Pedestrian datasets. In the first experiment, we trained the model on the Peds1 dataset and tested it on the Pedestrian 2 dataset. This setup assesses how well the model, trained on one dataset, generalizes to a different dataset. The results of this experiment are summarized in Fig. 9(a), where the model’s performance is measured using the area under the curve (AUC) metric. In another scenario, Peds2 is considered the training dataset, and Peds1 is the target dataset. This setup further evaluates the method’s ability to adapt to different datasets. The results of this experiment are presented in Fig. 9(b), highlighting the AUC scores achieved with and without adaptation.

FIGURE 9.

Adaptivity capability of the proposed video anomaly detection method. (a) Model is trained on the Pedestrian 1 dataset and tested it on the Pedestrian 2 dataset. (b) Model is trained on the Pedestrian 2 dataset and tested it on the Pedestrian 1 dataset.

Show All

Chunk Avenue [32]: This dataset contains 16 training video clips (15328 training frames) and 21 testing clips (15324 testing frames).

Table 5 presents a comparative analysis of anomaly detection methods for the Chunk dataset, including performance metrics and computational information, such as hardware details and processing times.

TABLE 5 Comparison of Methodologies in Anomaly Detection for Chunk Dataset

Our SemiSiam model achieves the highest accuracy at 88.5%, outperforming both reconstruction-based and prediction-based methods. In comparison, methods like Conv-AE [25] achieve 70% accuracy, and ConvLSTM AE [16] report 77% accuracy, showcasing the superiority of our proposed SemiSiam method.

In terms of computational efficiency, the SemiSiam model operates at 20 FPS on an NVIDIA RTX 3060 Ti GPU. This demonstrates a strong balance between processing speed and accuracy, making it suitable for real-time applications. Although the FFP method [27] reports a faster processing speed of 25 FPS using an NVIDIA GeForce TITAN, it achieves a lower accuracy of 84.9%. The Stacked RNN [31] reports similar training times (1 h for 10000 steps), but with lower accuracy (81.7%).

City of Calgary: The proposed methodology is evaluated for its practical using real-world datasets provided by the City of Calgary, Canada. Defining anomalies here is more complex compared to synthetic or simplified datasets. For example, while datasets like UCSD may adopt a simplistic definition where all nonpedestrian entities are considered anomalies, such criteria fail to capture the nuanced nature of anomalies in real-world scenarios. By leveraging this dataset, our methodology demonstrates its efficacy in handling the complexities of urban surveillance and aims to advance anomaly detection techniques relevant to today’s dynamic urban landscapes. The dataset comprises 257 clips captured by urban surveillance cameras, with durations ranging from 10 min to 1 h each. These clips cover ten different scenarios, reflecting real-world challenges, such as varying weather conditions, changes in lighting, blurry images, and fluctuations in crowd density.

Fig. 10 shows an interesting scenario in detecting traffic anomaly. A reversing car is detected as an anomaly here. It is important to acknowledge that certain anomaly patterns related to the direction of movement pose specific challenges. For example, while a U-turn might be considered normal in most cities, it is categorized as an anomaly in some cities including Calgary. Despite this discrepancy, our proposed algorithm effectively addresses such scenarios. In Fig. 10, the proposed model analyzes the scene of a reversing car and detects it as an anomaly in all three baselines.

FIGURE 10.

Scenario where a car reversing is detected as an anomaly. (a) Analysis of the frame where a car reverses. (b) Detected anomaly based on the direction of the movement.

Show All

Fig. 11 showcases instances where a bus is detected within the frame. During the training phase, the model learned to recognize the bus as part of the normal pattern. However, when evaluated against the scenario and frame baselines, the bus is detected as an anomaly. This discrepancy highlights the nuanced nature of anomaly detection, where the context in which an object appears can significantly influence its classification.

FIGURE 11.

Scenario where a bus is detected within the frame.

Show All

To evaluate the performance of the proposed model, we considered four different clips, each with a duration of 20 s. The anomalies in these clips are specifically motion anomalies, as the proposed anomaly detection module is designed to monitor this type of anomaly. The specifications of the clips, including the nature and frequency of the anomalies, are illustrated in Table 6.

TABLE 6 Specifications and AUC of Evaluated Clips in City of Calgary Dataset

SECTION VI.

Conclusion

This article introduces an innovative and adaptable approach to video anomaly detection, with the capability of anomaly classification. The proposed methodology, grounded in multiple baselines—including a conventional method, a frame-based approach, and a scenario-based assessment—showcases remarkable adaptability to diverse scenarios, enhancing the system’s efficacy in detecting anomalies. A significant contribution of this article is the introduction of a semi-supervised FSL approach, which addresses challenges associated with limited training data. This approach not only enhances the algorithm’s adaptability but also makes it a practical and valuable tool for real-world video anomaly detection scenarios. Results across multiple datasets affirm the high-performance capabilities of the proposed methodology.

ACKNOWLEDGMENT

The authors thank the City of Calgary for providing the video data used in this work.

References is not available for this document.

MIT Libraries

MIT Libraries

Semi-Siam: A Novel Intelligent Monitoring System With a Multibaseline Video Anomaly Detection

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Works

Problem Formulation