Introduction
Anomaly detection (AD), which is also known as novelty detection or outlier detection, is a task that involves identifying abnormal cases in a pool of collected data or a data stream. Since the 1960s, it has been widely studied [1] and utilized in broad range of applications, such as, fraud detection [2]–[4], network security [5]–[7], video surveillance [8]–[10], medical diagnosis [11]–[13] and multiple sensor data [14]–[18]. In particular, as IoT and big data technologies have become common, acquiring meaningful features for performing AD from massive sensor data becomes more challenging. Under these circumstances, recent advancements in neural networks and deep learning have significantly influenced the field of AD. Deep anomaly detection (DAD) methods have demonstrated improved performance in many complicated AD tasks [19].
DAD framework can be classified into supervised, unsupervised, and semi-supervised AD based on the problem and data formulation. Supervised AD can be achieved when both normal and abnormal samples are sufficient and labeled [20]. However, in general, abnormal samples are neither sufficient nor labeled in many real-world applications [19]. This is the reason why semi-supervised and unsupervised AD methods have been applied widely in DAD. Unlike supervised AD, which directly determines whether the input is normal or not, both the semi-supervised and unsupervised AD methods learn the features of normality, after which they calculate the anomaly score to measure a degree of abnormality.
A recent review of [21] further categorized DAD according to the role of deep neural network (DNN) in AD process. Two major branches among them are generic normality feature learning and anomaly measure-dependent feature learning. The former is based on general feature extraction methods including autoencoders and generative adversarial networks (GAN). In doing this, networks learn generic representations to well reconstruct/generate/predict normal data. Typically, these methods utilize reconstruction error as an anomaly score. Feature extraction of the latter approach is more dependent on the anomaly scoring function. By designing a loss function for specific anomaly measures, deep learning model could learn score-dependent latent features, e.g., nearest neighbor distance, one-class classification (DSVDD [22]), and clustering-based scores (DAGMM [23]).
We focus on the former approach, in particular, autoencoder-based methods with semi-supervised AD settings which are easy to implement and have straightforward intuitions in detecting anomalies [21]. Deep autoencoders perform dimensionality reduction, such as principal component analysis [24]–[26] and random projection [27]–[29]. By training an autoencoder with normal samples, the network shows lower reconstruction error on normal data, but poor on abnormal data that are not exposed during the training. This difference allows us to separate two classes by reconstruction error. However, AD performance can be further improved because the objective function of minimizing reconstruction error is not identical to maximizing anomaly detection performance. In this regard, one approach to improve DAD performance is leveraging additional sources for anomaly scoring (i.e., anomaly source diversification), and here we exploit aleatoric uncertainty and recursive reconstruction errors.
There are two types of uncertainty in deep learning depending on its causes: epistemic uncertainty and aleatoric uncertainty [30]. Epistemic uncertainty, also called model uncertainty, originates from the difference between training results (difference between the models). Epistemic uncertainty can be reduced by acquiring more data. On the other hand, aleatoric uncertainty, which is also called data uncertainty, is attributed to the data itself. Therefore, it is inherent and irreducible. Recently, epistemic uncertainty has been considered in DAD, because the reconstruction for the abnormal sample has a significant variance depending on the model [31]–[35]. On the other hand, aleatoric uncertainty has received less attention in the field of DAD and is used as a threshold for the classification of normal and abnormal [36].
In this study, we propose a novel DAD framework that considers aleatoric uncertainty by introducing a quantile autoencoder (QAE). We leverage aleatoric uncertainty under the assumption of channel-wise consistency in normal conditions; that is, the inherent deviation of normal data would be less than abnormal data. Aleatoric uncertainty in terms of the range between two quantiles is used in the proposed framework with reconstruction errors. In addition, we propose the abnormality accumulation (AA) technique, which aggregates and calculates the anomaly score based on the errors of recursive reconstructions. This makes the difference between normal and abnormal distributions more evident. We verified the proposed framework with multi-variate sensor datasets in different domains. Each of the two methods contributes to anomaly source diversification. We further provide theoretical grounds that support the idea of anomaly source diversification under the assumption of Gaussian error distribution.
The main contributions of this study are as follows:
We propose QAE for uncertainty-based DAD. Aleatoric uncertainty term, which is the range between the two quantiles, is additionally considered in anomaly scoring. To the best of our knowledge, it is the first time to utilize QAE and quantile range in AD as the source of anomaly score.
We propose AA, in which recursive reconstruction error is additionally considered in anomaly scoring. AA decreases the overlapping region between anomaly score distributions on normal and abnormal data. Therefore, the two score distributions become more distinguishable, which facilitates the determination of the normal and anomalies. The performance of the proposed QAE-AA is tested using various multi-variate sensor datasets, and it demonstrates a significant improvement in AD performance in terms of AUROC.
We introduce the concept of anomaly source diversification, which means it becomes easier to distinguish normals and abnormals as we diversify and gather more error sources in calculating anomaly scores. We provide mathematical proofs of why anomaly source diversification is helpful in reconstruction error-based DAD methods under the assumption of Gaussian error distribution. This explains the utilization of QAE, AA, and Mahalanobis distance in the proposed framework, and we further show that the empirical error distributions can be modeled by a mixture of Gaussians.
This paper consists of five sections, including the introduction. In Section II, we introduce the related DAD methods. Section III describes the proposed framework, QAE-AA, and concept of anomaly source diversification. The experiments’ results and conclusions are presented in Sections IV and V, respectively.
Related Works
A. Reconstruction Error Based Methods
In DAD by generic normality feature learning, the difference between an input and the reconstructed output (e.g., mean squared error) is typically used as a measure of abnormality. In general, the autoencoder (AE) [45]–[47] and the variational autoencoder (VAE) [37], [48], [49] have been utilized because of their superior capability to learn latent representations. Once a neural network is trained using normal samples to minimize its reconstruction error, the normal samples are reconstructed effectively from lower-dimensional latent features, but the abnormal samples are not. Therefore, anomalies produce larger reconstruction errors and can be distinguished by properly specifying a threshold for the normal cases. In a recent work of [39], a novel DAD methodology namely reconstruction along projection pathways (RAPP) was introduced which leverages reconstruction errors in latent spaces. It first collects the latent features in each layer of the encoder from the first forward path. By inputting the output into the encoder network again, the difference in the latent features of the second forward path is used as an anomaly score.
Another branch is based on generative adversarial networks (GANs) [11], [38], [43], [50], [51]. By utilizing GAN architecture, a neural network can model the distribution of normal samples. Anomaly scoring in GAN approaches is also based on the reconstruction error, and discriminator loss is additionally used with the reconstruction error [11], [43], [50], [51]. GANomaly [38] performs anomaly scoring based on the reconstruction error of bottleneck features. Thus, the score is defined in the latent space. The unified framework of GAN-based DAD is well summarized in [43], and it shows that the ensemble of anomaly scores from GAN variants further improves detection performance.
Previous studies show the utilization of the latent reconstruction errors, discriminator loss, and ensemble of reconstruction errors has improved the AD performance. This can be explained by the concept of anomaly source diversification. Similarly, uncertainty can also be an additional source, and contribute to the improved performance of DAD methods.
B. Anomaly Detection With Uncertainty
There are various methods for dealing with uncertainty within a deep learning framework, such as Bayesian deep learning, the Monte-Carlo (MC) dropout technique, and deep ensembles [52]. In particular, the MC dropout technique is adopted in uncertainty-based DAD applications [31]–[35]. MC dropout utilizes dropout in both the training and inference stages. Therefore a single neural network can generate different outputs based on the probabilistic connections between neurons. The statistical information of the outputs can be obtained through MC sampling, where the dropout is activated during the inference stage. This approach mainly aims to exploit epistemic uncertainty. The following examples use the MC dropout technique for epistemic uncertainty quantification: [31] conducted a study on a deep learning-based time-series prediction with a confidence interval for the Uber dataset in which AD was performed by triggering an alarm when the observed value fell outside the 95% predictive interval. Uncertainty in terms of the variance of the reconstruction errors was utilized as a weighting factor for the anomaly score [32]. In the field of medical imaging, uncertainty was adopted to diagnose diabetic retinopathy from fundus images in [33]. In the study conducted by [34], pixel-wise variations in retinal optical coherence tomography images were derived and utilized for the segmentation of abnormal areas. Similarly, the uncertainty of abnormal images in the MVTec-AD dataset was derived, after which the area under the receiver operating characteristic (AUROC) scores between the residual-based and uncertainty-based detection results was compared [35].
In this study, we consider aleatoric uncertainty via QAE and use the uncertainty term to diversify the sources for anomaly scoring. Our approach is different from the previous methods in terms of the method for measuring uncertainty (aleatoric uncertainty with multiple quantile regression) and the way in which uncertainty is used for anomaly scoring (Mahalanobis distance-based anomaly score).
Proposed Methodology
In this section, we describe the proposed QAE model and AA technique in detail.
A. Quantile Autoencoder for Anomaly Detection
The aleatoric uncertainty is utilized for AD with QAE. Its basic concept is that the reconstruction of normal samples produces stable outputs within a certain range of variations for each input channel. In other words, normal cases reconstructed using normal-fitted latent features are likely to have low variability compared to the abnormal cases (due to the loss function that induces minimizing the variance of output). The consistency of normal samples is independently valid for each channel. To leverage aleatoric uncertainty, we propose a QAE which predicts multiple quantiles with a single neural network. Then, the range between two quantiles (upper and lower) is used as a degree of uncertainty. Then, the anomaly score is derived by Mahalanobis distance from both the reconstruction error and the uncertainty term, as illustrated in Fig. 1.
Structure of the QAE for AD. The QAE predicts the median value and two quantiles. The reconstruction error and the aleatoric uncertainty are used for anomaly scoring.
1) Quantile Autoencoder
The basic AE performs a single-value reconstruction, which is the mean of Gaussian distribution by minimizing the mean squared error (MSE). The proposed QAE is a variation of the AE that predicts the different quantiles of the output distribution by minimizing the sum of pinball losses. Thus, the QAE performs multiple quantile regressions through a single neural network, which can be regarded as multi-task learning.
Let \begin{align*} Q_{enc}(x)=&z, \\ Q_{dec}(z)=&[\hat {x}_{\tau _{l}},\hat {x}_{\tau _{m}},\hat {x}_{\tau _{u}}]=\hat {\mathbf {x}}_\tau,\tag{1}\end{align*}
Quantile regression can be performed by minimizing the pinball loss [53]. For the \begin{align*} L_\tau (x,\hat {x}_\tau) = \begin{cases} \tau (x-\hat {x}_\tau) & \text {if $x \geq \hat {x}_\tau $} \\ (1-\tau)(\hat {x}_\tau -x) & \text {if $x < \hat {x}_\tau $} \end{cases}\tag{2}\end{align*}
\begin{equation*} L_{Q}(x,\hat {\mathbf {x}}_\tau) = L_{\tau _{l}}(x,\hat {x}_{\tau _{l}})+L_{\tau _{m}}(x,\hat {x}_{\tau _{m}})+L_{\tau _{u}}(x,\hat {x}_{\tau _{u}})\tag{3}\end{equation*}
2) Quantile-Based Anomaly Scoring
In the proposed approach, anomaly scores are derived from the reconstruction error and the aleatoric uncertainty term which is the range between the predicted upper and lower quantiles. From the output of the QAE \begin{align*} \epsilon _{rec}=&x - \hat {x}_{\tau _{m}}, \\ \epsilon _{unc}=&\hat {x}_{\tau _{u}}-\hat {x}_{\tau _{l}}.\tag{4}\end{align*}
\begin{align*} A_{r}(x)=&||\epsilon _{rec}||^{2}/d_{\epsilon _{rec}}, \tag{5}\\ A_{q}(x)=&||\epsilon ||^{2}/d_{\epsilon }\tag{6}\end{align*}
\begin{equation*} A_{nq}(x)=(\epsilon -\mu)S^{-1}(\epsilon -\mu)^{T},\tag{7}\end{equation*}
B. Abnormality Accumulation
In addition to the QAE, we propose an anomaly scoring technique, namely abnormality accumulation (AA), to further improve AD performance. The basic concept of AA is calculating the anomaly score based on the aggregated errors using a single AE. Let the superscript \begin{equation*} A_{nqa}(x)=([\epsilon ^{i}]_{i=1}^{N}-\mu _{A})S^{-1}_{A}([\epsilon ^{i}]_{i=1}^{N}-\mu _{A})^{T},\tag{8}\end{equation*}
Algorithm 1 Abnormality Accumulation
Input: input sample
set initial value
for i =
obtain
compute
concatenate
end for
aggregate
compute the channel-wise mean
Output: anomaly score with AA,
Illustration of AA. Abnormality becomes accumulated by aggregating
C. Anomaly Source Diversification
As previously stated, the use of QAE and AA with Mahalanobis distance in the proposed DAD methods can be supported by the concept of anomaly source diversification which is based on the following proposition.
Proposition 1:
For
Proof:
Let \begin{equation*} \theta ^{*} = \mathop {\mathrm {arg\,min}} _\theta \mathbb {E}_{X_{n}}[||x-Q_\theta (x)||^{2}]. \tag{9}\end{equation*}
If we consider multivariate case of
Fig. 3 shows the changes in the anomaly score distribution according to different
Difference between gamma distributions of normal and abnormal data based on the increase in
The above proposition emphasizes the importance of obtaining as many
Note that Proposition 1 assumes the ideal case of zero-mean Gaussian error distributions on both normal and abnormal data. Although normally distributed empirical errors can be found in the literature [55], the assumption of zero-mean Gaussianity on untrained abnormal data can be questioned because empirical errors are more likely to be skewed and biased. In this regard, we analyze the empirical error distributions of the real-world datasets used in this study. Fig. 4 shows histograms of reconstruction errors on abnormal data with Gaussian mixture model fitting results. As can be seen, the Gaussian mixture model well fits empirical error distribution, and we further generalize Proposition 1 to Gaussian distribution with arbitrary mean and variance as follows.
Error distribution with Guassian mixture model fitting results: (a) MI-F, (b) SNSR.
Proposition 2:
For
Proof:
For random variable \begin{equation*} \frac {1}{k}\sum _{i=1}^{k} Y_{i}^{2} =\frac {\sigma _{y}^{2}}{k}\sum _{i=1}^{k} X_{i}^{2} = \frac {\sigma _{y}^{2}}{k} Z\tag{10}\end{equation*}
Thus, increasing
Experiments
To verify the effectiveness of the proposed methodology, we compare the AUROC score which is a well-known evaluation metric for classification models. In binary classification, the perfect classifier has AUROC of 1, and the uniformly random classifier has AUROC of 0.5. During the experiment, we refer to the verification framework and datasets presented in the work of [39].
A. Datasets and Problem Settings
We compare the results of five different datasets; MI-F, MI-V, EOPT, RARM (binary class), and SNSR (multi class). Description of datasets is given in Table 2. For experiments, we set two normality conditions; unimodal normality and multimodal normality. Unimodal normality indicates a single normal class, and multimodal normality means a normal dataset is composed of multiple normal classes. The binary class datasets have labeled normal class. Therefore, the experiment is performed with unimodal normality. On the other hand, a multi-class dataset has no explicit labels for normal. Therefore, we set a target class that is considered normal in unimodal normality and abnormal in multimodal normality (thus remaining classes become normal, and in the following tables, MM is used for the abbreviation of multimodal normality.) We report averaged results on a different target class. For the training-test split, randomly selected 60% of the data in the normal class is used for training, and each half of the remaining data is used for the validation and normal test sets, respectively. All input features are normalized with z-score normalization.
B. Network Structure and Experimental Setup
We build QAE with the same backbone network structure in [39] except for the final layer and loss function to provide the prediction of multiple quantiles
We compared the results of the proposed methodology by following two steps for the ablation study. First, we compared the QAE with different anomaly score settings:
C. Experimental Results
Table 3 compares the mean and standard deviation of the AUROC results from the QAE with different scoring functions. Utilizing uncertainty terms without normalization (
Table 4 summarizes the AUROC of
Finally, Table 5 shows the performance comparison between the proposed methods and other AD methods. The characteristics of anomalies are different for each dataset, thus it is difficult to find a unique model that overwhelms the others for all cases. However, the proposed QAE-AA shows the best AUROC score in four out of six datasets; MI-V, RARM, SNSR,
This comparison result to various AD methodologies verifies the effectiveness of the proposed methodology, and shows that the concept of anomaly source diversification could be embodied by utilizing aleatoric uncertainty and iterative reconstruction errors.
D. Visualization on Anomaly Score Sources
Fig. 5 shows examples of the QAE output (i.e., square of anomaly sources
Examples of squared anomaly sources
E. Analysis on Score Distribution
Fig. 6 shows the changes in anomaly score distributions for MI datasets. Anomaly score distributions of normal and abnormal samples are illustrated in blue and orange histograms, respectively. Although the shapes do not perfectly match the ideal cases presented in Fig. 3, it is observed that the overlapping region between normal and abnormal distributions is reduced when applying QAE and AA. Therefore higher AUROC score can be achieved by the proposed methodology.
Examples of anomaly score distribution of normal and abnormal data: (a) MI-F, (b) MI-V.
Conclusion
In this research, we investigate the concept of anomaly score diversification, and propose QAE network and AA technique. The effectiveness of the proposed framework is verified with experiments on real-world datasets. Anomaly source diversification is inspired by the idea that diversifying error sources in the calculation of anomaly scores induces performance improvement in anomaly detection. We provide a theoretical background for this by showing that the distributions of mean square error anomaly scores on normal and abnormal become farther apart as the number of error sources under the assumption of Gaussian error increases.
In doing this, we propose a QAE that produces not only median but also quantiles to leverage aleatoric uncertainty as an additional error source for anomaly scoring. The outputs reconstructed from the abnormal samples are likely to have larger channel-wise uncertainty than that of normal samples likewise to the reconstruction errors. In addition, we introduce the AA technique that aggregates the errors via recursive reconstructions and then calculates anomaly score by using Mahalanobis distance. As the dimension of the errors is increased by the recursion, the difference between the anomaly score distributions of normal and abnormal samples becomes more apparent. The effectiveness of the proposed QAE-AA is verified with various datasets in the real world. QAE-AA obtained the highest AUROC score in four out of six datasets and achieved an average 4% to 23% higher AUROC score. These experimental results show that the proposed methodology can improve AD performance.
Recent works [44], [62] reported notable AD performance in some benchmark datasets by utilizing adversarial examples and time series AD setting, which can be additionally applied to the proposed QAE-AA framework. In this regard, our future research is moving toward overcoming the limitation and further improving AD performance. For example, AD performance on image data can be enhanced by additionally considering epistemic uncertainty or latent space errors.
ACKNOWLEDGMENT
An earlier version of this paper was presented in part at the Workshop on AI for Design and Manufacturing (ADAM) during the 36th AAAI Conference on Artificial Intelligence (AAAI-22), in February 2022 [63].