Loading web-font TeX/Main/Regular
Detecting Outlier Machine Instances Through Gaussian Mixture Variational Autoencoder With One Dimensional CNN | IEEE Journals & Magazine | IEEE Xplore

Detecting Outlier Machine Instances Through Gaussian Mixture Variational Autoencoder With One Dimensional CNN


Abstract:

Today's large datacenters house a massive number of machines, each of which is being closely monitored with multivariate time series (e.g., CPU idle, memory utilization) ...Show More

Abstract:

Today's large datacenters house a massive number of machines, each of which is being closely monitored with multivariate time series (e.g., CPU idle, memory utilization) to ensure service quality. Detecting outlier machine instances with multivariate time series is crucial for service management. However, it is a challenging task due to the multiple classes and various shapes, high dimensionality, and lack of labels of multivariate time series. In this article, we propose DOMI, a novel unsupervised model that combines Gaussian mixture VAE with 1D-CNN, to detect outlier machine instances. Its core idea is to capture the normal patterns of machine instances by learning their latent representations that consider the shape characteristics, reconstruct input data by the learned representations, and apply reconstruction probabilities to determine outliers. Moreover, DOMI interprets the detected outlier instance based on the reconstruction probability changes of univariate time series. Extensive experiments have been conducted on the dataset collected from 1821 machines with a 1.5-month-period, which are deployed in ByteDance, a top global content service provider. DOMI achieves the best F1-Score of 0.94 and AUC score of 0.99, significantly outperforming the best performing baseline method by 0.08 and 0.03, respectively. Moreover, its interpretation accuracy is up to 0.93.
Published in: IEEE Transactions on Computers ( Volume: 71, Issue: 4, 01 April 2022)
Page(s): 892 - 905
Date of Publication: 09 March 2021

ISSN Information:

Funding Agency:


1 Introduction

Modern large datacenters usually deploy hundreds of thousands to millions of machines, including physical servers, virtual machines, dockers, to support diverse types of Internet-based services [1], [2]. About 5%\sim18%

Online. [Available]: https://www.statista.com/statistics/430769/annual-failure-rates-of-servers/

of these machines suffer from software bugs and/or hardware failures per year. The unexpected failures may cause data loss and resource congestion due to machines being unavailable [3], which can heavily degrade the quality of services (QoS) and reduce revenue [4]. Therefore, operation engineers carefully monitor machine metrics, such as CPU idle, memory utilization, TCP retransmission rate, to obtain a global view of each machine's status [5]. The monitoring data of each metric forms a univariate time series, and thus each machine can be represented as an entity with multivariate time series [6], [7].

Contact IEEE to Subscribe

References

References is not available for this document.