1 Introduction
Modern large datacenters usually deploy hundreds of thousands to millions of machines, including physical servers, virtual machines, dockers, to support diverse types of Internet-based services [1], [2]. About 5%\sim18%
Online. [Available]: https://www.statista.com/statistics/430769/annual-failure-rates-of-servers/
of these machines suffer from software bugs and/or hardware failures per year. The unexpected failures may cause data loss and resource congestion due to machines being unavailable [3], which can heavily degrade the quality of services (QoS) and reduce revenue [4]. Therefore, operation engineers carefully monitor machine metrics, such as CPU idle, memory utilization, TCP retransmission rate, to obtain a global view of each machine's status [5]. The monitoring data of each metric forms a univariate time series, and thus each machine can be represented as an entity with multivariate time series [6], [7].