I. Introduction
As the cloud infrastructure becomes more complex, the importance of technology to detect and restore fault is increasing. Various frameworks have been proposed for detection and recovery of fault in a cloud environment. Existing fault detection methods are largely divided into a system log analysis method [1] [2] and a threshold-based alarm generation method [3] [4] [5] [6] [7] [8]. Fault detection method using system log analysis requires knowledge of log keywords related to specific fault. In addition, there is a problem in that there is a lot of noise in the data, so that the data processing time and computation cost are very high. The threshold-based fault detection method is a method generally used in current cloud systems. The open source monitoring tools Prometheus [9] and Zabbix [10] also use threshold-based event triggering methods. However, this method requires the user to manually set an appropriate threshold for each metric. For this, knowledge about the correlation between a specific metric and a fault is required. This has a problem that depends on empirical knowledge or benchmarking results for a specific workload. This problem makes it difficult for a system administrator to find an appropriate threshold value according to the situation and causes enormous follow-up cost. These costs increase rapidly as the size and complexity of the infrastructure increases. In order to solve this problem, research on a machine learning-based fault detection method is being actively conducted. The fault detection method using machine learning has low dependence on domain knowledge such as the correlation between metrics and fault. Also, since it comprehensively considers various metrics, the accuracy is high, and there is a possibility of expansion into fault prediction and fault cause analysis. In this paper, three faults are defined as CPU Fault, Memory Fault, and Network Fault, and fault conditions are generated by injecting the defined faults, and multi-layer data of these fault conditions are collected using Prometheus. It trains the collected data using RF (Random Forest), SVM (Support Vector Machine), and DNN (Deep Neural Network) as machine learning models and detects fault in cloud infrastructure. Meanwhile, in a supervised learning detection model, the number of features can have a great influence on detection accuracy. If there are many overlapping features or similar features, it can lead to negative results like over fitting. However, if the number of features is too small, it can lead to a decrease in accuracy. Therefore, this paper introduces various feature engineering techniques and analyzes the accuracy according to the feature analysis model. We present a feature elimination method that can increase the accuracy of fault detection in cloud infrastructure by analyzing the results according to the combination of various feature analysis models.