Conferences >2021 International Conference...

Performance Analysis of Machine Learning Based Fault Detection for Cloud Infrastructure

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

As the cloud infrastructure becomes more complex, the importance of fault detection technology is increasing. A machine learning-based fault detection technology is being...Show More

Metadata

Abstract:

As the cloud infrastructure becomes more complex, the importance of fault detection technology is increasing. A machine learning-based fault detection technology is being used to overcome the limitations of the existing fault detection method through log analysis and threshold-based fault detection method. Machine learning-based fault detection methods are greatly influenced by features. In this paper, we introduce feature engineering techniques that can affect accuracy, and propose a method to improve the performance of fault detection models in cloud infrastructure through comparative analysis and verification of various feature analysis models.

Published in: 2021 International Conference on Information Networking (ICOIN)

Date of Conference: 13-16 January 2021

Date Added to IEEE Xplore: 02 February 2021

ISBN Information:

Print on Demand(PoD) ISSN: 1976-7684

DOI: 10.1109/ICOIN50884.2021.9333875

Conference Location: Jeju Island, Korea (South)

Contents

I. Introduction

As the cloud infrastructure becomes more complex, the importance of technology to detect and restore fault is increasing. Various frameworks have been proposed for detection and recovery of fault in a cloud environment. Existing fault detection methods are largely divided into a system log analysis method [1] [2] and a threshold-based alarm generation method [3] [4] [5] [6] [7] [8]. Fault detection method using system log analysis requires knowledge of log keywords related to specific fault. In addition, there is a problem in that there is a lot of noise in the data, so that the data processing time and computation cost are very high. The threshold-based fault detection method is a method generally used in current cloud systems. The open source monitoring tools Prometheus [9] and Zabbix [10] also use threshold-based event triggering methods. However, this method requires the user to manually set an appropriate threshold for each metric. For this, knowledge about the correlation between a specific metric and a fault is required. This has a problem that depends on empirical knowledge or benchmarking results for a specific workload. This problem makes it difficult for a system administrator to find an appropriate threshold value according to the situation and causes enormous follow-up cost. These costs increase rapidly as the size and complexity of the infrastructure increases. In order to solve this problem, research on a machine learning-based fault detection method is being actively conducted. The fault detection method using machine learning has low dependence on domain knowledge such as the correlation between metrics and fault. Also, since it comprehensively considers various metrics, the accuracy is high, and there is a possibility of expansion into fault prediction and fault cause analysis. In this paper, three faults are defined as CPU Fault, Memory Fault, and Network Fault, and fault conditions are generated by injecting the defined faults, and multi-layer data of these fault conditions are collected using Prometheus. It trains the collected data using RF (Random Forest), SVM (Support Vector Machine), and DNN (Deep Neural Network) as machine learning models and detects fault in cloud infrastructure. Meanwhile, in a supervised learning detection model, the number of features can have a great influence on detection accuracy. If there are many overlapping features or similar features, it can lead to negative results like over fitting. However, if the number of features is too small, it can lead to a decrease in accuracy. Therefore, this paper introduces various feature engineering techniques and analyzes the accuracy according to the feature analysis model. We present a feature elimination method that can increase the accuracy of fault detection in cloud infrastructure by analyzing the results according to the combination of various feature analysis models.

Performance Analysis of Machine Learning Based Fault Detection for Cloud Infrastructure

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Performance Analysis of Machine Learning Based Fault Detection for Cloud Infrastructure

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

References