Loading web-font TeX/Main/Regular
FSFP: A Fine-Grained Online Service System Performance Fault Prediction Method Based on Cross-attention | IEEE Conference Publication | IEEE Xplore

FSFP: A Fine-Grained Online Service System Performance Fault Prediction Method Based on Cross-attention


Abstract:

An online service system may experience various performance faults during operation. Detecting and locating these faults after they occur can significantly impact the use...Show More

Abstract:

An online service system may experience various performance faults during operation. Detecting and locating these faults after they occur can significantly impact the user experience and lead to significant losses. Therefore, it is necessary to predict faults before they occur. Existing methods for fault prediction typically only predict the possibility of fault, without providing more granular predictions, such as the type of fault. This can make troubleshooting more difficult for developers. In this paper, we propose a fine-grained fault prediction method called FSFP, which not only predicts the possibility of fault but also identifies the type of fault that may occur. The method initially collects performance monitoring metrics from the runtime system, including two types: normal operation and abnormal conditions. It then utilizes cross-attention to capture the interdependencies between these two types of monitoring metrics, followed by the construction of a multi-label classification model. We evaluated FSFP by injecting faults into a benchmark microservice system. In terms of predicting the possibility of fault, FSFP achieved a precision of 0.999, a recall of 0.998, and an F1 score of 0.999. In terms of predicting the type of fault, FSFP achieved an exact match ratio of 0.955 and a Hamming loss of 0.017. In terms of predicting six specific types of faults, FSFP achieved four optimal F1 scores.
Date of Conference: 04-07 December 2023
Date Added to IEEE Xplore: 02 April 2024
ISBN Information:

ISSN Information:

Conference Location: Seoul, Korea, Republic of

Funding Agency:


I. Introduction

Various software systems serve our daily work in different aspects of life. However, during the operation of these systems, performance faults such as slow response times are inevitable. Once these faults occur, they can significantly impact the system's availability and reliability, resulting in financial losses. For example, according to a recent survey [1],the average cost per hour of server downtime is between 301,000 and 400,000. To minimize the losses caused by performance faults, remediation after the occurrence of a fault is one approach [2]–[7]. However, predicting and identifying potential risks before the faults happen and taking preventive measures can directly prevent service unavailability. Therefore, many engineers have conducted research in this area.

Contact IEEE to Subscribe

References

References is not available for this document.