MLPing: Real-Time Proactive Fault Detection and Alarm for Large-Scale Distributed IDC Network | IEEE Conference Publication | IEEE Xplore

MLPing: Real-Time Proactive Fault Detection and Alarm for Large-Scale Distributed IDC Network


Abstract:

Through providing cheap rack and network hosting services, third-party internet data centers (IDCs) have gained significant popularity among cloud service providers. Real...Show More

Abstract:

Through providing cheap rack and network hosting services, third-party internet data centers (IDCs) have gained significant popularity among cloud service providers. Real-time monitoring of the quality of the IDC network and proactively alarming is crucial to guaranteeing the reliability of cloud services. The prevailing approach to addressing this problem involves utilizing active probes and making evaluations based on the results of single-link or multi-link probing. However, the existing efforts still tend to generate a significant number of unnecessary alerts, resulting in enormous operational costs. For this reason, we first build a large-scale distributed ping-based dial test system that enables monitoring the quality of the IDC network in a many-to-one probe mode. We develop an efficient exporter tool based on the standard Prometheus' data interface to ensure real-time and precise measurement data collection. To quickly and accurately detect potential network issues, we also design a multi-step heuristic-based fault detection and alarm method. Furthermore, we propose a comprehensive alarm life-cycle model based on the results of multi-link probing to guide alarm management in production practice. This system has been successfully deployed in the production environment of Sangfor company's managed cloud for over a year, enabling proactive diagnosis of hundreds of IDC gateway IP addresses. The actual statistical results indicate a significant improvement in the mean time to repair (MTTR) for IDC network failures, reducing it from a few hours to just a few minutes. The average daily number of alarms generated by this system is less than 15, decreasing approximately 85 % compared to before. The alarm accuracy exceeds 95 % and the false negative rate is less than 2 % ■
Date of Conference: 23-26 July 2024
Date Added to IEEE Xplore: 22 August 2024
ISBN Information:

ISSN Information:

Conference Location: Jersey City, NJ, USA

Contact IEEE to Subscribe

References

References is not available for this document.