MLPing: Real-Time Proactive Fault Detection and Alarm for Large-Scale Distributed IDC Network | IEEE Conference Publication | IEEE Xplore

MLPing: Real-Time Proactive Fault Detection and Alarm for Large-Scale Distributed IDC Network


Abstract:

Through providing cheap rack and network hosting services, third-party internet data centers (IDCs) have gained significant popularity among cloud service providers. Real...Show More

Abstract:

Through providing cheap rack and network hosting services, third-party internet data centers (IDCs) have gained significant popularity among cloud service providers. Real-time monitoring of the quality of the IDC network and proactively alarming is crucial to guaranteeing the reliability of cloud services. The prevailing approach to addressing this problem involves utilizing active probes and making evaluations based on the results of single-link or multi-link probing. However, the existing efforts still tend to generate a significant number of unnecessary alerts, resulting in enormous operational costs. For this reason, we first build a large-scale distributed ping-based dial test system that enables monitoring the quality of the IDC network in a many-to-one probe mode. We develop an efficient exporter tool based on the standard Prometheus' data interface to ensure real-time and precise measurement data collection. To quickly and accurately detect potential network issues, we also design a multi-step heuristic-based fault detection and alarm method. Furthermore, we propose a comprehensive alarm life-cycle model based on the results of multi-link probing to guide alarm management in production practice. This system has been successfully deployed in the production environment of Sangfor company's managed cloud for over a year, enabling proactive diagnosis of hundreds of IDC gateway IP addresses. The actual statistical results indicate a significant improvement in the mean time to repair (MTTR) for IDC network failures, reducing it from a few hours to just a few minutes. The average daily number of alarms generated by this system is less than 15, decreasing approximately 85 % compared to before. The alarm accuracy exceeds 95 % and the false negative rate is less than 2 % ■
Date of Conference: 23-26 July 2024
Date Added to IEEE Xplore: 22 August 2024
ISBN Information:

ISSN Information:

Conference Location: Jersey City, NJ, USA
References is not available for this document.

I. Introduction

Cloud data centers have emerged as the dominant platform for hosting services for numerous companies, granting them access to computing, storage, and network resources [1]. Many cloud service providers (CSPs) typically opt to lease space from large-scale third-party wholesale internet data center (IDC) operators to provide users with their self-developed cloud services, including Infrastructure-as-a-Service, Platform-as-a-Service and Software-as-a-Service [2]. We reckon that cloud providers prefer not to build their own data centers due to huge capital expenditure (capex) requirements, long construction cycles, and the inflexibility of service deliveries.

A typical network infrastructure employed by a cloud service provider that utilizes a third-party IDC facility.

Select All
1.
Gartner predicts robust cloud computing market till 2027, 2023, [online] Available: https://www.techrepublic.com/article/gartner-predicts-robust-cloud-market/.
2.
D. A. Insights, China data centre sector, 2023, [online] Available: https://www.dbs.id/id/personal/aics/pdfController.page?pdfpath=/content/article/pdf/AI0/122018/181214_insights_china_data_centre_sector.pdf.
3.
Icmp-ping, 2019, [online] Available: https://ccnatutorials.in/subnetting-for-ccnalping-packet-internet-groper/.
4.
S. Pillai, What is tcp ping and how is it used, 2013, [online] Available: https://www.slashroot.in/what-tcp-ping-and-how-it-used.
5.
Http-ping, 2023, [online] Available: https://www.vanheusden.com/httping/.
6.
D. Schweikert, fping, 2023, [online] Available: https://fping.org/.
7.
Telegraph-ping, 2023, [online] Available: https://www.influxdata.com/time-series-platformltelegraf/.
8.
Bpf documentation, 2023, [online] Available: https://www.kernel.org/doc/html/latest/bpf/index.html.
9.
Prometheus, 2023, [online] Available: https://prometheus.io/docslintroductionloverview/.
10.
Zabbix, 2023, [online] Available: https://www.zabbix.com/cn.
11.
O. AG, Smokeping, 2023, [online] Available: https://oss.oetiker.ch/smokeping/.
12.
A. Cloud, Application real-time monitoring service (arms), 2023, [online] Available: https://help.aliyun.com/zh/arms/cloud-dial-testlcreate-alert-rules-for-browsing-tasks?spm=a2c4g.11186623.0.0.7450127epJSYXI.
13.
T. Cloud, Cloud automated testing (cat), 2023, [online] Available: https://cloud.tencent.com/product/cat.
14.
G. Yu, Z. Cai, S. Wang, H. Chen, F. Liu and A. Liu, "Unsupervised online anomaly detection with parameter adaptation for KPI abrupt changes", IEEE Trans. Netw. Serv. Manag., vol. 17, no. 3, pp. 1294-1308, 2020.
15.
M. Sun, Y. Su, S. Zhang, Y. Cao, Y. Liu, D. Pei, et al., "CTF: anomaly detection in high-dimensional time series with coarse-to-fine model transfer", 40th IEEE Conference on Computer Communications INFOCOM 2021, pp. 1-10, May 2021.
16.
J. Huang, E. Kurniawan and S. Sun, "Cellular KPI anomaly detection with GAN and time series decomposition", IEEE International Conference on Communications ICC 2022, pp. 4074-4079, 2022.
17.
S. Zhang, Z. Zhong, D. Li, Q. Fan, Y. Sun, M. Zhu, et al., "Efficient KPI anomaly detection through transfer learning for large-scale web services", IEEE J. Sel. Areas Commun., vol. 40, no. 8, pp. 2440-2455, 2022.
18.
H. Zeng, R. Mahajan, N. McKeown, G. Varghese, L. Yuan and M. Zhang, "Measuring and troubleshooting large operational multipath networks with gray box testing", Mountain Safety Res., 2015.
19.
Y. Jin, S. Renganathan, G. Ananthanarayanan, J. Jiang, V. N. Padman-abhan, M. Schroder, et al., "Zooming in on wide-area latencies to a global cloud provider", Proceedings of the ACM Special Interest Group on Data Communication SIGCOMM 2019, pp. 104-116, August 2019.
20.
B. Hou, C. Hou, T. Zhou, Z. Cai and F. Liu, "Detection and charac-terization of network anomalies in large-scale rtt time series", IEEE Transactions on Network and Service Management, no. 99, pp. 1-1, 2021.
21.
C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. A. Maltz, et al., "Pingmesh: A large-scale system for data center network latency measurement and analysis", Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication SIGCOMM 2015, pp. 139-152, August 2015.
22.
A. Roy, D. Bansal, D. Brumley, H. K. Chandrappa, P. Sharma, R. Tewari, et al., "Cloud datacenter SDN monitoring: Experiences and challenges", Proceedings of the Internet Measurement Conference 2018 IMC 2018, pp. 464-470, 2018.
23.
C. Tan, Z. Jin, C. Guo, T. Zhang, H. Wu, K. Deng, et al., "{NetBouncer}: Active device and link failure localization in data center networks", 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pp. 599-614, 2019.
24.
C. Fang, H. Liu, M. Miao, J. Ye, L. Wang, W. Zhang, et al., "Vtrace: Automatic diagnostic system for persistent packet loss in cloud-scale overlay network", SIGCOMM '20: Proceedings of the 2020 Annual conference of the ACM Special In terest Group on Data Communication on the applications technologies architectures and protocols for computer communication, pp. 31-43, 2020.
25.
S. Zhu, J. Lu, B. Lyu, T. Pan, C. Jia, X. Cheng, et al., "Zoonet: a proactive telemetry system for large-scale cloud networks", Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies CoNEXT 2022, pp. 321-336, December 2022.
26.
Y. Wang and R. Fang, "An approach for fast fault detection in virtual network", Tehnicki vjesnik, vol. 30, no. 4, pp. 1146-1151, 2023.
27.
K. Liu, Z. Jiang, J. Zhang, H. Wei, X. Zhong, L. Tan, et al., "Hostping: Diagnosing intra-host network bottlenecks in {RDMA} servers", 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pp. 15-29, 2023.
28.
C. Inc, Inband network telemetry, 2023, [online] Available: https://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus9000/sw/92x/programmability/guide/b-cisco-nexus-9000-series-nx-os-programmability-guide-92x/b-cisco-nexus-9000-series-nx-os-programmability-guide-92x_chapter_0100001.html.
29.
D. Cao, Understanding ping command and icmp with examples, 2023, [online] Available: https://www.howtouselinux.com/post/ping-icmp.
30.
Arp-ping, 2023, [online] Available: https://linuxtect.com/linux-arping-command-tutorial/.
Contact IEEE to Subscribe

References

References is not available for this document.