Loading [MathJax]/extensions/MathZoom.js
Machine Learning-Based Monitoring Trail: On Fast and Accurate Optical Path Failure Localization in All-Optical DCNs | IEEE Journals & Magazine | IEEE Xplore

Machine Learning-Based Monitoring Trail: On Fast and Accurate Optical Path Failure Localization in All-Optical DCNs


Abstract:

We propose Machine Learning-based Monitoring Trail (mlm-trail) to monitor optical path failures for large model training. Existing monitoring trail (m-trail) method can o...Show More

Abstract:

We propose Machine Learning-based Monitoring Trail (mlm-trail) to monitor optical path failures for large model training. Existing monitoring trail (m-trail) method can only localize optical link failures, whereas mlm-trail differs from it by combining with machine learning and providing bidirectional voltage constraint. mlm-trail can provide fast, accurate and integrated failure localization of both optical links and optical switches using only a small number of monitors. Firstly, we construct an input dataset based on the edge relationships of 10000 virtual network topologies (the mappings of optical links and optical switches in all-optical data center networks (DCNs)). Then, monitoring trail under bidirectional voltage constraint is formulated by integer liner program (ILP) to minimize the overall monitoring cost (including monitor and bandwidth costs), and thus construct an output dataset. Finally, we train the learning model based on above dataset using classical and proposed hybrid machine learning models to achieve fast generation of monitoring trails. Based on the constraints of full coverage of monitoring trail and uniqueness of alarms, the generated output results based on machine learning are modified to achieve unambiguous localization for each optical path. Numerical results show that mlm-trail outperforms m-trail in localization speed and scalability, and also outperforms machine learning algorithms in accuracy and cost.
Published in: Journal of Lightwave Technology ( Volume: 43, Issue: 5, 01 March 2025)
Page(s): 2039 - 2052
Date of Publication: 24 October 2024

ISSN Information:

Funding Agency:


I. Introduction

With the exponential growth of large model parameters and the advancement of Dense Wavelength Division Multiplexing (DWDM) technology, computing networks such as all-optical data center networks (DCNs) are rapidly developing due to high computing and transportation capacity [1]. In all-optical DCNs for parallel training of large models, optical path flicker caused by dirty optical modules or fiber bending can interrupt the entire training task. Meta once proposed in the training log of the OPT-175B model that almost the entire training process of large model encounters frequent restarts and interruptions, which leads to significant economic losses [2]. According to the 2023 Artificial Intelligence Index report by the Stanford Institute of Artificial Intelligence, OpenAI's GPT-3 consumes 1287 megawatt hours of electricity in a single training session, which means a one-day interruption in power consumption would result in a loss of approximately $2.5 million [3], raising new challenges in all-optical DCNs management. It is important that the failed optical paths can be immediately localized and bypassed. But it is generally not easy to accurately localize the failures of optical links and switches simultaneously in an instantaneous manner [4]. Due to the fact that the areas of optical propagation and optical switching in the all-optical DCNs do not need to perform photoelectric conversion, traditional electronic methods used to detect failures of CPUs, memory and disks cannot be used to detect optical paths [5].

Contact IEEE to Subscribe

References

References is not available for this document.