I. Introduction
With the exponential growth of large model parameters and the advancement of Dense Wavelength Division Multiplexing (DWDM) technology, computing networks such as all-optical data center networks (DCNs) are rapidly developing due to high computing and transportation capacity [1]. In all-optical DCNs for parallel training of large models, optical path flicker caused by dirty optical modules or fiber bending can interrupt the entire training task. Meta once proposed in the training log of the OPT-175B model that almost the entire training process of large model encounters frequent restarts and interruptions, which leads to significant economic losses [2]. According to the 2023 Artificial Intelligence Index report by the Stanford Institute of Artificial Intelligence, OpenAI's GPT-3 consumes 1287 megawatt hours of electricity in a single training session, which means a one-day interruption in power consumption would result in a loss of approximately $2.5 million [3], raising new challenges in all-optical DCNs management. It is important that the failed optical paths can be immediately localized and bypassed. But it is generally not easy to accurately localize the failures of optical links and switches simultaneously in an instantaneous manner [4]. Due to the fact that the areas of optical propagation and optical switching in the all-optical DCNs do not need to perform photoelectric conversion, traditional electronic methods used to detect failures of CPUs, memory and disks cannot be used to detect optical paths [5].