I. Introduction
Large-scale deep learning (DL) [1] models like ChatGPT
https://openai.com/blog/chatgpt
has recently drawn a lot of attention for their superior performance in natural language tasks like dialogue, text summarization, translation, and so on. Training large models is hard due to two reasons. On the one hand, their volumes of model parameters exceed the storage capacity of a typical computing device. On the other hand, these models are trained with terabyte (TB) degree datasets, which require several or more GPU years to finish the training process [2]. Thus, research and industrial communities apply distributed training [3] to address this problem. They manually design decent parallelism strategies that make the best efforts to utilize the aggregated computing power of all the available devices. These strategies may consist of schemes such as data parallelism (DP) [4], tensor parallelism (TP) [5], and pipeline parallelism (PP) [6]. However, with the increasing diversity of model types and sizes and the rapid development of deep learning infrastructure, manual strategies designed for specific models and hardware may become inadequate and require redesign. Such redesigns can be time-consuming and require expert engineering experience in deep learning, distributed training, and infrastructure.