A Survey on Auto-Parallelism of Large-Scale Deep Learning Training | IEEE Journals & Magazine | IEEE Xplore

A Survey on Auto-Parallelism of Large-Scale Deep Learning Training


Abstract:

Deep learning (DL) has gained great success in recent years, leading to state-of-the-art performance in research community and industrial fields like computer vision and ...Show More

Abstract:

Deep learning (DL) has gained great success in recent years, leading to state-of-the-art performance in research community and industrial fields like computer vision and natural language processing. One of the reasons for this success is the huge amount parameters adopted in DL models. However, it is impractical to train a moderately large model with a large number of parameters on a typical single device. Thus, It is necessary to train DL models in clusters with distributed training algorithms. However, traditional distributed training algorithms are usually sub-optimal and highly customized, which owns the drawbacks to train large-scale DL models in varying computing clusters. To handle the above problem, researchers propose auto-parallelism, which is promising to train large-scale DL models efficiently and practically in various computing clusters. In this survey, we perform a broad and thorough investigation on challenges, basis, and strategy searching methods of auto-parallelism in DL training. First, we abstract basic parallelism schemes with their communication cost and memory consumption in DL training. Further, we analyze and compare a series of current auto-parallelism works and investigate strategies and searching methods which are commonly used in practice. At last, we discuss several trends in auto-parallelism which are promising in further research.
Published in: IEEE Transactions on Parallel and Distributed Systems ( Volume: 34, Issue: 8, August 2023)
Page(s): 2377 - 2390
Date of Publication: 26 June 2023

ISSN Information:

Funding Agency:


I. Introduction

Large-scale deep learning (DL) [1] models like ChatGPT

https://openai.com/blog/chatgpt

has recently drawn a lot of attention for their superior performance in natural language tasks like dialogue, text summarization, translation, and so on. Training large models is hard due to two reasons. On the one hand, their volumes of model parameters exceed the storage capacity of a typical computing device. On the other hand, these models are trained with terabyte (TB) degree datasets, which require several or more GPU years to finish the training process [2]. Thus, research and industrial communities apply distributed training [3] to address this problem. They manually design decent parallelism strategies that make the best efforts to utilize the aggregated computing power of all the available devices. These strategies may consist of schemes such as data parallelism (DP) [4], tensor parallelism (TP) [5], and pipeline parallelism (PP) [6]. However, with the increasing diversity of model types and sizes and the rapid development of deep learning infrastructure, manual strategies designed for specific models and hardware may become inadequate and require redesign. Such redesigns can be time-consuming and require expert engineering experience in deep learning, distributed training, and infrastructure.

Contact IEEE to Subscribe

References

References is not available for this document.