Loading [MathJax]/extensions/MathMenu.js
Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication | IEEE Conference Publication | IEEE Xplore

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication


Abstract:

Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model paramete...Show More

Abstract:

Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model parameters, gradients, and optimizer states of the model among multiple GPUs. Consequently, this requires data-intensive Allgather and Reduce-Scatter communication to share the model parameters, which becomes a bottleneck. Existing schemes that use GPU-aware MPI libraries are highly prone to saturating the interconnect bandwidth. Therefore, integrating GPU-based compression into MPI libraries has proven efficient to achieve faster training time. In this paper, we propose an optimized Ring algorithm of Allgather and Reduce-Scatter collectives that encompass an efficient collective-level online compression scheme. At the microbenchmark level, Allgather achieves benefits of up to 83.6% and 30.3% compared to the baseline and existing point-to-point-based compression in a state-of-the-art MPI library on modern GPU clusters. Reduce-Scatter achieves 88.1% and 40.6% compared to baseline and point-to-point compression, respectively. For distributed DL training with PyTorch-FSDP, our approach yields 31.7% faster training than the baseline, and up to 12.5% compared to the existing point-to-point-based compression while maintaining similar accuracy.
Date of Conference: 15-19 May 2023
Date Added to IEEE Xplore: 18 July 2023
ISBN Information:

ISSN Information:

Conference Location: St. Petersburg, FL, USA
Citations are not available for this document.

I. Introduction

Over the past decade, DL (Deep Learning) has gained tremendous success in many areas including Image Classification, Natural Language Processing, Self-driving cars, etc. DNNs (Deep Neural Networks) is the key technology that is capable of automatically extracting features from multi-modal datasets and developing a model that understands the complex and non-linear relationships between these features. Training these DNNs is a compute-intensive workload that is typically done on parallel systems with GPUs (Graphics Processing Units). DL frameworks like TensorFlow [1] and PyTorch [2] support efficient DNN training on such systems.

Cites in Papers - |

Cites in Papers - IEEE (3)

Select All
1.
Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, Torsten Hoefler, "Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI", SC24: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-17, 2024.
2.
Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Zhaorui Zhang, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur, "An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression", 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.752-764, 2024.
3.
Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. DK Panda, "Accelerating Large Language Model Training with Hybrid GPU-based Compression", 2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp.196-205, 2024.

Cites in Papers - Other Publishers (1)

1.
Junqi Yin, Sajal Dash, John Gounley, Feiyi Wang, Georgia Tourassi, "Evaluation of pre-training large language models on leadership-class supercomputers", The Journal of Supercomputing, vol.79, no.18, pp.20747, 2023.
Contact IEEE to Subscribe

References

References is not available for this document.