Conferences >2023 IEEE International Paral...

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model paramete...Show More

Metadata

Abstract:

Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model parameters, gradients, and optimizer states of the model among multiple GPUs. Consequently, this requires data-intensive Allgather and Reduce-Scatter communication to share the model parameters, which becomes a bottleneck. Existing schemes that use GPU-aware MPI libraries are highly prone to saturating the interconnect bandwidth. Therefore, integrating GPU-based compression into MPI libraries has proven efficient to achieve faster training time. In this paper, we propose an optimized Ring algorithm of Allgather and Reduce-Scatter collectives that encompass an efficient collective-level online compression scheme. At the microbenchmark level, Allgather achieves benefits of up to 83.6% and 30.3% compared to the baseline and existing point-to-point-based compression in a state-of-the-art MPI library on modern GPU clusters. Reduce-Scatter achieves 88.1% and 40.6% compared to baseline and point-to-point compression, respectively. For distributed DL training with PyTorch-FSDP, our approach yields 31.7% faster training than the baseline, and up to 12.5% compared to the existing point-to-point-based compression while maintaining similar accuracy.

Published in: 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Date of Conference: 15-19 May 2023

Date Added to IEEE Xplore: 18 July 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/IPDPS54959.2023.00023

Conference Location: St. Petersburg, FL, USA

Citations are not available for this document.

Contents

I. Introduction

Over the past decade, DL (Deep Learning) has gained tremendous success in many areas including Image Classification, Natural Language Processing, Self-driving cars, etc. DNNs (Deep Neural Networks) is the key technology that is capable of automatically extracting features from multi-modal datasets and developing a model that understands the complex and non-linear relationships between these features. Training these DNNs is a compute-intensive workload that is typically done on parallel systems with GPUs (Graphics Processing Units). DL frameworks like TensorFlow [1] and PyTorch [2] support efficient DNN training on such systems.

Cites in Papers - |

Cites in Papers - IEEE (3)

Select All

Mikhail Khalilov, Salvatore Di Girolamo, Marcin Chrapek, Rami Nudelman, Gil Bloch, Torsten Hoefler, "Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI", SC24: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-17, 2024.

Show Article

Google Scholar

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Zhaorui Zhang, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur, "An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression", 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.752-764, 2024.

Show Article

Google Scholar

Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. DK Panda, "Accelerating Large Language Model Training with Hybrid GPU-based Compression", 2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp.196-205, 2024.

Show Article

Google Scholar

Cites in Papers - Other Publishers (1)

Junqi Yin, Sajal Dash, John Gounley, Feiyi Wang, Georgia Tourassi, "Evaluation of pre-training large language models on leadership-class supercomputers", The Journal of Supercomputing, vol.79, no.18, pp.20747, 2023.

CrossRef Google Scholar

References is not available for this document.

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

Cites in Papers - |

Cites in Papers - IEEE (3)

Cites in Papers - Other Publishers (1)

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

Cites in Papers - IEEE (3) | Other Publishers (1)

Cites in Papers - IEEE (3)

Cites in Papers - Other Publishers (1)

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Cites in Papers - |