Conferences >2023 IEEE International Paral...

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model paramete...Show More

Metadata

Abstract:

Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model parameters, gradients, and optimizer states of the model among multiple GPUs. Consequently, this requires data-intensive Allgather and Reduce-Scatter communication to share the model parameters, which becomes a bottleneck. Existing schemes that use GPU-aware MPI libraries are highly prone to saturating the interconnect bandwidth. Therefore, integrating GPU-based compression into MPI libraries has proven efficient to achieve faster training time. In this paper, we propose an optimized Ring algorithm of Allgather and Reduce-Scatter collectives that encompass an efficient collective-level online compression scheme. At the microbenchmark level, Allgather achieves benefits of up to 83.6% and 30.3% compared to the baseline and existing point-to-point-based compression in a state-of-the-art MPI library on modern GPU clusters. Reduce-Scatter achieves 88.1% and 40.6% compared to baseline and point-to-point compression, respectively. For distributed DL training with PyTorch-FSDP, our approach yields 31.7% faster training than the baseline, and up to 12.5% compared to the existing point-to-point-based compression while maintaining similar accuracy.

Published in: 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Date of Conference: 15-19 May 2023

Date Added to IEEE Xplore: 18 July 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/IPDPS54959.2023.00023

Conference Location: St. Petersburg, FL, USA

Contents

I. Introduction

Over the past decade, DL (Deep Learning) has gained tremendous success in many areas including Image Classification, Natural Language Processing, Self-driving cars, etc. DNNs (Deep Neural Networks) is the key technology that is capable of automatically extracting features from multi-modal datasets and developing a model that understands the complex and non-linear relationships between these features. Training these DNNs is a compute-intensive workload that is typically done on parallel systems with GPUs (Graphics Processing Units). DL frameworks like TensorFlow [1] and PyTorch [2] support efficient DNN training on such systems.

References is not available for this document.

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?