Loading [MathJax]/extensions/MathMenu.js
Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication | IEEE Conference Publication | IEEE Xplore

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication


Abstract:

Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model paramete...Show More

Abstract:

Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out data-parallel training of Deep Learning (DL) models. It shards the model parameters, gradients, and optimizer states of the model among multiple GPUs. Consequently, this requires data-intensive Allgather and Reduce-Scatter communication to share the model parameters, which becomes a bottleneck. Existing schemes that use GPU-aware MPI libraries are highly prone to saturating the interconnect bandwidth. Therefore, integrating GPU-based compression into MPI libraries has proven efficient to achieve faster training time. In this paper, we propose an optimized Ring algorithm of Allgather and Reduce-Scatter collectives that encompass an efficient collective-level online compression scheme. At the microbenchmark level, Allgather achieves benefits of up to 83.6% and 30.3% compared to the baseline and existing point-to-point-based compression in a state-of-the-art MPI library on modern GPU clusters. Reduce-Scatter achieves 88.1% and 40.6% compared to baseline and point-to-point compression, respectively. For distributed DL training with PyTorch-FSDP, our approach yields 31.7% faster training than the baseline, and up to 12.5% compared to the existing point-to-point-based compression while maintaining similar accuracy.
Date of Conference: 15-19 May 2023
Date Added to IEEE Xplore: 18 July 2023
ISBN Information:

ISSN Information:

Conference Location: St. Petersburg, FL, USA

I. Introduction

Over the past decade, DL (Deep Learning) has gained tremendous success in many areas including Image Classification, Natural Language Processing, Self-driving cars, etc. DNNs (Deep Neural Networks) is the key technology that is capable of automatically extracting features from multi-modal datasets and developing a model that understands the complex and non-linear relationships between these features. Training these DNNs is a compute-intensive workload that is typically done on parallel systems with GPUs (Graphics Processing Units). DL frameworks like TensorFlow [1] and PyTorch [2] support efficient DNN training on such systems.

Contact IEEE to Subscribe

References

References is not available for this document.