Loading [MathJax]/extensions/MathMenu.js
Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters | Prometeus GmbH Conference Publication | IEEE Xplore

Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters


Abstract:

With the increasing scale of High-Performance Computing (HPC) and Deep Learning (DL) applications through GPU adaptation, the seamless communication of data stored on GPU...Show More

Abstract:

With the increasing scale of High-Performance Computing (HPC) and Deep Learning (DL) applications through GPU adaptation, the seamless communication of data stored on GPUs has become a critical factor in enhancing overall application performance. AllReduce is a communication collective operation that is commonly used in HPC applications and distributed DL training, especially Data Parallelism. Data Parallelism is a common strategy where parallel GPUs are used to process the partitioned training dataset using a replica of the DL model. However, AllReduce operation for large GPU data still performs poorly due to the limited interconnect bandwidth between the GPU nodes. Some strategies of Gradient Quantization or Sparse AllReduce modifying the Stochastic Gradient Descent (SGD) algorithms may not support different training scenarios. Recent research shows integrating GPU-based compression into MPI libraries is efficient to achieve faster data transmission. In this paper, we propose optimized Recursive-Doubling and Ring AllReduce algorithms that encompass efficient collective-level GPU-based compression schemes in a state-of-the-art GPU-Aware MPI library. At the microbenchmark level, the proposed Recursive-Doubling and Ring algorithms with compression support achieve benefits of up to 75.3% and 85.5% respectively compared to the baseline, and 24.8% and 66.1% respectively compared to naive point-to-point compression on modern GPU clusters. For distributed DL training with PyTorch-DDP, these two approaches yield up to 32.3% and 35.7% faster training than the baseline, while maintaining similar accuracy.
Date of Conference: 12-16 May 2024
Date Added to IEEE Xplore: 10 May 2024
Electronic ISBN:978-3-9826336-0-2
Conference Location: Hamburg, Germany

Funding Agency:

No metrics found for this document.

References

References is not available for this document.