Designing Efficient Pipelined Communication Schemes using Compression in MPI Libraries | IEEE Conference Publication | IEEE Xplore

Designing Efficient Pipelined Communication Schemes using Compression in MPI Libraries


Abstract:

The emergence of trillion-parameter models in AI, and the deployment of dense Graphics Processing Unit (GPU) systems with high-bandwidth inter-GPU and network interconnec...Show More

Abstract:

The emergence of trillion-parameter models in AI, and the deployment of dense Graphics Processing Unit (GPU) systems with high-bandwidth inter-GPU and network interconnects underscores the need to design efficient architecture-aware large message communication operations. GPU-based on-the-fly compression communication designs help reduce the amount of data transferred across processes, thereby improving large message communication performance. In this paper, we first analyze bottlenecks in state-of-the-art on-the-fly compression-based MPI implementations for blocking as well as non-blocking point-to-point communication operations. We then propose efficient point-to-point designs that improve upon state-of-the-art implementations through fine-grained overlap of copy, compression and communication operations. We demonstrate the efficacy of our proposed designs by comparing against state-of-the-art communication runtimes using micro-benchmarks and candidate communication patterns. Our proposed designs deliver 28.7% improvements in latency, 49.7% in bandwidth, and 36% in bi-directional bandwidth using micro-benchmarks, and up to 16.5% improvements for 3D stencil-based communication patterns over state-of-the-art designs.
Date of Conference: 18-21 December 2022
Date Added to IEEE Xplore: 26 April 2023
ISBN Information:

ISSN Information:

Conference Location: Bengaluru, India

I. Introduction

The advent of Graphics processing units (GPUs) has enabled applications to perform a wide variety of compute intensive tasks at a much faster rate than CPUs. Owing to their massive compute capabilities, High Performance Computing (HPC) clusters such as the #4 supercomputer, named Summit, on the Top500 list [1] have employed multiple GPUs per node spanning thousands of nodes. These clusters employ high-bandwidth inter-node interconnects such as Infiniband [2] and inter-GPU interconnects such as NVIDIA NVLink [3] to facilitate large volumes of distributed communication between GPUs in the system along with low latency. The Message Passing Interface (MPI) is the defacto standard for distributed communication on HPC clusters, providing APIs for point-to-point as well as collective communication operations. The trend towards building supercomputers with GPUs and high performance interconnects is only expected to expand with the move towards exascale. The onus of utilizing different inter-connects and compute elements in super-computing systems while achieving the lowest possible communication latency between processes falls on MPI libraries.

Contact IEEE to Subscribe

References

References is not available for this document.