Conferences >2024 IEEE Symposium on High-P...

OHIO: Improving RDMA Network Scalability in MPI_Alltoall Through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The presence of exascale computers has pushed a new boundary in computing capability, which poses performance challenges in parallel programming models on how to exploit ...Show More

Metadata

Abstract:

The presence of exascale computers has pushed a new boundary in computing capability, which poses performance challenges in parallel programming models on how to exploit such systems efficiently. A dominant programming model for running parallel programs is the Message Passing Interface. Among primitives provided by MPI, Alltoall is a communication-intensive operation, which is utilized by many applications and is well-known for being difficult to optimize. Alltoall algorithms can be mainly classified into flat and hierarchical. The hierarchical designs avoid the slowdown of intra-node communication by inter-node communication by decoupling them. The hierarchical designs also reduce network congestion by reducing concurrently injected messages into the network. This work demonstrates an additional benefit of hierarchical designs to improve connection scalability in RDMA networks. This is attributed to the cache thrashing happening inside network adapters. All of these advantages of hierarchical schemes collectively contribute to the network scalability of Alltoall. This motivates us to propose a further optimized hierarchical design to enhance performance and network scalability. The design is network-agnostic and evaluated on clusters with InfiniBand and Omni-Path network adapters. The proposed design achieves average latency improvements of 61.13%, 56.40%, 37.49%, and 51.90% over Open MPI + UCX, HPC-X, Intel MPI, and MVAPICH2-X at micro-benchmark level with up to 7168 cores, respectively. In addition, the evaluation at application-level with Car-Parrinello Molecular Dynamics code shows 24.98 %, 40.44 % and 50.48 % improvement in the simulation time, compared to MVAPICH2-X, Open MPI + UCX, and Intel MPI, respectively.

Published in: 2024 IEEE Symposium on High-Performance Interconnects (HOTI)

Date of Conference: 21-23 August 2024

Date Added to IEEE Xplore: 10 September 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/HOTI63208.2024.00019

Conference Location: Albuquerque, NM, USA

Funding Agency:

Contents

I. Introduction

Contemporary high-performance clusters are equipped with powerful CPUs with a high processor count per node. An AMD system featuring Milan or Rome architectures supports up to 128 cores per node, while the Intel Ice Lake architecture offers a maximum of 80 cores. Such nodes are then connected together with high-performance networks such as InfiniBand, Omni-Path, and RoCE. These networks distinguish themselves from traditional Ethernet with Remote Direct Memory Access (RDMA) in providing high-performance zero-copy and kernel bypass message transfers. To fully exploit supercomputers, an efficient and robust programming model is required to catch up with the trend of such systems in the number of core counts per node, network speed, and system size.

References is not available for this document.

OHIO: Improving RDMA Network Scalability in MPI_Alltoall Through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

OHIO: Improving RDMA Network Scalability in MPI_Alltoall Through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References