Conferences >2024 IEEE International Paral...

HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Modern multi/many-core processors in HPC systems have hundreds of cores with deep memory hierarchies. HPC applications run at high core counts often experience contention...Show More

Metadata

Abstract:

Modern multi/many-core processors in HPC systems have hundreds of cores with deep memory hierarchies. HPC applications run at high core counts often experience contention between processes/threads on shared resources such as caches, leading to degraded performance. This is especially true for dense collective patterns, such as MPI_Alltoall, that have many concurrent memory transactions. The ordering of memory copies during the MPI_Alltoall operation can significantly affect performance as cache-efficient access patterns could potentially reduce cache misses. However, the correct access pattern depends on various factors, including cache associativity, cache sizes, coherence protocols, and memory layouts. This paper first identifies sources of bottlenecks in performing memory operations in an Alltoall. We propose different orderings for the memory copies in Alltoall operations and study their effectiveness for various message sizes. We overcome bandwidth bottlenecks related to repeated bus requests in the cache by proposing a hybrid memory copy scheme that combines regular temporal and non-temporal stores. Then, we implement an Alltoall algorithm that dynamically picks between memory orders based on their performance for different message sizes/number of processes. To the best of our knowledge, this is the first work that explores a combination of dynamic memory copy orders and non-temporal instructions for optimizing MPI_Alltoall operations. Our proposed solutions reduce the latency versus state-of-the-art solutions by up to 10x at the micro-benchmark level and 22.2% for the CPU time per loop in distributed Fast Fourier Transforms (FFTs) using P3DFFT.

Published in: 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Date of Conference: 27-31 May 2024

Date Added to IEEE Xplore: 08 July 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/IPDPS57955.2024.00076

Conference Location: San Francisco, CA, USA

Contents

I. Introduction

Modern multi-core CPUs have multiple levels of memory hierarchies to cater to the needs of diverse workloads. Processor caches play an essential role in performance due to their low latency compared to other forms of memory such as DRAM, HBM, SSDs, etc., but they have limited capacity. Traditionally, caches are divided into multiple levels (in most cases, L1, L2, and L3), with L1 having the lowest latency and the latency increasing multi-fold as we traverse the memory hierarchy from L2 to L3 and from L3 to main memory. This puts the burden on applications to utilize processor caches effectively to avoid unnecessary latency hits. In a distributed setting with multiple cores, the cache behavior of communication libraries also plays an important role in the performance of applications. The Messaging Passing Interface (MPI) [1] has been a pervasive programming model in high-performance computing for distributed communication. MPI libraries often have to look at reducing cache misses by optimizing buffer accesses to be cache-efficient, especially for jobs with high core counts per node. Efficient cache usage is especially important for dense collective patterns such as MPI_Alltoall, as the significant increase in memory transactions at high core counts is likely to cause a large number of conflict misses in the cache. Due to the trend of increasing core counts on modern processors such as the AMD EPYC 9004 series [2] with 192 cores per socket, the problem of optimizing the cache usage of MPI_Alltoall within the node (intra-node) is important. Cache optimizations for MPI_Alltoall can be viewed from two standpoints – improving spatial locality and improving the bandwidth of memory transactions.

References is not available for this document.

HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

Authors

Figures

References

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?