I. Introduction
Modern multi-core CPUs have multiple levels of memory hierarchies to cater to the needs of diverse workloads. Processor caches play an essential role in performance due to their low latency compared to other forms of memory such as DRAM, HBM, SSDs, etc., but they have limited capacity. Traditionally, caches are divided into multiple levels (in most cases, L1, L2, and L3), with L1 having the lowest latency and the latency increasing multi-fold as we traverse the memory hierarchy from L2 to L3 and from L3 to main memory. This puts the burden on applications to utilize processor caches effectively to avoid unnecessary latency hits. In a distributed setting with multiple cores, the cache behavior of communication libraries also plays an important role in the performance of applications. The Messaging Passing Interface (MPI) [1] has been a pervasive programming model in high-performance computing for distributed communication. MPI libraries often have to look at reducing cache misses by optimizing buffer accesses to be cache-efficient, especially for jobs with high core counts per node. Efficient cache usage is especially important for dense collective patterns such as MPI_Alltoall, as the significant increase in memory transactions at high core counts is likely to cause a large number of conflict misses in the cache. Due to the trend of increasing core counts on modern processors such as the AMD EPYC 9004 series [2] with 192 cores per socket, the problem of optimizing the cache usage of MPI_Alltoall within the node (intra-node) is important. Cache optimizations for MPI_Alltoall can be viewed from two standpoints – improving spatial locality and improving the bandwidth of memory transactions.