Graphics processing units (GPUs) have become ubiquitous in modern supercomputers due to their high compute capability and power efficiency. In these supercomputing clusters, many message passing interface (MPI)-based large-scale GPU-based applications exchange data that are noncontiguous in memory. MPI provides derived datatypes (DDTs) that allow an application to represent any noncontiguous layout in memory. Table 1 gives a summary of access patterns and possible datatypes of different HPC applications. This underscores the importance of optimizing such noncontiguous transfers.
Abstract:
The importance of graphics processing units (GPUs) in accelerating HPC applications is evident by the fact that a large number of supercomputing clusters are GPU enabled....Show MoreMetadata
Abstract:
The importance of graphics processing units (GPUs) in accelerating HPC applications is evident by the fact that a large number of supercomputing clusters are GPU enabled. Many of these HPC applications use message passing interface (MPI) as their programming model. These MPI applications frequently exchange data that is noncontiguous in GPU memory. MPI provides derived datatypes (DDTs) to represent such data. Past research on DDTs mainly focused on optimizing the pack–unpack kernels. Modern HCAs are capable of gathering/scattering data from/to noncontiguous GPU memory regions. We propose a low-overhead HCA-assisted scheme to improve the performance of GPU-based noncontiguous exchanges without the GPU-based pack–unpack kernels. We show that the proposed scheme provides up to 2× benefits compared to the existing pack-based scheme at the benchmark level. Furthermore, we show up to 17% improvement with the SW4Lite application compared to other MPI libraries, such as MVAPICH2-GDR and OpenMPI+UCX.
Published in: IEEE Micro ( Volume: 43, Issue: 2, 01 March-April 2023)