Performance Characterization of Network Mechanisms for Non-Contiguous Data Transfers in MPI | IEEE Conference Publication | IEEE Xplore

Performance Characterization of Network Mechanisms for Non-Contiguous Data Transfers in MPI


Abstract:

Message Passing Interface (MPI) is a very popular parallel programming model for developing parallel scientific applications. The complexity of data handled by scientific...Show More

Abstract:

Message Passing Interface (MPI) is a very popular parallel programming model for developing parallel scientific applications. The complexity of data handled by scientific applications often results in their placement in non-contiguous locations in memory. In order to handle such complex and non-contiguous data, domain scientists often use user-defined datatypes that are supported by the MPI standard through Derived Data Types (DDT). Traditionally, popular implementations of the MPI standard have used simple schemes to “pack” and “unpack” non-contiguous data to and from contiguous memory regions before and after communication operations. On the other hand, vendors of high-performance interconnects have introduced several hardware offloaded schemes to perform optimized transfers of non-contiguous data. Although researchers have attempted to characterize the performance of non-contiguous transfers in the past, they have not gone in depth into the communication runtime to see where the bottlenecks lie, especially in the presence of network offloaded support. In this paper, we take up this challenge and evaluate different designs for non-contiguous data transfers on a particular MPI run-time using our synthetic benchmarks. We consider the following designs: 1) Pack-Unpack based RDMA transfer 2) User mode Memory Registration (UMR) based RDMA transfer. 3) Pipelined transfer. 4) SGL based transfer. For each of these designs, we measure the impact of a) serialization, b) memory registration, c) Packing, and d) additional overheads to send. From these evaluations, we realize why MPI run-times may not meet the expectations of DDT, and when to use DDT based implementations.
Date of Conference: 18-22 May 2020
Date Added to IEEE Xplore: 28 July 2020
ISBN Information:
Conference Location: New Orleans, LA, USA
Citations are not available for this document.

I. Introduction

The Message Passing Interface (MPI) [8] has remained a defacto programming model for developing large-scale parallel applications. The ranks (e.g., processes) in MPI applications communicate with each other to send/receive data to/from their peer ranks. Often times, parallel applications such as NAS MG, SPEC3D, MILC, and others require transferring data that is non-contiguous in system memory (e.g., columns of a 2D matrix, faces of a 3D stencil, and more). MPI semantics provide high-level abstractions to represent noncontiguous data layouts in memory as user-defined datatypes or Derived Data Types. The programmer can create derived datatypes of their application layouts and instead of using primitives types (e.g., INT, DOUBLE, etc.) use derived types in the data transfer. This relieves the application programmer of dealing with the non-contiguity but puts the onus on the communication run-time (e.g., MPI library) to achieve efficient communication performance for these types. MPI libraries employ several approaches to handle derived datatype transfers. The simplest approach involves the sender packing all the non-contiguous regions into a single contiguous buffer and send it out while the receiver unpacks the data into its local non-contiguous memory locations. The other approaches involve the use of hardware-assisted mechanisms like Mellanox InfiniBand Scatter-Gather-Lists (SGL) and User-mode Memory Registration (UMR) which offload transfers to the Network Interface Card (NIC).

Cites in Papers - |

Cites in Papers - Other Publishers (4)

1.
Majid Salimi Beni, Sascha Hunold, Biagio Cosenza, "Analysis and prediction of performance variability in large-scale computing systems", The Journal of Supercomputing, 2024.
2.
Majid Salimi Beni, Biagio Cosenza, "An Analysis of Long-Tailed Network Latency Distribution and Background Traffic on Dragonfly+", Benchmarking, Measuring, and Optimizing, vol.13852, pp.123, 2023.
3.
Jubin Wang, Yuan Zhuang, Yunhui Zeng, "A transmission optimization method for MPI communications", The Journal of Supercomputing, 2023.
4.
Majid Salimi Beni, Sascha Hunold, Biagio Cosenza, , 2023.
Contact IEEE to Subscribe

References

References is not available for this document.