I. Introduction
The Message Passing Interface (MPI) [8] has remained a defacto programming model for developing large-scale parallel applications. The ranks (e.g., processes) in MPI applications communicate with each other to send/receive data to/from their peer ranks. Often times, parallel applications such as NAS MG, SPEC3D, MILC, and others require transferring data that is non-contiguous in system memory (e.g., columns of a 2D matrix, faces of a 3D stencil, and more). MPI semantics provide high-level abstractions to represent noncontiguous data layouts in memory as user-defined datatypes or Derived Data Types. The programmer can create derived datatypes of their application layouts and instead of using primitives types (e.g., INT, DOUBLE, etc.) use derived types in the data transfer. This relieves the application programmer of dealing with the non-contiguity but puts the onus on the communication run-time (e.g., MPI library) to achieve efficient communication performance for these types. MPI libraries employ several approaches to handle derived datatype transfers. The simplest approach involves the sender packing all the non-contiguous regions into a single contiguous buffer and send it out while the receiver unpacks the data into its local non-contiguous memory locations. The other approaches involve the use of hardware-assisted mechanisms like Mellanox InfiniBand Scatter-Gather-Lists (SGL) and User-mode Memory Registration (UMR) which offload transfers to the Network Interface Card (NIC).