1 Introduction
Emerging high-performance computing (HPC) systems are marked by two factors: 1) the usage of accelerators like general purpose graphics processing units (GPGPUs) to boost their computing capabilities, and 2) high-performance commodity interconnects such as InfiniBand (IB) to push the frontiers of performance and scalability. As a result, numerous HPC applications, runtimes, and frameworks are adopting the massive parallelism computing power of GPUs [1], [2], [3], [4], [5], [6].