1. Introduction
The emergence of accelerators such as NVIDIA General Purpose Graphic Processing Units (GPGPU or GPU in short) is changing the landscape of supercomputing systems. This trend is evident in the TOP500 list released in July 2015, where 90 systems make use of accelerator/co-processor technology [1]. GPUs, being PCle devices, have their own memory space and require data to be transferred to their memory through specific mechanisms. The Compute Unified Device Architecture (CUDA) [2] API is the most popular programming framework available for users to take advantage of GPU s. It provides mechanisms to compute on GPU, synchronize threads on GPU, and move data between CPU and GPU. In addition to the generic CUDA APIs, auxiliary features such as GPUDirect help expedite data transfers to/from GPU memory. GPUDirect is a set of features that enable efficient data movement among GPUs as well as between GPUs and peer PCI Express (PCIe) devices. CUDA 5.0 introduced the GPUDirect RDMA (GDR) feature, which allows InfiniBand network adapters to directly read from or write to GPU device memory while completely bypassing the host [3]. This has the potential to yield significant performance benefits especially in the presence of multiple communication configurations that GPU devices expose. In these heterogeneous systems data can be transferred from Host-to-Host (H-H), Device-to-Device (D-D), Host-to-Device (H-D), and Device-to-Host (D-H). Further these configurations can be either intra-node or inter-node.