Conferences >2015 IEEE International Confe...

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs (referred to as "Device"). It enables IB network...Show More

Metadata

Abstract:

GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs (referred to as "Device"). It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs of OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to inefficiencies and sub-optimal performance. In this paper, we analyze the performance of various OpenSHMEM operations with different inter-node and intra-node communication configurations (Host-to-Device, Device-to-Device, and Device-to-Host) on GPU based systems. We propose novel designs that ensure "truly one-sided" communication for the different inter-/intra-node configurations identified above while working around the hardware limitations. To the best of our knowledge, this is the first work that investigates GDR-aware designs for OpenSHMEM communication operations. Experimental evaluations indicate 2.5X and 7X improvement in point-point communication for intra-node and inter-node, respectively. The proposed framework achieves 2.2µs for an intra-node 8 byte put operation from Host-to-Device, and 3.13µs for an inter-node 8 byte put operation from GPU to remote GPU. With Stencil2D application kernel from SHOC benchmark suite, we observe a 19% reduction in execution time on 64 GPU nodes. Further, for GPULBM application, we are able to improve the performance of the evolution phase by 53% and 45% on 32 and 64 GPU nodes, respectively.

Published in: 2015 IEEE International Conference on Cluster Computing

Date of Conference: 08-11 September 2015

Date Added to IEEE Xplore: 29 October 2015

Electronic ISBN:978-1-4673-6598-7

ISSN Information:

DOI: 10.1109/CLUSTER.2015.21

Conference Location: Chicago, IL, USA

Contents

1. Introduction

The emergence of accelerators such as NVIDIA General Purpose Graphic Processing Units (GPGPU or GPU in short) is changing the landscape of supercomputing systems. This trend is evident in the TOP500 list released in July 2015, where 90 systems make use of accelerator/co-processor technology [1]. GPUs, being PCle devices, have their own memory space and require data to be transferred to their memory through specific mechanisms. The Compute Unified Device Architecture (CUDA) [2] API is the most popular programming framework available for users to take advantage of GPU s. It provides mechanisms to compute on GPU, synchronize threads on GPU, and move data between CPU and GPU. In addition to the generic CUDA APIs, auxiliary features such as GPUDirect help expedite data transfers to/from GPU memory. GPUDirect is a set of features that enable efficient data movement among GPUs as well as between GPUs and peer PCI Express (PCIe) devices. CUDA 5.0 introduced the GPUDirect RDMA (GDR) feature, which allows InfiniBand network adapters to directly read from or write to GPU device memory while completely bypassing the host [3]. This has the potential to yield significant performance benefits especially in the presence of multiple communication configurations that GPU devices expose. In these heterogeneous systems data can be transferred from Host-to-Host (H-H), Device-to-Device (D-D), Host-to-Device (H-D), and Device-to-Host (D-H). Further these configurations can be either intra-node or inter-node.

References is not available for this document.

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

References