Loading [MathJax]/extensions/MathMenu.js
A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs | IEEE Conference Publication | IEEE Xplore

A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs


Abstract:

Smart Network Interface Cards (SmartNICs) such as NVIDIA’s BlueField Data Processing Units (DPUs) provide advanced networking capabilities and processor cores, enabling t...Show More

Abstract:

Smart Network Interface Cards (SmartNICs) such as NVIDIA’s BlueField Data Processing Units (DPUs) provide advanced networking capabilities and processor cores, enabling the offload of complex operations away from the host. In the context of MPI, prior work has explored the use of DPUs to offload non-blocking collective operations. The limitations of current state-of-the-art approaches are twofold: They only work for a pre-defined set of algorithms/communication patterns and have degraded communication latency due to staging data between the DPU and the host. In this paper, we propose a framework that supports the offload of any communication pattern to the DPU while achieving low communication latency with perfect overlap. To achieve this, we first study the limitations of higher-level programming models such as MPI in expressing the offload of complex communication patterns to the DPU. We present a new set of APIs to alleviate these shortcomings and support any generic communication pattern. Then, we analyze the bottlenecks involved in offloading communication operations to the DPU and propose efficient designs for a few candidate communication patterns. To the best of our knowledge, this is the first framework providing both efficient and generic communication offload to the DPU. Our proposed framework outperforms state-of-the-art staging-based offload solutions by 47% in Alltoall micro-benchmarks, and at the application level, we see improvements up to 60% in P3DFFT and 15% in HPL on 512 processes.
Date of Conference: 15-19 May 2023
Date Added to IEEE Xplore: 18 July 2023
ISBN Information:

ISSN Information:

Conference Location: St. Petersburg, FL, USA

I. Introduction

Modern CPU-based High-Performance Computing (HPC) clusters employ the use of powerful processors with high core counts, and high-bandwidth, low-latency network interface cards (NICs)/switches. The emergence of network hardware such NVIDIA’s ConnectX-7 NICs [1]/Quantum-2 switches [2] capable of 400Gbps per port, and 128+ core AMD EPYC CPUs [3], indicate a trend toward supporting massive amounts of compute and network parallelism for AI/HPC workloads. The onus is on communication libraries and applications to efficiently utilize these platforms. A popular strategy to achieve this goal is by offloading communication operations to another thread/hardware resource to overlap them with compute operations. The idea of "offloading" a communication pattern can be viewed from two perspectives: 1) The set of APIs that define how to orchestrate the offload of these patterns, and 2) The underlying mechanisms used to efficiently offload communication operations.

Contact IEEE to Subscribe

References

References is not available for this document.