I. Introduction
Modern CPU-based High-Performance Computing (HPC) clusters employ the use of powerful processors with high core counts, and high-bandwidth, low-latency network interface cards (NICs)/switches. The emergence of network hardware such NVIDIA’s ConnectX-7 NICs [1]/Quantum-2 switches [2] capable of 400Gbps per port, and 128+ core AMD EPYC CPUs [3], indicate a trend toward supporting massive amounts of compute and network parallelism for AI/HPC workloads. The onus is on communication libraries and applications to efficiently utilize these platforms. A popular strategy to achieve this goal is by offloading communication operations to another thread/hardware resource to overlap them with compute operations. The idea of "offloading" a communication pattern can be viewed from two perspectives: 1) The set of APIs that define how to orchestrate the offload of these patterns, and 2) The underlying mechanisms used to efficiently offload communication operations.