I. Introduction
Contemporary high-performance clusters are equipped with powerful CPUs with a high processor count per node. An AMD system featuring Milan or Rome architectures supports up to 128 cores per node, while the Intel Ice Lake architecture offers a maximum of 80 cores. Such nodes are then connected together with high-performance networks such as InfiniBand, Omni-Path, and RoCE. These networks distinguish themselves from traditional Ethernet with Remote Direct Memory Access (RDMA) in providing high-performance zero-copy and kernel bypass message transfers. To fully exploit supercomputers, an efficient and robust programming model is required to catch up with the trend of such systems in the number of core counts per node, network speed, and system size.