Kaushik Kandadi Suresh - IEEE Xplore Author Profile

Showing 1-13 of 13 results

Results

Modern SmartNICs are capable of performing both computation and communication operations. In this context, past works on accelerating HPC/DL applications have manually selected some computational phases for offloading them to the SmartNICs. In this work, we identify Vector Multiply-Adds (VMA), Distributed Dot Products (DDOT), and Sparse Matrix-Vector Multiplication (Matvec) as three fundamental op...Show More
One-sided communication is one of many approaches to use for data transfer in High-Performance Computing (HPC) applications. One-sided operations require less demand on parallel programming libraries and do not require HPC hardware to issue acknowledgments of successful data transfer. Thanks to its inherently non-blocking nature, one-sided communication is also useful for improving overlap between...Show More
Modern multi/many-core processors in HPC systems have hundreds of cores with deep memory hierarchies. HPC applications run at high core counts often experience contention between processes/threads on shared resources such as caches, leading to degraded performance. This is especially true for dense collective patterns, such as MPI_Alltoall, that have many concurrent memory transactions. The orderi...Show More
Over the past several years, Smart Network Interface Cards (NIC/SmartNICs) have rapidly evolved in popularity. In particular, NVIDIA’s BlueField line of SmartNICs has been effective in a wide variety of uses: Offloading communication in High-Performance Computing applications (HPC), various stages of the Deep Learning (DL) pipeline, and is designed especially for Datacenter/virtualization uses. Th...Show More
The Message-Passing Interface (MPI) provides convenient abstractions such as MPI_Allreduce for inter-process collective reduction operations. With the advent of deep learning and large-scale HPC systems, it is ever so important to optimize the latency of the MPI_Allreduce operation for large messages. Due to the amount of compute and communication involved in MPI_Allreduce, it is beneficial to off...Show More
Many High-Performance Computing (HPC) clusters around the world use some variation of InfiniBand interconnects, all of which are powered by the ‘‘Verbs’’ API. Verbs supply a quick, efficient, and developer-friendly method of passing data buffers between nodes through their interconnect(s). In more recent years, the MLX5-DV (Direct Verbs) API has made itself known as a method of providing mechanism...Show More
Smart Network Interface Cards (SmartNICs) such as NVIDIA’s BlueField Data Processing Units (DPUs) provide advanced networking capabilities and processor cores, enabling the offload of complex operations away from the host. In the context of MPI, prior work has explored the use of DPUs to offload non-blocking collective operations. The limitations of current state-of-the-art approaches are twofold:...Show More
The importance of graphics processing units (GPUs) in accelerating HPC applications is evident by the fact that a large number of supercomputing clusters are GPU enabled. Many of these HPC applications use message passing interface (MPI) as their programming model. These MPI applications frequently exchange data that is noncontiguous in GPU memory. MPI provides derived datatypes (DDTs) to represen...Show More
Graphics Processing Units (GPUs) have become ubiquitous in today’s supercomputing clusters primarily because of their high compute capability and power efficiency. Message Passing Interface (MPI) is a widely adopted programming model for large-scale GPU-based applications used in such clusters. Modern GPU-based systems have multiple HCAs. Previously, scientists have leveraged multi-HCA systems to ...Show More
The importance of GPUs in accelerating HPC applications is evident by the fact that a large number of super-computing clusters are GPU-enabled. Many of these HPC applications use MPI as their programming model. These MPI applications oftentimes exchange data that is non-contiguous in GPU memory. MPI provides Derived Datatypes(DDTs) to represent such data. In the past, researchers have proposed sol...Show More
Modern MPI-based scientific applications frequently use derived datatypes (DDT) for inter-process communication. Designing scalable solutions capable of dynamically adapting themselves to the complex communication requirements posed by DDT-based applications bring forth several new challenges. In this work, we address these challenges and propose solutions to efficiently improve the performance of...Show More
The Message-Passing Interface (MPI) is the de-facto standard for designing and executing applications on massively parallel hardware. MPI collectives provide a convenient abstraction for multiple processes/threads to communicate with one another. Mellanox's HDR InfiniBand switches pro-vide Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) capabilities to offload collective communica...Show More
Message Passing Interface (MPI) is a very popular parallel programming model for developing parallel scientific applications. The complexity of data handled by scientific applications often results in their placement in non-contiguous locations in memory. In order to handle such complex and non-contiguous data, domain scientists often use user-defined datatypes that are supported by the MPI standa...Show More