Loading [MathJax]/extensions/MathZoom.js
Dhabaleswar K. Dk Panda - IEEE Xplore Author Profile

Showing 1-25 of 399 results

Filter Results

Show

Results

Scaling up Large Language Model(LLM) training involves fitting a tremendous amount of training parameters across a limited number of workers. However, methods like ZeRO-3 that drastically reduce GPU memory pressure often incur heavy communication to ensure global synchronization and consistency. Established efforts such as ZeRO++ use secondary partitions to avoid inter-node communications, given t...Show More
Hyperparameter Optimization (HPO) can unlock the full potential of Deep Learning (DL) models; however, it is considered one of the most compute-intensive tasks in the DL do-main due to multi-dimensional search spaces and complex neural network architectures. A common method for accelerating HPO workloads is parallelizing training jobs on multiple computing devices, such as modern GPUs in High-Perf...Show More
Modern SmartNICs are capable of performing both computation and communication operations. In this context, past works on accelerating HPC/DL applications have manually selected some computational phases for offloading them to the SmartNICs. In this work, we identify Vector Multiply-Adds (VMA), Distributed Dot Products (DDOT), and Sparse Matrix-Vector Multiplication (Matvec) as three fundamental op...Show More
The demand for computing power in high- performance computing and deep learning applications is steadily increasing, leading to a noticeable inclination toward equipping modern exascale clusters with accelerators. In particular, dis-tributed Deep Learning training necessitates high-performance G PU -aware MPI operations, with reduction operations being widely employed. Unlike data movement-based M...Show More
One-sided communication is one of many approaches to use for data transfer in High-Performance Computing (HPC) applications. One-sided operations require less demand on parallel programming libraries and do not require HPC hardware to issue acknowledgments of successful data transfer. Thanks to its inherently non-blocking nature, one-sided communication is also useful for improving overlap between...Show More
In Artificial Intelligence (AI) and high-performance computing (HPC), growing data and model sizes require distributed processing across multiple nodes due to single-node limitations, increasing inter-node communication. To address these challenges, we propose a novel MPI allgather method leveraging CXL technology, which supports composable architectures and dynamic resource allocation in data cen...Show More
Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive communication routines to collect, aggregate, and re-distribute gradients, activations, and other important model information, which pose significant overhead. Co-desi...Show More
Parameter-efficient Fine-tuning (PEFT) methods have emerged as powerful techniques for adapting pre-trained Large Language Models (LLMs) to specific tasks with reduced computational and memory overhead. However, despite their promising potential, there remains a gap in understanding how these methods perform in distributed computing settings. In this paper, we present a comprehensive characterizat...Show More
Deep learning (DL) models based on the transformer architecture have revolutionized many DL applications such as large language models (LLMs), vision transformers, audio generation, and time series prediction. Much of this progress has been fueled by distributed training, yet distributed communication remains a substantial bottleneck to training progress. This paper examines the communication beha...Show More
The presence of exascale computers has pushed a new boundary in computing capability, which poses performance challenges in parallel programming models on how to exploit such systems efficiently. A dominant programming model for running parallel programs is the Message Passing Interface. Among primitives provided by MPI, Alltoall is a communication-intensive operation, which is utilized by many ap...Show More
Message Passing Interface (MPI) is a common parallel programming model in High-Performance Computing fields. Recently, it has been widely used in Artificial Intelligence (AI) applications. However, the performance of those applications is limited by the memory wall problem, which is the performance gap between the processor and memory. To address these problems, we propose a novel computing archit...Show More
The Message Passing Interface is the de facto standard in high-performance computing (HPC) for inter-process communication. MPI libraries employ numerous algorithms for each collective communication pattern whose behavior is largely affected by the underlying hardware, communication pattern, message size, and number of processes involved. Choosing the “best” algorithm for every possible scenario i...Show More
In the realm of large language models (LLMs) like the Generative Pre-trained Transformer (GPT), the Mixture of Experts (MoE) paradigm has emerged as a powerful technique for enhancing model expressiveness and accuracy. However, the deployment of GPT MoE models for parallel inference on distributed systems presents significant challenges, primarily due to the extensive Alltoall communication requir...Show More
Modern multi/many-core processors in HPC systems have hundreds of cores with deep memory hierarchies. HPC applications run at high core counts often experience contention between processes/threads on shared resources such as caches, leading to degraded performance. This is especially true for dense collective patterns, such as MPI_Alltoall, that have many concurrent memory transactions. The orderi...Show More
With the increasing scale of High-Performance Computing (HPC) and Deep Learning (DL) applications through GPU adaptation, the seamless communication of data stored on GPUs has become a critical factor in enhancing overall application performance. AllReduce is a communication collective operation that is commonly used in HPC applications and distributed DL training, especially Data Parallelism. Dat...Show More
Autoregressive models, despite their commendable performance in a myriad of generative tasks, face challenges stemming from their inherently sequential structure. Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens. This inherent characteristic severely impedes computational efficiency during i...Show More
In modern multi-/many-core HPC systems, the increasing number of processor cores presents new challenges in managing parallel compute workloads across multiple nodes. One crucial aspect that significantly impacts the startup phase of parallel MPI jobs is the methodology used for connection establishment. In this paper, we investigate the limitations of existing all-to-all connection establishment ...Show More
Quantization is a popular technique used in Deep Neural Networks (DNN) inference to reduce the size of models and improve the overall numerical performance by exploiting native hardware. This paper attempts to conduct an elaborate performance characterization of the benefits of using quantization techniques-mainly FP16/INT8 variants with static and dynamic schemes-using the MLPerf Edge Inference b...Show More
The MPI4Spark effort was able to reconcile disparities that existed between High-Performance Computing (HPC) environments and Big Data stacks, by adopting an MPI-based solution inside of A pache Spark’s Netty communication layer that was capable of better utilizing high-speed interconnects — such as InfiniBand (IB), Intel Omni-Path (OPA), and HPE Slingshot — across a variety of HPC systems. Apache...Show More
Supervised Deep Learning (DL) thrives on Big Data; however, it inherits a major limitation—training and testing datasets must be fully annotated to train Deep Neural Networks (DNNs). To mitigate this bottleneck, we propose HARVEST—a distributed computer-vision framework that employs state-of-the-art semi-supervised learning (SSL) algorithms to train accurate DNNs using Distributed Data Parallelism...Show More
Over the past several years, Smart Network Interface Cards (NIC/SmartNICs) have rapidly evolved in popularity. In particular, NVIDIA’s BlueField line of SmartNICs has been effective in a wide variety of uses: Offloading communication in High-Performance Computing applications (HPC), various stages of the Deep Learning (DL) pipeline, and is designed especially for Datacenter/virtualization uses. Th...Show More
The Message-Passing Interface (MPI) provides convenient abstractions such as MPI_Allreduce for inter-process collective reduction operations. With the advent of deep learning and large-scale HPC systems, it is ever so important to optimize the latency of the MPI_Allreduce operation for large messages. Due to the amount of compute and communication involved in MPI_Allreduce, it is beneficial to off...Show More
Many High-Performance Computing (HPC) clusters around the world use some variation of InfiniBand interconnects, all of which are powered by the ‘‘Verbs’’ API. Verbs supply a quick, efficient, and developer-friendly method of passing data buffers between nodes through their interconnect(s). In more recent years, the MLX5-DV (Direct Verbs) API has made itself known as a method of providing mechanism...Show More
In this paper, we propose Scalable Meta-Parallelism for Deep Learning Search (ScaMP): a distributed Hyperparameter Optimization (HPO) and Neural Architecture Search (NAS) framework that supports out-of-core models with flexible parallelism schemes. SCaMP is integrated into the modern DL ecosystem, and enables both efficient parallel training of concurrent candidate architectures and aggregate devi...Show More
In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies [1], [2] to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixt...Show More