Loading [MathJax]/extensions/MathZoom.js
Mohamed Wahib - IEEE Xplore Author Profile

Showing 1-25 of 42 results

Results

We introduce CSI-Inpainter, a novel approach for obstacle removal using Wi-Fi channel state information (CSI). This method harnesses CSI data to reconstruct obscured visual elements, regardless of lighting conditions. Extensive empirical evaluation in both office and industrial settings demonstrates the effectiveness of CSI-Inpainter’s exceptional ability to identify and reconstruct occluded segme...Show More
This paper presents an open-source library that pushes the limits of performance portability for irregular General Matrix Multiplication (GEMM) on the widely-used Arm architectures. Our library, autoGEMM, is designed to support a wide range of Arm processors: from edge devices to HPCgrade CPUs. autoGEMM generates optimized kernels for various hardware configurations by auto-combining fragments of ...Show More
Attention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g. microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attenti...Show More
GPUDirect Storage, a novel tool provided by Nvidia, facilitates better utilization of GPU I/O by avoiding extra copies through a bounce buffer in the CPU host memory and enabling direct memory access. This technology offers significant advantages, particularly its high throughput capabilities and low latency. However, it also presents challenges in implementation due to strict layout requirements....Show More
Graph Convolutional Networks (GCNs) are widely used in various domains. However, training distributed full-batch GCNs on large-scale graphs poses challenges due to high communication overhead. This work presents a hybrid pre-post-aggregation approach and an integer quantization method to reduce communication costs. With these techniques, we develop a scalable distributed GCN training framework, Su...Show More
Overfitting is a well-documented and studied issue in supervised learning. Human experts have been designing methods to reduce over-fitting by observing the validation knowledge, e.g., learning rate schedules, dropout, and adversarial training. We propose a validation-loss landscape exploration/exploitation method called VKI (Validation Knowledge Inheritance). We reformulate the traditional gradie...Show More
Neural architecture search (NAS) automates the design of neural networks, but faces high computational costs for evaluating the performance candidate architectures. Surrogate-assisted NAS methods use approximate computational models to get predictive estimation instead of real complete training, but also face the challenge of maintaining the balance between training cost and predictive effectivene...Show More
Neural architecture search (NAS) is an effective approach for automating the design of deep neural networks. Evolutionary computation (EC) is commonly used in NAS due to its global optimization capability. However, the evaluation phase of architecture candidates in EC-based NAS is compute-intensive, limiting its application for many real-world problems. To overcome this challenge, we propose a nov...Show More
This paper proposes a method called Meta Gener-ative Data Augmentation Optimization (MGDAO) to overcome limitations in existing data augmentation techniques. While traditional data augmentation methods have relied on expert intuition to determine effective transformations, recent approaches have attempted to generate data augmentation strategies automatically. However, these automatic methods can ...Show More
When training neural networks, the weights of the model are updated at each optimization step, and the older weights are discarded. In this paper, we propose a method called, Training Knowledge Inheritance (TKI), to use the knowledge about the progression of weight and loss data in reducing overfitting and improving the generalization in the later stages of training. We reformulate the traditional...Show More
With the success of deep learning, there have been numerous efforts to build hardware for it. One approach that is gaining momentum is neuromorphic computing with spiking neural networks (SNNs), which are multiplication-free and open the possibility of using analog computing via novel technologies. However, to design effective and efficient hardware for such architectures, a fast and accurate soft...Show More
Ptychography is a popular microscopic imaging modality for many scientific discoveries and sets the record for highest image resolution. Unfortunately, the high image resolution for ptychographic reconstruction requires significant amount of memory and computations, forcing many applications to compromise their image resolution in exchange for a smaller memory footprint and a shorter reconstructio...Show More
Traditional deep model optimization methods discard the training weights can which contain information about the loss landscape that could guide further model optimization. In this paper, we show that a supervisor neural network could be used to predict the validation performance of another target neural network (student) through its training weights. Then based on this behavior, we propose a weig...Show More
Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion. Because this puts enormous pressure on the I/O subsystem, the most common approach to distributed SGD in HPC environments is to replicate the entire dataset to node local SSDs. However, ...Show More
We present FastConv, a template-based code auto-generation open-source library that can automatically generate high-performance deep learning convolution kernels of arbitrary matrices/tensors shapes. FastConv is based on the Winograd algorithm, which is reportedly the highest performing algorithm for the time-consuming layers of convolutional neural networks. ARM CPUs cover a wide range of designs...Show More
Filtered Back-Projection (FBP) is a fundamental compute intense algorithm used in tomographic image reconstruction. Cone-Beam Computed Tomography (CBCT) devices use a cone-shaped X-ray beam, in comparison to the parallel beam used in older CT generations. Distributed image reconstruction of cone-beam datasets typically relies on dividing batches of images into different nodes. This simple input de...Show More
Adaptive mesh refinement (AMR) is an important method that enables many mesh-based applications to run at effectively higher resolution within limited computing resources by allowing high resolution only where really needed. This advantage comes at a cost, however: greater complexity in the mesh management machinery and challenges with load distribution. With the current trend of increasing hetero...Show More
Distributed training of Deep Neural Networks (DNNs) on High-Performance Computing (HPC) systems is becoming increasingly common. HPC systems dedicated entirely or mainly to Deep Learning (DL) workloads are becoming a reality. The collective communication overhead for calculating the average of weight gradients, e.g., an Allreduce operations, is one of the main factors limiting the scaling of data ...Show More
The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in additi...Show More
GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for multi-GPU synchronization. Nvidia's latest CUDA provides a variety of synchronization methods. Until now, there is no full understanding of the characteristics of thos...Show More
Computed Tomography (CT) is a widely used technology that requires compute-intense algorithms for image reconstruction. We propose a novel back-projection algorithm that reduces the projection computation cost to 1/6 of the standard algorithm. We also propose an efficient implementation that takes advantage of the heterogeneity of GPU-accelerated systems by overlapping the filtering and back-proje...Show More
This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versa...Show More
Data parallelism is the dominant method used to scale-up deep learning (DL) training across multiple compute nodes. Collective communication of the local gradients between nodes is a critical bottleneck due to the significant increase in complexity and size of DL models. Researchers cope with this problem by one of the following solutions: a) optimizing the collective communication algorithm to ac...Show More
Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications' need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this view. In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC pr...Show More