Loading [MathJax]/extensions/MathZoom.js

Mohamed Wahib - IEEE Xplore Author Profile

IEEE.org
IEEE Xplore
IEEE SA
IEEE Spectrum
More Sites

- Donate
- Personal Sign In

Access provided by:

MIT Libraries

Access provided by:

MIT Libraries

ADVANCED SEARCH

Author details

Mohamed Wahib

Also published under: M. Wahib, Mohammed Wahib

Publications

42

Citations

242

Publications by Year

20072025

Co-Authors:

Kiyoshi AkamaTakayuki AokiRaymond BairJan BalewskiJacob Balma

Show All Co-Authors (121)

Mohamed Wahib

Also published under: M. Wahib, Mohammed Wahib

Affiliation

RIKEN Center for Computational Science (RIKEN-CCS), Kobe, Japan

Publication Topics

CIFAR-100 Dataset,
Learning Rate,
Neural Network,
Training Dataset,
Validation Loss,
Bellman Equation,
Computational Cost,
Data Augmentation,
Deep Neural Network,
Deep Q-learning,
High-performance Computing,
Implicit Function Theorem

Biography

Mohamed Wahib received the PhD degree in computer science from Hokkaido University, Japan, in 2012. He is currently a senior scientist with AIST/TokyoTech Open Innovation Laboratory, Tokyo, Japan. Prior to that, he was a researcher with the RIKEN Center for Computational Science. Prior to his graduate studies, he was a researcher with Texas Instruments (TI) R&D Labs, Dallas, TX, USA, for four years. His research interests include the central topic of performance-centric software development in the context of HPC.(Based on document published on 27 January 2022).

Publications

42

Citations

242

Publications by Year

20072025

Co-Authors:

Kiyoshi Akama
Takayuki Aoki
Raymond Bair
Jan Balewski
Jacob Balma

Show All Co-Authors (121)

Author's Published Works

Search History

Showing 1-25 of 42 results

Conferences (36)

Early Access Articles (2)

Journals (2)

Magazines (2)

Sort

Filter Results

Show

Subscribed Content

Open Access Only

Range
Single Year
Mohamed Wahib(37)
Masaharu Munetomo(16)
Satoshi Matsuoka(12)
Asim Munawar(11)
Peng Chen(11)
Information Initiative Center, Hokkaido University, Sapporo, Japan(11)
RIKEN Center for Computational Science, Kobe, Japan(7)
Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Japan(4)
Grad. School of Info. Science and Technology, Hokkaido University, Sapporo, Japan(4)
Tokyo Institute of Technology, Tokyo, Japan(4)
2008 10th IEEE International Conference on High Performance Computing and Communications(2)
2008 9th IEEE/ACM International Conference on Grid Computing(2)
2011 IEEE Congress of Evolutionary Computation (CEC)(2)
2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)(2)
2024 International Joint Conference on Neural Networks (IJCNN)(2)
IEEE(42)
New Orleans, LA, USA(4)
Atlanta, GA, USA(3)
Dalian, China(2)
Denver, CO, USA(2)
Kobe, Japan(2)
Parallelization(12)
High-performance Computing(11)
Deep Learning(8)
Deep Neural Network(6)
L2 Cache(6)

Select All on Page

Sort By

Results

Enabling Visual Scene Recovery from Wi-Fi CSI for Occlusion-Free Surveillance

Cheng Chen;Shoki Ohta;Takayuki Nishio;Mehdi Bennis;Jihong Park;Mohamed Wahib

IEEE Internet of Things Journal

Year: 2025 | Early Access Article |

We introduce CSI-Inpainter, a novel approach for obstacle removal using Wi-Fi channel state information (CSI). This method harnesses CSI data to reconstruct obscured visual elements, regardless of lighting conditions. Extensive empirical evaluation in both office and industrial settings demonstrates the effectiveness of CSI-Inpainter’s exceptional ability to identify and reconstruct occluded segme...Show More

Enabling Visual Scene Recovery from Wi-Fi CSI for Occlusion-Free Surveillance

Cheng Chen;Shoki Ohta;Takayuki Nishio;Mehdi Bennis;Jihong Park;Mohamed Wahib

IEEE Internet of Things Journal

Year: 2025 | Early Access Article |

autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm Architectures

Du Wu;Jintao Meng;Wenxi Zhu;Minwen Deng;Xiao Wang;Tao Luo;Mohamed Wahib;Yanjie Wei

SC24: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2024 | Conference Paper |

HTML

This paper presents an open-source library that pushes the limits of performance portability for irregular General Matrix Multiplication (GEMM) on the widely-used Arm architectures. Our library, autoGEMM, is designed to support a wide range of Arm processors: from edge devices to HPCgrade CPUs. autoGEMM generates optimized kernels for various hardware configurations by auto-combining fragments of ...Show More

autoGEMM: Pushing the Limits of Irregular Matrix Multiplication on Arm Architectures

Du Wu;Jintao Meng;Wenxi Zhu;Minwen Deng;Xiao Wang;Tao Luo;Mohamed Wahib;Yanjie Wei

SC24: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2024 | Conference Paper |

Adaptive Patching for High-resolution Image Segmentation with Transformers

Enzhi Zhang;Isaac Lyngaas;Peng Chen;Xiao Wang;Jun Igarashi;Yuankai Huo;Masaharu Munetomo;Mohamed Wahib

SC24: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2024 | Conference Paper |

HTML

Attention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g. microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attenti...Show More

Adaptive Patching for High-resolution Image Segmentation with Transformers

Enzhi Zhang;Isaac Lyngaas;Peng Chen;Xiao Wang;Jun Igarashi;Yuankai Huo;Masaharu Munetomo;Mohamed Wahib

SC24: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2024 | Conference Paper |

Asynchronous I/O Optimization for X-Ray Imaging via GPUDirect Storage

Du Wu;Peng Chen;Yiyu Tan;Yusuke Tanimura;Toshio Endo;Satoshi Matsuoka;Mohamed Wahib

2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)

Year: 2024 | Conference Paper |

HTML

GPUDirect Storage, a novel tool provided by Nvidia, facilitates better utilization of GPU I/O by avoiding extra copies through a bounce buffer in the CPU host memory and enabling direct memory access. This technology offers significant advantages, particularly its high throughput capabilities and low latency. However, it also presents challenges in implementation due to strict layout requirements....Show More

Asynchronous I/O Optimization for X-Ray Imaging via GPUDirect Storage

Du Wu;Peng Chen;Yiyu Tan;Yusuke Tanimura;Toshio Endo;Satoshi Matsuoka;Mohamed Wahib

2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)

Year: 2024 | Conference Paper |

Communication Optimization for Distributed GCN Training on ABCI Supercomputer

Chen Zhuang;Peng Chen;Xin Liu;Toshio Endo;Satoshi Matsuoka;Mohamed Wahib

2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)

Year: 2024 | Conference Paper |

HTML

Graph Convolutional Networks (GCNs) are widely used in various domains. However, training distributed full-batch GCNs on large-scale graphs poses challenges due to high communication overhead. This work presents a hybrid pre-post-aggregation approach and an integer quantization method to reduce communication costs. With these techniques, we develop a scalable distributed GCN training framework, Su...Show More

Communication Optimization for Distributed GCN Training on ABCI Supercomputer

Chen Zhuang;Peng Chen;Xin Liu;Toshio Endo;Satoshi Matsuoka;Mohamed Wahib

2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)

Year: 2024 | Conference Paper |

Validation Loss Landscape Exploration with Deep Q-Learning

Enzhi Zhang;Rui Zhong;Masaharu Munetomo;Mohamed Wahib

2024 International Joint Conference on Neural Networks (IJCNN)

Year: 2024 | Conference Paper |

HTML

Overfitting is a well-documented and studied issue in supervised learning. Human experts have been designing methods to reduce over-fitting by observing the validation knowledge, e.g., learning rate schedules, dropout, and adversarial training. We propose a validation-loss landscape exploration/exploitation method called VKI (Validation Knowledge Inheritance). We reformulate the traditional gradie...Show More

Validation Loss Landscape Exploration with Deep Q-Learning

Enzhi Zhang;Rui Zhong;Masaharu Munetomo;Mohamed Wahib

2024 International Joint Conference on Neural Networks (IJCNN)

Year: 2024 | Conference Paper |

Progressive Neural Predictor with Score-Based Sampling

Liwen Jiang;Yu Xue;Ferrante Neri;Xiaoping Zhao;Mohamed Wahib

2024 International Joint Conference on Neural Networks (IJCNN)

Year: 2024 | Conference Paper |

HTML

Neural architecture search (NAS) automates the design of neural networks, but faces high computational costs for evaluating the performance candidate architectures. Surrogate-assisted NAS methods use approximate computational models to get predictive estimation instead of real complete training, but also face the challenge of maintaining the balance between training cost and predictive effectivene...Show More

Progressive Neural Predictor with Score-Based Sampling

Liwen Jiang;Yu Xue;Ferrante Neri;Xiaoping Zhao;Mohamed Wahib

2024 International Joint Conference on Neural Networks (IJCNN)

Year: 2024 | Conference Paper |

Neural Architecture Search With Progressive Evaluation and Sub-Population Preservation

Yu Xue;Jiajie Zha;Danilo Pelusi;Peng Chen;Tao Luo;Liangli Zhen;Yan Wang;Mohamed Wahib

IEEE Transactions on Evolutionary Computation

Year: 2024 | Early Access Article |

Neural architecture search (NAS) is an effective approach for automating the design of deep neural networks. Evolutionary computation (EC) is commonly used in NAS due to its global optimization capability. However, the evaluation phase of architecture candidates in EC-based NAS is compute-intensive, limiting its application for many real-world problems. To overcome this challenge, we propose a nov...Show More

Neural Architecture Search With Progressive Evaluation and Sub-Population Preservation

Yu Xue;Jiajie Zha;Danilo Pelusi;Peng Chen;Tao Luo;Liangli Zhen;Yan Wang;Mohamed Wahib

IEEE Transactions on Evolutionary Computation

Year: 2024 | Early Access Article |

Meta Generative Data Augmentation Optimization

Enzhi Zhang;Bochen Dong;Mohamed Wahib;Rui Zhong;Masaharu Munetomo

2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)

Year: 2023 | Conference Paper |

HTML

This paper proposes a method called Meta Gener-ative Data Augmentation Optimization (MGDAO) to overcome limitations in existing data augmentation techniques. While traditional data augmentation methods have relied on expert intuition to determine effective transformations, recent approaches have attempted to generate data augmentation strategies automatically. However, these automatic methods can ...Show More

Meta Generative Data Augmentation Optimization

Enzhi Zhang;Bochen Dong;Mohamed Wahib;Rui Zhong;Masaharu Munetomo

2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)

Year: 2023 | Conference Paper |

Training Knowledge Inheritance Through Deep Q-Net

Enzhi Zhang;Ruqin Wang;Mohamed Wahib;Rui Zhong;Masaharu Munetomo

2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

Year: 2023 | Conference Paper |

HTML

When training neural networks, the weights of the model are updated at each optimization step, and the older weights are discarded. In this paper, we propose a method called, Training Knowledge Inheritance (TKI), to use the knowledge about the progression of weight and loss data in reducing overfitting and improving the generalization in the later stages of training. We reformulate the traditional...Show More

Training Knowledge Inheritance Through Deep Q-Net

Enzhi Zhang;Ruqin Wang;Mohamed Wahib;Rui Zhong;Masaharu Munetomo

2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

Year: 2023 | Conference Paper |

Simeuro: A Hybrid CPU-GPU Parallel Simulator for Neuromorphic Computing Chips

Huaipeng Zhang;Nhut-Minh Ho;Dogukan Yigit Polat;Peng Chen;Mohamed Wahib;Truong Thao Nguyen;Jintao Meng;Rick Siow Mong Goh;Satoshi Matsuoka;Tao Luo;Weng-Fai Wong

IEEE Transactions on Parallel and Distributed Systems

Year: 2023 | Volume: 34, Issue: 10 | Journal Article |

Cited by: Papers (4)

HTML

With the success of deep learning, there have been numerous efforts to build hardware for it. One approach that is gaining momentum is neuromorphic computing with spiking neural networks (SNNs), which are multiplication-free and open the possibility of using analog computing via novel technologies. However, to design effective and efficient hardware for such architectures, a fast and accurate soft...Show More

Simeuro: A Hybrid CPU-GPU Parallel Simulator for Neuromorphic Computing Chips

Huaipeng Zhang;Nhut-Minh Ho;Dogukan Yigit Polat;Peng Chen;Mohamed Wahib;Truong Thao Nguyen;Jintao Meng;Rick Siow Mong Goh;Satoshi Matsuoka;Tao Luo;Weng-Fai Wong

IEEE Transactions on Parallel and Distributed Systems

Year: 2023 | Volume: 34, Issue: 10 | Journal Article |

Image Gradient Decomposition for Parallel and Memory-Efficient Ptychographic Reconstruction

Xiao Wang;Aristeidis Tsaris;Debangshu Mukherjee;Mohamed Wahib;Peng Chen;Mark Oxley;Olga Ovchinnikova;Jacob Hinkle

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2022 | Conference Paper |

HTML

Ptychography is a popular microscopic imaging modality for many scientific discoveries and sets the record for highest image resolution. Unfortunately, the high image resolution for ptychographic reconstruction requires significant amount of memory and computations, forcing many applications to compromise their image resolution in exchange for a smaller memory footprint and a shorter reconstructio...Show More

Image Gradient Decomposition for Parallel and Memory-Efficient Ptychographic Reconstruction

Xiao Wang;Aristeidis Tsaris;Debangshu Mukherjee;Mohamed Wahib;Peng Chen;Mark Oxley;Olga Ovchinnikova;Jacob Hinkle

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2022 | Conference Paper |

Learning from the Past: Regularization by Validation

Enzhi Zhang;Mohamed Wahib;Masaharu Munetomo

2022 Joint 12th International Conference on Soft Computing and Intelligent Systems and 23rd International Symposium on Advanced Intelligent Systems (SCIS&ISIS)

Year: 2022 | Conference Paper |

Cited by: Papers (1)

HTML

Traditional deep model optimization methods discard the training weights can which contain information about the loss landscape that could guide further model optimization. In this paper, we show that a supervisor neural network could be used to predict the validation performance of another target neural network (student) through its training weights. Then based on this behavior, we propose a weig...Show More

Learning from the Past: Regularization by Validation

Enzhi Zhang;Mohamed Wahib;Masaharu Munetomo

2022 Joint 12th International Conference on Soft Computing and Intelligent Systems and 23rd International Symposium on Advanced Intelligent Systems (SCIS&ISIS)

Year: 2022 | Conference Paper |

Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning

Truong Thao Nguyen;François Trahay;Jens Domke;Aleksandr Drozd;Emil Vatai;Jianwei Liao;Mohamed Wahib;Balazs Gerofi

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Year: 2022 | Conference Paper |

Cited by: Papers (12)

HTML

Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion. Because this puts enormous pressure on the I/O subsystem, the most common approach to distributed SGD in HPC environments is to replicate the entire dataset to node local SSDs. However, ...Show More

Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning

Truong Thao Nguyen;François Trahay;Jens Domke;Aleksandr Drozd;Emil Vatai;Jianwei Liao;Mohamed Wahib;Balazs Gerofi

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Year: 2022 | Conference Paper |

Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

Jintao Meng;Chen Zhuang;Peng Chen;Mohamed Wahib;Bertil Schmidt;Xiao Wang;Haidong Lan;Dou Wu;Minwen Deng;Yanjie Wei;Shengzhong Feng

IEEE Transactions on Parallel and Distributed Systems

Year: 2022 | Volume: 33, Issue: 11 | Journal Article |

Cited by: Papers (11)

HTML

We present FastConv, a template-based code auto-generation open-source library that can automatically generate high-performance deep learning convolution kernels of arbitrary matrices/tensors shapes. FastConv is based on the Winograd algorithm, which is reportedly the highest performing algorithm for the time-consuming layers of convolutional neural networks. ARM CPUs cover a wide range of designs...Show More

Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning

Jintao Meng;Chen Zhuang;Peng Chen;Mohamed Wahib;Bertil Schmidt;Xiao Wang;Haidong Lan;Dou Wu;Minwen Deng;Yanjie Wei;Shengzhong Feng

IEEE Transactions on Parallel and Distributed Systems

Year: 2022 | Volume: 33, Issue: 11 | Journal Article |

Scalable FBP Decomposition for Cone-Beam CT Reconstruction

Peng Chen;Mohamed Wahib;Xiao Wang;Takahiro Hirofuchi;Hirotaka Ogawa;Ander Biguri;Richard Boardman;Thomas Blumensath;Satoshi Matsuoka

SC21: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2021 | Conference Paper |

Cited by: Papers (1)

HTML

Filtered Back-Projection (FBP) is a fundamental compute intense algorithm used in tomographic image reconstruction. Cone-Beam Computed Tomography (CBCT) devices use a cone-shaped X-ray beam, in comparison to the parallel beam used in older CT generations. Distributed image reconstruction of cone-beam datasets typically relies on dividing batches of images into different nodes. This simple input de...Show More

Scalable FBP Decomposition for Cone-Beam CT Reconstruction

Peng Chen;Mohamed Wahib;Xiao Wang;Takahiro Hirofuchi;Hirotaka Ogawa;Ander Biguri;Richard Boardman;Thomas Blumensath;Satoshi Matsuoka

SC21: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2021 | Conference Paper |

MLPerf™ HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

Steven Farrell;Murali Emani;Jacob Balma;Lukas Drescher;Aleksandr Drozd;Andreas Fink;Geoffrey Fox;David Kanter;Thorsten Kurth;Peter Mattson;Dawei Mu;Amit Ruhela;Kento Sato;Koichi Shirahata;Tsuguchika Tabaru;Aristeidis Tsaris;Jan Balewski;Ben Cumming;Takumi Danjo;Jens Domke;Takaaki Fukai;Naoto Fukumoto;Tatsuya Fukushi;Balazs Gerofi;Takumi Honda;Toshiyuki Imamura;Akihiko Kasagi;Kentaro Kawakami;Shuhei Kudo;Akiyoshi Kuroda;Maxime Martinasso;Satoshi Matsuoka;Henrique Mendonça;Kazuki Minami;Prabhat Ram;Takashi Sawada;Mallikarjun Shankar;Tom St. John;Akihiro Tabuchi;Venkatram Vishwanath;Mohamed Wahib;Masafumi Yamazaki;Junqi Yin

2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)

Year: 2021 | Conference Paper |

Cited by: Papers (13)

HTML

Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning appli...Show More

MLPerf™ HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

Steven Farrell;Murali Emani;Jacob Balma;Lukas Drescher;Aleksandr Drozd;Andreas Fink;Geoffrey Fox;David Kanter;Thorsten Kurth;Peter Mattson;Dawei Mu;Amit Ruhela;Kento Sato;Koichi Shirahata;Tsuguchika Tabaru;Aristeidis Tsaris;Jan Balewski;Ben Cumming;Takumi Danjo;Jens Domke;Takaaki Fukai;Naoto Fukumoto;Tatsuya Fukushi;Balazs Gerofi;Takumi Honda;Toshiyuki Imamura;Akihiko Kasagi;Kentaro Kawakami;Shuhei Kudo;Akiyoshi Kuroda;Maxime Martinasso;Satoshi Matsuoka;Henrique Mendonça;Kazuki Minami;Prabhat Ram;Takashi Sawada;Mallikarjun Shankar;Tom St. John;Akihiro Tabuchi;Venkatram Vishwanath;Mohamed Wahib;Masafumi Yamazaki;Junqi Yin

2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)

Year: 2021 | Conference Paper |

Structured Adaptive Mesh Refinement Adaptations to Retain Performance Portability With Increasing Heterogeneity

Anshu Dubey;Martin Berzins;Carsten Burstedde;Michael L. Norman;Didem Unat;Mohammed Wahib

Computing in Science & Engineering

Year: 2021 | Volume: 23, Issue: 5 | Magazine Article |

Cited by: Papers (2)

HTML

Adaptive mesh refinement (AMR) is an important method that enables many mesh-based applications to run at effectively higher resolution within limited computing resources by allowing high resolution only where really needed. This advantage comes at a cost, however: greater complexity in the mesh management machinery and challenges with load distribution. With the current trend of increasing hetero...Show More

Structured Adaptive Mesh Refinement Adaptations to Retain Performance Portability With Increasing Heterogeneity

Anshu Dubey;Martin Berzins;Carsten Burstedde;Michael L. Norman;Didem Unat;Mohammed Wahib

Computing in Science & Engineering

Year: 2021 | Volume: 23, Issue: 5 | Magazine Article |

An Allreduce Algorithm and Network Co-design for Large-Scale Training of Distributed Deep Learning

Truong Thao Nguyen;Mohamed Wahib

2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Year: 2021 | Conference Paper |

Cited by: Papers (8)

HTML

Distributed training of Deep Neural Networks (DNNs) on High-Performance Computing (HPC) systems is becoming increasingly common. HPC systems dedicated entirely or mainly to Deep Learning (DL) workloads are becoming a reality. The collective communication overhead for calculating the average of weight gradients, e.g., an Allreduce operations, is one of the main factors limiting the scaling of data ...Show More

An Allreduce Algorithm and Network Co-design for Large-Scale Training of Distributed Deep Learning

Truong Thao Nguyen;Mohamed Wahib

2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Year: 2021 | Conference Paper |

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

Mohamed Wahib;Haoyu Zhang;Truong Thao Nguyen;Aleksandr Drozd;Jens Domke;Lingqi Zhang;Ryousei Takano;Satoshi Matsuoka

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2020 | Conference Paper |

Cited by: Papers (7)

HTML

The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in additi...Show More

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

Mohamed Wahib;Haoyu Zhang;Truong Thao Nguyen;Aleksandr Drozd;Jens Domke;Lingqi Zhang;Ryousei Takano;Satoshi Matsuoka

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2020 | Conference Paper |

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

Lingqi Zhang;Mohamed Wahib;Haoyu Zhang;Satoshi Matsuoka

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Year: 2020 | Conference Paper |

Cited by: Papers (10)

HTML

GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for multi-GPU synchronization. Nvidia's latest CUDA provides a variety of synchronization methods. Until now, there is no full understanding of the characteristics of thos...Show More

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

Lingqi Zhang;Mohamed Wahib;Haoyu Zhang;Satoshi Matsuoka

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Year: 2020 | Conference Paper |

iFDK: A Scalable Framework for Instant High-Resolution Image Reconstruction

Peng Chen;Mohamed Wahib;Shinichiro Takizawa;Ryousei Takano;Satoshi Matsuoka

SC19: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2019 | Conference Paper |

HTML

Computed Tomography (CT) is a widely used technology that requires compute-intense algorithms for image reconstruction. We propose a novel back-projection algorithm that reduces the projection computation cost to 1/6 of the standard algorithm. We also propose an efficient implementation that takes advantage of the heterogeneity of GPU-accelerated systems by overlapping the filtering and back-proje...Show More

iFDK: A Scalable Framework for Instant High-Resolution Image Reconstruction

Peng Chen;Mohamed Wahib;Shinichiro Takizawa;Ryousei Takano;Satoshi Matsuoka

SC19: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2019 | Conference Paper |

A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

Peng Chen;Mohamed Wahib;Shinichiro Takizawa;Ryousei Takano;Satoshi Matsuoka

SC19: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2019 | Conference Paper |

HTML

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versa...Show More

A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

Peng Chen;Mohamed Wahib;Shinichiro Takizawa;Ryousei Takano;Satoshi Matsuoka

SC19: International Conference for High Performance Computing, Networking, Storage and Analysis

Year: 2019 | Conference Paper |

Topology-aware Sparse Allreduce for Large-scale Deep Learning

Truong Thao Nguyen;Mohamed Wahib;Ryousei Takano

2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC)

Year: 2019 | Conference Paper |

Cited by: Papers (4)

HTML

Data parallelism is the dominant method used to scale-up deep learning (DL) training across multiple compute nodes. Collective communication of the local gradients between nodes is a critical bottleneck due to the significant increase in complexity and size of DL models. Researchers cope with this problem by one of the following solutions: a) optimizing the collective communication algorithm to ac...Show More

Topology-aware Sparse Allreduce for Large-scale Deep Learning

Truong Thao Nguyen;Mohamed Wahib;Ryousei Takano

2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC)

Year: 2019 | Conference Paper |

Double-Precision FPUs in High-Performance Computing: An Embarrassment of Riches?

Jens Domke;Kazuaki Matsumura;Mohamed Wahib;Haoyu Zhang;Keita Yashima;Toshiki Tsuchikawa;Yohei Tsuji;Artur Podobas;Satoshi Matsuoka

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Year: 2019 | Conference Paper |

Cited by: Papers (11)

HTML

Among the (uncontended) common wisdom in High-Performance Computing (HPC) is the applications' need for large amount of double-precision support in hardware. Hardware manufacturers, the TOP500 list, and (rarely revisited) legacy software have without doubt followed and contributed to this view. In this paper, we challenge that wisdom, and we do so by exhaustively comparing a large number of HPC pr...Show More

Double-Precision FPUs in High-Performance Computing: An Embarrassment of Riches?

Jens Domke;Kazuaki Matsumura;Mohamed Wahib;Haoyu Zhang;Keita Yashima;Toshiki Tsuchikawa;Yohei Tsuji;Artur Podobas;Satoshi Matsuoka

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Year: 2019 | Conference Paper |

IEEE Personal Account

Change username/password

Purchase Details

Payment Options
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical interests

Need Help?

US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support

Follow

About IEEE Xplore | Contact Us | Help | Accessibility | Terms of Use | Nondiscrimination Policy | IEEE Ethics Reporting | Sitemap | IEEE Privacy Policy

A public charity, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

© Copyright 2025 IEEE - All rights reserved, including rights for text and data mining and training of artificial intelligence and similar technologies.

IEEE Account

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests

Need Help?

US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support

About IEEE Xplore
Contact Us
Help
Accessibility
Terms of Use
Nondiscrimination Policy
Sitemap
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.
© Copyright 2025 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.