Fast Parameter Synchronization for Distributed Learning with Selective Multicast | IEEE Conference Publication | IEEE Xplore

Fast Parameter Synchronization for Distributed Learning with Selective Multicast


Abstract:

Recent advances in distributed machine learning show theoretically and empirically that, for many models, provided workers would participate in the synchronizations event...Show More

Abstract:

Recent advances in distributed machine learning show theoretically and empirically that, for many models, provided workers would participate in the synchronizations eventually, i) the training still converges, even if only p workers take part in each round of synchronization, and ii) a larger p generally leads to a faster rate of convergence. These findings shed light on eliminating the bottleneck effects of parameter synchronization in large-scale data-parallel distributed training, having motivated several optimization designs.In this paper, we focus on optimizing the parameter synchronization for peer-to-peer distributed learning, in which workers generally broadcast or multicast their updated parameters to others for synchronization, and propose SELMCAST, an expressive and Pareto-optimal multicast receiver selection algorithm, to achieve the goal. Compared with the state-of-the-art design that randomly selects exactly p receivers for each worker’s multicast in a bandwidth-agnostic way, SELMCAST chooses receivers based on the global view of their available bandwidth and loads, yielding two advantages. Firstly, it could optimize the bottleneck sending rate, thus cutting down the time cost of parameter synchronization. Secondly, when more than p receivers are with sufficient bandwidth, they would be selected as many as possible, bringing benefits to the convergence of training. Extensive evaluations show that SELMCAST is efficient and always achieves near-optimal performance.
Date of Conference: 16-20 May 2022
Date Added to IEEE Xplore: 11 August 2022
ISBN Information:

ISSN Information:

Conference Location: Seoul, Korea, Republic of

Funding Agency:

No metrics found for this document.

I. Introduction

Over the past decade, machine learning techniques have obtained huge success and have been widely employed for various applications like email filtering, advertising recommendation, speech recognition, machine translation, computer vision, etc [1]–[4]. With the increasing popularity of machine learning and rapid developments of new technologies, the realistic quantities of training data for a learning task have increased from GBs to TBs and PBs. Data-parallel distributed training becomes the key to obtaining the resulting model over such a massive of data within reasonable times [1]–[3].

Usage
Select a Year
2025

View as

Total usage sinceAug 2022:300
0123456JanFebMarAprMayJunJulAugSepOctNovDec540000000000
Year Total:9
Data is updated monthly. Usage includes PDF downloads and HTML views.
Contact IEEE to Subscribe

References

References is not available for this document.