Journals & Magazines >IEEE Transactions on Services... >Volume: 18 Issue: 1

Efficient Parameter Synchronization for Peer-to-Peer Distributed Learning With Selective Multicast

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Recent advances in distributed machine learning show theoretically and empirically that, for many models, provided that workers will eventually participate in the synchro...Show More

Metadata

Abstract:

Recent advances in distributed machine learning show theoretically and empirically that, for many models, provided that workers will eventually participate in the synchronizations,

$i)$ the training still converges, even if only

$p$ workers take part in each round of synchronization, and

$ii)$ a larger

$p$ generally leads to a faster rate of convergence. These findings shed light on eliminating the bottleneck effects of parameter synchronization in large-scale data-parallel distributed training and have motivated several optimization designs. In this paper, we focus on optimizing the parameter synchronization for peer-to-peer distributed learning, where workers broadcast or multicast their updated parameters to others for synchronization, and propose SelMcast, a suite of expressive and efficient multicast receiver selection algorithms, to achieve the goal. Compared with the state-of-the-art (SOTA) design, which randomly selects exactly

$p$ receivers for each worker’s multicast in a bandwidth-agnostic way, SelMcast chooses receivers based on the global view of their available bandwidth and loads, yielding two advantages, i.e., accelerated parameter synchronization for higher utilization of computing resources and enlarged average

$p$ values for faster convergence. Comprehensive evaluations show that SelMcast is efficient for both peer-to-peer Bulk Synchronous Parallel (BSP) and Stale Synchronous Parallel (SSP) distributed training, outperforming the SOTA solution significantly.

Published in: IEEE Transactions on Services Computing ( Volume: 18, Issue: 1, Jan.-Feb. 2025)

Page(s): 156 - 168

Date of Publication: 25 November 2024

ISSN Information:

DOI: 10.1109/TSC.2024.3506480

Funding Agency:

References is not available for this document.

Contents

I. Introduction

Over the past decade, machine learning techniques obtain tremendous success and have been widely employed for various applications like email filtering, advertising recommendation, speech recognition, machine translation, computer vision, etc [1], [2], [3], [4], [5]. With the increasing popularity of machine learning and the rapid development of new technologies, the realistic quantities of training data for a learning task have increased from GBs to TBs and PBs. Data-parallel distributed training has become the key to obtaining the resulting model over such massive amounts of data within reasonable times [2], [3], [4].

References is not available for this document.

MIT Libraries

MIT Libraries

Efficient Parameter Synchronization for Peer-to-Peer Distributed Learning With Selective Multicast

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Efficient Parameter Synchronization for Peer-to-Peer Distributed Learning With Selective Multicast

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?