Processing math: 100%
Efficient Parameter Synchronization for Peer-to-Peer Distributed Learning With Selective Multicast | IEEE Journals & Magazine | IEEE Xplore

Efficient Parameter Synchronization for Peer-to-Peer Distributed Learning With Selective Multicast


Abstract:

Recent advances in distributed machine learning show theoretically and empirically that, for many models, provided that workers will eventually participate in the synchro...Show More

Abstract:

Recent advances in distributed machine learning show theoretically and empirically that, for many models, provided that workers will eventually participate in the synchronizations, i) the training still converges, even if only p workers take part in each round of synchronization, and ii) a larger p generally leads to a faster rate of convergence. These findings shed light on eliminating the bottleneck effects of parameter synchronization in large-scale data-parallel distributed training and have motivated several optimization designs. In this paper, we focus on optimizing the parameter synchronization for peer-to-peer distributed learning, where workers broadcast or multicast their updated parameters to others for synchronization, and propose SelMcast, a suite of expressive and efficient multicast receiver selection algorithms, to achieve the goal. Compared with the state-of-the-art (SOTA) design, which randomly selects exactly p receivers for each worker’s multicast in a bandwidth-agnostic way, SelMcast chooses receivers based on the global view of their available bandwidth and loads, yielding two advantages, i.e., accelerated parameter synchronization for higher utilization of computing resources and enlarged average p values for faster convergence. Comprehensive evaluations show that SelMcast is efficient for both peer-to-peer Bulk Synchronous Parallel (BSP) and Stale Synchronous Parallel (SSP) distributed training, outperforming the SOTA solution significantly.
Published in: IEEE Transactions on Services Computing ( Volume: 18, Issue: 1, Jan.-Feb. 2025)
Page(s): 156 - 168
Date of Publication: 25 November 2024

ISSN Information:

Funding Agency:

References is not available for this document.

I. Introduction

Over the past decade, machine learning techniques obtain tremendous success and have been widely employed for various applications like email filtering, advertising recommendation, speech recognition, machine translation, computer vision, etc [1], [2], [3], [4], [5]. With the increasing popularity of machine learning and the rapid development of new technologies, the realistic quantities of training data for a learning task have increased from GBs to TBs and PBs. Data-parallel distributed training has become the key to obtaining the resulting model over such massive amounts of data within reasonable times [2], [3], [4].

Select All
1.
S. Luo, P. Fan, K. Li, H. Xing, L. Luo and H. Yu, "Fast parameter synchronization for distributed learning with selective multicast", Proc. IEEE Int. Conf. Commun., pp. 4775-4780, 2022.
2.
S. Shi, Z. Tang, X. Chu, C. Liu, W. Wang and B. Li, "A quantitative survey of communication optimizations in distributed deep learning", IEEE Netw., vol. 35, no. 3, pp. 230-237, Jun. 2021.
3.
J. Verbraeken et al., "A survey on distributed machine learning", ACM Comput. Surv., vol. 53, no. 2, pp. 1-33, Mar. 2020.
4.
P. Xie et al., "Orpheus: Efficient distributed machine learning via system and algorithm co-design", Proc. ACM Symp. Cloud Comput., pp. 1-13, 2018.
5.
S. Luo, P. Fan, H. Xing, L. Luo and H. Yu, "Eliminating communication bottlenecks in cross-device federated learning with in-network processing at the edge", Proc. IEEE Int. Conf. Commun., pp. 4601-4606, 2022.
6.
S. Luo, X. Yu, K. Li and H. Xing, "Releasing the power of in-network aggregation with aggregator-aware routing optimization", IEEE/ACM Trans. Netw., vol. 32, no. 5, pp. 4488-4502, Oct. 2024.
7.
A. Sapio et al., "Scaling distributed machine learning with in-network aggregation", Proc. 18th Symp. Netw. Syst. Des. Implementation, pp. 785-808, 2021.
8.
L. Luo et al., "Fast synchronization of model updates for collaborative learning in micro-clouds", Proc. IEEE 23rd Int. Conf. High Perform. Comput. Commun., pp. 831-836, 2021.
9.
S. Luo, R. Wang, K. Li and H. Xing, "Efficient cross-cloud partial reduce with CREW", IEEE Trans. Parallel Distrib. Syst., vol. 35, no. 11, pp. 2224-2238, Nov. 2024.
10.
S. Luo, R. Wang and H. Xing, "Efficient inter-datacenter AllReduce with multiple trees", IEEE Trans. Netw. Sci. Eng., vol. 11, no. 5, pp. 4793-4806, Sep./Oct. 2024.
11.
X. Miao et al., "Heterogeneity-aware distributed machine learning training via partial reduce", Proc. ACM SIGMOD Int. Conf. Manage. Data, pp. 2262-2270, 2021.
12.
Q. Luo et al., "Prague: High-performance heterogeneity-aware asynchronous decentralized training", Proc. 25th ACM Int. Conf. Architectural Support Program. Lang. Operating Syst., pp. 401-416, 2020.
13.
S. Dutta, J. Wang and G. Joshi, "Slow and stale gradients can win the race", IEEE J. Sel. Areas Inf. Theory, vol. 2, no. 3, pp. 1012-1024, Sep. 2021.
14.
P. Xie et al., "Lighter-communication distributed machine learning via sufficient factor broadcasting", Proc. 32nd Conf. Uncertainty Artif. Intell., pp. 795-804, 2016.
15.
H. Li et al., "Malt: Distributed data-parallelism for existing ML applications", Proc. 10th ACM Eur. Conf. Comput. Syst., pp. 1-16, 2015.
16.
Q. Hu et al., "Hydro: Surrogate-Based hyperparameter tuning service in datacenters", Proc. 17th Symp. Operating Syst. Des. Implementation, pp. 757-777, 2023.
17.
L. Mai et al., "KungFu: Making training in distributed machine learning adaptive", Proc. 14th Symp. Operating Syst. Des. Implementation, pp. 937-954, 2020.
18.
Q. Ho et al., "More effective distributed ML via a stale synchronous parallel parameter server", Proc. 26th Int. Conf. Neural Inf. Process. Syst., pp. 1223-1231, 2013.
19.
H. Cui et al., "Exploiting bounded staleness to speed up big data analytics", Proc. USENIX Annu. Techn. Conf., pp. 37-48, 2014.
20.
S. Li et al., "Taming unbalanced training workloads in deep learning with partial collective operations", Proc. 25th ACM SIGPLAN Symp. Princ. Pract. Parallel Program., pp. 45-61, 2020.
21.
X. Zhao, A. An, J. Liu and B. X. Chen, "Dynamic stale synchronous parallel distributed training for deep learning", Proc. 39th IEEE Int. Conf. Distrib. Comput. Syst., pp. 1507-1517, 2019.
22.
A. Barrak, "The promise of serverless computing within peer-to-peer architectures for distributed ML training", Proc. AAAI Conf. Artif. Intell., pp. 23 383-23 384, 2024.
23.
S. Luo, H. Yu, K. Li and H. Xing, "Efficient file dissemination in data center networks with priority-based adaptive multicast", IEEE J. Sel. Areas Commun., vol. 38, no. 6, pp. 1161-1175, Jun. 2020.
24.
S. Luo, H. Xing and P. Fan, "Softwarized IP multicast in the cloud", IEEE Netw., vol. 35, no. 6, pp. 233-239, Nov./Dec. 2021.
25.
S. Li et al., "Sync-switch: Hybrid parameter synchronization for distributed deep learning", Proc. IEEE 41st Int. Conf. Distrib. Comput. Syst., pp. 528-538, 2021.
26.
S. Luo, H. Yu, Y. Zhao, S. Wang, S. Yu and L. Li, "Towards practical and near-optimal coflow scheduling for data center networks", IEEE Trans. Parallel Distrib. Syst., vol. 27, no. 11, pp. 3366-3380, Nov. 2016.
27.
S. Luo et al., "Selective coflow completion for time-sensitive distributed applications with poco", Proc. 49th Int. Conf. Parallel Process., pp. 1-10, 2020.
28.
S. Luo, P. Fan, H. Xing and H. Yu, "Meeting coflow deadlines in data center networks with policy-based selective completion", IEEE/ACM Trans. Netw., vol. 31, no. 1, pp. 178-191, Feb. 2023.
29.
T. H. Cormen et al., Introduction to Algorithms, Cambridge, MA, USA:MIT Press, 2009.
30.
L. L. Larmore, Oct. 2024, [online] Available: https://web.cs.unlv.edu/larmore/Courses/CSC456/ssPart.pdf.
Contact IEEE to Subscribe

References

References is not available for this document.