Loading [MathJax]/extensions/MathMenu.js
Efficient Inter-Datacenter AllReduce With Multiple Trees | IEEE Journals & Magazine | IEEE Xplore

Efficient Inter-Datacenter AllReduce With Multiple Trees


Abstract:

In this paper, we look into the problem of achieving efficient inter-datacenter AllReduce operations for geo-distributed machine learning (Geo-DML). Compared with intra-d...Show More

Abstract:

In this paper, we look into the problem of achieving efficient inter-datacenter AllReduce operations for geo-distributed machine learning (Geo-DML). Compared with intra-datacenter distributed training, the heterogeneous wide-area network (WAN) connections among Geo-DML workers are scarce, expensive, and unstable, making existing proposals designed for homogeneous networks fall short. Despite that some recent optimizations have been proposed for Geo-DML, they break the consistency semantics of bulk synchronous parallel (BSP), thus bringing no benefit to the widely existing BSP-based applications. To address these issues, we propose mTree, a topology management suite for Geo-DML. With the global view of the heterogeneous WAN connections, mTree builds multiple optimized spanning trees along with suggested workload distribution proportions, respecting the constraints of both the number of trees and their maximum height specified by the training. Based on these results, geo-distributed workers could launch concurrent tree-based pipelined AllReduce operations to make efficient use of the heterogeneous network. Detailed performance studies on real-world network topologies imply that mTree achieves efficient AllReduce, significantly outperforming existing solutions.
Published in: IEEE Transactions on Network Science and Engineering ( Volume: 11, Issue: 5, Sept.-Oct. 2024)
Page(s): 4793 - 4806
Date of Publication: 25 June 2024

ISSN Information:

Funding Agency:


I. Introduction

Nowadays, an increasing number of geo-distributed machine learning (Geo-DML) systems have been employed to train large sophisticated models for applications like image and video classification, speech processing, machine translation, and topic modeling over massive data around the globe [1], [2]. In these systems, involved training workers are hosted on different datacenters which are networked with scarce, expensive, and unstable wide-area network (WAN) connections [2]. To guarantee convergence, workers participating in data-parallel training must synchronize their local training results with AllReduce operations periodically. As confirmed by numerous recent studies, the time cost of performing parameter synchronization over these cross-datacenter connections has dominated the efficiency of geo-distributed training [1], [2], [3]. Accordingly, improving the efficiency of inter-datacenter (inter-DC) AllReduce operations over WAN connections becomes the key to optimizing the performance of large-scale Geo-DML. Then, a fundamental question follows: How to make maximum usage of heterogeneous inter-datacenter WAN connections to achieve efficient AllReduce operations for geo-distributed training?

Contact IEEE to Subscribe

References

References is not available for this document.