I. Introduction
Nowadays, an increasing number of geo-distributed machine learning (Geo-DML) systems have been employed to train large sophisticated models for applications like image and video classification, speech processing, machine translation, and topic modeling over massive data around the globe [1], [2]. In these systems, involved training workers are hosted on different datacenters which are networked with scarce, expensive, and unstable wide-area network (WAN) connections [2]. To guarantee convergence, workers participating in data-parallel training must synchronize their local training results with AllReduce operations periodically. As confirmed by numerous recent studies, the time cost of performing parameter synchronization over these cross-datacenter connections has dominated the efficiency of geo-distributed training [1], [2], [3]. Accordingly, improving the efficiency of inter-datacenter (inter-DC) AllReduce operations over WAN connections becomes the key to optimizing the performance of large-scale Geo-DML. Then, a fundamental question follows: How to make maximum usage of heterogeneous inter-datacenter WAN connections to achieve efficient AllReduce operations for geo-distributed training?