I. Introduction
Nowadays, machine learning (ML), especially deep learning, has demonstrated great capabilities and achieved great success in abundant fields like machine vision [1], natural language processing [2], weather prediction [3], content generation [4], and game playing [5]. With the development of ML, new advanced models are constantly proposed. Both the size of the model and the scale of the training dataset show explosive growth trends. In order to complete the model training in a reasonable time, distributed machine learning (DML), especially with the paradigm of data parallelism, has become an inevitable design. However, simply increasing the cluster scale to enhance the compute capacity often fails to achieve the corresponding performance improvements. During the data-parallel distributed training, to guarantee the convergence of the global model, training workers have to synchronize their locally trained gradients or updated model parameters periodically [6]. As confirmed by recent studies [7], [8], [9], [10], with the training cluster’s scale increases, the communication cost of model synchronization gradually becomes a prominent performance bottleneck for the entire training.