I. Introduction
Machine learning has demonstrated great promises in a wide range of application domains, e.g., self-driving, smart city, language processing, etc., which are fundamentally altering the way individuals and organizations live, work and interact [2]–[4]. With the rapid growth of training data and machine learning model size, how to efficiently train machine learning model in a distributed manner has received much attention since computation can be parallelized in multiple nodes. A widely used framework in distributed machine learning is data parallelism over Parameter Server (PS) architecture, i.e., data are distributed over multiple workers and a global model is cooperatively optimized with the coordination of servers [5], [6]. As to the algorithm running in PS for solving training problems, distributed Stochastic Gradient Descent (D-SGD) is usually adopted because it is applicable to various model optimizations with proven efficiency in terms of scalability [5], [7], [8].