I. Introduction
Machine learning (ML) technology has drawn huge attention due to the great potential in various application areas. Training large ML models is compute-intensive and may involve a large amount of training data. Reducing the training time is significant for ML applications and directly affects the profit of a company [1] [2]. To this end, distributed machine learning (DML) was proposed. Typically, DML partitions the training data, and uses a set of workers to perform the training process in parallel. Worker instances are placed on GPU exclusively in different servers. The parameters trained by each worker are aggregated and synchronized periodically. As such, DML accelerates the training process by utilizing compute resources efficiently. With the development of data centers and cloud computing, DML is currently widely used in the industry.