Conferences >2020 IEEE 22nd International ...

Job-aware Communication Scheduling for DML Training in Shared Cluster

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Distributed machine learning (DML) systems equipped with multiple computing nodes have been widely adopted to accelerate large model training in the industry. To maximize...Show More

Metadata

Abstract:

Distributed machine learning (DML) systems equipped with multiple computing nodes have been widely adopted to accelerate large model training in the industry. To maximize resource utilization, a critical problem is how to schedule the communication of DML jobs efficiently. However, previous approaches work well only when a job can use the network resources exclusively. Training multiple jobs in shared cluster without scheduling will bring significant performance degradation since network contention. In this paper, we propose JCS, a job-aware communication scheduler to overcome the above problems. JCS profiles the priority with a novel metric among jobs and schedule communication of jobs according to both computation and communication information. To demonstrate the effectiveness of our algorithm, we perform extensive simulations with DML job traces. The simulation results show that our algorithm can reduce average job completion time by 19%, 39% and 46% over RRSP, SCF and LCoF.

Published in: 2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

Date of Conference: 14-16 December 2020

Date Added to IEEE Xplore: 26 April 2021

ISBN Information:

DOI: 10.1109/HPCC-SmartCity-DSS50907.2020.00058

Conference Location: Yanuca Island, Cuvu, Fiji

Funding Agency:

Contents

I. Introduction

Machine learning (ML) technology has drawn huge attention due to the great potential in various application areas. Training large ML models is compute-intensive and may involve a large amount of training data. Reducing the training time is significant for ML applications and directly affects the profit of a company [1] [2]. To this end, distributed machine learning (DML) was proposed. Typically, DML partitions the training data, and uses a set of workers to perform the training process in parallel. Worker instances are placed on GPU exclusively in different servers. The parameters trained by each worker are aggregated and synchronized periodically. As such, DML accelerates the training process by utilizing compute resources efficiently. With the development of data centers and cloud computing, DML is currently widely used in the industry.

References is not available for this document.

Job-aware Communication Scheduling for DML Training in Shared Cluster

Abstract:

Metadata

Abstract:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Job-aware Communication Scheduling for DML Training in Shared Cluster

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

I. Introduction

References