Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning | IEEE Conference Publication | IEEE Xplore

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning


Abstract:

Deep learning is a popular machine learning technique and has been applied to many real-world problems, ranging from computer vision to natural language processing. Howev...Show More

Abstract:

Deep learning is a popular machine learning technique and has been applied to many real-world problems, ranging from computer vision to natural language processing. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to train a large model over large datasets. A popular solution is to distribute and parallelize the training process across multiple machines using the parameter server framework. In this paper, we present a distributed paradigm on the parameter server framework called Dynamic Stale Synchronous Parallel (DSSP) which improves the state-of-the-art Stale Synchronous Parallel (SSP) paradigm by dynamically determining the staleness threshold at the run time. Conventionally to run distributed training in SSP, the user needs to specify a particular stalenes threshold as a hyper-parameter. However, a user does not usually know how to set the threshold and thus often finds a threshold value through trial and error, which is time-consuming. Based on workers' recent processing time, our approach DSSP adaptively adjusts the threshold per iteration at running time to reduce the waiting time of faster workers for synchronization of the globally shared parameters (the weights of the model), and consequently increases the frequency of parameters updates (increases iteration through-put), which speedups the convergence rate. We compare DSSP with other paradigms such as Bulk Synchronous Parallel (BSP), Asynchronous Parallel (ASP), and SSP by running deep neural networks (DNN) models over GPU clusters in both homogeneous and heterogeneous environments. The results show that in a heterogeneous environment where the cluster consists of mixed models of GPUs, DSSP converges to a higher accuracy much earlier than SSP and BSP and performs similarly to ASP. In a homogeneous distributed cluster, DSSP has more stable and slightly better performance than SSP and ASP, and converges much faster than BSP.
Date of Conference: 07-10 July 2019
Date Added to IEEE Xplore: 31 October 2019
ISBN Information:

ISSN Information:

Conference Location: Dallas, TX, USA
References is not available for this document.

I. Introduction

The parameter server framework [1] [2] has been developed to support distributed training of large-scale machine learning (ML)models (such as deep neural networks [3]–[5])over very large data sets, such as Microsoft COCO [6], ImageNet 1K [3] and ImageNet 22K [7]. Training a deep model using a large-scale cluster with an efficient distributed paradigm reduces the training time from weeks on a single server to days or hours.

Select All
1.
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le et al., "Large scale distributed deep networks", Advances in neural information processing systems, pp. 1223-1231, 2012.
2.
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, et al., Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems, 2015.
3.
A. Krizhevsky, I. Sutskever and G. E. Hinton, "Imagenet classification with deep convolutional neural networks", Advances in neural information processing systems, pp. 1097-1105, 2012.
4.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, "Rethinking the inception architecture for computer vision", The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
5.
K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014.
6.
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, et al., "Microsoft coco: Common objects in context", European conference on computer vision, pp. 740-755, 2014.
7.
T. M. Chilimbi, Y. Suzue, J. Apacible and K. Kalyanaraman, "Project adam: Building an efficient and scalable deep learning training system", OSDI, vol. 14, pp. 571-582, 2014.
8.
M. Li, D. G. Andersen and J. W. Park, Scaling distributed machine learning with the parameter server.
9.
E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, et al., "Petuum: A new platform for distributed machine learning on big data", IEEE Transactions on Big Data, vol. 1, no. 2, pp. 49-67, 2015.
10.
S. Landset, T. M. Khoshgoftaar, A. N. Richter and T. Hasanin, "A survey of open source tools for machine learning with big data in the hadoop ecosystem", Journal of Big Data, vol. 2, no. 1, pp. 24, 2015.
11.
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen et al., "Mllib: Machine learning in apache spark", The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235-1241, 2016.
12.
J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang and A. Yuille, Deep captioning with multimodal recurrent neural networks (m-rnn), 2014.
13.
J. Zhou, X. Li, P. Zhao, C. Chen, L. Li, X. Yang, Q. Cui, J. Yu, X. Chen, Y. Ding et al., "Kunpeng: Parameter server based distributed learning systems and its applications in alibaba and ant financial", Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1693-1702, 2017.
14.
S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson and E. P. Xing, "On model parallelization and scheduling strategies for distributed machine learning", Advances in neural information processing systems, pp. 2834-2842, 2014.
15.
Y. Zhou, Y. Yu, W. Dai, Y. Liang and E. Xing, "On convergence of model parallel proximal gradient algorithm for stale synchronous parallel system", Artificial Intelligence and Statistics, pp. 713-722, 2016.
16.
E. P. Xing, Q. Ho, P. Xie and D. Wei, "Strategies and principles of distributed machine learning on big data", Engineering, vol. 2, no. 2, pp. 179-195, 2016.
17.
A. V. Gerbessiotis and L. G. Valiant, "Direct bulk-synchronous parallel algorithms", Journal of parallel and distributed computing, vol. 22, no. 2, pp. 251-267, 1994.
18.
Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, et al., "More effective distributed ml via a stale synchronous parallel parameter server", Advances in neural information processing systems, pp. 1223-1231, 2013.
19.
H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons et al., "Exploiting bounded staleness to speed up big data analytics", USENIX Annual Technical Conference, pp. 37-48, 2014.
20.
B. Recht, C. Re, S. Wright and F. Niu, "Hogwild: A lock-free approach to parallelizing stochastic gradient descent", Advances in neural information processing systems, pp. 693-701, 2011.
21.
M. Zinkevich, M. Weimer, L. Li and A. J. Smola, "Paral-lelized stochastic gradient descent", Advances in neural information processing systems, pp. 2595-2603, 2010.
22.
L. Wang, Y. Yang, R. Min and S. Chakradhar, "Accelerating deep neural network training with inconsistent stochastic gradient descent", Neural Networks, vol. 93, pp. 219-229, 2017.
23.
W. Dai, A. Kumar, J. Wei, Q. Ho, G. A. Gibson and E. P. Xing, High-performance distributed ml at scale through parameter server consistency models, 2015.
24.
J. Wei, W. Dai, A. Qiao, Q. Ho, H. Cui, G. R. Ganger, et al., "Managed communication and consistency for fast data-parallel iterative analytics", Proceedings of the Sixth ACM Symposium on Cloud Computing, pp. 381-394, 2015.
25.
M. Li, D. G. Andersen, A. J. Smola and K. Yu, "Communication efficient distributed machine learning with the parameter server", Advances in Neural Information Processing Systems, pp. 19-27, 2014.
26.
J. Chen, X. Pan, R. Monga, S. Bengio and R. Jozefowicz, Revisiting distributed synchronous sgd, 2016.
27.
S. Hadjis, C. Zhang, I. Mitliagkas, D. Iter and C. Ré, Omnivore: An optimizer for multi-device deep learning on cpus and gpus, 2016.
28.
R. Zhang and J. Kwok, "Asynchronous distributed admm for consensus optimization", International Conference on Machine Learning, pp. 1701-1709, 2014.
29.
SOSCIP GPU, [online] Available: https://docs.scinet.utoronto.ca/index.php/SOSCIP\_GPU.
30.
A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images", Citeseer Tech. Rep., 2009.

Contact IEEE to Subscribe

References

References is not available for this document.