I. Introduction
In recent years, there has been a growing interest in scaling up machine learning algorithms through parallelization and distribution [1]. Within this resurging field of machine learning, deep learning has caught much attention due to many empirical successes, mostly based on simply building larger models with more parameters [2]–[6]. Deep learning models often achieve the best accuracies but the training speed is often limited by hardware. Fortunately, neural network-based algorithms can be parallelized to utilize greater computing power for speed.