I. Introduction
We are witnessing the increasingly widespread adoption of deep learning in a plethora of application domains. The unprecedented success of deep learning is, in large part, powered by rapid model innovations, which in turn critically depend on algorithms and systems support for training. One of these innovations, distributed deep learning-training deep neural networks on a cluster of GPU servers-is increasingly leveraged to train complex models on larger datasets. In particular, SGD-based optimization has emerged as the de facto way to perform distributed training and provides the basis for parallelizing training jobs, allowing deep learning practitioners to evaluate different model variants quickly.