I. Introduction
In the past few decades, Deep Learning has become the essential machine learning algorithm and achieved remarkable success in various application fields ranging from machine translation [1], image processing [2], speech recognition [3] and many others. With increase of the scale and complexity of data sets and neural network models, the training of deep neural networks also becomes more and more difficult. For example, the number of parameters of a single network model can be as large as a billion [4]. Training such complex neural networks is very costly, in both time and computing resources. Therefore, how to reduce training time while keeping satisfactory accuracy has become a hotspot in deep learning research.