I. Introduction
Recently, deep neural networks achieved superior performance in a variety of applications, such as computer vision [1]–[4] and natural language processing [5], [6]. However, along with high performance, the deep neural network’s architecture becomes much deeper and wider which requires a high cost of computation and memory in inference. It is a great burden to deploy these models on edge-computing systems, such as embedded devices and mobile phones. Therefore, many methods [7]–[11] are proposed to reduce the deep neural network’s computational complexity and high storage. Some lightweight networks, such as Inception [12], MobileNet [13], ShuffleNet [14], SqueezeNet [15], and Condense-Net [16] have been proposed to reduce the network size as much as possible under the condition of keeping high recognition accuracy. All the above-mentioned methods focus on physically reducing the internal redundancy of the model to obtain a shallow and thin architecture. Nevertheless, how to train the reduced network with high performance is yet an unresolved issue.