I. Introduction
Deep neural networks (DNNs) are artificial neural networks with more than three layers (i.e., more than one hidden layer), which progressively extract higher-level features from the raw input in the learning process. They have delivered the state-of-the-art accuracy on various real-world problems, such as image classification, face recognition, and language translation [1]. The superior accuracy of DNNs, however, comes at the cost of high computational and space complexity. For example, the VGG-16 model [2] has about 138 million parameters, which requires over 500 MB memory for storage and 15.5G multiply-and-accumulates (MACs) to process an input image with 224 × 224 pixels. In myriad application scenarios, it is desirable to make the inference on edge devices rather than on cloud, for reducing the latency and dependency on connectivity and improving privacy and security. Many of the edge devices that draw the DNNs inference have stringent limitations on energy consumption, memory capacity, etc. The large-scale DNNs [3], [4] are usually difficult to be deployed on edge devices, thus hindering their wide application.