I. Introduction
Deep convolutional neural networks (CNNs) have achieved unprecedented success in the field of artificial intelligence (AI) in the past decade. However, the intensive computation required for even inference makes it challenging to deploy pre-trained models on resource-constrained edge devices. The essential and computationally dominant operation in CNN models–the convolution–requires overwhelming multiply-and-accumulate (MAC) operations with excessive on-/off-chip memory access. It is well-known that the energy bottleneck in such computation lies in the data movement rather than the arithmetic operations, leading to the so-called memory wall [1].