I. Introduction
Convolutional neural network (CNN) has become a promising machine learning engine for image-oriented data analytics [1]. A GPU-based CNN accelerator is currently dominant in use. It can achieve high throughput in convolution but with high power consumption. On the other hand, an FPGA-based CNN accelerator has been also investigated due to its energy efficiency benefits [2] but it has quite limited low parallelism with need of reduced numeric precision. Moreover, for image-data oriented computing, large amount of data needs to be hold in memory with significant leakage power consumption. As such, it really requires a re-examination of both of the CNN algorithm as well the underlying hardware platform towards high energy efficiency as well as high throughput in convolution.