1 Introduction
There is currently huge interest in accelerating Convolutional Neural Network (CNN) inference on mobile devices, for tasks such as image classification, detection and segmentation. CNN layers are typically implemented by lowering 2D convolution to general matrix multiply (GEMM) kernels, which are typically the runtime bottleneck when executed on CPUs, motivating hardware acceleration. The systolic array (SA) is a special-purpose processor for efficiently accelerating GEMM. The SA consists of an array of MAC processing elements (PEs), which communicate operands and results using local register-to-register communication only, which makes the array very efficient and easily scalable without timing degradation. These advantages have led to their deployment in commercial products, e.g., the Google Tensor Processing Unit (TPU) [1].