Journals & Magazines >IEEE Computer Architecture Le... >Volume: 19 Issue: 1

Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). The s...Show More

Metadata

Abstract:

Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). The systolic array (SA) is a pipelined 2D array of processing elements (PEs), with very efficient local data movement, well suited to accelerating GEMM, and widely deployed in industry. In this letter, we describe two significant improvements to the traditional SA architecture, to specifically optimize for CNN inference. First, we generalize the traditional scalar PE, into a Tensor-PE, which gives rise to a family of new Systolic Tensor Array (STA) microarchitectures. The STA family increases intra-PE operand reuse and datapath efficiency, resulting in circuit area and power dissipation reduction of as much as 2.08× and 1.36× respectively, compared to the conventional SA at iso-throughput with INT8 operands. Second, we extend this design to support a novel block-sparse data format called density-bound block (DBB). This variant (STA-DBB) achieves a 3.14× and 1.97× improvement over the SA baseline at iso-throughput in area and power respectively, when processing specially-trained DBB-sparse models, while remaining fully backwards compatible with dense models.

Published in: IEEE Computer Architecture Letters ( Volume: 19, Issue: 1, 01 Jan.-June 2020)

Page(s): 34 - 37

Date of Publication: 12 March 2020

ISSN Information:

DOI: 10.1109/LCA.2020.2979965

Contents

1 Introduction

There is currently huge interest in accelerating Convolutional Neural Network (CNN) inference on mobile devices, for tasks such as image classification, detection and segmentation. CNN layers are typically implemented by lowering 2D convolution to general matrix multiply (GEMM) kernels, which are typically the runtime bottleneck when executed on CPUs, motivating hardware acceleration. The systolic array (SA) is a special-purpose processor for efficiently accelerating GEMM. The SA consists of an array of MAC processing elements (PEs), which communicate operands and results using local register-to-register communication only, which makes the array very efficient and easily scalable without timing degradation. These advantages have led to their deployment in commercial products, e.g., the Google Tensor Processing Unit (TPU) [1].

References is not available for this document.

MIT Libraries

MIT Libraries

Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference

Abstract:

Metadata

Abstract:

ISSN Information:

1 Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1 Introduction

References