I. Introduction
Deep learning (DL) has revolutionized application domains such as image processing, autonomous driving, and remote healthcare. However, DL is both compute- and data-intensive in nature. As a result, a major portion of the computations (e.g., training for DL) have remained traditionally confined to datacenters. CPU and GPU-based manycore architectures are the most common choice of hardware for deep learning applications. However, general purpose CPU- and GPU-based systems are not customized for deep learning and often suffer from: (a) high area and power overheads, and (b) memory bottleneck. These limitations of traditional manycore systems have resulted in many studies aimed at developing the next generation of DL accelerators.