I. Introduction
Numerous emerging smart applications (e.g. IoT, wearables, drones, etc.) demand on-chip continuous learning, compelling the development of application-specific memories and architectures. More often these applications demand the implementation of learning algorithms for large network models in an energy-efficient manner. Conventional digital memory solutions based on SRAM or DRAM cannot address the required density/energy requirements due to large area and restrictive off-chip memory access costs. High-end expensive graphical processing units (GPUs) have been a default choice to perform DNN training. The energy and time requirement of training the state-of-the-art DNN architectures on GPUs is high [1]. This necessitates the development of more energy-/area-efficient custom hardware accelerators for performing deep learning training workloads. A few ASIC processors have been recently reported for DNN training [2], [3], [4], but based on conventional SRAM for on-chip storage, which requires a large amount of memory access with associated density and leakage power constraints.