I. Introduction
Research on developing accelerators for training deep neural networks (DNNs) has attracted significant interest. Several potential applications such as autonomous navigation, health care, and mobile devices require learning in-the-field while adhering to strict memory and energy budgets. DNN training demands significant time and compute/memory. The two most expensive computations in DNNs are the matrix-vector multiplications (MVM) and vector-vector outer product (VVOP) and both require multiplications for a layer with a weight matrix of the size N×N. Several strategies to improve efficiency of MVM computation have been proposed with minimal impact on the training accuracy. These strategies leverage either low precision digital representation [1], [2] or crossbar architectures [3], [4], [5], [6]. Less precise implementations of MVM are shown to perform sufficiently well for DNN training [1], [2], [7], [8], [9], [10].