I. Introduction
As DNNs become deeper and more complicated, number of operations and parameters also increase significantly. For example, as a representative DNN, VGG-16 [1] has 138MB parameters using 8-bit precision for weights and activations. As a result, data movements between the computing units and memory units become the bottleneck. Especially, expensive off-chip DRAM access occurs frequently due to the limit on-chip SRAM buffer size for DNN workloads. There have been many research efforts on the design of application specific integrated circuit (ASIC) accelerators such as Eyeriss [2] and TPU [3], where the parameters are stored in global buffer and the computation is still performed at the digital multiply-accumulate (MAC) arrays. Compute-in-memory (CIM) is an efficient paradigm to address the memory wall problem in DNN hardware acceleration [4]. Convolution operation essentially contains vector-matrix multiplication (VMM), which takes up most of the computations in DNNs. The crossbar structure supports analog VMM operations by activating multiple rows and perform current summation along bit lines (BLs). Emerging non-volatile memory (eNVMs) such as phase change memory (PCM) [5] and resistive random-access memory (RRAM) [6] provide promising solutions to design CIM-based accelerator due to smaller cell size than SRAM at the same node. Though these eNVMs based CIM architectures are promising, grand challenges exist in designing a practical CIM accelerator that supports both training and inference. First, most of the CIM architectures proposed so far, such as PRIME [7] and ISAAC [8] could support the inference only. The data flow for CIM is largely unexplored. Second, the impact of ADC resolution on the training/inference accuracy is rarely explored. Third, the asymmetric and nonlinear conductance tuning introduces significant training accuracy loss [9], making it difficult to utilize the multilevel states of eNVMs for in-situ training.