I. Introduction
Neural networks (NNs) are critical drivers of new technologies such as image processing and speech recognition. Modern NNs have millions of trainable parameters [1], [2] demanding large amounts of memory and computational resources. This makes the process of training difficult to perform on-chip. As a result, most hardware architectures for NNs perform training off-chip on CPUs/GPUs or the cloud, and only support inference capabilities on the final FPGA/ASIC device [3]–[5]. Unfortunately, off-chip training results in on-chip implementation of a non-reconfigurable network which cannot support training time optimizations over model structure and hyperparameters. This severely hinders the development of independent NN devices which a) dynamically adapt themselves to new models and data, and b) do not depend on costly, possible insecure, cloud resources or power-hungry data centers for training.