I. Introduction
Pushed by the increasing complexity of near-sensor data analytics algorithms, the computational performance required by Internet-of-Things (IoT) end-nodes has increased dramatically in the last few years. Nowadays, near-sensor applications, such as convolutional neural network (CNN)-based image analysis and bio-metric processing, have to efficiently operate on large volumes of sensor data captured by microcontrollers as well as application parameters such as weights of CNNs. To deal with this increasing complexity, state-of-the-art system-on-chips (SoCs) have already achieved performance in the order of several Giga Operation per Seconds (GOPS) within a power envelope in the order of a few mW, exploiting parallelism, Instruction Set Architecture (ISA) specialization, and domain-specific acceleration [1]–[3].