I. Introduction
The decentralized edge computing advocates equipping proper computational capabilities to the edge devices where data are generated locally [1]. And it is often a challenging task to deploy computational demanding algorithms on these parsimonious edge nodes, such as inner-product or sum-of-product computation that are widely used for ‘in-node’ signal pre-processing, conditioning, feature extraction tasks, etc. Distributed Arithmetic is a promising design alternative for achieving bit-serial, multiplier-less implementation of inner-product computation [2] with reduced area and power consumption, as compared to parallel, multiply-accumulator based implementation. A few DA design examples have been demonstrated, such as finite impulse response (FIR) filter [3], discrete cosine transform [4], convolution [5], etc. However, a major hurdle for employing DA is that the LUT size will increase exponentially with the length of inner-product vectors.