I. Introduction
In the conventional von Neumann architecture, a clear gap lies between data storage and processing: memories store data, while processors compute on data. Owing to Moore’s law, in the past few decades, the computing power of the integrated circuits has rapidly scaled as logic gates became faster and the number of processing cores increased steadily until we hit the “Memory Wall” [1]. The on-chip global interconnects’ latency and energy cannot keep up with the scaling of logic gates. Thus, the computation throughput and energy have become dominated by the memory bandwidth and data movement energy. As shown in Fig. 1(a), the bandwidth at the I/Os of all SRAM banks inside a big memory macro such as a 20-MB L3 cache, which is over a hundred TB per second [2], [3], and is comparable to the theoretical maximum computation bandwidth of the state-of-the-art systolic processing array [4]. Hence, the bottleneck is the local data network inside the memory macro and the global data bus on chip. Furthermore, a large fraction of energy consumption today is spent on moving data back and forth between memory and compute units [5]. As shown in Fig. 1(b), it only takes sub-pico joules of energy to do a 32-bit addition while tens of pico joules are spent on retrieving data from far away memory banks.
Bottlenecks in the conventional von Neumann architecture. (a) Low on-chip network bandwidth. (b) High data movement energy.