I. Introduction
With the development of technologies, such as neural networks and big data, the demand for hardware resources for applications is constantly developing in the direction of large-scale, large-capacity, and fast computing. Generative pretrained transformer 3 (GPT-3) [2], the third-generation language prediction model in the GPT-n series created by OpenAI, has a capacity of 175 billion machine learning parameters, and has been trained about 45 TB of text data from multiple sources which include Wikipedia and books. The training overhead of GPT-3 is flops, which will take about 3200 TitanX GPUs for one year of training. The huge demand for hardware resources boosts the urgent requirement for innovations in hardware architectures. As a result, as shown in Fig. 1, coarse-grained reconfigurable architecture (CGRA) [3], [4], [5], [6], [7], [8], an energy-efficient and processing-flexible parallel hardware architecture is designed. CGRAs combine the advantages of flexibility of general-purpose processors (GPPs) and energy efficiency of application-specific integrated circuits (ASICs), and are popular for multimedia and AI applications [3], [4], [5], [6], [7], [8].
Architecture of a CGRA with a PEA.