Loading web-font TeX/Main/Regular
An Energy-Efficient Coarse-Grained Reconfigurable Processing Unit for Multiple-Standard Video Decoding | IEEE Journals & Magazine | IEEE Xplore

An Energy-Efficient Coarse-Grained Reconfigurable Processing Unit for Multiple-Standard Video Decoding


Abstract:

A coarse-grained reconfigurable processing unit (RPU) consisting of 16 ×16 multi-functional processing elements (PEs) interconnected by an area-efficient line-switched me...Show More

Abstract:

A coarse-grained reconfigurable processing unit (RPU) consisting of 16 ×16 multi-functional processing elements (PEs) interconnected by an area-efficient line-switched mesh connect (LSMC) routing is implemented on a 5.4 mm ×3.1 mm die in TSMC 65 nm LP1P8M CMOS technology. A hierarchical configuration context (HCC) organization scheme is proposed to reduce the implementation overhead and the energy dissipation spent on fast reconfiguration. The proposed RPU is integrated into two system-on-a-chips (SoCs), targeting multiple-standard video decoding. The high-performance chip, comprising two RPU processors (named REMUS_HPP), can decode 1920 ×1080 H.264 video streams at 30 frames per second (fps) under 200 MHz. REMUS_HPP achieves a 25% performance gain over the XPP-III reconfigurable processor with only 280 mW power consumption, resulting in a 14.3 × improvement on energy efficiency. The other chip (named REMUS_LPP), targeting low power applications, integrates only one RPU processor. REMUS_LPP can decode 720 ×480 H.264 video streams at 35fps with 24.5 mW under 75 MHz, achieving a 76% reduction in power dissipation and a 3.96 × improvement on energy efficiency compared with the ADRES reconfigurable processor.
Published in: IEEE Transactions on Multimedia ( Volume: 17, Issue: 10, October 2015)
Page(s): 1706 - 1720
Date of Publication: 03 August 2015

ISSN Information:

Funding Agency:


SECTION I.

Introduction

The Rapid evolution of today's embedded computing and portable, handheld, wireless products are pushing the boundaries of computer design and manufacturing technologies. One of the most significant trends is that the consumer markets are demanding products with evermore features and sophistications. Technologies should help the products to consume less power and in the meanwhile deliver greater computational capacity and density for emerging communication and multimedia processing demands.

Traditionally, application-specific DSPs or FPGAs are cooperating with General Purpose Processors (GPPs) to accelerate the computation speed and improve the energy efficiency of the overall system. However, they only prevail in one or two domains of the large design space of speed, power, cost and flexibility [1]. Most FPGAs provide low level fine-grained parallelism with a high degree of flexibility but pay a power penalty and could only be configured at power up [2]. In contrast, although DSPs are easy to program and have relatively lower power consumptions, their computational throughputs are often limited intrinsically by the sequentially fetch-and-execute instruction based architecture, which is known as the von Neumann bottleneck [3].

Coarse-Grained reconfigurable computing fabrics fall in between DSPs and FPGAs, and they offer several capabilities and features that could not be simultaneously addressed by the aforementioned two or other implementing technologies. High-level program languages (such as C or C++) are often used in developing tool chain, providing fast time-to-market features similar to DSP and GPP. The processors also possess clusters of reconfigurable computational elements that are capable of performing various arithmetic operations, and therefore extremely high computational throughputs can be achieved through multiple level of parallelism.

Several designs of coarse-grained reconfigurable processor have been presented in the literature during the last decade. A dynamically reconfigurable processor named DAPDNA-2 targeting multimedia application is proposed in [4], [5]. Totally, 376 coarse-grained data processing units and IO units are closely coupled with a central Reduced Instruction Set Computer (RISC) processor in the hardware architecture through a specially designed on-chip diorama network. The implemented chip achieves a comparable performance with Intel's Pentium IV processor under the clock frequency of 166 MHz for typical signal and image processing applications. XPP-III, reported in [6], [7], is another example of high-performance commercial reconfigurable multimedia processor. It consists of 56 Processing Elements (PEs) and 8 Very Long Instruction Word (VLIW) RISC cores. The XPP-III architecture utilizes a fast configuration mechanism, which can dynamically switch preloaded configuration contexts in only one clock cycle. The implemented chip is capable of decoding 1080p Main Profile (MP) H.264 video streams at 24 frames per second (fps). ADRES, which combines a VLIW processor with a reconfigurable array of 8 \times 8 function units, is presented in [8]. In this design, the reconfigurable array is used to accelerate the dataflow-like kernels, whereas the VLIW core executes the non-kernel code by exploiting instruction-level parallelism.

A common focus of these works is how to utilize coarse-grained reconfigurable processing circuits to improve the performance and throughput of the fabricated processors. However, another important issue, i.e., the energy efficiency of coarse-grained reconfigurable processing circuits, has not yet been deeply considered or extensively discussed. The main contributions of this work include: proposing an energy-efficient VLSI array architecture of a coarse-grained Reconfigurable Processing Unit (RPU) targeting computation-intensive multiple-standard video decoding; proposing a Line-Switched Mesh Connect (LSMC) routing scheme which reduced the chip area significantly; introducing a Hierarchical Configuration Context (HCC) organization scheme which reduced the implementation overhead and the energy dissipation resulted from fast reconfiguration of the PEs. A prototype of the RPU processor is implemented in 65 nm CMOS and functionality is tested with multiple video decoding applications including H264, MPEG-2 and AVS [9]. In the implementation section, results on applying the proposed architecture to accelerate other computation-intensive and control-intensive applications, such as the kernels selected from a widely used universal algorithm-set, data encryption and communication baseband signal processing, are also provided. Experimental results show that the proposed flexible architecture is also effective in speeding-up different types of applications with moderate power dissipations.

SECTION II.

Algorithm Analysis and Design Consideration

The proposed architecture targets multiple-standard video decoding which is normally regarded as one of the most challenging applications, because video decoding consists of wide-ranging computation/processing subtasks with different characteristics. Although there are varieties of video encoding/decoding standards catering for different application scenarios, most of them could be generally summarized by a unified framework [10]. In this paper, three widely used standards, e.g., H.264, MPEG-2, and AVS, are considered. The general decoding framework of these three standards is illustrated in Fig. 1, which includes the following subtasks: entropy decoding, Inverse Quantization (IQ), Inverse Discrete Cosine Transform (IDCT), Motion Compensation (MC), intra-prediction, and post filtering (i.e., Deblocking).

Fig. 1. - General framework for video decoding.
Fig. 1.

General framework for video decoding.

These decoding subtasks can be generally classified into three categories: bit-level arithmetic operations (entropy decoding and IQ), block-based word-level calculations (IDCT, MC, intra-prediction, reconstruction, and deblocking filtering), and 2-D image data reading/writing (frame memory accessing operations). It has been shown in [11] that the block-based word-level calculations comprise a majority of the decoding workloads. For instance, an H.264 decoding application is performed on a RISC processor and the measured MC, deblocking, and Inverse Quantization and Transformation (IQT) workloads account for more than 75% of the total workload as illustrated in Fig. 2. MPEG-2, AVS, and even the High Efficiency Video Coding (HEVC) [12] also show similar computational patterns. These features make the reconfigurable fabric that provides great number of PEs an ideal platform for video decoding application implementation.

Fig. 2. - Measured computational workload of a typical H.264 decoding application (video steam “foreman” on an arm926ej processor) [11].
Fig. 2.

Measured computational workload of a typical H.264 decoding application (video steam “foreman” on an arm926ej processor) [11].

Another advantage of utilizing reconfigurable devices is that the structure and the functionality of the whole or parts of the circuit can be dynamically reconfigured after fabrication to satisfy the requirements of different subtasks within a specific decoding standard or among different standards. According to the former analysis, the major design issues should cover:

  • PE design, including the selection of the granularity, the arithmetic functionality of the PE and the coding scheme of the operation codes;

  • structure of the reconfigurable PE Array (PEA), including the optimal size of the PEA and the organization of the heterogeneous PEs;

  • hardware interconnections, including the topology of the interconnections and the router design; and

  • on-chip storage structures, including the hierarchy of the memory and the input/output (I/O) interfaces.

A particular design issue considered in this paper is how to improve the efficiency of the configuration procedure. Several H.264 decoding experiments on a reconfigurable fabric exploiting MorphoSys hardware architecture are carried out [13]. The workload of the configuration procedure, the computation, and the data I/O are statistically counted for each decoding subtasks. The simulated results are illustrated in Fig. 3. One could clearly see that the configuration of the reconfigurable processor comprises a majority part of the overall workload for all the three subtasks. The main reason is that, in Coarse-Grained Reconfigurable Architectures (CGRAs), the computation tasks are usually divided into many small kernels (compiled in the forms of multiple contexts) which are then mapped on and executed by the array repeated [14]. Due to the large scale of the array, these contexts generally consist of a few hundred to a few thousand bytes. Therefore, the configuration time is normally longer compared with computation time [15]. Moreover, the configuration work load will inevitably result in a larger portion of power consumption. Therefore, the key of designing an energy-efficient reconfigurable processor also lies on reducing the overhead of the reconfiguration process.

Fig. 3. - Timing breakdown of different operations on a reconfigurable system.
Fig. 3.

Timing breakdown of different operations on a reconfigurable system.

Besides multimedia algorithms, there are many other kinds of applications with different computation patterns that are also suitable for reconfigurable hardware implementation. Some representative examples include GEMM [16], SPMV,1 FFT and Viterbi. GEMM and SPMV can represent the category of applications with dense- and sparse-matrix operations, respectively. Both applications share data structures that are very similar with the pixel-based data structures in image processing applications, on which parallel operations of the data can be efficiently utilized. The algorithm of FFT is very similar to the DCT operation in multimedia applications and Viterbi is based on construct graphical models, both of which are also suitable for parallelization on PE arrays. However, it is worth noting that some applications, such as NQueen and Floydwarshall etc., are not suitable for array-based hardware acceleration. In these algorithms, there exist plenty of back-tracking and branching operations, which require complicated flow control and task scheduling. These types of operations cannot be efficiently mapped on PE arrays to exploit the massive computation units. A good summary of the above mentioned algorithms with different computation pattern and data structures are reported in [17] and [18]. More detailed discussions and analyses are also carried out in the following sections.

SECTION III.

RPU Architecture

Fig. 4. - Logical structure of the proposed RPU.
Fig. 4.

Logical structure of the proposed RPU.

The proposed architecture of the coarse-grained RPU [19] is shown in Fig. 4. It consists of five major parts: the PEA, the Input Control logic, the Output Control logic, the Inner Buffering memory, and the Context Storing and Controlling Logic. The former four parts make up of the data processing path, and the latter one forms the configuration path.

The operation of the RPU is driven by the input context. When the context streams are pushed into the context storage module, the controlling logic first translates and reorganizes the contexts, and then configures the input control module or the inner buffering module to prepare the data. At the same time, the PEA is reconfigured to the desired functional structure. Once the data are prepared, the data stream is sent to the PEA and processed.

A. Data Processing Path

The data processing path is responsible for the fetching, processing, storing and exporting of the data stream. The core part of the data processing path is a 16 \times 16 coarse-grained heterogeneous PEA.

1. 1) Reconfigurable PEA

Fig. 5. - Architecture of PEA and RCA.
Fig. 5.

Architecture of PEA and RCA.

The reconfigurable PEA is organized in four groups referred as the Reconfigurable Cell Array (RCA), which contains 8 \times 8 reconfigurable PEs as illustrated in Fig. 5. Each RCA can work independently thus providing a higher level of parallelism to improve the RPU throughput. Moreover, certain RCAs can also be turned off to maintain very low power consumption when desired.

Fig. 6. - Internal structure of PE.
Fig. 6.

Internal structure of PE.

Reconfigurable PE: Each PE contains an Arithmetic Logic Unit (ALU), a group of input/output (A_REG, B_REG, and C_REG), temporary result registers (T_REG), and a route block as shown in Fig. 6. As summarized in Table I, the ALU in each PE integrates up to 26 different operators, which can perform logical operations, such as NOT, AND, OR, XOR, etc., and arithmetic operations, such as addition, multiplication, shift, comparison, saturation, absolute, etc.

Table I Function summary of the ALU
Table - Function summary of the ALU
The probability of each specific operation in H.264/MPEG-2/AVS video decoding applications is also listed in Table I. One can find that the multiplication related operations (No. 17, 18, and 24) do not appear often in video decoding applications (total probability is no more than 0.069). Therefore, in one RCA, only the ALUs in the second line contain 16 \times 16 fixed-point multipliers (i.e., totally 32 multipliers integrated in one RPU). The total area of the 256 PEs is reduced by up to 80% due to this optimization. Thus, the proposed PEA is a heterogeneous array rather than a homogeneous one.

A 5-bit fixed-length operation code (OPCODE) is used to represent each operation. To reduce the power usage on continuous toggling of the configuration registers, each bit of the OPCODE is designed to disable the toggling of the configuration registers of the corresponding arithmetic or logic circuits in the ALU when a certain bit of the OPCODE is zero. The transition probabilities of each of bits are kept in minimum amount and even distributed according to the experimental results of several round decoding simulations. The power consumption of ALUs thus can therefore be further reduced.

Reconfigurable Interconnections: LSMC is exploited in the design of RCA. In this structure, the PEs within each RCA are organized in a line-to-line manner, i.e., each PE could be connected to any PEs in the adjacent upper and lower lines (the last line connects to the first line to support iterative operations) through the inter-layer mesh interconnections as shown in Fig. 5. For I/O data, the RCA is connected to a pair of I/O FIFOs (referred as Level-2 FIFOs in Fig. 5) through an I/O bus. Unlike the traditional full mesh connection scheme, where every PEs within different lines are equally connected to the I/O FIFO's data ports (the D1, D2, …, D32) as shown in Fig. 7, in our proposed structure, only one particular line of PEs and a corresponding line of FIFO data

Fig. 7. - Comparison between the traditional full mesh scheme and the proposed LSMC.
Fig. 7.

Comparison between the traditional full mesh scheme and the proposed LSMC.

Fig. 8. - The profiled area (kgates) of the components of the four RCAs (one pea). (a) full mesh interconnection. (b) proposed LSMC interconnection.
Fig. 8.

The profiled area (kgates) of the components of the four RCAs (one pea). (a) full mesh interconnection. (b) proposed LSMC interconnection.

ports are selected and connected to the I/O bus at one configuration. After that, a full mesh network is utilized to transmit data between the I/O FIFO and PEs in the I/O bus. During the architectural exploration step, to compare the cost difference between the two interconnection schemes, we have implemented a PEA module with four N \times N RCAs, the ratios of the interconnection area to the PEs’ area are measured and reported in Fig. 8 for N = 8 to 128. It can be clearly seen that the interconnection consumes a great portion of the total RCA area when a full mesh connection scheme is adopted while the ratio is much smaller for the proposed LSMC.

To verify that the optimization of the interconnection networks does not affect the overall performance, the measured execution time (in terms of clock cycles) of the mapped kernel algorithms, including MC, deblocking and IDCT, is compared on both structures in Fig. 9. From the results, one can clearly see that the proposed LSMC introduces no more than 5% performance deterioration.

Fig. 9. - Performance (cycles) comparison between full mesh scheme and the proposed LSMC.
Fig. 9.

Performance (cycles) comparison between full mesh scheme and the proposed LSMC.

PEA Architecture Exploration: As been analyzed in Section II, the data in the computation-intensive tasks, including Pred. (MC or intra-prediction), IDCT, and deblocking filtering (DeB.), are processed in parallel within each Macro Block (MB) or sub-MB. Although the MB size is fixed as 16 \times 16 in H.264, the sub-algorithms, such as MC, can be performed with the block size of 4 \times 4, 4 \times 8, 8 \times 4, 8 \times 8, 8 \times 16, 16 \times 8, or 16 \times 16. In order to decide the optimal array size, we performed an architectural exploration, in which we assume that the PEA is composed of four independent RCA modules, and each RCA has a minimum size of 4 \times 4 to fit the minimum possible sub-MB size.

The primary performance constraint used is the average decoding speed per MB (one MB consists of 16 \times 16 pixels). In high-definition video stream, one frame consists of 8160 MBs (i.e., 1920 \times 1088/256). Thus, the equivalent average decoding performance (assuming the operating frequency is 200 MHz) requirement for processing one MB is around 816 cycles (i.e., 200 \times {10^6}/(8160 \times 30)). Another design constraint is the implementation cost, measured in terms of the silicon area of the fabricated chip. As depicted in Fig. 4 and Fig. 5, the total area of the RPU is determined by several design parameters, including the 2-D scale of the RCA (width “W” and height “H”), the width/depth of the input/output FIFOs, etc. However, in the early stage of the PEA architecture exploration, we assume both the data memory and the context memory are large enough so that the performance evaluation is accurate and concentrate on the size of the PEA.

The decoding performance is then measured by performing cycle-accurate simulations with a SystemC model of the proposed architecture. In the simulation, we fixed the value of W and H as N, and measured the decoding speed in terms of cycle with parameter N of RCA from the value of N = 4 to N = 32. Take H.264 high definition decoding (resolution: 1920 \times 1080) for instance, the average performances of the most important sub-algorithms are profiled separately in Fig. 10.

It can be seen that the deblocking filtering algorithm requires the longest processing time for all cases. For most cases, it is even larger than the sum of the prediction time and the IDCT time. Furthermore, considering the fact that the deblocking filtering process depends on the data obtained from Pred. and IDCT sub-algorithms, we decide to allocate independent hardware resources, i.e., two identical RPU modules, to implement DeB. and Pred. + {\rm IDCT} processing in parallel to improve the decoding efficiency. From the profiling data, one can see that when N \geq 8, the proposed architecture could satisfy the performance constraints (i.e., 816 clock cycles) as discussed in previous paragraph. Afterwards, by using an area profiler integrated with our simulator, we also estimated the area of one PEA (integrating four RCAs) with the values of N from 4 to 32. It is found that the area for N = 16 is around 3.1 times larger than that of N = 8. Therefore, we select the architecture with parameter N = 8 (i.e., one 16 \times 16 PEA with four 8 \times 8 RCAs integrated), which is both sufficient in performance demand and economic in hardware resource.

It should be noted that the hardware resource optimization is only performed to satisfy the computational workload of typical decoding scenarios. For the worst case scenarios, such as those frames full of high-speed motion objects, there will be frame rate drops occasionally, which is a common design issue. Several schemes have been presented in the literature to address this issue. For instance, in [20], several frame dropping mechanisms, including dropping I frame (DIF), DIF(\lambda), discarding first P frame (DFPF), discarding third P frame (DTPF), and discarding B frame (DBF) for MPEG videos, are presented to reduce bandwidth consumption subject to a quality of service (QoS) constraint. In [21], frame dropping algorithms which only select one type of B frames and avoid consecutive frame dropping scenarios are proposed. Two kinds of mechanisms, including immediately dropping and dropping next frames, are developed based on frame type, the deferred decoding time, and the expected decoding time.

2. 2) input/output and Inner Buffering Memory Structure

During the decoding process, the fetched source data stream is usually discrete and thus needs to be buffered. From the description of previous section, one could see that the input/output data path of the RCA is of 8 \times 2 \times 16 = 256-bit (i.e., each PE has two 16-bit input operands from outside). Therefore, we design two 256-bit wide input/output FIFO (i.e., the level-2 FIFO mentioned in Fig. 5) for each RCA to address this issue as shown in Fig. 11. The input level-2 FIFO has a depth of 32 and the output level-2 FIFO has a depth of 4.

Fig. 10. - The profiled performance (clock cycle number) when performing H.264 decoding sub-algorithms on four RCAs. (a) pred. (b) idct. (c) deblocking filtering.
Fig. 10.

The profiled performance (clock cycle number) when performing H.264 decoding sub-algorithms on four RCAs. (a) pred. (b) idct. (c) deblocking filtering.

Fig. 11. - Input/output and inner buffering structure of RPU.
Fig. 11.

Input/output and inner buffering structure of RPU.

An RCA within one RPU may also need to exchange data with the other RCAs. A 32 \times 256-bit single-port memory (the Macro Buffer shown in Fig. 11) is designed to store the temporary results among RCAs. A single RCA may also be configured to perform different types of computation tasks, for example, first performing the IDCT followed by the MC or intra-prediction operation. In this case, the temporary results are stored in another 32 \times 256-bit dual-port memory (the RCA Internal Memory (RIM) as illustrated in Fig. 11). The Exchange Memory (EM) is a 64 \times 256-bit dual-port memory designed to store the switching data between RPUs. Two 1024 bytes blocks are used to realize double buffering strategy, i.e., the two blocks works in a ping-pong fashion.

B. Configuration Path

a. 1) hierarchical Configuration Context Representation

In this paper, we propose the HCC representation scheme to reduce the context storage, transmission and power consumption overhead. As shown in Fig. 12, the context for the proposed architecture is logically organized in three levels: core context, group context and complete context.

Fig. 12. - Hierarchical representation of context.
Fig. 12.

Hierarchical representation of context.

The first level of context (core context) contains the ALU OPCODEs, the inter-lay interconnection configuration bits and the data loading/storing commands. By using a group of OPCODEs, the RCA can be configured to execute a specific operation. The configuration bits for the inter-lay interconnections control the flow of the processed data stream. The data loading/storing commands for the input/output interfaces select the data sources for each RCA.

The indexes of the core contexts are grouped into a higher level (level-2), i.e., the group context in Fig. 12. It represents a serial of data-dependent RCA level calculations. The idea of using level-2 group context representation is based on the fact that for certain applications, such as an iterative algorithm, the computation requires similar operations and the processed data may be repeatedly used, which means that the RCA is likely to be reconfigured using the same structures.

The highest level of context representation is referred as the complete context. It is an extension of the group context with synchronization command to control the flow of the implemented algorithm. The complete contexts are dynamically generated by the host processor and will not be stored within the RPU. We summarize the information on the sizes of the different context types (in terms of 32-bit word) and the corresponding memories used in designing RPU in Table II.

Table II Summary of the context designed for RPU
Table - Summary of the context designed for RPU

b. 2) hardware Support

The proposed hardware architecture of the configuration path is illustrated in Fig. 13. Two dedicated context memories–the Global Core Context Memory (GCCM) and the Global Group Context Memory (GGCM) are designed to store the level-1 core contexts and level-2 group contexts, respectively. During the execution, the level-3 complete contexts are fed from external host processor and transmitted to the RPU context parser, which translates the complete contexts and reads the corresponding group contexts from GGCM.

Fig. 13. - Architecture of the configuration path.
Fig. 13.

Architecture of the configuration path.

Table III gives a summary of the final implemented storage sizes of three different types of context when using the proposed scheme to process H.264 decoding, where the data for non-hierarchical scheme are also reported for references. From this comparison, one could easily see that, thanks to HCC, 76% of the overall contexts are reduced compared with the non-hierarchical one for H.264 decoding applications. Fig. 14 further compares the ratio of the configuration time versus the calculation and data accessing time in the XPP40 processor [22] and the proposed RPU when implementing H.264 decoding sub-algorithms, respectively. One could clearly see that the reconfiguration time of RPU is only 4% to 13% of the entire execution time, which is much less than that of the XPP40 architecture.

Table III Two solutions for the storage of context
Table - Two solutions for the storage of context
Fig. 14. - Ratio of the configuration time versus the calculation and data accessing time in the xpp40 [22] and RPU architectures. (a) xpp40 processor. (b) rpu. The absolute cycle numbers are also given for reference.
Fig. 14.

Ratio of the configuration time versus the calculation and data accessing time in the xpp40 [22] and RPU architectures. (a) xpp40 processor. (b) rpu. The absolute cycle numbers are also given for reference.

C. Flexibility Analysis

Both the data processing and configuration path are designed with high level of flexibility such that other non-multimedia computation-intensive tasks can also be efficiently executed on RPU. Firstly, it can be observed that the proposed structure of the PE array and input and output data-paths are of high flexibility. The design of the PEs can be considered as the ALUs in general-purpose CPUs that support a wide range of instructions. The interconnection of the PE array is a non-blocking network, which enables the data communication between PEs within one RCA, the input/output streams between FIFOs outside the RCA and the stacks of registers within the RCAs. From computation perspective, the hardware structure also satisfies the requirements of the non-multimedia applications, such as GEMM matrix operations, structured modeling and FFT algorithms.

GEMM [16] is widely used in many process controls, real-time signal processing applications. The Fig. 15 illustrates the computation kernel of GEMM. In Fig. 15, A, B and C denote matrices with size of m \times n, n \times k and m \times k, respectively. Since the inner loop only contains one operation, i.e., multiply-add, and there are no data dependences between successive operations, the inner loop can be unfolded and mapped onto PEA to fully exploit parallelism. The proposed LSMC interconnection scheme and the multi-level buffering structure can efficiently reduce the interval time between successive pipelines, which maximize the data throughput and the utilization of the PEs. Moreover, the PEA only needs one time configuration by rearranging the input data stream to further boost the execution of the kernel. Another example is the structured algorithm which simulates the node update process of a structured network. A fixed-pointed version of the structured grid algorithm is illustrated in the Fig. 16. As can be seen from the Fig. 16, all the internal nodes share the same operation, data and memory access pattern. Moreover, the input and output data are updated in the same cycle without any dependences. Therefore, the RPU can be efficiently utilized to execute the computation tasks, which is illustrated in Fig. 17.

Fig. 15. - Computation kernel of GEMM.
Fig. 15.

Computation kernel of GEMM.

Fig. 16. - The computation kernel of structured algorithm ($u$ is the 3-D matrix of a structured grid, u_prev is the previous value, time_step is the time interval of structure updating).
Fig. 16.

The computation kernel of structured algorithm (u is the 3-D matrix of a structured grid, u_prev is the previous value, time_step is the time interval of structure updating).

Fig. 17. - Mapped scheme of the kernel of structured algorithm (u_prev is a grid node value, factor_x, factor_y, factor_z, factor_all are the updating impact factors from the adjacent nodes).
Fig. 17.

Mapped scheme of the kernel of structured algorithm (u_prev is a grid node value, factor_x, factor_y, factor_z, factor_all are the updating impact factors from the adjacent nodes).

In contrast, control-intensive algorithms with non-fixed branching and iterative operations cannot be efficiently mapped on the RPU architecture. Taking the NQueen algorithm for instance, the NQueen algorithm is a classic solution for backtracking problem. The computation flow of the NQueen algorithm can be summarized in Fig. 18. Although the conditioning and forwarding operations can be implemented as loops on the PEs, the backtracking operations are too costly to be mapped on the PE structure. The main reason is that the backtracking does not have a fixed boundary, which requires an infinite reconfiguration of the PEs and a very large memory space for the temporary results. So far, control-intensive algorithm could not be efficiently mapped and executed on coarse-grained reconfigurable hardware as on general-purpose CPUs. Relative studies can be found in [23] and [24], however, detailed discussions will not be included in this paper.

Fig. 18. - Computation kernel of nqueen algorithm (queenrows stores the locations of queens, $n$ is the number of queens).
Fig. 18.

Computation kernel of nqueen algorithm (queenrows stores the locations of queens, n is the number of queens).

Fig. 19. - A two extreme cases of execution sequence for a generic application in cgrA [25] (context $N$ means a reconfiguration period of cgra, and kernel $N$ means an execution period of CGRA; ${N_t}$ is the iteration number of the execution).
Fig. 19.

A two extreme cases of execution sequence for a generic application in cgrA [25] (context N means a reconfiguration period of cgra, and kernel N means an execution period of CGRA; {N_t} is the iteration number of the execution).

In terms of configuration path, the proposed HCC scheme significantly reduced the configuration time of the RPU. Generally speaking, the key of improving the computation and energy efficiency is to maintain a very high utilization of the hardware resources of the PE array. Two major practices include: maximizing the portion of computation period and reducing the configuration period within the whole execution time. Fig. 19 explains how applications are normally executed on CGRAs [25]. Since the hardware resource on the CGRA is always limited, real-world applications cannot be fully mapped on the CGRA in a single time. Therefore, multiple computation kernels, which require fewer hardware resources and can be mapped on the CGRA, are first generated by the software compiler. Each context in Fig. 19 represents the required configuration bit streams for that kernel. The upper and lower part of Fig. 19 represents two extreme scenarios of how the compiled application will be executed: (a) each kernel will be executed {N_t} times one by one, and (b) each kernel are first executed sequentially and then repeated by {N_t} times. For most of the real world applications, the real scenarios are between the two extreme cases. Therefore, frequent context switching on CGRA is always required to complete an application. Fig. 20 illustrates how such optimizations can be conducted to achieve an optimal execution sequence.

Fig. 20. - Optimized execution and configuration pipeline.
Fig. 20.

Optimized execution and configuration pipeline.

However, current CGRAs usually have not achieved a satisfactory level of configuration and hardware utilization rate. For instance, the ratio of the XPP40 processor [22] achieved is only around 24%. By using the proposed HCC scheme, the total volume of the configuration context can be compressed by 70%, which introduces an 86% increase of the computation time in the total execution period. This optimization is a general scheme and is equivalently effective when non-multimedia applications are mapped on RPU.

I. Algorithms Mapping

A C-based compiler framework [26] is developed for mapping the target applications to the proposed reconfigurable architecture as shown in Fig. 21. High-level programming language (standard C) is used to write application programs. The compiler can automatically extract the data parallelism at instruction level with the help of manually annotated directives and produce the Data Flow Graphs (DFGs), which are then mapped to the RPU by matching the extracted template with predefined template.

Fig. 21. - Compilation flow for mapping programs on RPU.
Fig. 21.

Compilation flow for mapping programs on RPU.

Table IV Execution cycles of kernels in polybench and H.264 decoding on REMUS
Table - Execution cycles of kernels in polybench and H.264 decoding on REMUS

D. Front-End Processing

In the front-end processing, the SUIF2 [27] and Mach-SUIF [28] tools are utilized and redeveloped for standard optimizations. The computation-intensive program tasks, which are suitable for parallel execution on RPU, are extracted as DFGs. The left parts are annotated and converted back to C program.

In multimedia applications, the major computation-intensive portions are the nested loops. To improve the decoding performance, the nested loops must be mapped onto PEA judiciously so that the instruction level parallelism can be fully exploited. Due to the architectural features, the performance of CGRA is greatly affected by the reconfiguration of PEA, data communication between the host controller and PEA, and computation of PEA. Moreover these factors are correlated. To take all these factors into consideration, an analytic total execution time (TET) model is established based on polyhedral model [29]. Taking this TET model as an optimization target, the mapping of loop nests on PEA is formulated as a constrained optimization problem. To efficiently resolve this optimization problem, a heuristic loop transformation algorithm (PolyMAP) is proposed. PolyMAP can explore the solution space of nested loops transformation and find the best tradeoff among reconfiguration cost, data communication cost and computation cost. It is also a source-to-source tool which generates the transformed loop nests for the following processing.

To evaluate the improvement of the proposed scheme, we compare the performance of most kernels in PolyBench [30] and H.264 decoding mapped by PolyMAP, EPIMap [31] and PolyCOM [32] in Table IV. EPIMap and PolyCOM are two representatives of state-of-the-art loop mapping approaches on CGRAs. EPIMap is the best existing loop pipelining method, which employs the re-computation technique [31]. PolyCOM [32] leverages the polyhedral model to reduce the data memory communication cost, which results in better execution performance of nested loop on CGRAs. It can be seen from Table IV that PolyMAP can improve the performance of the kernels by 22% and 32% compared to EPIMap and PolyCOM, respectively.

E. Back-End Processing

Improved template-based DFG sub-graph isomorphism and recombination mapping schemes are proposed in this paper. As shown in Fig. 21, they are utilized to reduce the amount of the configuration context and the configuration time in the proposed compilation flow. In this paper, a sub-graph isomorphism scheme is proposed to classify the original input atomic-DFGs into several categories to facilitate the generation of congruent kernels. The scheme recursively compares the parameter sets of each atomic-DFG, and the atomic-DFGs that have the same parameter set are classified into the same category (i.e., a congruent kernel). Otherwise, a new category will be created. This scheme can reduce up to 30%-40% amount of configuration context.

Table V shows the mapped results of the IDCT, MC and deblocking algorithms on RPU. The performance is measured in terms of cycles per MB. We compare the data with the performance of mapping the same algorithms on XPP40. Both the worst case and typical case performance are compared. For XPP40, only one result data is reported due to the limited information. To be conservative, we only compare the data of XPP40 with that of RPU in terms of the worst case performance. Note that template-based mapping schemes share a common limitation that one has to manually develop new templates to satisfy the new applications’ requirements.

II. Silicon Implementation and Evaluation

F. Chip Implementation and Test Environment

In order to fully test and verify the functionality and performance of the proposed architecture, we first implement the reconfigurable processor core RPU as a single chip named after CHAMELEON. This prototype chip is used for massive algorithm tests and performance profiling. Two other SoCs, one integrated with two RPU processors targeting high-performance video decoding applications and the other one comprised of a single RPU processor targeting low-power multimedia applications, are also implemented and named after RHINOCEROS and REINDEER, respectively. The relationships of the developed IP cores and two fabricated chips are depicted in Fig. 22.

The CHAMELEON prototype chip is implemented on a 6.5~\hbox{mm} \times 6.2~\hbox{mm} die by using TSMC 65 nm low power one-poly eight-metal (LP1P8M) CMOS process. Fig. 23 shows its die micrograph. Since many test I/O pins are intentionally implemented, the final die of CHAMELEON is pad-limited, resulting in a big die size compared with the 5.4~\hbox{mm} \times 3.1~\hbox{mm} RPU core.

A medium scale reconfigurable processor core, named after REMUS_HPP (REconfigurable MUltimedia System-High Performance Processor), is then designed. The block diagram of the core's internal architecture is shown in Fig. 24. The REMUS_HPP processor integrates two identical RPU cores and an RPU Micro Controller (RMC). The RMC is directly connected to RPU without any bus-based structures and is used dedicatedly to generate the configuration context and control the data I/O process of RPU to satisfy the high configuration bandwidth requirements as analyzed in Section III-B1). A central Master Micro Controller (MMC) is used to initialize the RMC and other peripheral circuits, which include an interrupt control module, a Direct Memory Access Controller (DMAC), a 128 Kbytes program memory and a dedicated 64-bit External Memory Interface (EMI), through a high-speed 32-bit multi-level system bus. A Video Pre-Processor (VPP) is also integrated to perform the bit-level entropy decoding subtasks, while the most computation-intensive tasks (i.e., IDCT, MC, intra-prediction, reconstruction, deblocking, etc.) are implemented on RPUs. The REMUS_HPP processor core can independently realize the whole H.264/MPEG-2/AVS decoding applications. Therefore, we only provide the comparison data for REMUS_HPP in the following discussion.

Table V Performance comparison of executing some computation-intensive subtasks of H.264 decoding algorithms
Table - Performance comparison of executing some computation-intensive subtasks of H.264 decoding algorithms
Fig. 22. - Relation map of the RPU core and implemented chips.
Fig. 22.

Relation map of the RPU core and implemented chips.

Fig. 23. - Die micrograph of CHAMELEON.
Fig. 23.

Die micrograph of CHAMELEON.

Fig. 24. - Block diagram of the remus_hpp core.
Fig. 24.

Block diagram of the remus_hpp core.

The REMUS_HPP core is then embedded into RHINOCEROS, whose VLSI architecture and die micrograph are shown in Figs. 25 and 26. The integrated REMUS_HPP core is used to support the whole multi-standard video decoding tasks, and the other modules in RHINOCEROS are responsible for other routine tasks such as stream tuning, video displays, peripheral devices controlling, as well as running operating system and applications.

Fig. 25. - Architecture of RHINOCEROS.
Fig. 25.

Architecture of RHINOCEROS.

Fig. 26. - Die micrograph of RHINOCEROS.
Fig. 26.

Die micrograph of RHINOCEROS.

Fig. 27. - (a) architecture of REINDEER. (b) the die micrograph of REINDEER.
Fig. 27.

(a) architecture of REINDEER. (b) the die micrograph of REINDEER.

Fig. 28. - Block diagram of remus_lpp core.
Fig. 28.

Block diagram of remus_lpp core.

Another energy-efficient SoC, referred as REINDEER in this paper, for portable multi-media processing devices, is implemented in the last using the same technology and physical integration methodology, as shown in Fig. 27. REINDEER is built up from a low-power reconfigurable processor core named after REMUS_LPP (REconfigurable MUltimedia System-Low Performance Processor) whose VLSI architecture is presented in Fig. 28. One can easily find that the internal structure of REMUS_LPP is identical to that of REMUS_HPP except that only one single RPU core is utilized. We summarize the physical information of these three chips and three cores in Table VI.

G. Evaluation and Comparison

1. Performance

To evaluate the performance of the proposed RPU, we measured the performance in terms of decoding frame rate for the RHINOCEROS and REINDEER chips and reported the results in Tables VII and VIII, respectively. For each video format, fifty-four encoded video streams are tested and the average decoding speeds are recorded. The corresponding working clock frequency of the RHINOCEROS processor is 200 MHz and 100 MHz for the REINDEER chip.

Table VI Physical information of chameleon, rhinoceros, and REINDEER
Table - Physical information of chameleon, rhinoceros, and REINDEER
Table VII Measured performance for remus_hpp (working frequency 200 mhz)
Table - Measured performance for remus_hpp (working frequency 200 mhz)
Table VIII Measured performance for remus_lpp (working frequency 100 mhz)
Table - Measured performance for remus_lpp (working frequency 100 mhz)

Table IX Comparisons when performing H.264 decoding
Table - Comparisons when performing H.264 decoding
Table X Comparisons when performing H.264 decoding
Table - Comparisons when performing H.264 decoding
We first compare the video decoding performance of the REMUS_HPP processor with the XPP-III reconfigurable processor [6], [7], a state-of-the-art many-core processor [33] and a dedicated multi-format video codec chip reported in [34] in Table IX. From the measured data, one can clearly see that the proposed design achieves a 25% performance boost than the XPP-III processor and a similar decoding speed with the many-core and Application-Specific Integrated Circuit (ASIC) designs.

To show more clearly the performance differences between the proposed architecture with [33] and [34], one can also normalize the data from Table IX in terms of the number of the processed MBs per second per MHz (MBs/s/MHz). For instance, when compared with the many-core processor, which is reported very recently in [33], the normalized performance is 0.66 \times ~\hbox{higher}.

The maximum bit rate that the proposed REMUS_HPP architecture can support is around 73 Mbps. For those scenarios with very high bit rate and/or high motion, there will be inevitable frame drops. As discussed in Section III-A, several optimization schemes are adopted to minimize the effects in such scenarios.

In Table X, we compare the decoding performance data of the REMUS_LPP processor with the ADRES reconfigurable processor [8], a DSP based video decoder chip [35] developed for multi-format set-top box and another video codec chip [36]. From the normalized performance data, one can see that the REMUS_LPP processor achieves a 16% faster decoding speed than both the ADRES reconfigurable processor and the DSP chip. When compared with the video codec chip presented in [36], the performance of the REMUS_LPP processor is 1.28 \times ~\hbox{lower}. The very high decoding speed achieved by [36] benefits from the 3.44 times higher working frequency adopted. When both performances are normalized in MBs/s/MHz, our design has a clear 0.94 \times ~\hbox{faster} advantage on decoding speed.

2. Energy Efficiency

To evaluate the energy efficiency of the implemented RPU architecture, the power data in both Tables is converted in MBs/s/mW, and one could clearly see that REMUS_HPP has a 14.3 \times (i.e., (874-57)/57) improvements on energy efficiency than the XPP-III processor, and the energy efficiency is almost two times of the value for many-core processor. When it comes to the dedicated codec chip, the energy efficiency of this codec is 1.8 \times ~\hbox{larger} than the proposed architecture. It makes sense that the decoding engine in the codec processor is a custom designed ASIC, whose energy efficiency should be better than that of reconfigurable architecture.

REMUS_LPP achieves a 76% power reduction and a 3.96 \times ~\hbox{energy} efficiency improvement compared with ADRES. When compared with the DSP chip, the proposed architecture has a much larger advantage. It is notable that the video codec chip in [36] only has around 40% energy efficiency of the proposed design, which is quite different from that of the codec processor reported in [34]. The main reason is that this codec processor has a RSIC and VLIW based multi-core architecture, whose functions are programmable.

It is also worth noting again that although the above comparisons on performance and energy efficiency are made between the REMUS_HPP and REMUS_LPP processors and the reference design, the major advantages of the proposed architecture come from the RPU utilization, since the computation-intensive video decoding tasks, i.e., IDCT, MC, intra-prediction, reconstruction, deblocking, etc., are all carried out on RPUs.

H. Flexibility Evaluation

Although the above performance and energy efficiency comparisons are all based on video decoding applications, we note that the advantages of the proposed architecture is not limited in video decoding. To verify the efficiency of the proposed architecture in accelerating non-multimedia applications, a serial of different types of kernels are implemented on the proposed platform. The detailed results are as follows:

Thirteen Kernels. Thirteen kernels, as shown in the first column of Table XI, are selected as benchmarks based on the algorithm classification proposed by [17], [18]. This classification is widely used and cited in many studies as a universal benchmark [38], [39]. According to [17], any real world application can be characterized by a single or a combination of the thirteen classes of algorithms, and therefore, studies, such as [38] and [39], have used these algorithms to evaluate the system performance and efficiency instead of implementing too many applications. The selected thirteen kernels are implemented on REMUS_HPP and compared with the Atom 2302 and Cortex A83 processors in terms of clock cycles, as reported in Table XI. In this experiment, the size of the input data is carefully tuned to guarantee that the execution times of all kernels are of the same order of magnitude. For control-intensive kernels, a mapping strategy that transforms branch prediction into partial predication is adopted in this work. The key idea of this strategy is instead of executing the “If-Then-Else” statements sequentially, both condition for “If” and two paths of “Then/Else” are executed in parallel. For irregular memory access whose address is only parsed in the runtime, the EMI module can be configured into gather mode that can fetch multiple discrete data in one transaction.

Table XI Performance comparison of execution time of thirteen kernel algorithms on different architectures
Table - Performance comparison of execution time of thirteen kernel algorithms on different architectures

From Table XI, it can be observed that there is a certain performance boost for the proposed REMUS_HPP over the Atom 230 and Cortex A8 processors. The selected thirteen kernels can be roughly divided into three categories, i.e., computation-intensive, control-intensive and bandwidth-limited applications (or called data-intensive applications). For computation-intensive kernels, such as GEMM, SPMV and FFT, which can be fit on the array architecture efficiently and exploit parallelism, REMUS_HPP achieves a very high performance, i.e., more than 8 \times performance boosts. For kernels requiring very high memory bandwidth (i.e., bandwidth-limited), such as BFS, specially-optimized memory interface can feed sufficient amount of data to the reconfigurable array to improve the performance. For control-intensive kernels that lack parallelism, such as Floydwarshall, NQueen, REMUS_HPP can still provide acceptable performance with reduced configuration time enabled by using the proposed HCC scheme and less memory access enabled by flexible array interconnection.

In terms of power, REMUS_HPP consumes around 100 mW for all these kernels, while Atom 230 consumes about 4 W and Cortex A8 consumes around 300 mW according to their datasheets,2, 4respectively. This advantage of energy efficiency should be attributed to the reconfigurable array design and HCC scheme, each of which leads to lower energy overhead in data migration and configuration context decoding, respectively.

Table XII Comparison of data encryption algorithms implemented on different hardware architectures (remus_lpp)
Table - Comparison of data encryption algorithms implemented on different hardware architectures (remus_lpp)
Data Encryption. The Advanced Encryption Standard (AES) and Data Encryption Standard (DES) applications are selected and implemented on the REMUES_LPP platform. Regarding AES, its DFG processing are mapped and pipelined on the reconfigurable PE array and different data blocks are carried out in a parallel way by using the same set of configuration contexts. In terms of DES, the round function of the algorithm is unrolled and then mapped onto the PE array. A large number of the pipeline stages are adopted so that a higher level of data parallelism can be utilized. As shown in Table XII, the proposed architecture achieves comparable performance (i.e., throughput) with respect to the state-of-the-art schemes. The most distinct advantages of the proposed architecture are that the energy efficiency is much higher than both the many-core processor [40] and commercial mainstream FPGA [41] based designs.

Global Positioning System (GPS). The baseband algorithm of GPS is also mapped on the REMUS_LPP architecture. The tasks with high computation workload (i.e., the acquisition and correlation) are accelerated on RPU, while the tracking loop and positioning are executed by MMC. The reconfigurable PE array is very suitable for Discrete Fourier Transform (DFT)-based calculation of acquisition. In the implemented design, REMUS_LPP can simultaneously capture and track 9 satellites with only 23.75 mW power consumption. The achieved positioning accuracy is 1.24 meters, which satisfies the requirements of the target applications.

SECTION IV.

Conclusion

This paper has presented an energy-efficient architecture of coarse-grained reconfigurable processing unit—RPU. Thanks to the proposed reconfigurable PEA with LSMC structures and an HCC based context organization scheme, the implemented high performance version of processor, which integrates two RPUs, achieves a 25% speedup and 14.3 \times ~\hbox{improvements} on energy efficiency for H.264 high profile decoding compared with the XPP-III processor. Meanwhile, the low power processor integrated with a single RPU, has a 1.92 \times ~\hbox{energy} efficiency advantage over ADRES for H.264 baseline profile decoding application. Even compared with the state-of-art many-core processor and dedicated video codec designs, the performance of the proposed architecture is very close. The notable improvements on both performance and energy efficiency have proved that the proposed RPU is one of the ideal choices for implementing multiple standard video decoding applications. Take the HEVC for example, according to our qualitative analysis, since both the operations and the calculation patterns are potentially supported by the proposed PEA based reconfigurable fabric, the motion compensation, the intra prediction and the inverse transform & quantization of HEVC (accounting for about 50% of the total computation burden [42]) can also be effectively processed by RPU.

Moreover, it has been demonstrated that the proposed hardware architectures and optimization schemes are applicable for designing other domain-specific reconfigurable processors and systems, especially suitable for the computation-intensive applications. Thirteen kernel algorithms, applications from domains such as signal processing and data encryption, are implemented on the proposed architecture. The performance and energy efficiency are verified and compared favorably with the general-purpose processors and FPGAs.

Finally, it is worthy to note that the proposed architecture is not a perfect architecture to support all general purpose computation tasks. The experimental results have shown that for certain kernel algorithms, the proposed architecture could not gain any performance boosts compared with general-purpose CPUs. Moreover, the performance advantages achieved by the proposed architecture over the Atom and Cortex processors do not hold for other high-end general-purpose CPUs, such as Intel's Core i7 processor [43]. The references selected in this study are targeting for low-power mobile platforms. It should also be noted that the proposed RPU architecture and HCC configuration scheme are specifically designed to improve computation speed and energy efficiency for multimedia applications. Although both the architecture and compilation optimizations can be conducted to further improve the computation efficiency for other general applications, such studies are not included in the scope of the paper. Future studies still remain to be done.

References

References is not available for this document.