

Received September 25, 2019, accepted October 21, 2019, date of publication November 4, 2019, date of current version November 26, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2951408

# A Low-Power and High-PSNR Unified DCT/IDCT Architecture Based on EARC and Enhanced Scale Factor Approximation

# JIANFENG ZHANG<sup>®</sup>, WEI SHI, LI ZHOU, RUI GONG, LEI WANG, AND HONGWEI ZHOU

College of Computer, National University of Defense Technology, Changsha 410072, China

Corresponding author: Jianfeng Zhang (jianfengzhang@nudt.edu.cn)

This work was supported in part by the HGJ under Grant 2017ZX01028103 and Grant 2018ZX01029103, in part by the National Key Research and Development Program of China under Grant 2018YFB2202603, and in part by the National Natural Science Foundation of China under Grant 61802427, Grant 61502507, Grant 61602496, and Grant 61832018.

**ABSTRACT** The Discrete cosine transform (DCT) and inverse Discrete cosine transform (IDCT) have been widely used in image and video compression standards, and more and more researchers focus on the method that taking use of the coordinate rotation digital computer (CORDIC) to execute the DCT and IDCT, the reason is that CORDIC can realize the complex transcendental functions only by using shifters and adders, and is easily suitable implemented in a parallel way. However, the conventional CORDIC has some drawbacks, such as low precision, long iteration number, scale factor and so on. In our previous paper, we have implemented an unified DCT/IDCT based on adaptive recoding CORDIC (ARC) to improve the performance. In this paper, compared with our previous work, we propose an enhanced unified architecture for DCT and IDCT based on the cooperation between the enhanced adaptive recoding coordinate rotation digital computer (EARC) and the conventional CORDIC, in which the radix-2 scale factor approximation proposed in our previous paper is also optimized significantly to achieve better performance in power consumption and PSNR. To conduct a fair competition, the proposed architecture is also validated on a Virtex 5 FPGA platform to evaluate the performance. Under DCT only mode, compared with the Huang and Lee architectures, the proposed architecture at least uses 3.3% less area, dissipates 9% less power at the nearly 2% cost of the critical path delay. Under DCT/IDCT mode, the proposed architecture also saves over 2% hardware resources, reduces more than 5.9% power dissipation when compared to the latest unified DCT/IDCT architectures. Meanwhile, the proposed architecture exceeds the state-of-the-art unified architectures over 0.98 dB in PSNR. Compared with 2S-8P-DCT and 3S-8P-DCT, the proposed unified DCT/IDCT architecture also achieves the best performance in accuracy, processing time and throughput.

# INDEX TERMS DCT, IDCT, ARC, FPGA, PSNR.

# I. INTRODUCTION

Due to the perfect energy packing [1] and very close approximation to the optimal Karhunen-Loeve transform (KLT) [2], the discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) have been widely applied in various image and video compression standards, including JPEG [3], MPEG [4], H.264 [5] and HEVC [6] when they were proposed [7]. However, the computations of 2-D DCT and IDCT are too complex, and then the2-D DCT is commonly calculated by first applying a 1-D DCT over the rows followed by another 1-D DCT applied to the columns of the

The associate editor coordinating the review of this manuscript and approving it for publication was Songwen Pei<sup>(1)</sup>.

input matrix. Hence, the kernel processing for 2-D DCT and IDCT is 1-D DCT. In the meantime, as DCT and IDCT are usually applied in image and video systems simultaneously, a unified 1-D DCT and IDCT architecture would be very efficient. As the existence of transcendental function calculation for 1-D DCT/IDCT, many acceleration algorithms are proposed, which can be divided into three different kinds, such as multiplier-based algorithms [8], [9], distributed arithmetic (DA) based algorithms [10], [11] and coordinate rotation digital computer (CORDIC) based algorithms [12]–[15]. Even the multiplier-based and the DA-based DCT/IDCT could have high peak signal-to-noise ratio (PSNR) compared with the conventional CORDIC-based method, but they are not well applied because of the high-power consumption and

the complicated computation components. Therefore, more and more researchers focus on the method by implementing DCT/IDCT based on CORDIC [16], in which the transcendental functions can be realized in a parallel way only by using adders and shifters, which means CORDIC-based DCT/IDCT can reduce architecture complexity and save power consumption. For example, Huang and Xiao [12], [13] proposed a rapid radix-2 algorithm for computing DCT/IDCT based on CORDIC, in which the latency is restricted by the data dependence of the neighbouring rotations. Lee *et al.* [14] proposed a series of low-power DCTs based on look-ahead CORDIC at the cost of PSNR.

Our previous work [15] only based on adaptive recoding CORDIC (ARC) [18] achieves better performance, but the accuracy of CORDIC based rotation elements can be improved by our latest published work enhanced adaptive recoding CORDIC (EARC) [19]. Meanwhile, our another published paper [22] proposed a novel method that the two different vector rotations share the same CORDIC-based component, and the performance is improved compared with the Huang and the Lee architectures, but is even worse than the one in [15]. Hence, we propose an enhanced unified DCT/IDCT architecture based on the adopting of the efficient decomposition method in this paper, in which the transcendental functions are executed by the cooperation between the conventional CORDIC and EARC to further reduce the computation error, and the scale factor of conventional CORDIC and the compensated scale factor of DCT/IDCT are integrated together, which are realized by the optimized radix-2 scale factor approximation scheme to further cover the cases that the first terms of the quadratic polynomials are  $2^{-1}$  to further improve the computation accuracy. To make a fair competition with other works, the enhanced unified DCT/IDCT architecture has also been synthesized on a Xilinx Virtex-5 LX110T platform to verify the correctness and performance. Compared to the state-of-the-art DCTs and the latest unified DCT/IDCT architectures, including our previous work [15], the proposed architecture demonstrates better performance.

The rest of this article is organized as follows: The proposed enhanced unified DTC/IDCT architecture and the optimized radix-2 scale-factor approximation scheme are discussed in Section II. Section III analyses the simulation and comparison results. Conclusions are drawn in Section IV.

### **II. ENHANCED UNIFIED ARCHITECTURE**

As the 8-point DCT and IDCT are mostly used in image and video standards [3]–[6], we propose an enhanced unified 8-point DCT/IDCT architecture based on the cooperation between EARC and the conventional CORDIC, and the optimized radix-2 scale-factor approximation scheme for the unified architecture is also described in this section.

Firstly, we analyse the decompositions of the DCT to illustrate the unified architecture. Secondly, we discuss how to implement these rotation elements based on EARC or on the cooperation between EARC and the conventional CORDIC. Finally, the optimized radix-2 scale factor approximation scheme is depicted.

Concurrently, to conduct a fair comparison, we also assume the natural number 1 equals 12'b01000000000 when the word length is 12-bit.

# A. ENHANCED UNIFIED DCT/IDCT ARCHITECTURE

The N-point 1-D DCT transforms a real data sequence from the time domain  $\{x(n), n = 0, 1, 2, ..., N-1\}$  to the frequency domain  $\{y(n), n = 0, 1, 2; ...; N-1\}$ , which is described as

$$y(k) = \sqrt{\frac{2}{N}} \cdot c(k) \cdot \sum_{n=0}^{N-1} x(n) \cdot \cos\left[\frac{(2n+1)k\pi}{2N}\right]$$
(1)

in which the range of k is [0, 1, ..., N-1],  $c(0) = 1/\sqrt{2}$ and c(k) = 1 for k = 1, 2, ..., N-1. In the meantime, the computation of Equation (1) requires  $O(N^2)$  multiplications and  $O(N^2)$  memory to store the trigonometric values accompanying with the point N.

Meanwhile, IDCT transforms a real data sequence from the frequency domain  $\{y(n), n = 0, 1, 2; ...; N-1\}$  to the time domain  $\{x(n), n = 0, 1, 2, ..., N-1\}$ , and can be depicted as:

$$\mathbf{x}(\mathbf{n}) = \sqrt{\frac{2}{N}} \cdot \sum_{k=0}^{N-1} c(k) \cdot y(k) \cdot \cos \left[\frac{(2n+1)k\pi}{2N}\right]$$
(2)

where n = 0, 1, ..., N-1, and c(k) has the same meaning as the one in DCT. Hence, we can adopt the orthogonal property of the trigonometric transforms to derive Equation (2) from Equation (1).

For Equation (1), we decompose the 8-point DCT into four parts. First, the input signals are pre-processed and represented as follows:

$$\begin{cases} a_0 = x(0) + x(7), a_1 = x(0) - x(7) \\ a_2 = x(1) + x(6), a_3 = x(1) - x(6), \\ a_4 = x(2) + x(5), a_5 = x(2) - x(5) \\ a_6 = x(3) + x(4), a_7 = x(3) - x(4) \end{cases}$$
(3)

Then, the outputs of the 8-point DCT can be written as:

$$\begin{bmatrix} y(0) \\ y(4) \end{bmatrix} = \frac{1}{\sqrt{8}} \cdot \begin{bmatrix} 1 & 1 \\ -1 & 1 \end{bmatrix} \cdot \begin{bmatrix} a_0 + a_6 \\ a_4 + a_2 \end{bmatrix}$$

$$\begin{bmatrix} y(2) \\ y(6) \end{bmatrix} = \frac{1}{2} \cdot \begin{bmatrix} c_2 & -s_2 \\ s_2 & c_2 \end{bmatrix} \cdot \begin{bmatrix} a_0 - a_6 \\ a_4 - a_2 \end{bmatrix}$$

$$\begin{bmatrix} y(1) \\ y(7) \end{bmatrix} = \frac{1}{2} \cdot \begin{bmatrix} c_1 & -s_1 \\ s_1 & c_1 \end{bmatrix} \cdot \begin{bmatrix} a_1 \\ -a_7 \end{bmatrix}$$

$$+ \frac{1}{2} \cdot \begin{bmatrix} c_3 & s_3 \\ -s_3 & c_3 \end{bmatrix} \cdot \begin{bmatrix} a_3 \\ a_5 \end{bmatrix}$$

$$\begin{bmatrix} y(5) \\ y(3) \end{bmatrix} = \frac{1}{2} \cdot \begin{bmatrix} c_3 & s_3 \\ -s_3 & c_3 \end{bmatrix} \cdot \begin{bmatrix} a_7 \\ a_1 \end{bmatrix}$$

$$-\frac{1}{2} \cdot \begin{bmatrix} c_1 & -s_1 \\ s_1 & c_1 \end{bmatrix} \cdot \begin{bmatrix} a_3 \\ a_5 \end{bmatrix}$$

$$(4)$$

165685

in which  $c_m = \cos(m\pi/16)$  and  $s_m = \sin(m\pi/16)$ . As expressed in Equation (4), the 8-point DCT could be realized by vector rotations, whose rotation angles are  $\pi/16$ ,  $\pi/8$  and  $-3\pi/16$  respectively. Due to rotation angles are in the range of the CORDIC, it means that we can take use of CORDIC to implement the 8-point DCT for Equation (4).

For Equation (2), we can also use the same method for IDCT, which is illustrated as:

$$\begin{bmatrix} b_{1} \\ b_{0} \end{bmatrix} = \frac{1}{\sqrt{8}} \cdot \begin{bmatrix} 1 & 1 \\ -1 & 1 \end{bmatrix} \cdot \begin{bmatrix} y(0) \\ y(4) \end{bmatrix},$$

$$\begin{bmatrix} b_{3} \\ b_{2} \end{bmatrix} = \frac{1}{2} \cdot \begin{bmatrix} c_{2} - s_{2} \\ s_{2} & c_{2} \end{bmatrix} \cdot \begin{bmatrix} y(6) \\ y(2) \end{bmatrix}$$

$$\begin{bmatrix} b_{5} \\ b_{4} \end{bmatrix} = \frac{1}{2} \cdot \begin{bmatrix} c_{1} - s_{1} \\ s_{1} & c_{1} \end{bmatrix} \cdot \begin{bmatrix} y(7) \\ y(1) \end{bmatrix},$$

$$\begin{bmatrix} b_{7} \\ b_{6} \end{bmatrix} = \frac{1}{2} \cdot \begin{bmatrix} c_{1} - s_{1} \\ s_{1} & c_{1} \end{bmatrix} \cdot \begin{bmatrix} y(3) \\ y(5) \end{bmatrix}$$

$$\begin{bmatrix} b_{9} \\ b_{8} \end{bmatrix} = \frac{1}{2} \cdot \begin{bmatrix} c_{3} & s_{3} \\ -s_{3} & c_{3} \end{bmatrix} \cdot \begin{bmatrix} y(7) \\ y(1) \end{bmatrix},$$

$$\begin{bmatrix} b_{11} \\ b_{10} \end{bmatrix} = \frac{1}{2} \cdot \begin{bmatrix} c_{3} & s_{3} \\ -s_{3} & c_{3} \end{bmatrix} \cdot \begin{bmatrix} y(3) \\ y(5) \end{bmatrix}$$

$$(5)$$

in which the expressions are also realized by the vector rotations, and the rotation angles are the same as the ones in Equation (4). Therefore, by substituting Equation (5) into (2), we can get:

$$\begin{aligned} x(0) &= (b_1 + b_2) + (b_4 + b_{11}) \\ x(7) &= (b_1 + b_2) - (b_4 + b_{11}), \\ x(2) &= (b_0 + b_3) + (b_9 - b_7) \\ x(5) &= (b_0 + b_3) - (b_9 - b_7) \\ x(1) &= (b_0 - b_3) + (b_8 - b_6) \\ x(6) &= (b_0 - b_3) - (b_8 - b_6), \\ x(3) &= (b_1 - b_2) + (b_{10} - b_5) \\ x(4) &= (b_1 - b_2) - (b_{10} - b_5) \end{aligned}$$
(6)

After analysing the decomposed expressions for DCT and IDCT, it can be concluded that the DCT and IDCT would be implemented in the same structure under the control of the Mode controller. Figure 1 illustrates the data flow of the proposed unified DCT/IDCT architecture, which consists of three adder arrays (Adder 1, Adder 2, Adder 3) and two processing elements (PE\_1, PE\_2). For the Adder 1 and the Adder 2, they are the same as the ones in our previous work [15]. Although the components of the Adder\_3 are still the same as the ones in [15], the two constant scale factor  $1/\sqrt{8}$  compensation units are optimized, which would be discussed in the following part of the paper. However, the architecture of the PE\_1 consisted of the EARC rotation element EARC\_1, whose rotation angle is  $\pi/8$  and the scale factor is combined with the value 1/2, is the same as the one in our previous work [15]. The reason is that our previous work



even uses the ARC algorithm, but also replaces i = 2 iteration by executing i = 3 iteration twice. The PE\_2 consists of two inverters, four adders, two subtractors, two identical EARC\_2 and EARC\_3 rotators, whose rotation angle are  $-3\pi/16$  and  $\pi/16$  respectively. In the meantime, both of the results of EARC\_2 and EARC\_3 need to be scaled by 1/2.

#### **B. ENHANCED SCALE FACTOR APPROXIMATION**

As presented in [15], the radix-2 scale factor approximation improves the accuracy, and each part of the proposed approximation expressions has the same critical path delay as the required CORDIC iterations. However, the previous approximation expressions are not sufficient for the enhanced DCT/IDCT architecture, as it only considers the quadratic polynomials whose first terms are 1, whereas the scale factor range of the enhanced DCT/IDCT architecture is  $(2^{-2}, 1)$ , which means that it could also be represented by the quadratic polynomials whose first terms are  $2^{-1}$ . Hence, we propose an enhanced radix-2 scale factor approximation to achieve better performance, which is shown in Algorithm 1.

Compared with the previous one [15], the lines 10-18 whose first items are  $2^{-1}$  are further proposed, and the critical path delay of each part of the additionally proposed expressions is still the same as lines 5-9. The most important is that the enhanced scale factor approximation covers all the situations of two radix-2 parts multiplication, the computation results of which are in the range  $(2^{-2}, 1)$ .

After applying the enhanced scale factor approximation, the scale factor  $1/\sqrt{8}$  in the Adder\_3 is reformulated into

$$k'_{Adder_{3}} = \frac{1}{\sqrt{8}} \cong (1 - 2^{-2} - 2^{-9}) \cdot (2^{-1} - 2^{-5} + 2^{-8})$$
(7)

where the quantization error is below 1.6E-5. Meanwhile, compared with the approximation in [15], the accuracy of the enhanced scale factor approximation for Adder\_3 is improved by a factor of ten. Figure 2 shows the data flow of

| Algorithm 1 | 1 Enhanced | Scale Factor | Approximation |
|-------------|------------|--------------|---------------|
|-------------|------------|--------------|---------------|

| Initialization:                                                             |
|-----------------------------------------------------------------------------|
| Set $k_1 = k'_{ADDER-3} = \frac{1}{\sqrt{8}}$ ; $R_{Accuracy} = 2^{-12}$ ;  |
| $A_{Accuracy} = 2^{-12};$                                                   |
| Iteration:                                                                  |
| 1: for $m = 1$ to 10 do                                                     |
| 2: <b>for</b> $n = 1$ to 10 <b>do</b>                                       |
| 3: <b>for</b> $k = 1$ to 10 <b>do</b>                                       |
| 4: <b>for</b> $l = 1$ to 10 <b>do</b>                                       |
| 5: $A_1 = (1 + 2^{-m} + 2^{-n}) \cdot (1 + 2^{-k} - 2^{-l});$               |
| 6: $A_2 = (1 + 2^{-m} + 2^{-n}) \cdot (1 - 2^{-k} - 2^{-l});$               |
| 7: $A_3 = (1 + 2^{-m} - 2^{-n}) \cdot (1 + 2^{-k} - 2^{-l});$               |
| 8: $A_4 = (1 + 2^{-m} - 2^{-n}) \cdot (1 - 2^{-k} - 2^{-l});$               |
| 9: $A_5 = (1 - 2^{-m} - 2^{-n}) \cdot (1 - 2^{-k} - 2^{-l});$               |
| 10: $A_6 = (1 + 2^{-m} + 2^{-n}) \cdot (2^{-1} - 2^{-k} - 2^{-l});$         |
| 11: $A_7 = (1 + 2^{-m} - 2^{-n}) \cdot (2^{-1} - 2^{-k} - 2^{-l});$         |
| 12: $A_8 = (1 - 2^{-m} - 2^{-n}) \cdot (2^{-1} - 2^{-k} - 2^{-l});$         |
| 13: $A_9 = (1 - 2^{-m} + 2^{-n}) \cdot (2^{-1} - 2^{-k} - 2^{-l});$         |
| 14: $A_{10} = (1 + 2^{-m} - 2^{-n}) \cdot (2^{-1} - 2^{-k} + 2^{-l});$      |
| 15: $A_{11} = (1 - 2^{-m} - 2^{-n}) \cdot (2^{-1} - 2^{-k} + 2^{-l});$      |
| 16: $A_{12} = (1 + 2^{-m} + 2^{-n}) \cdot (2^{-1} + 2^{-k} + 2^{-l});$      |
| 17: $A_{13} = (1 + 2^{-m} - 2^{-n}) \cdot (2^{-1} + 2^{-k} + 2^{-l});$      |
| 18: $A_{14} = (1 - 2^{-m} - 2^{-n}) \cdot (2^{-1} + 2^{-k} + 2^{-l});$      |
| 19: $A_{15} = (2^{-1} + 2^{-m} + 2^{-n}) \cdot (2^{-1} + 2^{-k} + 2^{-l});$ |
| 20: $A_{16} = (2^{-1} + 2^{-m} + 2^{-n}) \cdot (2^{-1} - 2^{-k} + 2^{-l});$ |
| 21: $A_17 = (2^{-1} + 2^{-m} + 2^{-n}) \cdot (2^{-1} - 2^{-k} - 2^{-l});$   |
| 22: $A_{18} = (2^{-1} + 2^{-m} - 2^{-n}) \cdot (2^{-1} + 2^{-k} - 2^{-l});$ |
| 23: Set $C\_Accuracy = Min( k_1 - A\_i , i \in [1, 18])$                    |
| 24: <b>if</b> $C_Accuracy < R_Accuracy$ <b>then</b>                         |
| 25: <b>if</b> $C_Accuracy < A_Accuracy$ <b>then</b>                         |
| 26: $A\_Accuracy = C\_Accuracy;$                                            |
| 27: $q = i; w = m; r = n; t = k; s = l;$                                    |
| 28: <b>end if</b>                                                           |
| 29: <b>end if</b>                                                           |
| 30: end for                                                                 |
| 31: end for                                                                 |
| 32: end for                                                                 |
| 33: end for                                                                 |
| 34: return A Accuracy, $q, w, r, s, t$ ;                                    |



FIGURE 2. Data Flow of Adder\_3.

the Adder\_3, in which the first two clock cycles are required for a balanced pipeline to execute the scale factor compensation expressed as Equation (7), and the last stage is used to implement the Equation (4) vector rotation.



FIGURE 3. Data Flow of EARC\_2.

#### C. ROTATION ELEMENT EARC\_2

The rotation angle of EARC\_2 is  $-3\pi/16$ , which exceeds the convergence range of EARC, and we can use domain folding technology [15], but it imports the scale factor compensation again. Therefore, we first choose to execute the i = 1 iteration of the conventional CORDIC, and then the residual angle is 12'b000010000000, which can be realized by executing the i = 4 iteration of EARC. The rotation iteration sequencing of EARC\_2 ordered from left to right is

$$\begin{bmatrix} 1 & 2^{-1} \\ -2^{-1} & 1 \end{bmatrix}, \begin{bmatrix} 1 - 2^{-7} & 2^{-3} \\ -2^{-3} & 1 - 2^{-7} \end{bmatrix}$$
(8)

Even the EARC is scaling-free, but the EARC\_2 executes the i = 1 rotation of the conventional CORDIC first, thus the results of EARC\_2 need to be scaled by the scale factor  $\cos(\arctan(2^{-1})) = 1/\sqrt{1+2^{-2}}$  of the i = 1 rotation. Meanwhile, we can combine the post scale-factor 1/2 with  $1/\sqrt{1+2^{-2}}$  together to save the resources. We also use Algorithm 1 to approximate the factor k<sub>EARC\_2</sub>, which is expressed as

$$k_{EARC_2} = \frac{1}{\sqrt{1+2^{-2}}} \cdot \frac{1}{2} \cong (1+2^{-10}-2^{-4}) \cdot (2^{-1}-2^{-5}+2^{-7}) \quad (9)$$

where the accuracy of the  $k_{ARC_2}$  is above 2.9E-5.

Figure 3 shows the pipeline-balanced data flow of EARC\_2, in which the first two stages are used to execute the Equation (8) rotation iterations, and the scale factor compensation depicted as Equation (9) is implemented in the remaining two stages. Meanwhile, the shifter splitting scheme is used in the last stage to reduce the hardware resources [15].

### D. ROTATION ELEMENT EARC\_3

As  $\pi/16$  is in the convergence range of EARC, it is expressed as 12'b000011001001, and then EARC i = 4, i = 5, i = 8 and i = 11 iterations are implemented. The corresponding rotations are

$$\begin{bmatrix} 1 - 2^{-7} & -2^{-3} \\ 2^{-3} & 1 - 2^{-7} \end{bmatrix}, \begin{bmatrix} 1 - 2^{-9} & -2^4 \\ 2^{-4} & 1 - 2^{-9} \end{bmatrix}, \\\begin{bmatrix} 1 & -2^{-7} \\ 2^{-7} & 1 \end{bmatrix}, \begin{bmatrix} 1 & -2^{-10} \\ 2^{-10} & 1 \end{bmatrix}$$
(10)

Figure 4 shows the data flow of EARC\_3, which is also divided into four stages, but all of the stages are used to



FIGURE 4. Data Flow of EARC\_3.

execute the rotations illustrated in Equation (10). The reason is that EARC is scaling-free, and the rotation angle is in the convergence range of EARC.

## **III. PERFORMANCE EVALUATION AND COMPARISON**

In this section, we compare our enhanced unified DCT/IDCT architecture with the state-of-the-art DCT discussed in [13], [14] and our previous work presented in [15] in terms of accuracy, hardware resources, speed and power consumption. Due to our precious work [15] and the enhanced unified DCT/IDCT architecture are both synthesized on the Xilinx Virtex5 LX110T platform, the Huang CORDIC algorithm [13] and the Lee CORDIC algorithm [14] are re-implemented and synthesized on the same FPGA to conduct a fair comparison.

# A. ACCURACY COMPARISON

The accuracy of the CORDIC based rotation elements are first analysed, because they have a significant impact on the PSNR of the whole unified DCT/IDCT architecture.

The CORDIC algorithm has two types of errors: quantization error and truncation error [17]. As discussed in [19], the bit error position (BEP) is commonly used to compare the accuracy of the CORDIC based vector rotations. Therefore, we also the method mentioned in [19] that a pseudorandom sequence of 1000 vectors lying evenly in the convergence range  $[0, 2^8-1)$  is generated. Taking the general rotation angle  $-3\pi/16$  that is realized in all of the compared architecture as an example, we use these vectors as the inputs of the Huang CORDIC [13], the Lee CORDIC [14], the previous ARC 2 [15] and the proposed EARC 2. The corresponding cosine BEPs of the four different architectures are illustrated in Figure 5. The BEPs of the Lee CORDIC are approximately located at the 10<sup>th</sup> decimal digit position, and the Huang CORDIC, the previous ARC\_2 and the proposed EARC\_2 are located above the 9<sup>th</sup>, the 10<sup>th</sup> and the 12<sup>th</sup> decimal digit positions, respectively. The reason is that the proposed EARC\_2 is implemented by the cooperation between EARC and the conventional CORDIC to overcome the rotation angle out of the convergence range of EARC, and the scale factor of the conventional CORDIC is combined with the post scale factor in DCT/IDCT, which is realized by the enhanced radix-2 scale factor approximation.



FIGURE 5. BEP Comparison

In the meantime, we also use the MSE mentioned in [21] to compare our proposed algorithm with the ones described in [21]. As illustrated in [21], the MSE of the 2S-8P-DCT and 3S-8P-DCT are 1.403e-4 and 3.864e-5, individually, and the MSE of our proposed unified DCT/IDCT architecture is 9.462e-7, which is much less than 2S-8P-DCT and 3S-8P-DCT. The reason is that our unified DCT/IDCT architecture adopts the EARC, which reduces the computation error.

Furthermore, we generate  $500.8 \times 8$  test matrices to obtain the average PSNR to evaluate the performance of the different DCT and unified DCT/IDCT architectures. The PSNRs of the Huang architecture [13], the Lee architecture [14], the previous unified DCT/IDCT architecture [15] and the proposed unified DCT/IDCT architecture are 43.16 dB, 45.98 dB, 47.04 dB and 48.02 dB, respectively, which means the PSNR of the proposed unified DCT/IDCT architecture is the best and the Huang architecture is the worst. More previously, the proposed unified DCT/IDCT architecture exceeds the Lee, the Huang and the previous architectures by 4.86 dB, 2.04 dB and 0.98 dB, individually. As previously discussed, the reason is that the components in the proposed unified DCT/IDCT is realized based on EARC and the optimized radix-2 scale factor approximation scheme is adopted simultaneously.

### **B. AREA COMPARISON**

Due to the Lee architecture only works in DCT mode, the Huang, the previous architecture and the proposed unified DCT/IDCT architecture are all synthesized under DCT mode and unified DCT/IDCT mode.

Table 1 illustrates the consumed hardware resources of the Lee, the Huang, the previous work and the proposed architecture under DCT mode, and the required area of the Huang, the previous work and the proposed unified DCT/IDCT architecture working in DCT/IDCT mode is shown Table 2. No matter under DCT mode or DCT/IDCT mode, the proposed unified architecture uses the least hardware resources,

 TABLE 1. Area and speed comparison under DCT mode.

| Mode  |                                | DCT             |                 |                  |               |  |
|-------|--------------------------------|-----------------|-----------------|------------------|---------------|--|
| Arch. |                                | Huang<br>Arch.  | Lee<br>Arch.    | Previous<br>work | This<br>paper |  |
|       | Slice<br>Registers             | 1506            | 992             | 962              | 947           |  |
| Area  | Slice LUTs                     | 1526<br>(33.7%) | 1172<br>(13.7%) | 1046<br>(3.3%)   | 1012          |  |
|       | Slice LUT-<br>FF Pairs         | 1660            | 1216            | 1133             | 1106          |  |
|       | Critical Path<br>Delay (ns)    | 3.05<br>(3.6%)  | 3.16            | 2.95<br>(7.1%)   | 3<br>(5.3%)   |  |
|       | Latency                        | 13              | 6               | 6                | 6             |  |
| Speed | Processing<br>Time (ns)        | 39.65           | 18.96           | 17.70            | 18            |  |
|       | Throughput<br>(M<br>samples/s) | 328             | 316             | 339              | 333           |  |

TABLE 2. Area and speed comparison under DCT/IDCT mode.

| Mode  |                             | DCT/IDCT         |                   |            |  |
|-------|-----------------------------|------------------|-------------------|------------|--|
|       | Arch.                       |                  | Previous<br>work  | This paper |  |
|       | Slice Registers             | 1714<br>(44%)    | 980 (2%)          | 960        |  |
| Area  | Slice LUTs                  | 1812             | 1342              | 1273       |  |
|       | Slice LUT-FF<br>Pairs       | 2019             | 1358              | 1307       |  |
|       | Critical Path<br>Delay (ns) | 3.20             | 3.36              | 3.42       |  |
|       | Latency                     | 13               | 6                 | 6          |  |
| Speed | Processing<br>Time (ns)     | 41.60<br>(50.7%) | 20.16 (-<br>1.8%) | 20.52      |  |
|       | Throughput<br>(M samples/s) | 313              | 298               | 292        |  |

and the Huang architecture requires the most area. Taking the consumed Slice LUTs under DCT mode shown in Table 1 as an example, compared with the Huang, the Lee and the previous architectures, the proposed unified architecture saves area by 33.7%, 13.7% and 3.3%, respectively. For DCT/IDCT mode, using Slice Registers depicted in Table 2 as an example, compared with the proposed unified DCT/IDCT architecture, the Huang architecture and the previous architecture uses 44% and 2% more hardware resources, individually.

Compared with the Lee architecture, the previous work and the proposed unified DCT/IDCT architecture, the scale factor compensations of the Huang architecture is not integrated into the CORDIC based rotation elements to reduce the required hardware resource. Compared with the Lee architecture and the previous unified architecture, the proposed unified architecture adopts EARC or the cooperation of EARC and the conventional CORDIC to implement the required vector rotations to further save hardware resources.

Meanwhile, we also compare the proposed unified DCT/IDCT architecture with the 2S-8P-DCT and 3S-8P-DCT [21]. As the 2S-8P-DCT and 3S-8P-DCT are synthesized on Altera Stratix and our proposed architecture is validated on Xilinx Virtix5, we use the consumed adders to

compare the different architectures. The Adders of the 2S-8P-DCT, 3S-8P-DCT and the proposed unified DCT/IDCT architecture are 38, 48 and 100, separately, in which the proposed unified DCT/IDCT architecture supports to implement the DCT and IDCT function time-sharing, but the 2S-8P-DCT and 3S-8P-DCT can only realize the DCT function. Hence, the 2S-8P-DCT and 3S-8P-DCT require less adders.

## C. SPEED COMPARISON

As discussed in [15], we also compare the speed in terms of critical path delay, latency, processing time and throughput, which are also illustrated in Table 1 under DCT mode and Table 2 working in DCT/IDTC mode. As illustrated in Table 1 and Table 2, compared to the Huang architecture no matter working modes, all of the Lee, the previous work and the proposed unified DCT/IDCT architecture provide a factor of 2.17-fold improvement in latency. The reason is that the data dependence of the neighbouring rotation elements have also been eliminated to reduce the latency for the Lee, the previous work and the proposed unified DCT/IDCT architecture. Meanwhile, compared with the Huang and Lee architectures under DCT mode, the proposed unified architecture has the shorter critical path delay and processing time, and owns the higher throughput. However, compared to our previous work, the proposed unified DCT/IDCT architecture takes the longer critical path delay and processing time. For example, compared with the Lee architecture, Huang architecture and the previous work, the proposed unified DCT/IDCT architecture reduces the critical path delay by 3.6%, 7.1% and 5.3%, respectively.

Under the DCT/IDCT mode, the Huang architecture owns the least critical path delay. However, compared with the Huang architecture and the previous work, the proposed unified DCT/IDCT architecture reduces 50.7% and -1.8% in processing time.

Meanwhile, as the 2S-8P-DCT and 3S-8P-DCT only can work in DCT mode, we compare the latency and the throughput with the 2S-8P-DCT, 3S-8P-DCT and our proposed unified DCT/DCT architecture under DCT mode. The throughputs (M samples/s) of the three different architecture are 216, 181 and 333, individually, which illustrates that our unified DCT/DCT architecture provides a factor of 1.54-fold and 1.84-fold improvement in throughput. The latency of the 2S-8P-DCT, 3S-8P-DCT and our proposed unified DCT/DCT architecture are 4, 5 and 6, respectively. Therefore, the processing time (ns) for the 2S-8P-DCT, 3S-8P-DCT and our proposed unified DCT/DCT architecture are 18.52, 27.62 and 18 respectively, which illustrates that our unified DCT/DCT architecture uses the least processing time even the latency is the highest.

## **D. POWER COMPARISON**

As power plays an critical role in embedded systems, we also use the same methods in [15] to compare the power consumption between the Huang architecture, the Lee architecture, the previous work and the proposed unified DCT/IDCT

| Arch.        | Huang Arch. | Lee Arch.  |                 |
|--------------|-------------|------------|-----------------|
|              |             | 316 MHz    | 146 MHz         |
| Toggle Rates | 316 MHz     | Equal      | Equal Processin |
|              |             | Throughput | Time            |
| 12.5% (mW)   | 148 (45.3%) | 98 (17.3%) | 52              |
| 100% (mW)    | 257         | 184        | 88              |

#### TABLE 3. Power comparison under DCT mode (1).

TABLE 4. Power comparison under DCT mode (2).

| Arch.                 | Previous work       |                             | This paper          |                             |  |
|-----------------------|---------------------|-----------------------------|---------------------|-----------------------------|--|
|                       | 316 MHz             | 146 MHz                     | 316 MHz             | 146 MHz                     |  |
| Toggle<br>Rates       | Equal<br>Throughput | Equal<br>Processing<br>Time | Equal<br>Throughput | Equal<br>Processing<br>Time |  |
| 12.5%<br>(mW)         | 89 (9%)             | 47                          | 81                  | 38                          |  |
| 100%<br>( <i>mW</i> ) | 167                 | 82                          | 148                 | 67                          |  |

architecture by using the XPower tool [20], in which the first method is to keep the throughput constant by setting the same operating frequency, and the other method is to have the identical processing time according to the ratio of latencies.

For DCT mode, as discussed in speed comparison, due to the Lee architecture only could work under the frequency no more than 316 MHz, we first set the operating frequencies of all the architectures to 316 MHz to obtain the same throughput. In the meantime, as the Huang architecture requires the longest processing time, the operating frequencies of the Lee architecture, the previous work and the proposed unified DCT/IDCT architecture must be 146 MHz to keep the processing time constant after the frequency of the Huang architecture set to 316 MHz. The power dissipation of the four different architectures are illustrated in Table 3 and Table 4. No matter under same throughput or equal processing time and which toggle rate is selected, the proposed unified consumes the least power.

Taking toggle rate = 12.5% under the same throughput as an example, the power consumption of the Huang architecture, the Lee architecture, the previous work and the proposed unified DCT/IDCT architecture are 148 mW, 98 mW, 89 mW and 81 mW, respectively, which means that the Huang architecture, the Lee architecture and the previous work consumes 45.3%, 17.3% and 9% more power compared with the proposed unified DCT/IDCT architecture, respectively. On the condition that the processing time is the same, the proposed unified DCT/IDCT architecture also consumes much less power than the other three architectures for all tested toggle rates. The reason is that the required hardware resources of the proposed unified DCT/IDCT architecture are less than the Huang architecture, the Lee architecture and the previous work.

Table 5 illustrates the estimated power dissipation of the Lee, the previous work and the proposed unified DCT/IDCT architecture under DCT/IDCT mode, in which as the proposed unified DCT/IDCT architecture only could work under the frequency no more than 292 MHz, the operating

| Arch.                  | Huang<br>Arch. | Previous work            |                             | This paper               |                             |
|------------------------|----------------|--------------------------|-----------------------------|--------------------------|-----------------------------|
| Tagala                 | 292<br>MHz     | 292<br>MHz               | 137 MHz                     | 292<br>MHz               | 137 MHz                     |
| Rates                  |                | Equal<br>Through-<br>put | Equal<br>Processing<br>Time | Equal<br>Through-<br>put | Equal<br>Processing<br>Time |
| 12.5%<br>( <i>mW</i> ) | 208<br>(38.5%) | 136<br>(5.9%)            | 64                          | 128                      | 59                          |
| 100%<br>( <i>mW</i> )  | 391<br>(51.9%) | 244                      | 107<br>(6.5%)               | 227                      | 100                         |

TABLE 5. Power comparison under DCT/IDCT mode.

frequencies of the Huang architecture and the previous work are both set to 292 MHz to achieve the same throughput, and the previous work and the proposed unified DCT/IDCT architecture are also working under 137 MHz frequency to realize the same processing time with the Huang architecture working under 292 MHz. Taking the same throughput under the 12.5% toggle rate as an example again, the power dissipation of the Huang architecture, the previous work and the proposed unified DCT/IDCT architecture are 208 mW, 136 mW and 128 mW, individually, which means the proposed unified DCT/IDCT architecture consumes 38.5% and 5.9% less power than the Huang architecture and the previous work, respectively. On the equal processing time condition, even when choosing the best case 12.5% toggle rate for the Huang architecture and the worst case 100% toggle rate for the proposed unified DCT/IDCT architecture and the previous work, the proposed unified DCT/IDCT architecture still saves over 51.9% and 6.5% power consumption than the Huang architecture and the previous work, respectively.

Hence, no matter which mode is chosen, the proposed unified architecture dissipates the least power, especially under the equal processing time. The reason is that the proposed unified DCT/IDCT architecture consumes the least hardware resources.

# **IV. CONCLUSION**

This paper has presented an enhanced implementation of unified DCT/IDCT architecture based on the cooperation between the Enhanced Adaptive Recoding CORDIC (EARC) and the conventional CORDIC. Meanwhile, we also propose an optimized efficient adder and shifter-based radix-2 scale factor approximation. Under DCT mode, compared with the Huang architecture, the Lee architecture and the previous work, the proposed unified DCT/IDCT architecture saves over 3.3% in area, improves PSNR by over 0.98 dB, and dissipates over 9% less power at the nearly 1.8% cost of the critical path delay. For DCT/IDCT mode, compared with the Huang architecture and the previous work, the proposed unified DCT/IDCT architecture uses over 2% less hardware resources. Under the same throughput, compared with the Huang architecture and the previous work, the proposed unified architecture dissipates 38.5% and 5.9% less power, respectively. On the same processing time condition, even when choosing the 100% toggle rate for the proposed unified

architecture and 12.5% toggle rate for the Huang architecture, the proposed architecture still could save nearly 51.9% power consumption. Compared with the 2S-8P-DCT and 3S-8P-DCT, our proposed unified DCT/IDCT architecture provides the best accuracy and speed, but consumes the most hardware resources as the 2S-8P-DCT and 3S-8P-DCT only can work in DCT mode. Hence, no matter under DCT mode or DCT/DICT mode, the proposed unified architecture has the best performance.

#### REFERENCES

- K. R. Rao and P. Yip, Discrete Cosine Transform: Algorithms, Advantages, Applications. San Diego, CA, USA: Academic, 1990.
- [2] R. J. Clarke, "Relation between the Karhunen Loève and cosine transforms," *IEE Proc. F-Commun., Radar Signal Process.*, vol. 128, no. 6, pp. 359–360, Nov. 1981.
- [3] G. K. Wallace, "The JPEG still picture compression standard," *IEEE Trans. Consum. Electron.*, vol. 38, no. 1, pp. 18–34, Feb. 1992.
- [4] D. Le Gall, "MPEG: A video compression standard for multimedia applications," *Commun. ACM*, vol. 34, no. 4, pp. 46–58, 1991.
- [5] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, "Overview of the H.264/AVC video coding standard," *IEEE Trans. Circuits Syst. Video Technol.*, vol. 13, no. 7, pp. 560–576, Jul. 2003.
- [6] A. Ahmed, M. U. Shahid, and A. ur Rehman, "N point DCT VLSI architecture for emerging HEVC standard," J. VLSI Des., vol. 2012, Jan. 2012, Art. no. 6.
- [7] N. Ahmed, T. Natarajan, and K. R. Rao, "Discrete cosine transform," *IEEE Trans. Comput.*, vol. C-23, no. 1, pp. 90–93, Jan. 1974.
- [8] Y. Jeong, I. Lee, H. S. Kim, and K. T. Park, "Fast DCT algorithm with fewer multiplication stages," *Electron. Lett.*, vol. 34, no. 8, pp. 723–724, Apr. 1998.
- [9] C.-Y. Li, Y.-H. Chen, T.-Y. Chang, and J.-N. Chen, "A probabilistic estimation bias circuit for fixed-width Booth multiplier and its DCT applications," *IEEE Trans. Circuits Syst.*, *II, Exp. Briefs*, vol. 58, no. 4, pp. 215–219, Apr. 2011.
- [10] S. Yu and E. E. Swartziander, "DCT implementation with distributed arithmetic," *IEEE Trans. Comput.*, vol. 50, no. 9, pp. 985–991, Sep. 2001.
- [11] Y.-H. Chen, T.-Y. Chang, and C.-Y. Li, "High throughput DA-based DCT with high accuracy error-compensated adder tree," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 19, no. 4, pp. 709–714, Apr. 2011.
- [12] H. Huang and L. Xiao, "CORDIC based fast radix-2 DCT algorithm," *IEEE Signal Process. Lett.*, vol. 20, no. 5, pp. 483–486, May 2013.
- [13] H. Huang, L. Xiao, and J. Liu, "CORDIC-based unified architectures for computation of DCT/IDCT/DST/IDST," *Circuits, Syst., Signal Process.*, vol. 33, no. 3, pp. 799–814, Mar. 2014.
- [14] M.-W. Lee, J.-H. Yoon, and J. Park, "Reconfigurable CORDIC-based lowpower DCT architecture based on data priority," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 22, no. 5, pp. 1060–1068, May 2014.
- [15] J. Zhang, P. Chow, and H. Liu, "FPGA implementation of low-power and high-PSNR DCT/IDCT architecture based on adaptive recoding CORDIC," in *Proc. Int. Conf. Field-Program. Technol. (FPT)*, Dec. 2015, pp. 128–135.
- [16] J. E. Volder, "The CORDIC trigonometric computing technique," IRE Trans. Electron. Comput., vol. 8, no. 3, pp. 330–334, Sep. 1959.
- [17] P. K. Meher, J. Valls, T.-B. Juang, K. Sridharan, and K. Maharatna, "50 years of CORDIC: Algorithms, architectures, and applications," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 56, no. 9, pp. 1893–1907, Sep. 2009.
- [18] J. Zhang, H. Liu, W. Hu, D. Liu, and B. Zhang, "Adaptive recoding CORDIC," *IEICE Electron. Exp.*, vol. 9, no. 8, pp. 765–771, Apr. 2012.
- [19] J. Zhang, P. Chow, and H. Liu, "An enhanced adaptive recoding rotation CORDIC," ACM Trans. Reconfigurable Technol. Syst., vol. 9, no. 1, pp. 4:1–4:25, 2015.
- [20] XPower Estimator User Guide, User Guide, Xilinx, San Jose, CA, USA, Jan. 2012.
- [21] T.-T. Hoang, D.-H. Le, and C.-K. Pham, "Minimum adder-delay architecture of 8/16/32-point DCT based on fixed-rotation adaptive CORDIC," *IEICE Electron. Express*, vol. 15, no. 10, pp. 1–12, May 2018.

[22] Y. Feng, J. Zhang, and H. Liu, "A novel low-power and high-PSNR architecture based on ARC for DCT/IDCT," in *Computer Engineering and Technology*. Cham, Switzerland: Springer, 2016, pp. 55–68.



**JIANFENG ZHANG** received the B.S., M.S., and Ph.D. degrees from the National University of Defense Technology, in 2009, 2011, and 2016, respectively. He is currently an Associate Professor with the College of Computer, National University of Defense Technology. His current research interests include hardware accelerator, computer architecture, and artificial intelligence.



**WEI SHI** received the M.S. and Ph.D. degrees from the National University of Defense Technology, in 2006 and 2010, respectively. He is currently an Associate Professor with the College of Computer, National University of Defense Technology. His current research interests include computer architecture, microprocessor design, and information security.



**LI ZHOU** received the B.S., M.S., and Ph.D. degrees from the National University of Defense Technology, in 2008, 2010, and 2014, respectively. He is currently an Associate Professor with the College of Computer, National University of Defense Technology. His current research interests include microprocessor architecture and reconfigurable hardware design as well as accelerator hardware.





ogy, in 2005 and 2008, respectively. He is currently an Associate Professor with the College of Computer, National University of Defense Technology. His current research interests include computer architecture, microprocessor design, asynchronous circuit, and information security.

**RUI GONG** received the M.S. and Ph.D. degrees

from the National University of Defense Technol-

**LEI WANG** received the M.S. and Ph.D. degrees from the National University of Defense Technology, in 2000 and 2006, respectively. She is currently an Associate Professor with the College of Computer, National University of Defense Technology. Her current research interests include computer architecture, asynchronous circuit, artificial intelligence, and neuromorphic computation.



**HONGWEI ZHOU** received the M.S. and Ph.D. degrees from the National University of Defense Technology, in 2003 and 2007, respectively. He is currently an Associate Professor with the College of Computer, National University of Defense Technology. His current research interests include performance and power optimization for network-on-chip, memory system in multi-core processors, and architecture level power optimization for ultradeep submicron microprocessors.

...