

Received February 6, 2020, accepted February 18, 2020, date of publication March 6, 2020, date of current version March 18, 2020. Digital Object Identifier 10.1109/ACCESS.2020.2978773

# **Approximate Multiplier Design Using Novel Dual-Stage 4 : 2 Compressors**

# PRANOSE J. EDAVOOR<sup>®</sup>, SITHARA RAVEENDRAN<sup>®</sup>,

AND AMOL D. RAHULKAR<sup>10</sup>, (Member, IEEE)

National Institute of Technology, Goa 403401, India

Corresponding author: Pranose J. Edavoor (pranose@gmail.com)

This work was supported by the Science and Engineering Research Board (SERB), Department of Science and Technology, Government of India, under Grant ECR/2016/001352.

**ABSTRACT** High speed multimedia applications have paved way for a whole new area in high speed error-tolerant circuits with approximate computing. These applications deliver high performance at the cost of reduction in accuracy. Furthermore, such implementations reduce the complexity of the system architecture, delay and power consumption. This paper explores and proposes the design and analysis of two approximate compressors with reduced area, delay and power with comparable accuracy when compared with the existing architectures. The proposed designs are implemented using 45 nm CMOS technology and efficiency of the proposed designs have been extensively verified and projected on scales of area, delay, power, Power Delay Product (PDP), Error Rate (ER), Error Distance (ED), and Accurate Output Count (AOC). The proposed approximate 4 : 2 compressor shows 56.80% reduction in area, 57.20% reduction in power, and 73.30% reduction in delay compared to an accurate 4 : 2 compressor. The proposed compressors are utilised to implement  $8 \times 8$  and  $16 \times 16$  Dadda multipliers. These multipliers have comparable accuracy when compared expression of the proposed design in error resilient applications like image smoothing and multiplication.

**INDEX TERMS** Approximate 4:2 compressors, approximate multipliers, error resilient applications, image processing.

#### I. INTRODUCTION

The overhead on computation units in a processor to deliver high performance and execution efficiency can be leveraged by introducing approximation. Speed of operation which is inversely proportional to the delay of the system requires immense parallel operations that incur huge hardware and power dissipation [1], [2]. Energy and area efficient systems can be realised by relaxing the precision and reliability of the system. In order to maintain the balance between delay, area and power, approximate computing has emerged as a promising solution. Approximation in arithmetic operations result in faster systems with lesser design complexity and power consumption [3]–[5]. The trade off would be reduction in accuracy, which does not necessarily affect the normal operation for machine learning and multimedia applications. Such applications effectively take advantage of the

The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang<sup>10</sup>.

inability of human eye to detect variation in finer details within images and videos. This level of error tolerance is used to design approximate arithmetic circuits for Artificial Intelligence (AI) and Digital Signal Processing (DSP) applications.

Extensive research has been carried out to enhance the efficiency of approximate arithmetic units [6]–[11]. In multiplication operation, partial product summation has in-arguably been the major contributor to the power dissipation and delay in the system [12]. Research shows that compressors can reduce the delay associated with the partial product summations [12]. Compressors estimate the count of logic 1 in the input using half adders and/or full adders. The various commonly used topologies for compressors are 7: 3, 5: 2, 4: 2, and 3: 2 [12]–[15]. A 4: 2 compressor is preferred over any other topology due to the regularity achieved while cascading. Also, it has been widely used to design Dadda multipliers [15].

A transistor level XOR-XNOR based low power design for 4 : 2 compressor was proposed by Jiangmin *et al.* [16] which

is ideal for tree structured fast multipliers. Chang et al. [12] have proposed a 4 : 2 and a novel 5 : 2 compressor that operates on low supply voltage of 0.6 V. Momeni et al. [17] have proposed logic level approximation based architectures for 4 : 2 approximate compressor that are optimised for delay and power consumption. A re-configurable architecture for a 4 : 2 approximate compressor is proposed by Akbari *et al.* [18], where the re-configurability is achieved by switching between approximate and accurate operations when required. Ha and Lee [19] have proposed a 4 : 2 approximate compressor that reduces the error profile of the compressor by introducing a module for error recovery. While performing the multiplication operation, truncation of  $\frac{n}{2}$  columns (starting from right in the complete partial product array) is carried out. Compressors are applied only to the remaining columns. A probability driven approximate compressor is presented by Guo et al. [20]. The authors have proposed a top-down structure for an approximate multiplier which dynamically allocates between the 8 : 2, 6 : 2 and 4 : 2 approximate compressors based on the partial product count. As a measure to increase the accuracy of the multiplier, a grouped error recovery scheme is also proposed. Alouani et al. [21] have presented an approximate adder based heterogeneous approximate multiplier with reduced MED. This is achieved by utilising the genetic algorithm based approximate adders. Esposito et al. [22] have proposed an XOR-less (AND-OR based) compressor to minimise the average error and error probability. Chang et al. [23] have proposed a 4 : 2 compressor to improve energy quality efficiency in image processing with 25% error rate. Gorantla and Deepa [24] have proposed 4 : 2 and 5 : 2 compressors to reduce delay and power. Reddy et al. [25] have proposed a novel design for 4 : 2 compressor with an error rate of 12.5%. This is achieved by relaxing the constraints on area, delay and power.

Due to the considerable reduction in delay using transmission gates when compared to traditional CMOS based logic, optimised design with transmission gates are explored in literature. But, the major disadvantage is the inconsistency in the rise and fall times for different inputs [12], [16], [26].

In this paper, two novel 4 : 2 compressor architectures are presented. The contributions of the paper are listed below.

- A novel high speed area-efficient, low power 4 : 2 compressor architecture is proposed.
- Overall error rate is 25% with equal number of +1 and -1 ED. This leads to reduced Mean Error Distance (MED) and Mean Relative Error Distance (MRED) which is ideal for Multiply-Accumulate (MAC) operations.
- Dadda multiplier is implemented with the proposed 4 : 2 compressor.
- A modified dual-stage compressor design is proposed to reduce area, delay and power dissipation in multipliers in which more than two stages of cascaded compressors are required for partial product accumulation.



FIGURE 1. Exact 4:2 compressor.

• Proposed designs were verified with the implementation of multiplication and smoothing of images.

The rest of the paper is organised as follows. Section II presents the need for approximation in multipliers and approximate multipliers. The performance metrics evaluated to measure the efficiency of the proposed architectures are presented in Section III. Section IV describes the proposed 4 : 2 compressor architectures. Experimental results are presented in Section V, followed by Section VI with the concluding remarks.

#### **II. APPROXIMATE MULTIPLIERS**

Multiplication is unquestionably a performance determining operation in AI and DSP applications. These applications demand high speed multiplier architectures to necessitate high speed parallel operations with acceptable levels of accuracy. Introduction of approximation in multipliers leads to realisation of faster computations with reduced hardware complexity, delay and power, with accuracy in desirable levels.

Partial product summation is the speed limiting operation in multiplication due to the propagation delay in adder networks. In order to reduce the propagation delay, compressors are introduced. Compressors compute the sum and carry at each level simultaneously. The resultant carry is added with a higher significant sum bit in the next stage. This is continued until the final product is generated.

#### A. EXACT 4:2 COMPRESSOR

The general block diagram of an exact 4 : 2 compressor is shown in Figure 1. It comprises of five inputs, three outputs and two cascaded full adders.  $A_1, A_2, A_3, A_4$  and  $C_{IN}$  are the inputs and  $C_{OUT}$ , CARRY and SUM are the outputs of the exact 4:2 compressor.  $C_{OUT}$ , CARRY and SUM are given as

$$C_{OUT} = A_3(A_1 \oplus A_2) + A_1(\overline{A_1 \oplus A_2}) \tag{1}$$

$$CARRY = C_{IN}(A_1 \oplus A_2 \oplus A_3 \oplus A_4)$$

$$+A_4(\overline{A_1 \oplus A_2 \oplus A_3 \oplus A_4}) \tag{2}$$

$$SUM = C_{IN} \oplus A_1 \oplus A_2 \oplus A_3 \oplus A_4 \tag{3}$$

A compressor chain is shown in Figure 2.  $C_{IN}$  represents the input carry from the preceding 4 : 2 compressor that has processed the lower significant bits. *CARRY* and  $C_{OUT}$  are



FIGURE 2. Compressor chain.

TABLE 1. Truth table for exact 4:2 compressor.

| $A_1$ | $A_2$ | $A_3$ | $A_4$ | $C_{IN}$ | $C_{OUT}$ | CARRY | SUM |
|-------|-------|-------|-------|----------|-----------|-------|-----|
| 0     | 0     | 0     | 0     | 0        | 0         | 0     | 0   |
| 0     | 0     | 0     | 0     | 1        | 0         | 0     | 1   |
| 0     | 0     | 0     | 1     | 0        | 0         | 0     | 1   |
| 0     | 0     | 0     | 1     | 1        | 0         | 1     | 0   |
| 0     | 0     | 1     | 0     | 0        | 0         | 0     | 1   |
| 0     | 0     | 1     | 0     | 1        | 0         | 1     | 0   |
| 0     | 0     | 1     | 1     | 0        | 0         | 1     | 0   |
| 0     | 0     | 1     | 1     | 1        | 0         | 1     | 1   |
| 0     | 1     | 0     | 0     | 0        | 0         | 0     | 1   |
| 0     | 1     | 0     | 0     | 1        | 0         | 1     | 0   |
| 0     | 1     | 0     | 1     | 0        | 0         | 1     | 0   |
| 0     | 1     | 0     | 1     | 1        | 0         | 1     | 1   |
| 0     | 1     | 1     | 0     | 0        | 1         | 0     | 0   |
| 0     | 1     | 1     | 0     | 1        | 1         | 0     | 1   |
| 0     | 1     | 1     | 1     | 0        | 1         | 0     | 1   |
| 0     | 1     | 1     | 1     | 1        | 1         | 1     | 0   |
| 1     | 0     | 0     | 0     | 0        | 0         | 0     | 1   |
| 1     | 0     | 0     | 0     | 1        | 0         | 1     | 0   |
| 1     | 0     | 0     | 1     | 0        | 0         | 1     | 0   |
| 1     | 0     | 0     | 1     | 1        | 0         | 1     | 1   |
| 1     | 0     | 1     | 0     | 0        | 1         | 0     | 0   |
| 1     | 0     | 1     | 0     | 1        | 1         | 0     | 1   |
| 1     | 0     | 1     | 1     | 0        | 1         | 0     | 1   |
| 1     | 0     | 1     | 1     | 1        | 1         | 1     | 0   |
| 1     | 1     | 0     | 0     | 0        | 1         | 0     | 0   |
| 1     | 1     | 0     | 0     | 1        | 1         | 0     | 1   |
| 1     | 1     | 0     | 1     | 0        | 1         | 0     | 1   |
| 1     | 1     | 0     | 1     | 1        | 1         | 1     | 0   |
| 1     | 1     | 1     | 0     | 0        | 1         | 0     | 1   |
| 1     | 1     | 1     | 0     | 1        | 1         | 1     | 0   |
| 1     | 1     | 1     | 1     | 0        | 1         | 1     | 0   |
| 1     | 1     | 1     | 1     | 1        | 1         | 1     | 1   |

the outputs of order '1' with higher significance than the input  $C_{IN}$ . Table 1 presents the truth table for the exact compressor.

#### **B. NEED FOR APPROXIMATION IN MULTIPLIERS**

Let the unsigned multiplier  $(m_L)$  and multiplicand  $(m_C)$  be denoted as

$$m_L = \sum_{\substack{x=0\\7}}^{7} m_{L_x} \times 2^x \tag{4}$$

$$m_C = \sum_{y=0} m_{C_y} \times 2^y \tag{5}$$

Multiplication operation starts with the generation of partial product array and the generalised expression for partial



FIGURE 3. 8 × 8 approximate multiplier.

product is shown in equation (6).

$$P_{p_{x,y}} = \sum_{y=0}^{7} \sum_{x=0}^{7} m_{L_x} \times m_{C_y} \times 2^{(x+y)}$$
(6)

The partial products generated for an  $8 \times 8$  operation is shown in Figure 3. The complete output ( $\lambda$ ) of the multiplier (product) can be represented as

$$\lambda = \sum_{z=0}^{15} \lambda_z \times 2^z \tag{7}$$

Level 1 in Stage 1 has  $\lambda_0 = P_{p_{0,0}}$  and does not involve any operation. A half adder is required to generate  $\lambda_1$  in Level 2. The carry bit from half adder is passed onto Stage 2. Starting from Level 3, the number of terms in partial product array increases to 4 or more and reduces to 1 as the level increases. At this point, a 4 : 2 compressor facilitates fast partial product summation using approximation.

#### C. APPROXIMATE COMPRESSORS

On applying approximation to 4 : 2 compressor, output count can be reduced to 2. Approximation is done by eliminating  $C_{OUT}$  [25]. This incurs an error only when the input combination is '1111'. When the input bits are '1111' the *CARRY* and *SUM* are set to '11' and an error of -1 is introduced. An  $8 \times 8$  multiplication operation using approximate compressors is shown in Figure 3.

#### **III. PERFORMANCE METRICS**

This section introduces the various performance metrics that are analysed to measure the efficiency of approximate multipliers and compressors. The performance metrics can be broadly classified into accuracy metrics and implementation efficiency metrics.

# A. ACCURACY METRICS

Accuracy metrics are used to measure the degree of accuracy achieved by the multipliers designed with proposed compressors and existing approximate compressors.

#### 1) ERROR DISTANCE (ED)

ED refers to the difference between the exact 4 : 2 compressor output and the approximate 4 : 2 compressor output.

$$ED = Exact_{out} - Approx_{out} \tag{8}$$

#### 2) MEAN ERROR DISTANCE (MED)

MED refers to the mean of the ED for all possible input combinations.

$$MED = \frac{1}{2^{2N}} \sum_{k=1}^{2^{2N}} \left| ED_k \right|$$
(9)

#### 3) MEAN RELATIVE ERROR DISTANCE (MRED)

MRED refers to the mean of ED upon the corresponding *Exact*<sub>out</sub> for all possible input combinations.

$$MRED = \frac{1}{2^{2N}} \sum_{k=1}^{2^{2N}} \frac{ED_k}{Exact_{out\,k}}$$
(10)

#### 4) NORMALISED ERROR DISTANCE (NED)

NED measures the mean of ED normalised with the maximum possible error in the proposed design for all possible input combinations.

$$NED = \frac{1}{2^{2N}} \sum_{k=1}^{2^{2N}} \frac{ED_k}{ED_{max}}$$
(11)

# 5) ACCURATE OUTPUT COUNT (AOC)

AOC measures the number of accurate outputs for all possible input combinations.

#### **B. IMPLEMENTATION EFFICIENCY METRICS**

The proposed compressors are implemented using 45-nm CMOS technology, at an operating frequency of 1 GHz and supply voltage of 1 V. Implementation efficiency metrics analyses area, power, delay, and PDP. Area refers to how well the proposed design optimises the hardware, which makes the design compact. Delay refers to the time taken by a design to perform its intended operation and it determines the maximum speed at which the circuit can operate. An ideal design must have optimised parameters like area, delay and power for a comparable range of accuracy.

# IV. APPROXIMATE MULTIPLIER USING PROPOSED APPROXIMATE 4 : 2 COMPRESSOR ARCHITECTURES

Let the probability of error at the output of the compressor be p. So, 1-p denotes the probability of output to be correct. The



FIGURE 4. Probability tree for 2-Stage cascaded compressors.

probability tree for error to occur in cascaded compressors with ED = 1 is presented in Figure 4. With the reduction of error in the cascaded compressor network in multipliers, MED and MRED of the multiplier can be reduced.

It is evident from Figure 4 that, if the inputs at Stages 1 and 2 are one of the input combinations that produce correct results, then Stage 2 output will have ED = 0 and probability  $(1-p)^2$ . If the input to either Stage 1 or Stage 2 is erroneous, then output of Stage 2 has ED = 1 with a probability of  $p \times (1-p)$ . Further, if the inputs of Stages 1 and 2 are both erroneous, it results in an ED of 2 with a probability of  $p^2$ . For multiplication operation with *n* stages of partial product summation, the probability to get correct output at the *n*<sup>th</sup> stage is given in equation (12).

$$P_{(correct)_n} = (1-p)^n \tag{12}$$

The probability for output of the *nth* stage to be erroneous  $(P_{(error)_n})$  is the sum of the probabilities for all conditions where  $ED \ge 1$ .

$$P_{(error)_n} = P(ED = 1) + P(ED = 2) + \dots + P(ED = n)$$
 (13)

The probability for output of the  $2^{nd}$  stage (Figure 4) to be erroneous  $(P_{(error)_2})$  is the sum of the probabilities for all conditions where  $1 \le ED \le 2$ .

$$P_{(error)_2} = P(ED = 1) + P(ED = 2)$$
 (14)

$$P_{(error)_2} = 2(1-p)p + p^2$$
(15)

In general, the probability of output at  $n^{th}$  stage to be erroneous is given in equation (16).

$$P_{(error)_n} = \sum_{m=1}^n \binom{n}{m} (1-p)^{n-m} p^m$$
(16)

If the error has equal probability to be positive or negative, then the overall error probability in the cascaded architecture reduces. It is assumed that the probability of error to be positive and negative is equal to  $\frac{p}{2}$ . The probability tree modified for the current assumption is shown in Figure 5. In Figure 5, 1 - p represents the probability of output to be correct and



**FIGURE 5.** Probability tree for 2-Stage cascaded compressor with ED of +1 and -1.

 $\frac{p}{2}, \frac{p}{2}$  represents the probability of the output to have a deviation of +1 and -1 respectively. When the inputs to stages 1 and 2 are the combinations that produce correct results, the probability of output of Stage 2 is  $(1 - p)^2$  with ED = 0. If only one stage gets the correct input combination, then the probability of the output to be erroneous is  $(1 - p) \times p/2$ and ED is estimated to be 1. When ED at both the stages are same ((-1, -1) or (+1, +1)), the error distance is estimated to be 2 with a probability of  $p/2 \times p/2$ . When ED at both the stages differ ((-1, +1) or (+1, -1)), ED at the output of Stage 2 is reduced to 0, with a probability  $p/2 \times p/2$ . It is evident that cascading of compressors with positive and negative error distance with same absolute deviation from the actual output, can nullify the error at the output of the second stage and reduce the MED and MRED of the multiplier. Here, the probability of ED to be 0 at Stage 2 is given as

$$P'_{(correct)_2} = (1-p)^2 + \left(\frac{p}{2}\right)^2 + \left(\frac{p}{2}\right)^2 \tag{17}$$

$$P'_{(correct)_2} = P_{(correct)_2} + 2\left(\frac{p}{2}\right)^2$$
 (18)

It is evident from equation (18) that, with two stage cascading of compressors, the probability of the output to be correct has been increased by  $2(\frac{p}{2})^2$ . Consequently, the probability of the error, MED and MRED have reduced. Probability of error can further be reduced if there are more number of cascaded stages. Thus, to obtain minimum MED and MRED, equal positive and negative deviation with minimum ED is required.



FIGURE 6. Proposed area-efficient 4:2 compressor.

TABLE 2. Truth table for proposed area efficient 4:2 compressor.

| $A_1$ | $A_2$ | $A_3$ | $A_4$ | CARRY | SUM | ED |
|-------|-------|-------|-------|-------|-----|----|
| 0     | 0     | 0     | 0     | 0     | 0   | 0  |
| 0     | 0     | 0     | 1     | 0     | 1   | 0  |
| 0     | 0     | 1     | 0     | 0     | 1   | 0  |
| 0     | 0     | 1     | 1     | 0     | 1   | -1 |
| 0     | 1     | 0     | 0     | 1     | 0   | +1 |
| 0     | 1     | 0     | 1     | 1     | 0   | 0  |
| 0     | 1     | 1     | 0     | 1     | 0   | 0  |
| 0     | 1     | 1     | 1     | 1     | 1   | 0  |
| 1     | 0     | 0     | 0     | 1     | 0   | +1 |
| 1     | 0     | 0     | 1     | 1     | 0   | 0  |
| 1     | 0     | 1     | 0     | 1     | 0   | 0  |
| 1     | 0     | 1     | 1     | 1     | 1   | 0  |
| 1     | 1     | 0     | 0     | 1     | 0   | 0  |
| 1     | 1     | 0     | 1     | 1     | 1   | 0  |
| 1     | 1     | 1     | 0     | 1     | 1   | 0  |
| 1     | 1     | 1     | 1     | 1     | 1   | -1 |

Momeni *et al.* [17] (Design 2) has an error rate of 25% with ED ±1 but, the ED introduced is (-1, -1, -1, +1). A 25% error rate is obtained by the design proposed by Ha M. and Lee S. [18] with a constant *ED* of -1. Alouani *et al.* [21] have proposed 3 designs with 25% error rate. Design 1 has  $ED = \pm 2$ , Design 2 and 3 has  $ED = \pm 1$ . However, all the architectures are for approximate full adders using genetic algorithm rather than compressors. The 4 : 2 compressor proposed by Chang *et al.* [23], has 25% error rate but all the errors introduced for optimisation have ED = -1.

In order to address these issues, this paper proposes two novel 4 : 2 compressor architectures with 25% error rate to optimise approximate multiplier in terms of area, delay and power dissipation. The proposed compressor design ensures 2 positive and negative errors with |ED| = 1. The two architectures are listed below.

- A) A novel high speed area-efficient 4 : 2 compressor with reduced MED and MRED.
- B) A modified dual-stage compressor design for multipliers, with cascaded compressors.

#### A. PROPOSED HIGH SPEED AREA-EFFICIENT APPROXIMATE 4 : 2 COMPRESSOR

The proposed high speed area-efficient 4 : 2 approximate compressor is shown in Figure 6. The compressor inputs are  $A_1$ ,  $A_2$ ,  $A_3$  and  $A_4$ , outputs are *CARRY* and *SUM*.

| Outputs   | Proposed 4 : 2 compressor design                                                                                                                 | Proposed modified Dual-stage cascaded 4 : 2 compressor                                                                                                                                                       |
|-----------|--------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|           | Stage 1                                                                                                                                          |                                                                                                                                                                                                              |
| $Sum_0$   | $\overline{(K_3\oplus K_4)}$ ( $K_2$ + $K_1$ ) + ( $K_3\oplus K_4$ ) ( $K_2K_1$ )                                                                | $\overline{(K_3\oplus K_4)}$ $\overline{(K_2+K_1)}$ + $(K_3\oplus K_4)$ $\overline{(K_2K_1)}$                                                                                                                |
| $Carry_0$ | $K_3+K_4$                                                                                                                                        | $\overline{K_3 + K_4}$                                                                                                                                                                                       |
| $Sum_1$   | $\overline{(K_7\oplus K_8)}$ ( $K_6$ + $K_5$ ) + ( $K_7\oplus K_8$ ) ( $K_6K_5$ )                                                                | $\overline{(K_7\oplus K_8)}\ \overline{(K_6+K_5)}$ + $(K_7\oplus K_8)\ \overline{(K_6K_5)}$                                                                                                                  |
| $Carry_1$ | $K_7 + K_8$                                                                                                                                      | $\overline{K_7 + K_8}$                                                                                                                                                                                       |
| $Sum_2$   | $\overline{(K_{11} \oplus K_{12})} (K_{10} + K_9) + (K_{11} \oplus K_{12}) (K_{10} K_9)$                                                         | $\overline{(K_{11} \oplus K_{12})} \overline{(K_{10} + K_9)} + (K_{11} \oplus K_{12}) \overline{(K_{10}K_9)}$                                                                                                |
| $Carry_2$ | $K_{11}+K_{12}$                                                                                                                                  | $\overline{K_{11}+K_{12}}$                                                                                                                                                                                   |
| $Sum_3$   | $\overline{(K_{15} \oplus K_{16})} (K_{14} + K_{13}) + (K_{15} \oplus K_{16}) (K_{14} K_{13})$                                                   | $\overline{(K_{15} \oplus K_{16})} \overline{(K_{14} + K_{13})} + (K_{15} \oplus K_{16}) \overline{(K_{14}K_{13})}$                                                                                          |
| $Carry_3$ | $K_{15}+K_{16}$                                                                                                                                  | $\overline{K_{15}+K_{16}}$                                                                                                                                                                                   |
|           | Stage 2                                                                                                                                          |                                                                                                                                                                                                              |
| $Carry_0$ | $ \overline{(K_3 \oplus K_4)} (K_2 + K_1) + (K_3 \oplus K_4) (K_2 K_1)  + \overline{(K_7 \oplus K_8)} (K_6 + K_5) + (K_7 \oplus K_8) (K_6 K_5) $ | $\overline{\overline{(K_3 \oplus K_4)} \cdot \overline{(K_2 + K_1)} + (K_3 \oplus K_4)\overline{(K_2K_1)}}_{\overline{(K_7 \oplus K_8)} \cdot \overline{(K_6 + K_5)} + (K_7 \oplus K_8)\overline{(K_6K_5)}}$ |

**TABLE 3.** Logical expression for outputs at Stage 1 and Stage 2 of proposed 4 : 2 Compressor and Proposed modified Dual-stage 4 : 2 compressor in multiplier implementation.

A multiplexer (MUX) based design approach is used to generate *SUM*.

Output of *XOR* gate acts as the select line for the *MUX*. When select line goes high,  $(A_3A_4)$  is selected and when it goes low,  $(A_3 + A_4)$  is selected. By introducing an error with error distance 1 in the truth table of the exact compressor, the proposed 4 : 2 compressor is able to reduce carry generation logic to an *OR* gate. The logical expressions for realisation of *SUM* and *CARRY* are given below.

$$SUM = (A_1 \oplus A_2)A_3A_4 + \overline{(A_1 \oplus A_2)}(A_3 + A_4)$$
 (19)

$$CARRY = A_1 + A_2 \tag{20}$$

From the truth table of proposed 4 : 2 compressor (Table 2), it can been observed that the error has been introduced for the input values  $-\{0011\},\{0100\},\{1000\}$  and  $\{1111\}$ , so as to ensure that equal positive and negative deviation with ED = 1 (minimum) is obtained.

#### **B. PROPOSED MODIFIED DUAL-STAGE APPROXIMATE** 4 : 2 COMPRESSOR

As a measure to optimise the hardware utilisation of the proposed design, this paper proposes an alternate architecture for multipliers with more than three stages of cascaded compressors. In the high speed area-efficient compressor architecture (as shown in Figure 6), apart from the MUX, one *XOR*, one *AND* and two *OR* gates are required. *OR* and *AND* gates each need 6 transistors in CMOS logic implementation. In order to reduce the transistor count, this paper proposes an architecture with *NAND* and *NOR* gates as shown in Figure 7. Even though the *SUM* and *CARRY* generated by the modified architecture is not as same as that of the proposed 4 : 2 compressor architecture, with cascading of the compressor in multiples of 2, the error is nullified. This is explained with the help of Figure 8. Figure 8(a) has a two level cascading of proposed high speed area-efficient 4 : 2 compressors.



**FIGURE 7.** Basic building block for proposed modified Dual-stage 4 : 2 compressor.

Figure 8(b) has a two level cascading of modified dual-stage 4 : 2 compressors. The outputs at the Stage 1 differ for both the architectures, but the occurrence of negation in the order of an integral multiple of two (in Stage 1 and Stage 2) in the modified dual-stage 4 : 2 compressor will ensure that the outputs at Stage 2 are same. The modified dual-stage 4 : 2 compressor reduces area, delay and power dissipation compared to the proposed high speed area-efficient 4 : 2 compressor and other compressors in the literature due to the reduction in transistor count. Table 3 analyses the output of the two proposed architectures at different stages in a 2 stage cascaded structure.

 $Carry_0$  at Stage 2 output is minimised and is given in equation (21).

$$\overline{\overline{(K_3 \oplus K_4)} \cdot \overline{(K_2 + K_1)} + (K_3 \oplus K_4)\overline{(K_2K_1)}} = \overline{(K_3 \oplus K_4)} \cdot (K_2 + K_1) + (K_3 \oplus K_4) \cdot (K_2K_1) + (K_2 + K_1)(K_2K_1)$$
(21)

Here, it is seen that  $(K_2 + K_1)(K_2K_1)$  is not an essential prime implicant. Therefore, output expressions of Stage 2 for both the proposed architectures are the same. Similarly, *Sum*0 generated at Stage 1 differs, but the resultant logical expression at Stage 2 output remains the same. The proposed



FIGURE 8. Cascaded architecture for multiplication (a) Proposed area-efficient 4 : 2 compressor and (b) Proposed modified dual-stage architecture for 4 : 2 compressor.

| Compressor Design             | Error Rate (%) | Maximum ED |
|-------------------------------|----------------|------------|
| Reddy et al. [25]             | 12.5           | $\pm 1$    |
| Momeni et al. [17] (Design 1) | 37.5           | $\pm 1$    |
| Momeni et al. [17] (Design 2) | 25             | $\pm 1$    |
| Akbari et al. [18] (Design 1) | 62.5           | $\pm 2$    |
| Akbari et al. [18] (Design 2) | 62.5           | $\pm 2$    |
| Akbari et al. [18] (Design 3) | 50             | -2         |
| Akbari et al. [18] (Design 4) | 31.25          | +2         |
| Ha and Lee [19]               | 25             | -1         |
| Gorantla and Deepa [24]       | 18.75          | -2         |
| Esposito et al. [22]          | 56.25          | $\pm 1$    |
| Chang et al. [23]             | 25             | -1         |
| Proposed                      | 25             | $\pm 1$    |

| TABLE 4.   | Accuracy | efficiency | metrics | comparison | of | proposed | and |
|------------|----------|------------|---------|------------|----|----------|-----|
| existing 4 | : 2 comp | ressors.   |         |            |    |          |     |

modified dual-stage 4 : 2 compressor has the same output in cascaded architectures with 18.23% and 14.84% reduction in area and power respectively when compared to proposed high speed area efficient 4 : 2 compressor. The accuracy remains unaffected with the modifications introduced.

# **V. SIMULATION RESULTS**

This section presents the analyses of the proposed 4 : 2 compressor architectures and  $8 \times 8$ ,  $16 \times 16$  Dadda multiplier designed with the proposed compressors. The analysis is carried out to determine the efficiency of the proposed designs, which is projected in terms of accuracy metrics and implementation efficiency metrics.

# A. EFFICIENCY MEASURE OF PROPOSED 4 : 2 COMPRESSOR DESIGN

The proposed high speed area-efficient 4 : 2 compressor design is compared with 11 architectures in literature and with accurate 4 : 2 compressor. The designs are implemented using 45 nm CMOS technology using Cadence. The simulation of compressor designs are carried out with a supply voltage of 1 V at 1 GHz operating frequency.

Accuracy metrics of the proposed high speed area-efficient 4 : 2 compressor is presented in Table 4. The observations show that Akbari *et al.* [18] have the highest error rate with ED of  $\pm 2$ . Reddy *et al.* [25] have the minimum error rate of 12.5% with and ED of  $\pm 1$ . The proposed 4 : 2 compressor is able to achieve 25% error rate with an ED of  $\pm 1$ . This increase of 12.5% in error rate in the proposed design when compared to [25] is compensated by the considerable reduction in area, delay and power.

The implementation efficiency of the proposed 4:2 compressor is presented in Table 5. This table compares proposed design with existing state-of-the-art designs of 4:2 compressor in literature. From Table 5, it is evident that the proposed design is able to achieve 56.88% reduction in area, 57.24% reduction in power, 73.38% reduction in delay

| TABLE 5. Implementation efficiency metrics comparison of | of proposed and |
|----------------------------------------------------------|-----------------|
| existing 4 : 2 compressors.                              |                 |

| Compressor Design            | Area        | Power     | Delay | PDP                 |
|------------------------------|-------------|-----------|-------|---------------------|
| Compressor Design            | $(\mu m^2)$ | $(\mu W)$ | (ps)  | $(ps \times \mu W)$ |
| Accurate                     | 42.6        | 5.80      | 95.7  | 555.06              |
| Reddy et al. [25]            | 33.6        | 3.54      | 48.8  | 172.75              |
| Momeni et al. [17](Design 1) | 34          | 3.45      | 47.6  | 164.22              |
| Momeni et al. [17](Design 2) | 31          | 3.39      | 40.6  | 137.63              |
| Akbari et al. [18](Design 1) | 4           | 0.86      | 12.2  | 10.49               |
| Akbari et al. [18](Design 2) | 7.6         | 1.08      | 12.2  | 13.18               |
| Akbari et al. [18](Design 3) | 24.2        | 2.63      | 41.5  | 109.15              |
| Akbari et al. [18](Design 4) | 30.5        | 3.41      | 40.3  | 137.42              |
| Ha and Lee [19]              | 35.4        | 3.62      | 47.1  | 170.5               |
| Gorantla and Deepa [24]      | 37.6        | 3.97      | 61.84 | 245.35              |
| Esposito et al. [22]         | 13.66       | 1.83      | 17.31 | 31.66               |
| Chang et al. [23]            | 29.42       | 3.38      | 44.3  | 149.7               |
| Proposed                     | 18.37       | 2.48      | 25.48 | 63.19               |

and 88.62% reduction in PDP when compared to accurate 4 : 2 compressor. The proposed design achieves the best area, delay, power and PDP among the existing designs with error rate less than 62.5%. Among the approximate compressors, maximum area, power and delay is for Gorantla and Deepa [24]. The least area, delay, power and PDP is reported by Akbari *et al.* (Design 1) and (Design 2) [18]. But, it has an ER of 62.5% and an ED of  $\pm 2$ . The reduction in implementation efficiency metrics is achieved at the cost of degradation in accuracy of the compressor. The efficiency achieved in implementation metrics is projected as a graph and is shown in Figure 9.

# B. EFFICIENCY MEASURE OF 8 × 8 DADDA MULTIPLIER USING PROPOSED 4 : 2 COMPRESSOR DESIGN

 $8 \times 8$  Dadda multipliers are implemented using the proposed and existing compressor designs. All the multiplier designs compared in this analysis are implemented without truncation in the partial products. In the  $8 \times 8$  Dadda multiplier, Level 1 to Level 8 (as shown in Figure 3) employ approximate compressors and Level 9 to Level 15 employs exact compressors. The multipliers are implemented using 45 nm CMOS technology and the simulations are carried out with a supply voltage of 1 V at 1 GHz operating frequency. Table 6 shows the implementation and accuracy efficiency metrics for  $8 \times 8$ Dadda multiplier using proposed high speed area efficient 4 : 2 compressor and existing compressors. An exhaustive analysis was carried out for measuring the efficiency of the  $8 \times 8$  multipliers implemented. When compared to an exact compressor based multiplier, the multiplier with proposed design is able to achieve a reduction in area, delay, power and PDP by 30.43%, 43.55%, 38.09% and 65.06% respectively. The proposed 4 : 2 compressor based multiplier has the optimised area, delay and power reduction among all the architectures except Akbari et al. [18] (Design 1) and (Design 2) and Esposito et al. [22]. Akbari et al. [18] are able to



FIGURE 9. Implementation metrics comparison of proposed and exciting 4 : 2 compression design in scales of area reduction ratio, power reduction ratio and delay reduction ratio.

| TABLE 6. | . Implementation and accuracy efficiency metrics comp | arison implementation of 8 × | 8 Dadda multiplier with the | e proposed and existing 4 : 2 |
|----------|-------------------------------------------------------|------------------------------|-----------------------------|-------------------------------|
| compress | sors.                                                 |                              |                             |                               |

| Compressor Design Used       | Implementation Metrics |           |       |                     | Accuracy Metrics |        |        |        |
|------------------------------|------------------------|-----------|-------|---------------------|------------------|--------|--------|--------|
|                              | Area                   | Power     | Delay | PDP                 | 100              | MED    | MPFD   | NFD    |
|                              | $(\mu m^2)$            | $(\mu W)$ | (ns)  | $(ns \times \mu W)$ | AUC              | MED    | WIKED  | NED    |
| Accurate                     | 3979                   | 101       | 0.63  | 63.63               | -                | -      | -      | -      |
| Reddy et al. [25]            | 3312                   | 65.74     | 0.40  | 26.29               | 19527            | 502.9  | 0.0728 | 0.0077 |
| Momeni et al. [17](Design 1) | 3261                   | 64.85     | 0.41  | 26.59               | 103              | 3873.8 | 4.5483 | 0.0596 |
| Momeni et al. [17](Design 2) | 3092                   | 59.69     | 0.37  | 22.08               | 503              | 3508.8 | 4.2843 | 0.0540 |
| Akbari et al. [18](Design 1) | 682.2                  | 33.53     | 0.29  | 9.72                | 690              | 4376   | 0.4424 | 0.0673 |
| Akbari et al. [18](Design 2) | 713.8                  | 34.51     | 0.28  | 9.66                | 624              | 2787.8 | 0.3135 | 0.0429 |
| Akbari et al. [18](Design 3) | 2618                   | 53.92     | 0.34  | 18.33               | 1885             | 3095   | 0.3907 | 0.0476 |
| Akbari et al. [18](Design 4) | 3070                   | 61.47     | 0.42  | 25.82               | 11052            | 1376.6 | 0.0854 | 0.0212 |
| Ha and Lee [19]              | 3624                   | 79.24     | 0.50  | 39.62               | 15863            | 25.212 | 0.0326 | 0.0023 |
| Gorantla and Deepa [24]      | 3715                   | 80.19     | 0.49  | 39.29               | 8562             | 6476   | 1.2    | 0.0537 |
| Esposito et al. [22]         | 834                    | 39.22     | 0.31  | 12.16               | 2313             | 2888   | 0.3002 | 0.0444 |
| Chang et al. [23]            | 3683                   | 80.60     | 0.47  | 37.88               | 3130             | 1245   | 0.8107 | 0.0386 |
| Proposed                     | 2468                   | 51.01     | 0.33  | 16.83               | 7782             | 573.4  | 0.0487 | 0.0027 |

achieve this by reducing the accuracy of the system to an error rate of 62.5%. The proposed design is able to achieve comparable accuracy metrics with other designs with error rate of 25% with an exception when compared with Ha and Lee [19]. Ha and Lee [19] have incorporated dedicated error recovery modules to improve MED at the cost of increased area and power.

In order to examine the accuracy and energy efficiency of the mutliplier designed, MRED and PDP are considered. Figure 10 shows the  $PDP \times MRED$  for all the  $8 \times 8$  mutlipliers. Momeni *et al.* [17] (Design 1) and (Design 2) have the highest *PDP* × *MRED* as MRED is the highest for these designs. Chang *et al.* [23] based multiplier has moderate *PDP* × *MRED*. Among the designs with low values, Ha. M and Lee [19] has 1.291 and Reddy *et al.* [25] has 1.91. It is observed that the proposed design has the least *PDP* × *MRED* and hence it can be concluded that the proposed design is able to maintain a balance between the accuracy and energy efficiency better than the existing designs.

| Compressor Design Used       | Implementation Metrics |           |        |                     | Accuracy Metrics      |          |                              |
|------------------------------|------------------------|-----------|--------|---------------------|-----------------------|----------|------------------------------|
|                              | Transistor             | Power     | Delay  | PDP                 | MFD                   | MDED     | NED                          |
|                              | Count                  | $(\mu W)$ | (ns)   | $(ns \times \mu W)$ | WIED                  | MINED    | NED                          |
| Accurate                     | 11523                  | 458.54    | 1.2222 | 560.428             | -                     | -        | -                            |
| Reddy et al. [25]            | 7880                   | 251.84    | 0.8020 | 201.980             | $2.37 \ge 10^5$       | 243.88   | $5.53 \ge 10^{-5}$           |
| Momeni et al. [17](Design 1) | 7932                   | 279.95    | 0.7363 | 206.127             | $1.9 \ge 10^{6}$      | 15245.16 | $4.64 \ge 10^{-4}$           |
| Momeni et al. [17](Design 2) | 7824                   | 249.27    | 0.6918 | 172.445             | 1.7 x 10 <sup>6</sup> | 14333.63 | $4.05 \text{ x } 10^{-4}$    |
| Akbari et al. [18](Design 1) | 5680                   | 172.98    | 0.5844 | 101.090             | $2.2 \ge 10^{6}$      | 1482.14  | $5.34 \ge 10^{-4}$           |
| Akbari et al. [18](Design 2) | 5790                   | 178.58    | 0.5691 | 101.630             | $1.2 \ge 10^{6}$      | 1050.22  | $3.003 \mathrm{x} \ 10^{-4}$ |
| Akbari et al. [18](Design 3) | 6780                   | 238.66    | 0.6674 | 159.282             | $1.4 \ge 10^{6}$      | 1308.63  | $3.4748 \ge 10^{-4}$         |
| Akbari et al. [18](Design 4) | 7380                   | 245.38    | 0.7632 | 187.274             | $6.8 \ge 10^5$        | 286.74   | $1.6 \text{ x } 10^{-4}$     |
| Ha and Lee [19]              | 10324                  | 297.25    | 1.1217 | 333.425             | $6.8 \ge 10^4$        | 109.373  | $1.587 \ge 10^{-5}$          |
| Gorantla and Deepa [24]      | 9324                   | 309.76    | 1.0153 | 314.499             | $1.77 \ge 10^5$       | 4027.2   | $4.1348 \ge 10^{-4}$         |
| Esposito et al. [22]         | 6918                   | 186.34    | 0.6141 | 114.431             | $1.3 \ge 10^{6}$      | 1005.81  | $3.06 \ge 10^{-4}$           |
| Chang et al. [23]            | 8352                   | 304.65    | 0.9781 | 297.978             | $1.2 \ge 10^{6}$      | 2720.39  | $3.01 \text{ x } 10^{-4}$    |
| Proposed 4 : 2 Compressors   | 7078                   | 218.8     | 0.6246 | 136.662             | $8.8\times10^4$       | 163.54   | $2.05 \ge 10^{-5}$           |

 TABLE 7. Implementation and accuracy efficiency metrics comparison implementation of 16  $\times$  16 Dadda multiplier with the proposed and existing 4 : 2 compressors.



**FIGURE 10.** Figure of merit for 8 × 8 multiplier architectures.

# C. EFFICIENCY MEASURE OF 16 × 16 DADDA MULTIPLIER USING PROPOSED 4 : 2 COMPRESSORS

 $16 \times 16$  Dadda multipliers are implemented using the proposed and existing compressor designs. All the multiplier

designs compared in this analysis are implemented without truncation in the partial products. In the  $16 \times 16$  Dadda multiplier, Level 1 to Level 17 employ approximate compressors and Level 18 to Level 32 employ exact compressors. The multipliers are implemented using 45 nm CMOS technology and the simulations are carried out with a supply voltage of 1 V at 700 MHz operating frequency. Multiplication operation for  $16 \times 16$  inputs using the proposed compressors is shown in Figure 11. The proposed modified dual-stage compressors are used where there are two stages of cascaded partial products for summation. For all other partial product levels less than 14, proposed high speed area-efficient 4 : 2 compressors, full adders and half adders are used.

Table 7 presents the comparison of implementation efficiency metrics and accuracy metrics of  $16 \times 16$  multiplier using proposed modified Dual-Stage 4 : 2 compressor architecture and proposed area-efficient 4 : 2 compressor, with accurate 4:2 compressor and approximate 4:2compressors based multipliers. The simulations are carried out for one million random input combinations. The results show that the proposed compressor designs based multiplier is able to reduce the transistor count, power and delay by 38.58%, 52.28% and 48.89% respectively, when compared to multiplier with accurate compressors. Ha and Lee [19] and Chang et al. [23] based multipliers have the highest transistor count, power, and delay. The error recovery circuit which improves the accuracy in Ha and Lee [19] adds additional circuitry thereby adding overhead in all the implementation efficiency metrics. Akbari et al. (Design-1), (Design-2), (Design-3), [18] and Esposito et al. [22] are able to reduce the power, delay and transistor count with a considerable reduction in accuracy. The ER in all these designs are above 50%. Among all the designs with ER less than 50%, the proposed design is able to achieve substantial



FIGURE 11. Proposed 4 : 2 compressors used in 16 × 16 multiplier.

reduction in implementation efficiency metrics and has the least transistor count, delay, and power.

The analysis of various multiplier architectures are presented in scales of accuracy metrics - MED, MRED and NED. Momeni *et al.* (Design-1) [17] and Akbari *et al.* (Design-1) [18] have the highest MED and NED. The lowest MED and NED is for the multipliers using Ha and Lee [19] and the proposed approximate compressors-based multipliers. The highest MRED is for multiplier designs using Momeni *et al.* (Design-1) and (Design-2) [17]. The lowest MRED is for Ha and Lee [19] and proposed compressor based multipliers. Ha and Lee [19] based design is able to achieve high MED, high MRED and low NED due to the error recovery circuit which adds to the area, power and delay of the circuit. The proposed compressor is able to achieve comparable results as Ha and Lee [19], with reduced transistor count, power and delay.

The accuracy and energy efficiency of the multiplier designed is analysed by considering the MRED and PDP. Figure 12 shows the  $PDP \times MRED$  for all the

**IEEE**Access



**FIGURE 12.** Figure of merit for 16 × 16 multiplier architectures.

| 1 | 1 | 1  | 1 | 1 |
|---|---|----|---|---|
| 1 | 4 | 4  | 4 | 1 |
| 1 | 4 | 12 | 4 | 1 |
| 1 | 4 | 4  | 4 | 1 |
| 1 | 1 | 1  | 1 | 1 |

FIGURE 13. Image kernel for smoothing.

16 × 16 multipliers. Momeni *et al.* [17] (Design 1) and (Design 2) have the highest *PDP* × *MRED* as MRED is the highest for these designs. Gorantla and Deepa [24] and Chang *et al.* [23] based multipliers have moderate *PDP* × *MRED*. Among the designs with low values of *PDP* × *MRED*, Ha and Lee [19] have 0.36 and Reddy *et al.* [25] have 0.49. It is observed that, the proposed design has the least *PDP* × *MRED* of 0.22. Thus, it can be concluded that the proposed design is able to maintain a balance between the accuracy and energy efficiency better that the existing designs.

#### D. IMAGE PROCESSING APPLICATION WITH MULTIPLIER IMPLEMENTED USING PROPOSED COMPRESSOR

The approximate multiplier designed in this paper using the proposed 4 : 2 compressors is used in two image processing applications namely multiplication and smoothing of images.

| TABLE 8. | Efficiency of the image processing applications measured in |
|----------|-------------------------------------------------------------|
| terms of | MSSIMs.                                                     |

| Compressor Design Used     | Multiplication | Smoothing |
|----------------------------|----------------|-----------|
| Reddy et al. [25]          | 0.92           | 0.74      |
| Momeni et al. [17]         | 0.65           | 0.23      |
| (Design 1)                 |                |           |
| Momeni et al. [17]         | 0.68           | 0.23      |
| (Design 2)                 |                |           |
| Akbari et al. [18]         | 0.60           | 0.27      |
| (Design 1)                 |                |           |
| Akbari <i>et al</i> . [18] | 0.65           | 0.28      |
| (Design 2)                 |                |           |
| Akbari et al. [18]         | 0.64           | 0.57      |
| (Design 3)                 |                |           |
| Akbari <i>et al</i> . [18] | 0.81           | 0.64      |
| (Design 4)                 |                |           |
| Ha and Lee [19]            | 0.93           | 0.72      |
| Gorantla and Deepa [24]    | 0.86           | 0.69      |
| Esposito et al. [22]       | 0.63           | 0.29      |
| Chang et al. [23]          | 0.84           | 0.73      |
| Proposed                   | 0.90           | 0.74      |

The image kernel for smoothing is shown in Figure 13. The smoothing operation is carried out by convolution of the image kernel with  $5 \times 5$  image sub-blocks. For smoothing application, twelve standard images are considered; Pirate, Tiffany, Moon, Jet, Room, Cameraman, Lena, Elaine, Mandrill, Bridge, Lake and Pepper [28].

For image multiplication, six image pairs are considered ((Lena, Tiffany), (Elaine, Pirate), (Bridge, Room), (Mandrill, Peppers), (Moon, Cameraman) and (Jet, Lake)). The image pairs are multiplied and the resultant images are evaluated to measure its efficiency. The multiplied images for (Moon, Cameraman) pair using all multiplier architectures are shown in Figure 14.

In both the applications,  $8 \times 8$  multipliers implemented using proposed and existing compressors are used. In the multipliers, the approximate compressors are employed in first 8 levels from LSB (where ever applicable, as level 1, 2 and 3 employ half and full adders). The simulations are carried out using MATLAB 2019.2. The efficiency of the image processing applications are measured in terms of mean structural similarity (MSSIM) [27]. MSSIM predicts the quality of a image by calculating the deviation in pixel values with respect to a distortion free reference image. Table 8 presents the average MSSIM for the images processed with the approximate multiplier implemented using compressors. The proposed 4: 2 compressor design has comparable accuracy with Reddy et al. [25] and Ha and Lee [19]. In addition to this, the proposed compressor has better accuracy compared to all other compressor architectures under consideration.

Use of approximate compressors in multipliers can reduce the accuracy if compressors are employed in the

# **IEEE**Access



[24]

(k) Ha and Lee [19] (l) Gorantla and Deepa (m) Esposito et al. (n) Chang et al. [23] [22]

(o) Proposed

#### FIGURE 14. Multiplication of Cameraman and Moon images using 8 × 8 multipliers with proposed and existing compressor architectures.



FIGURE 15. Effect of approximate compressor levels on MSSIM in an 8 × 8 multiplier.

MSB levels. To understand the effect, the levels (as shown in Figure 3) in which the approximate compressors are employed in an  $8 \times 8$  multiplier are varied

and is projected in Figure 15. Approximate compressors are employed in the least significant levels, LEVEL 1 to LEVEL 11. LEVEL 12 to LEVEL 15 employs

exact compressors to maintain accuracy in acceptable range.

In Figure 15, 'n' represents the number of levels with approximate compressors. In this study, 'n' is varied from 7 to 11 and the changes in accuracy is examined by measuring the MSSIM. From Figure 15, it can be observed that the proposed architecture has comparable results when compared to other state-of-the-art compressor designs and the best among the compressors with error rate 25%.

#### **VI. CONCLUSION**

This paper presents two novel approximate 4 : 2 compressor architectures. Firstly, a high speed area efficient compressor architecture is proposed, which achieved a considerable reduction in area, delay and power when compared to other state-of-the-art compressor designs. The proposed design has comparable accuracy with 25% error rate and equal positive and negative absolute error deviation of 1. As a result, the proposed design reduces MED and MRED considerably without reducing the error rate. In addition to this, the paper also proposed a modified dual-stage compressor architecture, which further optimised the area, delay and power without altering the accuracy metrics. The architecture was designed and implemented at the transistor level using 45-nm technology with a supply voltage of 1 V. The design was validated using  $8 \times 8$ ,  $16 \times 16$  Dadda multiplier in image processing applications, like image multiplication and smoothing.

#### REFERENCES

- [1] S. Ghosh, D. Mohapatra, G. Karakonstantis, and K. Roy, "Voltage scalable high-speed robust hybrid arithmetic units using adaptive clocking," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 18, no. 9, pp. 1301–1309, Sep. 2010.
- [2] D. Baran, M. Aktan, and V. G. Oklobdzija, "Multiplier structures for low power applications in deep-CMOS," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, Rio de Janeiro, Brazil, May 2011, pp. 1061–1064.
- [3] S. Mittal, "A survey of techniques for approximate computing," ACM Comput. Surv., vol. 48, no. 4, pp. 1–33, Mar. 2016.
- [4] H. Jiang, C. Liu, L. Liu, F. Lombardi, and J. Han, "A review classification and comparative evaluation of approximate arithmetic circuits," ACM J. Emerg. Tech. Comput. Syst., vol. 13, no. 4, p. 60, 2017.
- [5] J. Liang, J. Han, and F. Lombardi, "New metrics for the reliability of approximate and probabilistic adders," *IEEE Trans. Comput.*, vol. 62, no. 9, pp. 1760–1771, Sep. 2013.
- [6] R. Zendegani, M. Kamal, M. Bahadori, A. Afzali-Kusha, and M. Pedram, "RoBA multiplier: A rounding-based approximate multiplier for highspeed yet energy-efficient digital signal processing," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 25, no. 2, pp. 393–401, Feb. 2017.
- [7] H. Jiang, J. Han, F. Qiao, and F. Lombardi, "Approximate Radix-8 booth multipliers for low-power and high-performance operation," *IEEE Trans. Comput.*, vol. 65, no. 8, pp. 2638–2644, Aug. 2016.
- [8] S. Hashemi, R. I. Bahar, and S. Reda, "DRUM: A dynamic range unbiased multiplier for approximate applications," in *Proc. IEEE/ACM Int. Conf. Computer-Aided Des. (ICCAD)*, Austin, TX, USA, Nov. 2015, pp. 418–425.
- [9] S. Narayanamoorthy, H. A. Moghaddam, Z. Liu, T. Park, and N. S. Kim, "Energy-efficient approximate multiplication for digital signal processing and classification applications," *IEEE Trans. Very Large Scale Integr.* (VLSI) Syst., vol. 23, no. 6, pp. 1180–1184, Jun. 2015.
- [10] G. Zervakis, S. Xydis, K. Tsoumanis, D. Soudris, and K. Pekmestzi, "Hybrid approximate multiplier architectures for improved poweraccuracy trade-offs," in *Proc. IEEE/ACM Int. Symp. Low Power Electron. Des. (ISLPED)*, Rome, Italy, Jul. 2015, pp. 79–84.

- [11] B. Shao and P. Li, "Array-based approximate arithmetic computing: A general model and applications to multiplier and squarer design," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 62, no. 4, pp. 1081–1090, Apr. 2015.
- [12] C.-H. Chang, J. Gu, and M. Zhang, "Ultra low-voltage low-power CMOS 4-2 and 5-2 compressors for fast arithmetic circuits," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 51, no. 10, pp. 1985–1997, Oct. 2004.
- [13] A. Saha, R. Pal, A. G. Naik, and D. Pal, "Novel CMOS multi-bit counter for speed-power optimization in multiplier design," *AEU-Int. J. Electron. Commun.*, vol. 95, pp. 189–198, Oct. 2018.
- [14] S.-F. Hsiao, M.-R. Jiang, and J.-S. Yeh, "Design of high-speed low-power 3-2 counter and 4-2 compressor for fast multipliers," *Electron. Lett.*, vol. 34, no. 4, p. 341, Feb. 1998.
- [15] Z. Wang, G. A. Jullien, and W. C. Miller, "A new design technique for column compression multipliers," *IEEE Trans. Comput.*, vol. 44, no. 8, pp. 962–970, Aug. 1995.
- [16] J. Gu and C.-H. Chang, "Ultra low voltage, low power 4-2 compressor for high speed multiplications," in *Proc. IEEE Int. Symp. Circuits Syst.* (*ISCAS*), Bangkok, Thailand, 2003, pp. 321–324.
- [17] A. Momeni, J. Han, P. Montuschi, and F. Lombardi, "Design and analysis of approximate compressors for multiplication," *IEEE Trans. Comput.*, vol. 64, no. 4, pp. 984–994, Apr. 2015.
- [18] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, "Dual-quality 4:2 compressors for utilizing in dynamic accuracy configurable multipliers," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 25, no. 4, pp. 1352–1361, Apr. 2017.
- [19] M. Ha and S. Lee, "Multipliers with approximate 4–2 compressors and error recovery modules," *IEEE Embedded Syst. Lett.*, vol. 10, no. 1, pp. 6–9, Mar. 2018.
- [20] Y. Guo, H. Sun, L. Guo, and S. Kimura, "Low-cost approximate multiplier design using probability-driven inexact compressors," in *Proc. IEEE Asia Pacific Conf. Circuits Syst. (APCCAS)*, Chengdu, China, Oct. 2018, pp. 291–294.
- [21] I. Alouani, H. Ahangari, O. Ozturk, and S. Niar, "A novel heterogeneous approximate multiplier for low power and high performance," *IEEE Embedded Syst. Lett.*, vol. 10, no. 2, pp. 45–48, Jun. 2018.
- [22] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro, and N. Petra, "Approximate multipliers based on new approximate compressors," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 65, no. 12, pp. 4169–4182, Dec. 2018.
- [23] Y.-J. Chang, Y.-C. Cheng, Y.-F. Lin, S.-C. Liao, C.-H. Lai, and T.-C. Wu, "Imprecise 4-2 compressor design used in image processing applications," *IET Circuits, Devices Syst.*, vol. 13, no. 6, pp. 848–856, Sep. 2019.
- [24] A. Gorantla and P. Deepa, "Design of approximate compressors for multiplication," ACM J. Emerg. Technol. Comput. Syst., vol. 13, no. 3, pp. 1–17, Apr. 2017.
- [25] K. Manikantta Reddy, M. H. Vasantha, Y. B. N. Kumar, and D. Dwivedi, "Design and analysis of multiplier using approximate 4-2 compressor," *AEU-Int. J. Electron. Commun.*, vol. 107, pp. 89–97, Jul. 2019.
- [26] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs. Oxford, U.K.: Oxford Univ. Press, 2000.
- [27] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," *IEEE Trans. Image Process.*, vol. 13, no. 4, pp. 600–612, Apr. 2004.
- [28] Sipi.usc.edu. (2016). *SIPI Image Database*. [Online]. Available: http://sipi.usc.edu/database/



**PRANOSE J. EDAVOOR** received the bachelor's degree in electronics and communication engineering from the Cochin University of Science and Technology, Kerala, India, in 2012, and the master's degree in VLSI from the National Institute of Technology at Goa, India, in 2016, where he is currently pursuing the Ph.D. degree with the Electrical and Electronics Department. His research interests include digital design, FPGA accelerators, ASIC design, wavelets, and deep neural networks.



**SITHARA RAVEENDRAN** received the bachelor's degree in electronics and communication engineering from the Cochin University of Science and Technology, Kerala, India, in 2008, and the master's degree in VLSI from the National Institute of Technology at Goa, India, in 2016, where she is currently pursuing the Ph.D. degree with the Electronics and Communication Department. She has more than five years of industrial experience in design and verification of digital circuits (SOC

level). Her research interests include digital design, reversible logic, FPGA, and ASIC design.



**AMOL D. RAHULKAR** (Member, IEEE) received the B.E. degree in instrumentation from the Shri Guru Gobind Singhji (SGGS) Institute of Engineering and Technology, Nanded, India, in 2000, the M.Tech. degree from the Indian Institute of Technology (IIT), Kharagpur, India, in 2002, and the Ph.D. degree from S. R. T. Marathwada University, Nanded, in 2013. He is currently working as an Associate Professor with the Department of Electrical and Electronics Engineering, National

Institute of Technology at Goa, Goa, India. His research interests include design of wavelets and filter banks, biometrics, applications of wavelet transform, and real time processing of signals using FPGA.

...