

Received May 16, 2019, accepted May 28, 2019, date of publication June 17, 2019, date of current version July 16, 2019. Digital Object Identifier 10.1109/ACCESS.2019.2923251

# VLSI Design of a High Throughput Hybrid **Precoding Processor for Wireless MIMO Systems**

**KUAN-TING CHEN<sup>1</sup>, YIN-TSUNG HWANG<sup>10</sup>, (Member, IEEE), AND YEN-CHANG LIAO<sup>2</sup>** <sup>1</sup>Department of Electrical Engineering, National Chung Hsing University, Taichung 40227, Taiwan

<sup>2</sup>Division of Digital IC Design, Ilitek Corporation, Hsinchu 30288, Taiwan

Corresponding author: Yin-Tsung Hwang (hwangyt@dragon.nchu.edu.tw)

This work was supported by the National Science Council, Taiwan, under Project 105-2221-E-005-071-MY3 and Project 107-2218-E-005-013.

ABSTRACT Hybrid precoding, a combination of radio frequency (RF) beamforming and digital precoding, has been investigated intensively these days for millimeter wave (mmWave) communication systems employing large antenna arrays. The key problem is constructing beamforming and precoding matrices for the RF beamformer and the digital baseband, respectively, based on the channel matrix decomposition result. This paper presents a new computing algorithm to achieve the matrix decomposition efficiently without compromising the performance. The algorithm computes beamforming (steering) and precoding matrices in separate phases to alleviate the computing overheads of iterative matrix updates. This measure also creates the computing parallelism to facilitate efficient hardware implementation. A novel computing scheme based on QR decomposition and blockwise inversion techniques is also developed to tackle the most critical least square solution module. This leads to a computing complexity reduction by a factor of 0.3 N when compared with the popular orthogonal matching pursuit (OMP) scheme, where N is the antenna array size. The simulation results indicate the percentage of choosing correct steering vectors is 90%, which is as good as the OMP scheme can achieve. A hardware accelerator design of the proposed scheme is developed by using a TSMC 40 nm CLN40G technology. The design, with a gate count of 419.3 k, can operate up to 333 MHz with a power consumption of 267.1 mW. This suggests a throughput rate of processing 10.4 M channel matrices per second. The core size is merely 0.58mm<sup>2</sup> while the entire die size including I/O pads is 2.26**mm<sup>2</sup>**.

**INDEX TERMS** Multiple-input multiple-output, hybrid precoding, array beamforming, millimeter wave, orthogonal matching pursuit, QR-decomposition, hardware accelerator.

#### I. INTRODUCTION

Millimeter wave (mmWave) communication systems can provide multi-gigabit per second data rates in short distance applications [1]. However, because of high path loss and small antenna element spacing, the transmission distance is limited. Beamforming schemes using large-scale antenna arrays can be used to combat path loss problems and to support spatial multiplexing. Beamforming can be performed in either the baseband or the RF end. The former approach is termed as digital beamforming. It can alleviate the accuracy problems of phase shift and signal combination encountered in RF beamforming but require more RF chains (one per antenna) and ADCs/DACs. The latter approach provides extra antenna gain to alleviate the link budget but calls for programmable phase shifters and RF signal combiners. Besides beamforming, another popular scheme for multiple-input multiple-output (MIMO) systems is spatial division multiplexing (SDM) for achieving high transmission rate. Assume the channel state information (CSI) is known, multiple data streams can be transmitted with a precoder and a combiner employed in the transmitter and the receiver, respectively. This creates an equivalent transmission channel consisting of parallel and independent communication links. Beamforming can be considered a special case of precoding, where the precoder applies phase shift only. Current mmWave systems usually employ large-scale antenna arrays to expand the channel capacity. While a purely digital precoding approach has been proved feasible in implementation [2] and may enjoy the advantages of full channel capacity and lower precoding complexity, an alternative approach seeking the combination of RF beamforming and digital precoding is

The associate editor coordinating the review of this manuscript and approving it for publication was Filbert Juwono.



FIGURE 1. A system block diagram of mmWave single user system with the joint hybrid analog and digital beamforming architecture.

getting popular. The main driving force is the cheaper implementation cost. The solution of combining digital precoding and RF beamforming (also known as hybrid precoding or hybrid beamforming) was proposed in [3], [4]. It requires fewer RF chains and ADC/DAC devices in system implementation. RF beamforming can also alleviate the power link budget. Fig. 1 shows the system block diagram of a single user system. At the transmitter side,  $N_s$  data streams are first processed by a digital base band precoding module and mapped to  $N_{RF}$  coded data streams for parallel transmissions. Coded data streams are up converted to the RF band and a beamformer network, which applies a distinctive steering vector to each individual coded data stream, generates  $N_{RF}$  directed beams transmitted by an antenna array of size  $N_t$  simultaneously. An RF signal combiner is needed per antenna. At the receiving end, the beamformer network applies multiple steering vectors, each aims at a specific beam, to the received antenna array signals.  $N_{RF}$  signals are retrieved and then down converted to the baseband, where a digital combiner decodes the data streams to recover the original ones. The essential issue of hybrid precoding is determining the steering matrix for the beamformer network and the precoding matrix for the baseband module efficiently. To create an equivalent transmission channel consisting of parallel and independent communication links, the channel matrix modeling the air link between  $N_t$  transmitting and Nr receiving antennas are factorized using Singular Value Decomposition (SVD). The product of the steering matrix and precoding matrix in the transmitting end should be equal to the conjugate transpose of the right unitary matrix, while the counterpart in the receiving end should also be the conjugate transpose of the left unitary matrix. This calls for additional matrix factorizations of the SVD result - the main design challenge of achieving the hybrid precoding. In addition, the steering vectors of the steering matrix should match the real signal transmission paths as much as possible for better beamforming results. If the factorization result does not lead to a perfect reconstruction of the unitary matrix, this causes losses in channel capacity. The matrix factorization for hybrid precoding can be considered as a simultaneous sparse approximation problem, which is often encountered in the applications of compressive sensing (CS). In hybrid precoding, the problem is formulated as selecting a small number of steering vectors from a codebook (dictionary) to reconstruct one of the unitary matrices of the SVD result. These steering vectors form the steering matrix of the beamformer network while the weighting matrix corresponds to the precoding matrix of the baseband module. Two principal methods for sparse approximation are  $\ell_1$  minimization, also known as Basis Pursuit (BP) [5], [6], and matching pursuit (MP) [7], [8]. Both methods are iterative. Although the  $\ell_1$ minimization schemes generally achieve better approximations, the computing complexity is much higher. The MP schemes are based on a greedy search approach, trying to find the columns (steering vectors) one at a time from the codebook to maximally reduce the approximation error. Orthogonal matching pursuit (OMP) [9], [10] is an extension of MP by computing the orthogonal projection of the signal onto the subspace spanned by the set of columns selected so far. Simultaneous sparse approximation differs from the sparse approximation in that multiple signal vectors are reconstructed from the same set of basis vectors simultaneously. The schemes presented in [11]-[13] adopt OMP based approaches and the work in [14] uses convex relaxation, a variation of basis pursuit. The application of OMP schemes to the hybrid precoding problem for large millimeter wave MIMO systems was first discussed in [3], [4]. However, all the aforementioned research efforts focus on algorithm development and pay little attention to the implementation issues. In particular, the computing complexities of these schemes are formidable and grow cubically with the size of MIMO systems. To achieve real time operations, significant efforts in largely reducing the computing complexity are needed. Among few works devoted to this purpose, in [15], the measurement matrix is QR factorized and updated progressively using the modified Gram-Schmidt (MGS) scheme. The least squares (LS) solution in each iteration can then be obtained incrementally to reduce the complexity. In [16], matrix inversions, the most computation demanding in the approximation scheme, are computed efficiently using Schur-Banachiewicz blockwise inversions. The scheme in [17], by assuming the millimeter wave systems have low rank channel matrices and the correlations between channels (beams) are low, performs the orthogonal projection only after all basis vectors are selected. This leads to an MP-like approach but can reduce the computing complexity significantly.

Aside from the algorithm developments, several efforts have been devoted to the hardware accelerator. Most of them are for compressive sensing reconstructions. A high-speed hardware architecture, which implements a simplified  $\ell_1$ minimization scheme to reconstruct compressively sensed images, is presented in [6]. An FPGA design realizing a modified OMP scheme for CS reconstruction is proposed in [18]. The design also shows significant speed advantages over the CPU and GPU based implementations. In [16], a simplified OMP scheme is first derived by employing matrix inversion bypass techniques to the projection operations. A hardwired design is then developed and synthesized using a 65nm CMOS process. A similar simplification is employed in [17] to develop the first hardware accelerator for hybrid precoding. The ASIC is implemented by using a TSMC 90nm process technology and can operate up to 167MHz.

Efficient hardware accelerator designs of hybrid precoding are essential to the implementations of millimeter wave massive MIMO systems. It requires both delicate algorithm simplifications to curb the circuit complexity and extensive architecture optimizations to enhance the throughput. In this paper, a low-complexity hybrid precoding algorithm based on a modified OMP scheme is first developed. The algorithm computes steering and precoding matrices in separate phases to alleviate the computing overheads of iterative matrix updates. This measure also creates the computing parallelism to facilitate hardware acceleration. A novel computing scheme based on QR decomposition and blockwise inversion techniques is also developed to tackle the most critical least square solution module. Based on the developed scheme, an efficient hardware accelerator design is presented. It supports an  $8 \times 8$  MIMO configuration with 4 data streams. The design can achieve 10.4M matrix factorizations of hybrid precoding per second and implemented using a 40nm process technology.

The rest of the paper is organized as follows: In Section II, preliminaries of joint RF beamforming and digital precoding schemes for mmWave MIMO system are introduced. These include the channel model, SVD based hybrid precoding schemes, and OMP based precoding matrices construction schemes. In section III, algorithm modifications are first addressed. A low complexity LS computing scheme based on QR decomposition and blockwise inversion is next described. Finally, the proposed hybrid precoding algorithm is presented. The algorithm performance simulations and complexity analyses are presented in section IV. The algorithm to hardware mapping and task scheduling details are elaborated in section V. The chip design results and performance comparison with prior arts are also summarized.

#### **II. PRELIMINARIES OF HYBRID PRECODING**

### A. MILLIMETER WAVE CHANNEL MODEL

Consider a single user mmWave MIMO system with  $N_t$  transmitter antennas and  $N_r$  receiver antennas. A  $N_r \times N_t$  block-fading channel can be mathematically expressed as

$$\mathbf{y}_{\mathbf{N}_{\mathbf{r}}\times\mathbf{1}} = \mathbf{H}_{N_{r}\times N_{t}}\mathbf{x}_{N_{t}\times\mathbf{1}} + \mathbf{n}_{N_{r}\times\mathbf{1}}, \qquad (1)$$

where **y** is the received signal vector, **H** is the channel matrix, **x** is the transmitted signal vector, and **n** is an independent and identically distributed (i.i.d.)  $C\mathcal{N}(0, \sigma_n^2)$  additive Gaussian noise. Millimeter Wave systems with tightly packed arrays and high path loss usually lead to limited spatial selectivity and high levels of antenna correlation. Therefore, we adopt a narrowband finite scatterer representation [19], based on the extended Saleh Valenzuela model (extended S-V model), to describe the channel matrix **H** as,

$$\mathbf{H} = \sum_{p=1}^{P} \alpha_{p} \boldsymbol{\psi} \left( \psi_{p} \right) \boldsymbol{\varphi}^{T} \left( \varphi_{p} \right) = \boldsymbol{\Psi} \boldsymbol{\Omega} \boldsymbol{\Theta}^{T}, \qquad (2)$$

VOLUME 7, 2019

where  $\boldsymbol{\psi}(\psi_p), \boldsymbol{\varphi}^T(\varphi_p)$ , are the receiver and transmitter phase weighting vectors, respectively, of the  $p^{\text{th}}$  scattering ray.  $\psi_p$  and  $\varphi_p$  are the associated azimuth angles.  $\alpha_p \sim C\mathcal{N}(0, 1)$  is the complex gain of the  $p^{\text{th}}$  scattering ray.  $\boldsymbol{\Psi} = [\boldsymbol{\psi}(\psi_1)\cdots\boldsymbol{\psi}(\psi_p)]$  and  $\boldsymbol{\Theta} = [\boldsymbol{\varphi}(\varphi_1)\cdots\boldsymbol{\varphi}(\varphi_p)]$ indicate the corresponding phase weighting matrices.  $\boldsymbol{\Omega} =$ diag $(\alpha_1\cdots\alpha_p)$  is a diagonal matrix representing the multipath amplitudes. The phase weighting vector in an *N*-element uniform linear array (ULA) antenna system assumes a format as

$$\boldsymbol{a}\left(\phi\right) = \frac{1}{\sqrt{N}} \left[1, e^{j\frac{2\pi}{\lambda}dsin(\phi)}, \dots, e^{j(N-1)\frac{2\pi}{\lambda}dsin(\phi)}\right]^{T}, \quad (3)$$

where  $\lambda$  is the wavelength of radio frequency and *d* is the inter-antenna spacing. This paper assumes omni-directional antennas at the receiver and transmitter and considers the field of view range in beamforming as  $\left(-\frac{\pi}{3}, \frac{\pi}{3}\right)$ .

#### **B. SVD BASED PRECODING SCHEME**

SVD is the most popular precoding scheme, where channel matrix  $\mathbf{H}$  is first decomposed as

$$\mathbf{H} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^{H} = \sum_{i=1}^{N_{min}} \sigma_{i} \boldsymbol{u}_{i} \boldsymbol{v}_{i}^{H},$$
  
$$N_{min} = min \left( N_{r}, N_{t} \right), \qquad (4)$$

 $\mathbf{U} = \begin{bmatrix} \mathbf{u}_1 \cdots \mathbf{u}_{N_r} \end{bmatrix} (\mathbf{V} = \begin{bmatrix} \mathbf{v}_1 \cdots \mathbf{v}_{N_t} \end{bmatrix}) \text{ is an } N_r \times N_r (N_t \times N_t)$ unitary matrix consisting of the left (right) singular vectors  $\mathbf{u}_i (\mathbf{v}_i)$ .  $\boldsymbol{\Sigma}$  is an  $N_r \times N_t$  diagonal matrix of real-valued singular values  $\sigma_i$  arranged in a decreasing order. It aims at creating a system of parallel independent sub-channels free of interferences, and the equivalent gain of each sub-channel is determined by the corresponding singular values  $\sigma_i$ . Given the transmit signal **s** and the precoder **V**, the received signal **y** with the combiner **U** can be written as

$$\mathbf{U}^{H}\mathbf{y} = \mathbf{U}^{H} (\mathbf{H}\mathbf{V}\mathbf{s} + \mathbf{n})$$
  
=  $\mathbf{U}^{H}\mathbf{H}\mathbf{V}\mathbf{s} + \mathbf{U}^{H}\mathbf{n}$   
=  $\mathbf{U}^{H} (\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^{H})\mathbf{V}\mathbf{s} + \mathbf{U}^{H}\mathbf{n}$   
=  $\boldsymbol{\Sigma}\mathbf{s} + \mathbf{U}^{H}\mathbf{n}$ , (5)

Because of channel sparsity and correlation in mmWave MIMO systems, the maximum number of transmitted data streams  $N_s$  cannot exceed the rank of channel matrix. In a hybrid precoding system, the precoder V is equivalent to the product of the baseband precoding matrix  $\mathbf{F}_{\mathbf{BB},\mathbf{t}}$ , which dispatches  $N_s$  data streams into  $N_{RF}^t$  transmitting beams; and the steering matrix  $\mathbf{F}_{\mathbf{RF},\mathbf{t}}$ , which transmits  $N_{RF}^t$  beams simultaneously through the transmitter beamformer network with  $N_t$  antennas. Similarly, the combiner  $\mathbf{U}^H$  equals the product of two counterparts ( $\mathbf{W}_{\mathbf{BB},\mathbf{r}}, \mathbf{W}_{\mathbf{RF},\mathbf{r}}$ ) in the receiving end. Therefore, we have

$$\mathbf{V} = (\mathbf{F}_{\mathbf{RF},\mathbf{t}})_{N_t \times N_{\mathbf{RF}}^t} \cdot (\mathbf{F}_{\mathbf{BB},\mathbf{t}})_{N_{\mathbf{RF}}^t \times N_s}$$
(6)

$$\mathbf{U}^{H} = \left(\mathbf{W}_{\mathbf{BB},\mathbf{r}}^{H}\right)_{N_{s} \times N_{\mathrm{RF}}^{r}} \cdot \left(\mathbf{W}_{\mathbf{RF},\mathbf{r}}^{H}\right)_{N_{\mathrm{RF}}^{r} \times N_{r}}$$
(7)

 $N_{\text{RF}}^{t}$  and  $N_{\text{RF}}^{r}$  are number of RF chains at the transmitter and the receiver, respectively. For notification simplicity,

we will omit the "t" and "r" subscripts in the following discussion. Considering a block-fading channel model with additive white Gaussian noises, the received signal of the hybrid precoding scheme can be expressed as

$$\hat{\mathbf{y}} = \sqrt{\rho} \mathbf{W}_{\mathbf{B}\mathbf{B}}^{H} \mathbf{W}_{\mathbf{R}\mathbf{F}}^{H} \mathbf{H} \mathbf{F}_{\mathbf{R}\mathbf{F}} \mathbf{F}_{\mathbf{B}\mathbf{B}} \mathbf{s} + \mathbf{W}_{\mathbf{B}\mathbf{B}}^{H} \mathbf{W}_{\mathbf{R}\mathbf{F}}^{H} \mathbf{n}, \qquad (8)$$

where  $\rho$  denotes the average received power. The hybrid precoder design at the transmitting site tries to minimize the discrepancy between the product of  $\mathbf{F_{RF}}$ ,  $\mathbf{F_{BB}}$  and the right unitary matrix V of the SVD result. In implementation practices, to reduce the complexity of determining the steering matrix  $\mathbf{F_{RF}}$ , a beamforming codebook  $\mathcal{B}$  containing a fixed number of supported steering vectors is developed. The column vectors of  $\mathbf{F_{RF}}$  are selected from the codebook.

$$\mathcal{B} \triangleq \{ \boldsymbol{a} \left( \phi_l \right), 1 \le l \le L \}$$
(9)

The optimal precoder design  $(F_{RF}^{opt}, F_{BB}^{opt})$  problem can then be formulated as:

$$\begin{pmatrix} \mathbf{F}_{\mathbf{RF}}^{\mathbf{opt}}, \mathbf{F}_{\mathbf{BB}}^{\mathbf{opt}} \end{pmatrix} = \underset{\mathbf{F}_{\mathbf{RF}}, \mathbf{F}_{\mathbf{BB}}}{\operatorname{arg\,min}} \|\mathbf{V} - \mathbf{F}_{\mathbf{RF}} \mathbf{F}_{\mathbf{BB}}\|_{F} , \\ \forall \mathbf{F}_{\mathbf{RF}} (:, i) \in \mathcal{B}, i = 1 \sim N_{\mathrm{RF}} \text{ and } \|\mathbf{F}_{\mathbf{RF}} \mathbf{F}_{\mathbf{BB}}\|_{F} = N_{s}, (10)$$

where  $\|\cdot\|_F$  indicates the Frobenius norm of the matrix.

## C. OMP BASED PRECODING MATRICES CONSTRUCTION SCHEME

As suggested in [3], [4], an OMP algorithm can be employed to construct  $\mathbf{F}_{\mathbf{RF}}$ ,  $\mathbf{F}_{\mathbf{BB}}$ . Algorithm 1 shows the pseudo code of the scheme, where unitary matrix V and the codebook matrix  $\Phi = [a(\phi_1), \cdots, a(\phi_L)]$  are given as inputs The steering matrix  $\mathbf{F}_{\mathbf{RF}}$  is set empty and the residual matrix  $\mathbf{R}_i$  is set to  $\mathbf{V}$ initially ( $\mathbf{R}_0 = \mathbf{V}$ ). The algorithm then constructs the steering matrix  $\mathbf{F}_{\mathbf{RF}}$  progressively. In each iteration, the steering vectors remained in the codebook are projected to a space spanned by the residual matrix  $\mathbf{R}_i$ . The vector with the largest norm after projection is added to the steering matrix  $\mathbf{F}_{\mathbf{RF}}$  and then removed from the codebook, where  $\Phi(:, k)$  indicates the  $k^{th}$  column of the codebook matrix  $\Phi$ . This is the essence of matching pursuit, i.e. finding the basis vector with the highest correlation in a greedy manner. Since the product of  $F_{RF}$ and  $\mathbf{F}_{\mathbf{BB}}$  should approximate V, a least square error solution of  $\mathbf{F}_{\mathbf{BB}}$  is calculated subject to a partially constructed  $\mathbf{F}_{\mathbf{RF}}$ . This also corresponds to an orthogonal projection in the OMP scheme. A pseudo inverse computation as indicated in line 7 of the algorithm is thus needed. The residual matrix  $\mathbf{R}_i$ is updated by subtracting the current product result  $\mathbf{F}_{\mathbf{RF}}\mathbf{F}_{\mathbf{BB}}$ from it. There are two major concerns associated with this algorithm when hardware implementation is considered. The first one is computing complexity. This is mostly attributed to the repetitive matrix pseudo-inverse computations needed every time  $F_{RF}$  is updated. The second one is the computing latency because of the intertwined computations of  $F_{RF}$ and  $\mathbf{F}_{\mathbf{B}\mathbf{B}}$  in each iteration. This leads to a purely sequential computation without any chance of parallelism to enhance the throughput.

## Algorithm 1 Precoder matrices construction via OMP

1.  $\mathbf{F}_{\mathbf{RF}} = \text{Empty Matrix}$ 

2. 
$$\mathbf{R}_0 = \mathbf{V}$$

3. **for** i = 1 to  $N_{\rm RF}$ 

4. 
$$\Psi_{i-1} = \Phi^H \mathbf{R}_{i-1}$$

- 5.  $k = \arg \max_{l=1,...,L} (\Psi_{i-1} \Psi_{i-1}^{H})_{l}$
- 6.  $\mathbf{F}_{\mathbf{RF}} = [\mathbf{F}_{\mathbf{RF}} | \mathbf{\Phi}(:, k)]$

7. 
$$\mathbf{F}_{\mathbf{BB}} = (\mathbf{F}_{\mathbf{RF}}^{H} \mathbf{F}_{\mathbf{RF}})^{-1} \mathbf{F}_{\mathbf{RF}}^{H} \mathbf{V}$$

8. 
$$\mathbf{R}_i = \mathbf{V} - \mathbf{F}_{\mathbf{RF}}\mathbf{F}_{\mathbf{BE}}$$

9. end for

10. 
$$\mathbf{F}_{\mathbf{BB}} = \sqrt{N_s} \frac{\mathbf{F}_{\mathbf{BB}}}{\|\mathbf{F}_{\mathbf{RF}}\mathbf{F}_{\mathbf{BB}}\|_F}$$

11. return F<sub>RF</sub>, F<sub>BB</sub>

FIGURE 2. The OMP scheme for constructing precoder matrices.

## III. PROPOSED HYBRID PRECODER/COMBINER CONSTRUCTION SCHEME

### A. MODIFICATIONS TO FACILITATE HARDWARE IMPLEMENTATION

To mitigate the aforementioned problems, the algorithm is modified in two aspects, i.e., computing parallelism and computing complexity. To enhance computing parallelism, the computations of  $\mathbf{F}_{\mathbf{RF}}$  and  $\mathbf{F}_{\mathbf{BB}}$  are decoupled. The sparse approximation scheme in [8] embraces a similar idea. However, an orthogonalization process is still needed. Note that the orthogonalization process adopted in the OMP scheme yields better approximation results than the original MP scheme when the correlations among the selected basis vectors are high. For millimeter wave MIMO systems using hybrid precoding schemes, the transmission beams should separate from each other by a modest angular margin to ensure the channel capacity. It is thus safe to assume the selected steering vectors are not highly correlated. (Simulation results will be given later in section III to justify the assumptions.) Based on this assumption, the steering matrix  $\mathbf{F}_{\mathbf{RF}}$  is constructed independently without performing a orthogonal projection to update the  $F_{BB}$  matrix. The update of the residual matrix  $\mathbf{R}_i$ , as indicated in Eq. (11), is obtained by mapping it to a subspace orthogonal to the selected steering vector  $\mathbf{\Phi}_{\mathbf{k}} = \mathbf{a} (\phi_k)$  in the  $(i-1)^{\text{th}}$  iteration.

$$\mathbf{R}_i = \mathbf{R}_{i-1} - \mathbf{\Phi}_{\mathbf{k}} \mathbf{\Phi}_{\mathbf{k}}^H \mathbf{R}_{i-1}, \qquad (11)$$

In other words, the residual matrix is updated by taking away the contribution from the selected vector. After the required number of steering vectors for  $\mathbf{F}_{\mathbf{RF}}$  is achieved, an orthogonal projection of matrix V to the space spanned by the column vectors of matrix  $\mathbf{F}_{\mathbf{RF}}$  is performed to obtain the  $\mathbf{F}_{\mathbf{BB}}$  matrix. This is equivalent to calculate the LS solution of  $\mathbf{V} = \mathbf{F}_{\mathbf{RF}}\mathbf{F}_{\mathbf{BB}}$ . Besides decoupling the computations of  $\mathbf{F}_{\mathbf{RF}}$ and  $\mathbf{F}_{\mathbf{BB}}$ , this approach also reduces the number of orthogonal projection to just one. In terms of system implementation practices, decoupling the computations of  $F_{RF}$  and  $F_{BB}$  is also a must for multi-carrier systems such as OFDM, where each carrier has its own distinctive channel matrix. Since analog beamforming setting indicated by  $F_{RF}$  is common to all carriers, it is calculated first and each carrier calculates its  $F_{BB}$  subject to  $F_{RF}$ , respectively.

Although only one pseudo-inverse computation remains after the decoupling process, its computing complexity (cubically proportional to the number of antennas) is still formidable especially for the massive MIMO systems. The pseudo inverse computation is used to find the LS solution of  $\mathbf{F}_{BB}$  subject to  $\mathbf{V} = \mathbf{F}_{RF}\mathbf{F}_{BB}$ . QR decomposition (QRD) is an alternative approach, which performs a matrix triangularization first followed by a backward substitution to obtain the result. The triangularization can be achieved by using a sequence of Givens rotations, which can be implemented efficiently by using a Coordinate Rotation Digital Computer (CORDIC) scheme. The CORDIC scheme is also known for its simplicity in circuit implementation. After ORD, a backward substitution process is needed to obtain the final LS result. However, its data flow is opposite to the data generation order of the QRD stage. In terms of hardware implementation, this implies a prolonged computing latency and extra data buffer to redirect the data flow. In this paper, instead of performing backward substitution, we propose to compute the inverse of matrix  $\mathcal{R}$  directly after QR decomposition. This enables a seamless data flow between two computing stages, i.e. triangularization of  $F_{RF}$  and solution of  $F_{BB}$ . A blockwise inversion scheme is applied to reduce the complexity of computing  $\mathcal{R}^{-1}$  significantly.

## B. LOW COMPLEXITY LS COMPUTATION BASED ON QR DECOMPOSITION AND BLOCKWISE INVERSION

Assume the QR decomposition of  $\mathbf{F}_{\mathbf{RF}}$  is  $\mathbf{Q} \cdot \mathbf{\mathcal{R}}$ , where  $\mathbf{Q}$  is a an  $N_t \times N_t$  unitary matrix and  $\mathbf{\mathcal{R}}$  is an  $N_t \times N_{RF}$  upper triangular matrix. Since  $\mathbf{F}_{\mathbf{BB}}$  is the LS solution of  $\mathbf{F}_{\mathbf{RF}}\mathbf{F}_{\mathbf{BB}} = \mathbf{V}$ , based on the QR decomposition result, it can be calculated as

$$\mathbf{F}_{\mathbf{B}\mathbf{B}} = \mathcal{R}^{-1} \mathbf{Q}^{\mathbf{H}} \mathbf{V},\tag{12}$$

where superscript "H" indicates the complex conjugate and transpose form.  $\mathbf{Q}^{\mathbf{H}}$  can be obtained by applying a sequence of complex valued Givens rotations  $\mathbf{Q}(i, k)$ , each aimed at nullifying the  $(i, k)^{\text{th}}$  element of  $\mathbf{F}_{\mathbf{RF}}$ . Note that  $\mathbf{F}_{\mathbf{RF}}$  is updated accordingly.

$$\mathbf{Q}^{\mathbf{H}} = \prod_{1}^{\mathbf{k} = \mathbf{N}_{\mathbf{RF}}} \left( \prod_{\mathbf{k}+1}^{\mathbf{i} = \mathbf{N}_{\mathbf{t}}} \mathbf{Q}(\mathbf{i}, \mathbf{k}) \right), \tag{13}$$

$$\mathbf{Q}(\mathbf{i},\mathbf{k}) = \begin{bmatrix} \cos\left(\alpha_{ik}\right) & \sin\left(\alpha_{ik}\right) \\ -\sin\left(\alpha_{ik}\right) & \cos\left(\alpha_{ik}\right) \end{bmatrix} \begin{bmatrix} \mathbf{e}^{-j\beta_{ik}} & \mathbf{0} \\ \mathbf{0} & \mathbf{e}^{-j\gamma_{ik}} \end{bmatrix},$$
(14)

 $\mathbf{Q}(i, k)$  consists of three real valued rotations by angles  $\alpha_{ik}$ ,  $\beta_{ik}$ , and  $\gamma_{ik}$ , respectively.  $e^{-j\beta_{ik}}$  and  $e^{-j\gamma_{ik}}$  correspond to two rotations converting the two complex valued elements respectively to real values, i.e. nullifying the imaginary parts.



FIGURE 3. Inversion of an upper triangular matrix after applying two levels of blockwise inversion.

The third rotation by angle  $\alpha_{ik}$  then nullifies one of the two elements.

The matrix inversion  $\mathcal{R}^{-1}$  is computed by using a blockwise inversion scheme. Eq. (15) shows the blockwise inversion of a matrix divided into 4 blocks (sub-matrices) (15), as shown at the bottom of the next page.

where **A** and **D** must be square, and **A** and  $\mathbf{D} - \mathbf{C}\mathbf{A}^{-1}\mathbf{B}$ must be nonsingular. Since **R** is an upper triangular matrix, **C** is always a zero matrix, and **A**, **D** are both upper triangular. The non-singularity conditions are thus guaranteed. The blockwise matrix inversion equation can be simplified as

$$\begin{bmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{0} & \mathbf{D} \end{bmatrix}^{-1} = \begin{bmatrix} \mathbf{A}^{-1} & -\mathbf{A}^{-1}\mathbf{B}\mathbf{D}^{-1} \\ \mathbf{0} & \mathbf{D}^{-1} \end{bmatrix}, \quad (16)$$

The inversions of diagonal blocks **A** and **D** can be computed first, then the remaining block can be computed by performing two matrix multiplications. We may apply the blockwise inversion scheme recursively to the diagonal blocks, which remain upper triangular, to further simplify the computations. This leads to a divide-and-conquer approach. If the block size reduces to  $1 \times 1$ , the matrix inversion becomes taking the reciprocal. Fig. 3 illustrates how the inversion of an upper triangular matrix is achieved by applying two levels of blockwise inversion. Note that

$$\mathbf{B}' = -\mathbf{A}^{-1}\mathbf{B}\mathbf{C}^{-1} \tag{17}$$

$$\mathbf{F}' = -\mathbf{E}^{-1}\mathbf{B}\mathbf{G}^{-1} \tag{18}$$

$$\mathbf{D}' = -\begin{bmatrix} \mathbf{A}^{-1} & \mathbf{B}' \\ 0 & \mathbf{C}^{-1} \end{bmatrix} \mathbf{D} \begin{bmatrix} \mathbf{E}^{-1} & \mathbf{F}' \\ 0 & \mathbf{G}^{-1} \end{bmatrix}$$
(19)

This suggest a computing scheme that computes the inversions of the diagonal blocks **A**, **C**, **E** and **G** first, and then computes the inversion of blocks **B**, **F** and **D** above the diagonal successively.

### C. LOW COMPLEXITY PRECODING MATRICES CONSTRUCTION SCHEME

Based on the modifications of OMP algorithm and the low complexity LS computing scheme elaborated in the previous two sub-sections, the proposed precoding matrices construction scheme is presented in Fig. 4.

The for loop between line 4 and line 8 selects the steering vector  $\Phi_k$  from the codebook, one at a time, until the number of RF beams ( $N_{\text{RF}}$ ) is reached.  $S_i$  is the index set of selected

Algorithm 2 Precoder construction scheme using modified OMP and low complexity projection

Input: V ,  $\Phi$ 

 $\Psi_0 = \mathbf{\Phi}^H \mathbf{V}$ 1.  $S_0 = \text{Empty Set}, \ \overline{S}_0 = \{i, 1 \le i \le L\}$ 2. 3.  $\mathcal{R}_{inv}$  = Empty Matrix for i = 1 to  $N_{\rm RF}$ 4.  $k = \arg \max_{l=1,\dots,l} \langle \Psi_{i-1}(l, :), \Psi_{i-1}^{H}(:, l) \rangle$ 5.  $S_i = S_{i-1} \cup \{k\}, \ \overline{S}_i = \overline{S}_{i-1} - \{k\}$ 6.  $\Psi_{i} = \Psi_{i-1}(\overline{S}_{i},:) - \Phi_{\overline{S}_{i}}^{H} \Phi_{k} \Psi_{i-1}(k,:)$ 7. 8. end for  $\mathbf{F}_{\mathbf{RF}} = \mathbf{\Phi}_{S_{N_{\mathbf{RF}}}}$ 9.  $[\mathbf{Q}, \mathbf{\mathcal{R}}] = \text{QRD}(\mathbf{F}_{\text{RF}})$ 10.  $\mathcal{R}^{-1} = binv(\mathcal{R})$ 11.  $\mathbf{F}_{\mathbf{B}\mathbf{B}} = \mathcal{R}^{-1}\mathbf{Q}^{H}\mathbf{V}$ 12.

13. return F<sub>RF</sub>, F<sub>BB</sub>

FIGURE 4. Proposed precoding matrices construction scheme.

steering vectors up to the *i*<sup>th</sup> iteration and  $\overline{S}_i$  is its complement set. Notation ( $\overline{S}_i$ , :) of a matrix means the sub-matrix consisting of the columns belonging to the index set  $\overline{S}_i$ . To simplify the computations, the update of residual matrix, as shown in Eq. (11), is replaced with the update of the residual matrix projection on the null space of the selected steering vectors (line 7 of the algorithm), i.e.

$$\Psi_{i} = \Phi_{\overline{S}_{i}}^{H} \mathbf{R}_{i} = \Phi_{\overline{S}_{i}}^{H} \left( \mathbf{R}_{i-1} - \Phi_{k} \Phi_{k}^{H} \mathbf{R}_{i-1} \right)$$
  
$$= \Phi_{\overline{S}_{i}}^{H} \mathbf{R}_{i-1} - \Phi_{\overline{S}_{i}}^{H} \Phi_{k} \Phi_{k}^{H} \mathbf{R}_{i-1}$$
  
$$= \Psi_{i-1} \left( \overline{S}_{i}, : \right) - \Phi_{\overline{S}_{i}}^{H} \Phi_{k} \Psi_{i-1} \left( k, : \right), \qquad (20)$$

After obtaining the steering matrix  $\mathbf{F}_{\mathbf{RF}}$ , QR decomposition is performed to obtain the unitary matrix  $\mathbf{Q}$  and upper triangular matrix  $\mathcal{R}$ . As shown in Eq. (13),  $\mathbf{Q}$  is the product of a sequence of Givens rotations, and can be computed explicitly by applying the same sequence of Givens rotations to an identity matrix while performing triangularization. *binv*() indicates the recursive blockwise upper triangular matrix inversion scheme as stated in sub-section III.B. Table 1 shows the computing complexity comparison between the proposed modified OMP scheme and the original OMP scheme, where N is the antenna array size,  $N_{RF}$  is the number of RF transmission beams,  $N_s$  is the number of data streams, and L is the codebook size. The measurement unit is complex-valued multiplication. For easier comparisons, normalized complexities subject to the assumptions  $N_s = N_{RF} = N/2$  and L = Nare also provided. The complexity of vector selection and residual matrix update of proposed scheme is smaller than the original OMP scheme by a factor of 0.3N. Similarly, the **F**<sub>BB</sub> update complexity of proposed scheme is reduced by a factor 0.67N. This is because only one instead of  $N_{RF}$ (N/2) orthogonal projections is performed. Further complexity reduction is attributed to the QRD and blockwise inversion based LS computing scheme described in section III.B. In total, the computing complexity reduction factor is 0.31N.

## IV. ALGORITHM PERFORMANCE SIMULATION RESULTS AND ANALYSES

Algorithm performance simulation is conducted to determine the codebook size of beamforming and to evaluate the performance discrepancy between the original OMP scheme and the proposed one. In simulation setting, the antenna array sizes  $N_t = N_r$  are both set to 8. The extended S-V model described in section II.A is used to characterize the channel matrix **H**. The transmission paths, each has a distinct path gain  $\alpha_p \sim C\mathcal{N}(0, 1)$ , coincide in angles with the beamforming vectors defined in the codebook.

The azimuth angles of departure and arrival both are bounded by  $\left(-\frac{\pi}{3}, \frac{\pi}{3}\right)$ . The evaluation index is the percentage of correctly choosing the  $N_{RF}$  steering vectors with largest path gains from the codebook. Since  $N_{RF}$  cannot be greater than N,  $N_{RF} = 2$ , 4 and 6, are simulated with different codebook sizes. Note that the half-power beam width of a sized N antenna array can be calculated as follows [19]:

$$\boldsymbol{\theta}_{3db} = 50 \frac{\lambda}{Nd} = \frac{100}{N} (deg) \tag{21}$$

where  $d = \lambda/2$  is the inter-antenna element spacing. This indicates the beamforming angular resolution. For N =8, the resolution is 12.5°. The maximum number of uniformly spaced beams that can be accommodated in the target angle span without overlapping in half power profile, is  $\lfloor (2\pi/3)/12.5^\circ \rfloor = 9$ . In the simulation experiments, the codebook consists of equally spaced steering vectors and the size rages from 8 to 20. Fig. 5 shows the comparison results between the original OMP scheme and the proposed, simplified one. In all three  $N_{RF}$  cases, the performance of the two schemes in choosing correct steering vectors are almost identical. In certain cases, e.g.,  $N_{RF} = 6$  and codebook size L between  $9 \sim 13$ , the correct selection ratio of the proposed scheme is even slightly higher than that of the original OMP scheme. This concludes that the proposed algorithm

$$\begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1} = \begin{bmatrix} A^{-1} + A^{-1}B(D - CA^{-1}B)^{-1}CA^{-1} & -A^{-1}B(D - CA^{-1}B)^{-1} \\ -(D - CA^{-1}B)^{-1}CA^{-1} & (D - CA^{-1}B)^{-1} \end{bmatrix},$$
(15)

| OMP algorithm                                                                     |                                                                                                                                                     |                                         |  |  |  |  |  |  |  |  |
|-----------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|--|--|--|--|--|--|--|--|
| update $\Psi_i$ $N_t \times N_S \times (N_{RF} \times L - N_{RF}^2/2 + N_{RF}/2)$ |                                                                                                                                                     |                                         |  |  |  |  |  |  |  |  |
| select k                                                                          | $\frac{1}{N_S \times (N_{RF} \times L - N_{RF}^2/2 + N_{RF}/2)}$                                                                                    |                                         |  |  |  |  |  |  |  |  |
| update <b>R</b> <sub>i</sub>                                                      | $\frac{N_t \times N_S \times (N_{RF}^2 + N_{RF})/2}{N_t \times N_S \times (N_{RF}^2 + N_{RF})/2}$                                                   |                                         |  |  |  |  |  |  |  |  |
| update <b>F<sub>BB</sub></b>                                                      | $\frac{N_t^2(N_{RF}^2 + N_{RF})/2 + N_t(N_{RF}^3/3 + N_{RF}^2/2 + N_{RF}/6)}{+N_t N_S(N_{RF}^2 + N_{RF})/2 + N_{RF}^4/4 + N_{RF}^3/2 + N_{RF}^2/4}$ |                                         |  |  |  |  |  |  |  |  |
| Modified OMP algorithm                                                            |                                                                                                                                                     |                                         |  |  |  |  |  |  |  |  |
| $\Psi_0$                                                                          | $L \times N_t \times N_S$                                                                                                                           |                                         |  |  |  |  |  |  |  |  |
| update $\Psi_i$                                                                   | $(N_t + N_S) \times (N_{RF} \times L - N_{RF}^2/2 + N_{RF}/2)$                                                                                      |                                         |  |  |  |  |  |  |  |  |
| select k                                                                          | $N_S \times (N_{RF} \times L - N_{RF}^2/2 + N_{RF}/2)$                                                                                              |                                         |  |  |  |  |  |  |  |  |
| update F <sub>BB</sub>                                                            |                                                                                                                                                     |                                         |  |  |  |  |  |  |  |  |
| Normali                                                                           | zed complexity with $N_s = \Lambda$                                                                                                                 | $I_{RF} = N/2$ and $L = N$              |  |  |  |  |  |  |  |  |
| Computation                                                                       | OMP                                                                                                                                                 | modified OMP                            |  |  |  |  |  |  |  |  |
| $\Psi_0$                                                                          | -                                                                                                                                                   | N <sup>3</sup> /2                       |  |  |  |  |  |  |  |  |
| update $\Psi_i$                                                                   | $3N^4/16 + N^3/8$                                                                                                                                   | $9N^3/16 - 3N^2/8$                      |  |  |  |  |  |  |  |  |
| Select k                                                                          | $3N^3/16 + N^2/8$                                                                                                                                   | $3N^3/16 + N^2/8$                       |  |  |  |  |  |  |  |  |
| update $\mathbf{R}_i$                                                             | $N^4/16 + N^3/8$                                                                                                                                    | -                                       |  |  |  |  |  |  |  |  |
| update F <sub>BB</sub>                                                            | $47N^4/192 + 9N^3/16$<br>+ $7N^2/48$                                                                                                                | $35N^3/96 + 5N^2/8$<br>- 7N/12          |  |  |  |  |  |  |  |  |
| Total                                                                             | $95N^4/192 + N^3$<br>+ $13N^2/48$                                                                                                                   | $\frac{155N^3}{96} + 3N^2/8$<br>- 7N/12 |  |  |  |  |  |  |  |  |

TABLE 1. Computing complexity analysis of OMP schemes.

simplification measures do not cause any performance losses. The simulation results also show the correct selection ratio degrades as L increases because of a finer angular resolution that has actually exceeded the capacity of the antenna array. The impact of  $N_{RF}$  value, however, depends on the codebook size L.

When a smaller codebook is used, the correct selection ratio with a smaller  $N_{RF}$  value is inferior to that with a larger one. This is because greedy based sparse approximation schemes cannot guarantee the order of vector selection always match the ranking order of path gain. If a larger set of vectors is to be selected, the correctness of selection order is less crucial, which leads to a higher "hit ratio". When a larger codebook is used, the selection procedure becomes more error prone, particularly for those steering vectors corresponding to less significant path gains. A bigger  $N_{RF}$  value means more error prone selections are made, which degrades the overall correct selection ratio.

Considering the angular resolution supported by the target antenna array size (8) and a goal of 90% accuracy in vector selections, the codebook size is set to 9 in designing the hardware accelerator.

Further hardware design parameters are explored. For a low complexity fixed point implementation, bit widths for all computing variables should be simulated to take the effect of quantization errors into account. The iteration number of the CORDIC scheme used in QR decomposition should also be determined. To curb the simulation complexity, the CORDIC iteration number is chosen first based on floating point simulation results. After fixing the iteration number, which is



**FIGURE 5.** Simulation results of percentage of correct steering vectors selection for OMP and the modified OMP schemes. (a)  $N_{RF} = 2$ . (b)  $N_{RF} = 4$ . (c)  $N_{RF} = 6$ .

8 in this case, the variable bit widths are simulated next. A QPSK modulation format is assumed. Other simulation settings are ( $N_s = N_{RF} = 4, N = 8, L = 9$ ). The CORDIC iteration number is 8. Variables at different computing stages may require different precisions. The fixed point simulations are performed progressively from input stage to output stage. Fig. 6 shows the bit error rate (BER) simulation results of the final stage. The BER curves of the three floating point versioned schemes, i.e., the original OMP scheme, the



**FIGURE 6.** Fixed point algorithm simulation results for hardware implementation.

| TABLE 2. | System configuration and hardware design specs. |
|----------|-------------------------------------------------|
|----------|-------------------------------------------------|

| Parameter                                                    | Value             |
|--------------------------------------------------------------|-------------------|
| Antenna size $N_t \times N_r$                                | 8×8 ULA           |
| Max number of data stream $N_s$                              | 4                 |
| Max number of RF chains $N_{RF}$                             | 4                 |
| Number of candidate basis vectors                            | 9                 |
| Input matrix V size / element size                           | 8×4 / 2×12 bits   |
| Output matrix $\mathbf{F}_{\mathbf{RF}}$ size / element size | 8×4 / 2×12 bits   |
| Output matrix $\mathbf{F}_{\mathbf{BB}}$ size / element size | 4×4 / 2×12 bits   |
| Initiation interval                                          | 32 clock cycles   |
| Computing latency                                            | ≤128 clock cycles |

proposed modified scheme (denoted as mOMP), and the SVD precoding scheme are included as the reference. The last one, especially, is to show how the two-stage hybrid precoding scheme deviates in performance from an ideal one-stage SVD precoding scheme. For fixed point simulations, the integer part is set to 2-bit wide, which is sufficient to avoid overflows according to the profiling results. The size of the fractional part ranges from 5 to 11 bits. Any deviation in BER curves from the floating point version is considered the fixed point implementation loss. According to the simulation results, the BER curve of the floating point mOMP scheme almost overlaps with that of the floating point OMP scheme. The performance loss of fixed point implementation becomes negligible when the fractional precision reaches 10 bits. When compared with the SVD precoding scheme, the SNR values for a  $10^{-4}$  BER differ by just 0.6dB. The actual performance gap can be even smaller if the link power issue is considered. The SVD precoding scheme does not employ RF beamforming and requires a larger transmission power to compensate the path loss.

## V. HIGH THROUGHPUT HYBRID PRECODING PROCESSOR

The design specs of the hybrid precoding processor, the hardware accelerator of the proposed precoding scheme, are summarized in Table 2. The system configuration largely follows the simulation setting described in section IV. Although the



FIGURE 7. System block diagram of the proposed hybrid precoding processor.

maximum number of RF chains  $N_{RF}$  is upper bounded by the antenna array size (N = 8), it is set deliberately to half the size, i.e. 4, to alleviate the circuit complexity of the RF beamformer network. The input matrix V is  $8 \times 4$  in size and contains 32 complex-valued elements. Each element consists of two 12-bit wide numbers. Similarly, the two output matrices  $F_{RF}$  and  $F_{BB}$  contains 48 complex-valued elements in total. To avoid an oversized I/O pin count, the I/O bandwidth is confined to one element per matrix per clock cycle. This calls for 72 (24  $\times$  3) data I/O pins. 32 clock cycles are needed to accomplish the I/O operations of one hybrid precoding. This is also the minimum time separation (initiation interval) between two consecutive hybrid precoding computations. The computing latency, which is defined as the time span between the completion of V matrix input and the availability of the first element of  $F_{BB}$  matrix, is set to be no larger than 4 times the initiation interval, i.e., 128 clock cycles. A prolonged computing latency requires excessive data storage to hold the intermediate results of all uncompleted hybrid precoding computations. This is a challenging goal in view of the algorithm complexity. It also implies that 4 distinct hybrid precoding computations should be performed concurrently to meet the goal.

The hardware architecture of proposed hybrid precoding processor, as shown in Fig. 7, consists of three functional blocks—the index selection unit (ISU), the QR decompensation unit (QRDU) and the blockwise inversion unit (BIU). The ISU (Steps 1–9 in **Algorithm 2**) is further divided into five modules—buffer registers (for  $\mathbf{V}, \Psi_i, |\Psi_i|^2$ , and  $S_i$ ), complex-valued matrix multiplier (MM), matrix adder/ subtractor, 3 look-up tables ( $\Phi, \Phi^H, \Phi^H \Phi$ ) and maximum index finding (MIF) module as shown in Fig. 8.

Both  $\Phi$ ,  $\Phi^H$  look-up tables are needed to support simultaneous accesses in parallel processing. The complex-valued matrix multiplier is shared among initial  $\Psi_0$  calculation, inner product operation of  $|\Psi_i|^2$  and update of  $\Psi_i$ . The matrix multiplier is a systolic array design consisting of 36 complex-valued multiplier-and-accumulators (MACs). It can complete the computations in 8 clock cycles. The FMI module employs a binary tree of comparators so that maximum index can be obtained in one clock cycle.



FIGURE 8. Architecture design of the index selection unit.

The QRDU (Step 10 in Algorithm 2), as shown in Fig. 9, consists of a computing kernel and two sets of data ping-pong buffers. The buffers hold  $(\mathbf{F}_{\mathbf{RF}}, \mathbf{V})$  initially and end up with  $(\mathbf{R}, \mathbf{Q}^H \mathbf{V})$  after decomposition.  $\mathbf{Q}^H \mathbf{V}$  is the result needed in  $\mathbf{F}_{\mathbf{BB}}$  computation (line 12 of the proposed algorithm). The computing kernel contains two systolic array designs. The upper one is responsible for converting the complex-valued diagonal elements to real ones. The lower one performs nullifications of elements beneath the diagonal. To the right of the two systolic arrays are the corresponding dependence graphs. A space-time transform scheme is employed to map the algorithm to the architecture domain. More specifically, each systolic array contains a triangular array sitting on top of a rectangular array. The former one is for the nullification purpose and the latter one computes the update of V matrix. A total of 74 function units is needed. The basic function unit is a CORDIC processor, which performs the Givens rotation and takes 8 iterations (micro-rotations) to achieve the required accuracy. Those function units located along the diagonal of the upper layer operate in the vector mode and all the remaining operate in the rotation mode. As shown in Fig. 10, a pipelined architecture using a folding factor of 4 is adopted. The hardware complexity is one fourth complexity of a fully extended one and 4 clock cycles are needed to complete one Givens rotation. This is the minimum hardware requirement needed to meet the initiation interval design constraint. Among the two CORDIC processor designs, Givens generator operates in the vector mode while Givens rotation operates in the rotation mode.

Refer to Fig. 11, the BIU (lines 11-12 in Algorithm 2) design contains 4 data buffers for  $\mathcal{R}$ ,  $\mathcal{R}^{-1}$ ,  $Q^{H}V$ ,  $F_{BB}$ , respectively, four dividers and a 4 × 4 complex-valued matrix multiplier. Since the upper triangular matrix  $\mathcal{R}$  size is 4 × 4, all sub-matrix inversions in Eq. (17) ~ (19) are reduced to



FIGURE 9. Architecture of QR decompensation unit.



FIGURE 10. Pipelined CORDIC processor design with a folding factor of 4.



FIGURE 11. Architecture of blockwise inversion unit.

scalar inversions and computed by dividers. Four dividers compute the inversions of 4 diagonal elements in parallel. To shorten the critical path delay, these dividers, instantiated from Synopsys DesignWare Library, are each configured with four-stage pipelining. The  $4 \times 4$  matrix multiplier multiples  $\mathcal{R}^{-1}$  with  $Q^{H}V$  to obtain  $F_{BB}$ .

Fig. 12 illustrates the system timing diagrams of the proposed hybrid precoding processor design. Each time slot corresponds to 8 clock cycles. A new V matrix is admitted every 32 clock cycles as indicated in the design specs. After matrix input, the processing is divided into three phases, i.e. ISU (24 clock cycles), QRDU (60 clock cycles), and BIU (16 clock cycles). The computing latency is thus 100 clock cycles but the initiation interval between two successive computations is 32 clock cycles. It takes another 16 clock cycles to output the result of  $F_{BB}$ . Since the computing latency of QRDU is greater than the target initiation interval, two QRDU modules (marked in different colors in Fig. 12) are employed and compute successive matrix decompositions alternately.

|      | Proposed Hybrid Precoding Processor |    |       |       |           |       |           |           |                |          | T = 8 clocks |                 |                 |     |     |           |           |                 |                 |                 |                 |                 |                 |                 |          |                 |     |                 |     |                 |    |
|------|-------------------------------------|----|-------|-------|-----------|-------|-----------|-----------|----------------|----------|--------------|-----------------|-----------------|-----|-----|-----------|-----------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|----------|-----------------|-----|-----------------|-----|-----------------|----|
| Time | $T_1$                               | T2 | $T_3$ | $T_4$ | $T_5$     | $T_6$ | Τ,        | $T_8$     | T <sub>9</sub> | $T_{10}$ | $T_{11}$     | T <sub>12</sub> | T <sub>13</sub> | T14 | T15 | $T_{16}$  | $T_{17}$  | T <sub>18</sub> | T <sub>19</sub> | T <sub>20</sub> | T <sub>21</sub> | T <sub>22</sub> | T <sub>23</sub> | T <sub>24</sub> | $T_{25}$ | T <sub>26</sub> | T21 | T <sub>28</sub> | T29 | T <sub>30</sub> | 31 |
| V1   |                                     | In | out   |       |           | ISU   | $\supset$ | $\langle$ | _              | _        | QRI          | DU              | _               | _   | Х   | BIU       | )(0       | hutpu           | )               |                 |                 |                 |                 |                 |          |                 |     |                 |     |                 | ٦  |
| V2   |                                     |    |       |       | $\subset$ | In    | put       |           | $\subset$      | ISU      |              | $\sim$          |                 |     | QRI | U         |           |                 | $\times$        | BIU             | χo              | utpu            | )               |                 |          |                 |     |                 |     |                 |    |
| V3   |                                     |    |       |       |           |       |           |           | $\subset$      | Inp      | out          | $\supset$       | $\subset$       | ISU |     | <         |           | _               | QRI             | U               |                 | _               | $\times$        | BIU             | Xo       | utpu            | )   |                 |     |                 |    |
| V4   |                                     |    |       |       |           |       |           |           |                |          |              |                 | $\subset$       | In  | out | $\supset$ | $\square$ | ISU             |                 | <               |                 |                 | QRL             | U               |          |                 | ×   | BIU             | XO  | utput           |    |
|      |                                     |    |       |       |           |       |           |           |                |          |              |                 |                 |     |     |           |           |                 |                 |                 |                 |                 |                 |                 |          |                 |     |                 |     |                 |    |
| •    |                                     |    |       |       |           |       |           |           |                |          |              |                 |                 |     |     |           |           |                 |                 |                 |                 |                 |                 |                 |          |                 |     |                 |     |                 |    |





FIGURE 13. Post layout view of the proposed precoder/combiner reconstruction processor.

The chip design is based on an ARM standard cell library in a 40nm CMOS technology. Fig. 13 shows the final layout view accomplished by using Synopsys IC Compiler. The chip design summary is listed in Table 3. Layout area, gate count and power consumption are based on the report from IC Compiler. The timing information is obtained by using the PrimeTime-PX tool. The core size of the chip is 0.58 mm<sup>2</sup> with an equivalent gate count of 419.3k.

Among the three function units, QRDU is the most complex one and accounts for 45.3% of the total logic gates used. The numbers of ISU and BIU are 35.5% and 17.7%, respectively. The remaining 5.3% is attributed to the I/O buffer. A CQFP 100-pin packing is adopted and 72 pins are used for data I/O. The chip die size is expanded to 2.26 mm<sup>2</sup>. The maximum clock rate of the core design can reach 333MHz. It suggests a throughput of processing 10.4M hybrid precoding per second. The corresponding initiation interval is merely 96ns. The power consumption of the entire chip is 267.1mW @333MHz compared with 125.1 mW for the core design.

Table 4 summarizes the performance of related ASIC designs. Since most of the prior arts on hybrid precoding focusing on the algorithm aspect, only one, i.e. [17], develops the ASIC design. It also uses an  $8 \times 8$  MIMO configuration but assumes a fully connected beamformer network to reach the maximum possible number of RF beams ( $N_{RF} = 8$ ). In our design, the beamformer network is partially connected to reduce the circuit complexity, and the maximum number of RF beams supported is 4.

#### TABLE 3. Chip design summary.

| Item                | Specification             |
|---------------------|---------------------------|
| Technology process  | TSMC 40nm CLN40G          |
| Cell library        | ARM Cell-based Design Kit |
| Package             | CQFP 100                  |
| Voltage (Core/IO)   | 0.9/3.3 V                 |
| Clock Rate          | 333 MHz                   |
| Core Size           | 0.58 mm <sup>2</sup>      |
| Die Size            | 2.26 mm <sup>2</sup>      |
| IO Pad              | 95                        |
| Gate Count          | 419.3 k                   |
| Power (Chip/Core)   | 267.1/125.1 mW            |
| Latency             | 100 Cycles                |
| Initiation interval | 96ns (32 clock cycles)    |
| Throughput          | 10.4 M matrices/sec       |

TABLE 4. Performance comparison of hybrid precoding chip designs.

|                    | [17]                                        | This work                                   |
|--------------------|---------------------------------------------|---------------------------------------------|
| Precoding Scheme   | Hybrid Precoding                            | Hybrid Precoding                            |
| Technology Process | TSMC 90nm                                   | TSMC 40nm                                   |
| MIMO size          | $8 \times 8$                                | $8 \times 8$                                |
| N <sub>RF</sub>    | 8                                           | 4                                           |
| N <sub>s</sub>     | 4                                           | 4                                           |
| Problem complexity | $9 \times 8 \times 4 + 8 \times 8 \times 4$ | $9 \times 8 \times 4 + 8 \times 4 \times 4$ |
| Clock Rate         | 167 MHz                                     | 333 MHz                                     |
| Gate Count         | 985 k                                       | 419.3 k                                     |
| Latency            | 114 cycles                                  | 132 cycles                                  |
| T <sub>C</sub>     | 41 cycles                                   | 32 cycles                                   |
| Throughput         | 4 M                                         | 10.4 M                                      |
| α                  | 9/4                                         | 1                                           |
| $\tilde{eta}$      | 17/13                                       | 1                                           |
| Norm. throughput   | 11.95                                       | 24.82                                       |
| BER @ SNR = $9dB$  | 10-6                                        | 10-6                                        |

To cope with the differences in system configuration and implementation process technologies, a normalization process is performed when comparing the chip design performance. The normalization takes three factors into account, i.e. process technology, problem complexity and circuit complexity. For the first factor, the ratio of technology feature size to 40nm is used as the process normalization weight  $\alpha$ . This is a commonly accepted scheme to factor out the process discrepancy when comparing chip designs [20]-[22]. For the second factor, we use the multiplication of two matrices as a generic problem complexity index  $\beta$ . In hybrid precoding, this index contains two terms. The first one corresponds to the index selection, which involves the codebook matrix  $\Phi$ and the unitary matrix V. The complexity is expressed as  $L \times N_t \times N_s$ . The second one is the decomposition of V to **F**<sub>RF</sub> and **F**<sub>BB</sub>. This leads to a complexity term of  $N_t \times N_{RF} \times N_s$ . The third factor (circuit complexity)  $\gamma$  is represented by the gate count of the design, which is a process independent term. A normalized throughput  $\rho_{norm}$  is defined as

$$\rho_{norm} = \frac{\alpha \times \hat{\beta} \times \text{no. of matrices/sec}}{\gamma}, \qquad (22)$$

where  $\hat{\beta}$  is the ratio of problem complexity with respect to the proposed scheme. From Table 4, the problem complexity of design [17] is 31% higher than that of the proposed one. Design [17], however, uses more than twice the gate count of the proposed design. This is mainly attributed to the efficiency of computing the inverse of the upper triangular matrix  $\mathcal{R}$  in matrix decomposition. The proposed two-level blockwise inversion scheme can effectively reduce the computing complexity and thus the gate count. The clock rate of the proposed design is twice that of design [17] (333MHz versus 167MHz). This basically matches the speed discrepancy between 40nm and 90nm process technologies. The normalized throughput of the proposed design is roughly two times that of design [17] (11.95 versus 24.82). The BER performance comparison is also shown in Table 4. Under the simulation setting of  $N_s = 4$ ,  $N_t = 8$ , and QPSK modulation, both designs can achieve BER =  $10^{-6}$  when SNR is 9dB. Actually, if the two schemes choose the same steering vectors from the beamforming codebook, the BER performance would be identical.

### **VI. CONCLUSION**

A joint RF beamforming and digital baseband precoding design on mmWave MIMO systems with a limited number of RF chains is investigated. The core problem is finding steering matrix  $\mathbf{F}_{\mathbf{RF}}$  and digital precoding matrix  $\mathbf{F}_{\mathbf{BB}}$  for the RF network and the baseband precoder, respectively. The steering vectors of  $\mathbf{F}_{\mathbf{RF}}$  are selected from a codebook and the goal is identifying best matched vectors to optimize the performance. In this paper, a constructive process is developed to obtain the two matrices. We present a modified OMP scheme, which decouples the constructions of  $F_{RF}$  and  $F_{BB}$  in order to reduce the computing complexity and to enhance the computing parallelism. The steering matrix  $\mathbf{F}_{\mathbf{RF}}$  is computed first followed by solving a least square problem to obtain the precoding matrix  $\mathbf{F}_{\mathbf{BB}}$ . A novel computing scheme using QR decomposition and blockwise inversion techniques is introduced to obtain the least square solution efficiently. The accuracy of matrix constructions and the algorithm computing complexity were both analyzed to verify the effectiveness of the proposed scheme. The results show that the correct vector selection ratio of the proposed scheme, in spite of a significant complexity reduction, is almost identical to that obtained by using the original OMP scheme. For the system configuration used in our chip implementation, i.e.,  $(N_t, N_r, N_s, N_{RF}, L) =$ (8, 8, 4, 4, 9), the ratio is nearly 0.9. Based on the proposed scheme, a hardware accelerator design is developed based on a TSMC 40nm CLN40G technology. From the post layout simulation results, the design is capable of working at a clock rate of 333MHz with a power consumption of 267.1mW when the supply voltage is set to 0.9V. The core size is merely 0.58 mm<sup>2</sup> while the entire die size including I/O pads is 2.26 mm<sup>2</sup>. The proposed design can perform 10.4M matrix decompositions per second and has the best normalized throughput performance when compared with existing work.

#### ACKNOWLEDGMENT

The authors would like to thank Taiwan Semiconductor Research Institute for its technical supports in IC designs and simulations.

#### REFERENCES

- S. K. Yong and C.-C. Chong, "An overview of multigigabit wireless through millimeter wave technology: Potentials and technical challenges," *EURASIP J. Wireless Commun. Netw.*, vol. 2007, no. 1, 2007, Art. no. 078907.
- [2] N. Tawa, T. Kuwabara, Y. Maruta, M. Tanio, and T. Kaneko, "28 GHz downlink multi-user MIMO experimental verification using 360 element digital AAS for 5G massive MIMO," in *Proc. 48th Eur. Microw. Conf.*, Sep. 2018, pp. 934–937.
- [3] O. E. Ayach, R. W. Heath, Jr., S. Abu-Surra, S. Rajagopal, and Z. Pi, "Low complexity precoding for large millimeter wave MIMO systems," in *Proc. IEEE Int. Conf. Commun. (ICC)*, Jun. 2012, pp. 3724–3729.
- [4] O. El Ayach, S. Rajagopal, S. Abu-Surra, Z. Pi, and R. W. Heath, Jr., "Spatially sparse precoding in millimeter wave MIMO systems," *IEEE Trans. Wireless Commun.*, vol. 13, no. 3, pp. 1499–1513, Mar. 2014.
- [5] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, "Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems," *IEEE J. Sel. Topics Signal Process.*, vol. 1, no. 4, pp. 586–597, Dec. 2007.
- [6] Y. Chen and X. Zhang, "High-speed architecture for image reconstruction based on compressive sensing," in *Proc. IEEE ICASSP*, Mar. 2010, pp. 1574–1577.
- [7] S. G. Mallat and Z. Zhang, "Matching pursuits with time-frequency dictionaries," *IEEE Trans. Signal Process.*, vol. 41, no. 12, pp. 3397–3415, Dec. 1993.
- [8] B. K. Natarajan, "Sparse approximate solutions to linear systems," SIAM J. Comput., vol. 24, no. 2, pp. 227–234, 1995.
- [9] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, "Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition," in *Proc. Asilomar Conf. Signals, Syst. Comput.*, Nov. 1993, pp. 40–44.
- [10] G. M. Davis, S. G. Mallat, and Z. Zhang, "Adaptive time-frequency decompositions with matching pursuit," *Opt. Eng.*, vol. 33, pp. 1–15, 1994.
- [11] J. A. Tropp, A. C. Gilbert, and M. J. Strauss, "Simultaneous sparse approximation via greedy pursuit," in *Proc. ICASSP*, Mar. 2005, pp. v/721-v/724.
- [12] J. A. Tropp, A. C. Gilbert, and M. J. Strauss, "Algorithms for simultaneous sparse approximation. Part I: Greedy pursuit," *Signal Process.*, vol. 86, no. 3, pp. 572–588, 2006.
- [13] L. Belmerhnia, E.-H. Djermoune, and D. Brie, "Greedy methods for simultaneous sparse approximation," in *Proc. 22nd Eur. Signal Process. Conf. (EUSIPCO)*, Sep. 2014, pp. 1851–1855.
- [14] J. A. Tropp, "Algorithms for simultaneous sparse approximation. Part II: Convex relaxation," *Signal Process.*, vol. 86, no. 3, pp. 589–602, Mar. 2006.
- [15] J. A. Tropp and A. C. Gilbert, "Signal recovery from random measurements via orthogonal matching pursuit," *IEEE Trans. Inf. Theory*, vol. 53, no. 12, pp. 4655–4666, Dec. 2007.
- [16] G. Huang and L. Wang, "High-speed signal reconstruction with orthogonal matching pursuit via matrix inversion bypass," in *Proc. IEEE Workshop Signal Process. Syst. (SiPS)*, Oct. 2012, pp. 191–196.
- [17] Y.-Y. Lee, C.-H. Wang, and Y.-H. Huang, "A hybrid RF/baseband precoding processor based on parallel-index-selection matrix-inversion-bypass simultaneous orthogonal matching pursuit for millimeter wave MIMO systems," *IEEE Trans. Signal Process.*, vol. 63, no. 2, pp. 305–317, Jan. 2015.
- [18] A. Septimus and R. Steinberg, "Compressive sensing hardware reconstruction," in *Proc. IEEE Int. Symp. Circuits Syst.*, 2010, pp. 3116–3119.
- [19] M. Richards, Fundamentals of Radar Signal Processing. New York, NY, USA: McGraw-Hill, 2005.
- [20] C.-H. Chen, C.-R. Tsai, Y.-H. Liu, W.-L. Hung, and A.-Y. Wu, "Compressive sensing (CS) assisted low-complexity beamspace hybrid precoding for millimeter-wave MIMO systems," *IEEE Trans. Signal Process.*, vol. 65, no. 6, pp. 1412–1424, Mar. 2017.

- [21] M.-R. Li, C.-H. Yang, and Y.-L. Ueng, "A 5.28-Gb/s LDPC decoder with time-domain signal processing for IEEE 802.15.3c applications," *IEEE J. Solid-State Circuits*, vol. 52, no. 2, pp. 592–604, Feb. 2017.
- [22] T.-H. Liu, Y.-J. Chen, Y.-K. Ko, Y.-C. Lin, and Y.-S. Chu, "Hardware implementation of the compressed beamforming weights calculation for the practical wireless MIMO-OFDM communication system," *IEEE Trans. Circuits Syst., II, Exp. Briefs*, vol. 65, no. 1, pp. 46–50, Jan. 2018.



**KUAN-TING CHEN** received the B.S. and M.S. degrees from the Department of Electronic Engineering and Electrical Engineering, Hsiuping University of Science and Technology, Taichung, Taiwan, in 2009 and 2011, respectively. He is currently pursuing the Ph.D. degree with the Department of Electrical Engineering, National Chung Hsing University, Taichung, Taiwan. His research interests include massive MIMO communications, hybrid precoding/beamforming, and communication baseband IC design.



**YIN-TSUNG HWANG** (M'93) received the B.S. and M.S. degrees in electronic engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1983 and 1985, respectively, and the Ph.D. degree from the Department of Electrical and Computer Engineering, University of Wisconsin-Madison, in 1993. He became a faculty member of the Department of Electronic Engineering, National Yunlin University of Science and Technology, Taiwan. In 2004, he joined the

Department of Electrical Engineering, National Chung Hsing University, where he is currently a Professor. He has also served as the Director of the MengYao Chip Center, National Chung Hsing University, from 2007 to 2010. He has been a council member of the Taiwan Integrated Circuits Design Association, since 2014. His research interests include VLSI designs for wireless base band signal processing, beamforming and radar signal processing, image compression, and low power VLSI circuit designs.



**YEN-CHANG LIAO** received the B.S. degree from the Department of Electrical Engineering, National Chung Hsing University, Taiwan, in 2015. He was then admitted to the graduate school of National Chung Hsing University and also participated in a dual-degree program with the Toyota Technological Institute (TTI), Japan. He spent one year in Japan to fulfill the academic requirements and received the M.S. degrees from both universities, in 2018. He then joined Ilitek

Corporation, Hsinchu, Taiwan, after his graduation. He is currently a Digital IC Design Engineer.

. . .