

Received 1 October 2022, accepted 17 October 2022, date of publication 26 October 2022, date of current version 4 November 2022. Digital Object Identifier 10.1109/ACCESS.2022.3217523

### **RESEARCH ARTICLE**

## Efficient Hardware Implementation of CORDIC-Based Symbol Detector for GSM MIMO Systems: Algorithm and Hardware Architecture

# HOANG-YANG LU<sup>®1</sup>, (Member, IEEE), MAO-HSU YEN<sup>®2</sup>, CHE-WEI CHANG<sup>1</sup>, CHUNG-WEI CHENG<sup>1</sup>, TZU-CHING HSU<sup>1</sup>, AND YU-CHI LIN<sup>1</sup>

<sup>1</sup>Department of Electrical Engineering, National Taiwan Ocean University, Keelung 20224, Taiwan
<sup>2</sup>Department of Computer Science and Engineering, National Taiwan Ocean University, Keelung 20224, Taiwan

Corresponding author: Mao-Hsu Yen (ymh@mail.ntou.edu.tw)

This work was supported in part by the National Science Council, Taiwan, under Contract MOST 111-2221-E-019-027; and in part by the Bureau of Standards, Metrology and Inspection (BSMI), Taiwan, under Contract 1D171101112-99.

**ABSTRACT** Aiming at providing an efficient hardware architecture for generalized spatial modulation (GSM) multiple-input multiple-output (MIMO) systems, in this paper a COordinate Rotation DIgital Computer (CORDIC) -based symbol detector is proposed. In the proposed detector, several CORDIC-based Givens rotation modules are conducted in parallel for the transmit antenna combinations (TACs) to facilitate QR-decomposition. After that, the proposed detector uses adders, shifters, and backward substitution mechanisms to currently estimate symbols for the corresponding TAC. At last, the estimated symbols and corresponding TAC index with the smallest distance measurement are chosen as the final solution. In particular, to achieve efficient hardware implementation, the architecture of the proposed detector has several features, including multiplication-free ranking mechanism, parallel CORDIC-based architectures, and shorter-bit-length multipliers. In addition, the overall architecture is pipelined to speed up the processing of the hardware. At last, computer simulations and hardware implementation are conducted under the configuration of four transmit, two active transmit, and four receive antennas. Simulation results reveal the proposed detector performs near to the optimal maximum likelihood (ML), but uses lower computational complexity. Additionally, the VLSI implementation results under the TSMC 90-nm CMOS technology show that the proposed hardware architecture requires 266K gates (KGEs), provides detection throughput 2.008 Gbps, works with pre-processing latency of 47 clock cycles, and has the hardware efficiency 7.54 (Mbps/KGEs) while operating at frequency 200.8 MHz. Moreover, implementation comparisons show the proposed architecture provides high throughput rate as well as hardware efficiency, and works with low pre-processing latency.

**INDEX TERMS** Givens rotation, CORDIC, symbol detector, GSM MIMO.

#### I. INTRODUCTION

Multiple-input multiple-output (MIMO), which deploys multiple antennas at the transmitters and at the receivers, has been shown as a potential approach to improve spectral

The associate editor coordinating the review of this manuscript and approving it for publication was Stefan Schwarz<sup>(b)</sup>.

efficiency and transmission reliability [1], [2], [3], [4]. In conventional MIMO schemes, all transmit antennas are activated simultaneously to send signal. As a result, the transmission will incur the inter-channel interference (ICI) and lead to performance degradation. Recently, an attractive MIMO technique, namely spatial modulation (SM), was proposed in [5], [6], [7], and [8]. In SM MIMOs, only one of the

transmit antennas is activated for transmission and hence the ICI can be naturally avoided [8]. In addition, due to only one antenna activated at each symbol slot of SM MIMO systems, one radio-frequency (RF) chain is needed at the transmitter and hence hardware cost can be significantly reduced [8]. Further, in SM MIMOs, the information bit stream sent at each symbol slot is divided into two blocks: (1) the first block with  $\log_2 M$  bits is mapped to a data symbol of M-ary quadrature-amplitude modulation (M-QAM) [9], (2) the second block with  $\log_2 N_t$  bits is mapped to a spatial symbol, which conveys the corresponding information of the active antenna index. To further improve the spectral efficiency of SM, generalized spatial modulation (GSM), which allows multiple transmit antennas to be activated simultaneously, has been proposed in [10], [11], and [12]. To provide better trade-off between spectral efficiency and ICI, in the recent decade GSM MIMO has inspired rich researchers to develop suitable schemes such as symbol detectors and their corresponding hardware architectures.

In GSM MIMO systems, symbol detectors not only estimate data symbols as conventional MIMO detectors, but also need to find indices of transmit antenna combinations (TACs). Hence, symbol detection in GSM MIMO systems requires more computations than those in conventional MIMOs. Therefore, developing low-complexity symbol detectors for GSM MIMOs is important. The maximum likelihood (ML) detector is optimal but calls for prohibitive complexity [13]. As for some suboptimal detection methods, such as minimum mean square error (MMSE) and zero forcing (ZF), most of them can be applied to GSM MIMOs, but different levels of modifications are required since the estimations of data symbols and spatial symbols are both needed. Further, in [13], the authors proposed a low complexity detector for GSM MIMO systems, namely ordered block minimum mean-square error (OB-MMSE), which achieves performance close to the optimal result. However, the OB-MMSE needs to compute the inversion of some fading channel matrices for the corresponding TACs. Moreover, the amount of computing the matrix inversion depends on a preset threshold. Hence, inadequate choices for the threshold will incur the problem of extra (less) searches and eventually increase computational complexity (degrade the performance of bit error rate (BER)) [14]. To further improve the BER performance of GSM MIMO systems, some detectors have been proposed in [14], [15], [16], and [17]. For example, in [14], two efficient compressive-sensing (ECS) detectors are designed, which are based on the orthogonal matching pursuit (OMP) algorithm [15]. Simulation results show the proposed two ECS detectors achieve the near-optimal performance with a considerable complexity reduction. A new successive sphere decoding algorithm (SSDA) is proposed in [16]. The SSDA conducts an efficient sphere decoding algorithm (SDA) to do the tree search for symbol estimation. In [17], an ordered TAC-ordered successive interference cancellation with ML verification (O<sup>2</sup>SIC-ML) detector is proposed, which first sorts the probable activated TACs and then applies the ordered successive interference cancellation (OSIC) mechanism for each TAC while considering the sliced symbol as well as its neighbors. Afterward, an ML verification is used until the computed distance between the received signal and the product of the estimated symbol candidate and the corresponding channel matrix is smaller or equal to a preset threshold. In fact, the threshold based detectors, including the OB-MMSE [13], the two ECS [14], the SSDA [16], and the  $O^2$ SIC-ML [17], have the problem with determining suitable threshold values. Also, different threshold values may cause different computations or different performance degradation. In addition, to our best knowledge, until now the above-mentioned threshold based detectors have not been investigated for physical hardware implementation.

In this paper, an efficient hardware implementation of a COordinate Rotation DIgital Computer (CORDIC) [18] based symbol detector for GSM MIMO systems is proposed. In the proposed scheme, the Givens rotation method [18] is applied for each TAC to QR-decompose the corresponding sub-block channel matrix into the product of a semi-unitary matrix and an upper triangular matrix [19]. Then, the backward substitution mechanism of the detector is conducted at the upper triangular matrix and the received signal to estimate the transmitted symbols of the corresponding TAC. After that, the proposed detector computes the distance between the received signal and the multiplication of the transmitted symbols and their corresponding sub-block channel matrix. At last, the estimated symbols and the corresponding TAC with the minimum distance are chosen as the final estimated result. Specially, the architecture of the proposed detector has four key characteristics to achieve efficient hardware implementation:

1) For ranking the detection order of the transmitted symbols, compute the summation of the absolute values of the channel vectors' entries, instead of calculating their norms. As a result, the hardware implementation at the ranking mechanism brings the benefits of multiplication-free;

2) Further, the complex CORDIC-based module is adopted to facilitate the Givens rotation method. Moreover, the architecture of the CORDIC-based modules is parallel and pipelined implementation aiming at speeding up the detection;

3) To further reduce the computations of the above-mentioned distance, multipliers with shorter-bit-length are deployed in the proposed detector;

4) For each TAC, we deploy their individual hardware mechanisms, which include the CORDIC-based module and symbol detection, in order to achieve the architecture of parallel processing.

At last, computer simulation results reveal that the proposed detector performs close to the optimal detector, while uses lower computational complexity. In addition, the VLSI implementation results under the TSMC 90-nm CMOS technology show that the proposed hardware architecture requires 266K gates (KGEs), provides detection throughput 2.008 Gbps, works with pre-processing latency of 47 clock cycles, and has hardware efficiency 7.54 (Mbps/KGEs) while

operating at frequency 200.8 MHz. Due to the above appealing results such as high hardware efficiency and throughput, the proposed detector is an appealing candidate for current and future GSM MIMO systems.

The remaining sections of the paper are organized as follows: Section II gives a system model. The detailed demonstration of the proposed scheme and hardware architecture are provided in Section III and Section IV, respectively. Conclusions are drawn in Section V.

*Notation* - Upper-case and lower-case bold letters denote matrices and vectors respectively.  $\Re\{w\}$ ,  $\Im\{w\}$ , are the realpart and imaginary-part of a complex-value w. |a| denotes the absolute value of a.  $|| \cdot ||$ ,  $(\cdot)^T$ , and  $(\cdot)^H$  are the norm, transpose, and conjugate transpose operators respectively [20].  $I_L$  is the  $L \times L$  identity matrix.  $\binom{n}{x}$  is the binomial coefficient, which represents the number of combinations of x items taken from a set of n items.  $\lfloor x \rfloor$  is the floor function that outputs the greatest integer less than or equal to x. A complex Gaussian random variable distributed with mean  $\mu$  and variance  $\sigma^2$  is denoted as  $\mathcal{CN}(\mu, \sigma^2)$ .

#### **II. SYSTEM MODEL**

In this paper, a GSM MIMO system with  $N_t$  transmit and  $N_r$  receive antennas is considered. Further,  $N_a(N_a < N_t)$  is denoted as the number of active transmit antennas, which are out of the  $N_t$  transmit antennas and utilized for data transmission at each symbol slot. So, the GSM MIMO system considered has  $K = 2^{\lfloor \log_2 {N_t \choose N_a} \rfloor}$  TACs. In addition, it is also assumed that at each symbol slot, a bit stream with length  $b = b_{s_1} + b_d$  is sent from the GSM transmitter, where  $b_s =$  $\lfloor \log_2 {N_l \choose N_a} \rfloor$  is the spatial bits and is used to map a TAC l with  $N_a$  corresponding transmit antenna indices and  $l \in L =$  $\{1, 2, \dots, K\}$ ; and  $b_d$  bits, namely data bits, are mapped to  $N_a$ *M*-QAM symbols, which are transmitted from the above  $N_a$ active transmit antennas. Hence, in the GSM MIMO system, a bit stream with  $b = \lfloor \log_2 {N_i \choose N_a} \rfloor + N_a \log_2 M$  bits is transmitted at each time slot. Furthermore, denote the  $N_a$  antenna indices of TAC l as  $(l_1, \dots, l_{N_a})$ . Then, the transmitted symbol vector in the GSM MIMO system can be expressed as  $x_{(l,s)} = [\cdots, 0, s_{l_1}, 0, \cdots, 0, s_{l_2}, 0, \cdots, 0, s_{l_{N_a}}, 0, \cdots]^T$ , where  $s_{l_1}, \cdots, s_{l_{N_a}}$  are the  $N_a$  symbols chosen from *M*-QAM modulation set S, and then sent from the  $N_a$  active antennas of TAC *l*. In addition, it is also assumed that the channel between the transmitter and the receiver is flat fading. Then, the baseband received signal can be expressed as

$$y = Hx_{(l,s)} + n = \sum_{k=l_1}^{l_{N_a}} h_k s_k + n = H_l s + n$$
(1)

where y, H, and n are the complex received signal vector, flat fading channel matrix, and additive white Gaussian noise (AWGN) vector respectively;  $s = [(s_{l_1}, \dots, s_{l_{N_a}})]^T$  is the transmitted symbol vector and  $H_l$  is the corresponding sub-matrix of H for TAC l; and  $h_k$  is the kth column of  $H_l$ . It is also assumed that the elements of the channel matrix H and of the AWGN vector *n* are distributed with  $\mathcal{CN}(0, 1)$  and with  $\mathcal{CN}(0, \sigma_n^2)$  respectively.

#### **III. THE PROPOSED ALGORITHM**

To estimate the transmitted symbol vector and the index of the corresponding TAC in (1), the ML detector is to find

$$(\hat{l}, \hat{s}) = \underset{l \in L, s \in S^{N_a}}{\arg \min} ||y - H_l s||^2.$$
 (2)

However, the fact is the ML detector is capable of providing the optimal performance but requires huge computational complexity, which results in impracticality for hardware implementation. Actually, using QR-factorization, (2) can be re-expressed as [16]

$$(\hat{l}, \ \hat{s}) = \min_{l \in L} (-||Q_l^H y||^2 + \min_{s \in S^{N_a}} ||Q_l^H y - R_l s||^2), \quad (3)$$

where  $H_l = Q_l R_l$ , and  $Q_l \in \mathbb{C}^{Nr \times Na}$ ,  $R_l \in \mathbb{C}^{Na \times Na}$  are the corresponding semi-unitary matrix and upper triangular matrix respectively. The expression in (3) shows the joint problem in (2) can be tackled by first detecting transmitted symbols and then estimating the indices of TACs. In addition, the rightmost term of (3) can be considered as a least square formulation for symbol detection. Hence, in the proposed scheme the back substitution mechanism can be used to estimate the transmitted symbol vector *s* of each TAC. After that, substitute all TACs' estimated symbols into (3) and choose the corresponding symbol vector and TAC index with the minimum value in (3) as the final result. Further, to facilitate the QR-decomposition, in this section a low-complexity Givens rotation detector is proposed and described in detail as follows.

#### Step 1: Sort the strengths of the channel vectors:

Due to backward substitution adopted in the proposed scheme, the ranking mechanism of detection order is deployed, aiming at improving the detection performance. For the ranking, the 2-norms of all channel vectors are computed and sorted at a descending order [21], [22], which can be expressed as

$$\{k_1, k_2, \cdots, k_{N_t}\} = sort\{||h_1||, ||h_2||, \cdots, ||h_{N_t}||\}, \quad (4)$$

where  $k_i$ ,  $i = 1, \dots, N_t$ , is the *i*th strongest strength of the  $N_t$  channel column vectors. However, computing the vector norms requires multipliers, which usually consume high clock cycles. As a result, the high consumption generally reduces system throughput. To solve the problem, we turn to compute 1-norm of the channel vectors, instead of 2-norm. That is, we compute the summation of channel entries' absolute values and sort them to approximate the goal of (4). That is, the expression

$$\{k_1, k_2, \cdots, k_{N_t}\} = sort\{\sum_{i=1}^{N_t} |h_{i,1}|, \cdots, \sum_{i=1}^{N_t} |h_{i,N_t}|\}, \quad (5)$$

is computed in the proposed scheme to roughly achieve the goal of (4). In fact, simulations results (shown in Fig. 1 and Fig. 2) reveal that the above approximation in (5) for

channel strength leads to slight degradation of the detection performance. However, this approximation can receive the benefits of saving multiplier deployment and clock cycles.

#### Step 2: Compute the Givens rotations:

In this step, the complex-valued Given rotations are applied to QR-decompose each TAC's channel matrix. First, for TAC l, relocate the column vectors of  $H_l$  from the rightmost column to the leftmost column according to their strength descending order in (5). The correspondingly relocated channel matrix is denoted as  $H_l$ . For example, assume  $N_t = 4$ ,  $N_a = 2$ , and TAC *l*'s antenna index set is  $\{2, 4\}$ . Hence, originally  $H_l = [h_2 \ h_4]$ . It is also assumed that from the result of (5), the strength of  $h_2$  is larger than that of  $h_4$ . Then, the relocated matrix  $\tilde{H}_l = [h_4 h_2]$ . As aforementioned, the goal of relocating the channel vectors is to reduce the error propagation effect, which may be incurred by the backward substitution. Furthermore, after the relocation, apply the complex-valued Given rotation (CGR) to QR-decompose the augmented matrix  $[\tilde{H}_l \ y]$  as  $[\tilde{R}_l \ \tilde{y}_l]$ , where  $\tilde{Q}_l$  and  $\tilde{R}_l$ are, respectively, the corresponding semi-unitary matrix and upper triangular matrix for  $\tilde{H}_l$ ; and  $\tilde{y}_l = \tilde{Q}_l^H y$ . The detail expression of  $[\tilde{R}_l \ \tilde{y}_l]$  is

$$\begin{bmatrix} \tilde{R}_{l} \ \tilde{y}_{l} \end{bmatrix} = \begin{bmatrix} \tilde{r}_{1,1}^{l} \ \tilde{r}_{1,2}^{l} & \cdots & \tilde{r}_{1,N_{a}}^{l} & \tilde{y}_{1}^{l} \\ 0 \ \tilde{r}_{2,2}^{l} & \cdots & \tilde{r}_{2,N_{a}}^{l} & \tilde{y}_{2}^{l} \\ \vdots & \vdots & \vdots & \vdots \\ 0 \ 0 \ \tilde{r}_{N_{a}-1,N_{a}-1}^{l} \ \tilde{r}_{N_{a}-1,N_{a}}^{l} \ \tilde{y}_{N_{a}-1}^{l} \\ 0 \ 0 \ 0 \ \tilde{r}_{N_{a},N_{a}}^{l} \ \tilde{y}_{N_{a}}^{l} \end{bmatrix}.$$
(6)

To achieve the result as (6),  $((Nr - 1) + (Nr - Na)) \times Na/2$ CGRs are needed to be successively conducted. For example, the expression of the CGR for converting a  $2 \times 2$  complex matrix  $\begin{bmatrix} a & c \\ b & d \end{bmatrix}$  to an upper triangular matrix can be written as

$$\begin{bmatrix} \cos \theta_{ab} & \sin \theta_{ab} \\ -\sin \theta_{ab} & \cos \theta_{ab} \end{bmatrix} \begin{bmatrix} e^{-j\theta_a} & 0 \\ 0 & e^{-j\theta_b} \end{bmatrix} \begin{bmatrix} a & c \\ b & d \end{bmatrix}$$
$$= \begin{bmatrix} \sqrt{a^2 + b^2} & \dot{c} \\ 0 & \dot{d} \end{bmatrix}, \tag{7}$$

where  $\theta_a$  and  $\theta_b$  are complex angles of *a* and *b*, respectively; and  $\theta_{ab} = tan^{-1}(|b|/|a|)$ . To implement the rotation matrix  $\begin{bmatrix} \cos\theta & \sin\theta \\ -\sin\theta & \cos\theta \end{bmatrix}$  in (7), the CORDIC algorithm [24] will be  $\mathbf{u}$ sed in the next section. Actually, the CORDIC uses the fact that  $\tan \theta = \frac{\sin \theta}{\cos \theta}$  so that the rotation matrix can be is considered as the equivalent form,  $[(1 + \tan^2 \theta)^{\frac{-1}{2}}] \begin{bmatrix} 1 & \tan \theta \\ -\tan \theta & 1 \end{bmatrix}$ . Also, instead of directly performing the rotation through the angle  $\theta$ , the CORDIC facilitates the rotation by iteratively performing a certain amount of micro-rotations through a set of predefined small angles  $\alpha_k$ , where  $\alpha_k = \tan^{-1}(2^{-k})$  $(i.e. \tan(\alpha_k) = 2^{-k})$  and the expression of the micro-rotations is

$$\theta = \sum_{k=1}^{n} \rho_k \alpha_k, \text{ and } \rho_k = \pm 1.$$
(8)

After conducting the OR-decomposition of Step 2, the rightmost term of (3) can be re-expressed as min  $||\tilde{y}_l - \tilde{R}_l s||^2$ , where  $\tilde{y}_l = \tilde{Q}_l^H y$ . Using (6), the correspondingly detail expression can be further re-written as

$$\begin{split} &||\begin{bmatrix} \tilde{y}_{1}^{l} \\ \vdots \\ \tilde{y}_{N_{a}}^{l} \end{bmatrix} - \begin{bmatrix} \tilde{r}_{1,1}^{l} \cdots \tilde{r}_{1,N_{a}}^{l} \\ \ddots \vdots \\ 0 & \tilde{r}_{N_{a},N_{a}}^{l} \end{bmatrix} \begin{bmatrix} s_{1} \\ \vdots \\ s_{N_{a}} \end{bmatrix} ||^{2}$$
(9)  
$$= |\tilde{y}_{N_{a}}^{l} - \tilde{r}_{N_{a},N_{a}}^{l} \times s_{N_{a}}|^{2} + |\tilde{y}_{N_{a}-1}^{l} - \tilde{r}_{N_{a}-1,N_{a}}^{l} \times s_{N_{a}} - \tilde{r}_{N_{a}-1,N_{a}-1}^{l} \times s_{N_{a}-1}|^{2} + \cdots$$
(10)

Then, using (10) and the backward substitution method,  $s_{N_a}$ is first estimated, which is formulated as

$$\hat{s}_{N_a} = \underset{s_{N_a} \in S}{\arg\min} |\tilde{y}_{N_a}^l - \tilde{r}_{N_a,N_a}^l \times s_{N_a}|.$$
(11)

To find the solution of (11), setting the right-side term of (11)to zero yields the estimation,  $\hat{s}_{N_a} = \tilde{y}_{N_a}^l / \tilde{r}_{N_a,N_a}^l$ , where the division operation is required. In fact, for hardware implementation of the proposed detector, we adopt shifters and adders, instead of directly using dividers. For further demonstration, in the following we assume the transmission are all 16-QAM symbols, *i.e.*  $\{s_i, i = 1, \dots, N_a\} \in S = \{u + i\}$  $jv : u, v \in \{-3, -1, 1, 3\}\}$ . In addition, the fact is worth mentioning that the diagonal elements of the upper triangular matrix  $\tilde{R}_l$  are all positive real numbers. Hence, instead of the 2-norm operation in (11), the 1-norm estimation is considered and expressed as

$$\Re\{\hat{s}_{N_a}\} = \arg\min_{\Re\{s_{N_a}\}\in\Re\{S\}} |\Re\{\tilde{y}_{N_a}^l\} - \tilde{r}_{N_a,N_a}^l \times \Re\{s_{N_a}\}|.$$
(12)

and

$$\Im\{\hat{s}_{N_a}\} = \arg\min_{\Im\{s_{N_a}\}\in\Im\{S\}} |\Im\{\tilde{y}_{N_a}^l\} - \tilde{r}_{N_a,N_a}^l \times \Im\{s_{N_a}\}|.$$
(13)

As a result,  $s_{N_a} = \tilde{y}_{N_a}^l / \tilde{r}_{N_a,N_a}^l$  derived from (11) can be hence separated as  $\Re\{s_{N_a}\} = \Re\{\tilde{y}_{N_a}^l\} / \tilde{r}_{N_a,N_a}^l$  and  $\Im\{s_{N_a}\} = \Im\{\tilde{y}_{N_a}^l\} / \tilde{r}_{N_a,N_a}^l$ . Then, we estimate by first checking if  $\Re\{\tilde{y}_{N_a}^l\}$ positive or not. If  $\Re{\{\tilde{y}_{N_a}^l\}}$  is positive (negative), the possible candidates for  $\Re\{s_{N_a}\}$  are  $\{1, 3\}$  ( $\{-1, -3\}$ ). We further assume that  $\Re{\{\tilde{y}_{N_a}^l\}}$  is positive. So, after the checking process, the above possible candidates for  $\Re\{s_{N_a}\}\$  are  $\{1, 3\}$ . Next, do the comparison between  $\tilde{r}_{N_a,N_a}^l \times 2$  and  $\Re{\{\tilde{y}_{N_a}^l\}}$ . Then, the final estimated result,  $\hat{s}_{N_a}$  is 3 if  $\tilde{r}_{N_a,N_a}^l \times 2 \ge \Re{\{\tilde{y}_{N_a}^l\}}$ . Otherwise,  $\hat{s}_{N_a}$  is 1. Note that 2 at the product of  $\tilde{r}_{N_a,N_a}^l \times 2$  is the decision boundary between 1 and 3, also, instead of using multipliers, left shifters are used in hardware implementation to facilitate the product by 2. Similarly, conduct the above same detection mechanisms, the estimation of  $\Im{s_{N_a}}$  can be also achieved.

#### TABLE 1. Algorithm 1: Proposed givens rotation algorithm.

| Step 1. Sort the strengths of the channel vectors                                          |
|--------------------------------------------------------------------------------------------|
| Conduct the computation for sorting the strengths of the channel vectors by (5).           |
| Step 2. Compute the Givens rotations for all TACs's channel matrices                       |
| For TAC $l, l = 1, 2, \cdots, K$ ,                                                         |
| compute the Givens rotations by (7) for the augmented matrix $[\mathbf{H}_l \mathbf{y}_l]$ |
| to find the corresponding matrices, as shown in (6).                                       |
| Step 3. Conduct the backward substitution for symbol detection                             |
| For TAC $l, l = 1, 2, \cdots, K$ ,                                                         |
| conduct the backward substitution in (11) and (14) to find the estimation                  |
| $(\hat{s}_{l_1},\cdots,\hat{s}_{l_{N_a}}).$                                                |
| Step 4. Compute the distances and choose the final solution                                |
| (1) Compute the distances in (15) for all TACs' symbol estimation.                         |
|                                                                                            |

(2) Choose the estimation of TAC index and symbols with minimum distance as the final solution.

Next, substitute  $\hat{s}_{N_a}$  into the second term of (10) and formulate the estimation for  $s_{N_a-1}$  as

$$\hat{s}_{N_a-1} = \operatorname*{arg\,min}_{s_{N_a-1}\in S} |\ddot{y}_{N_a-1}^l - \tilde{r}_{N_a-1,N_a-1}^l \times s_{N_a-1}|, \quad (14)$$

where we denote  $\ddot{y}_{N_a-1}^l = \tilde{y}_{N_a-1}^l - \tilde{r}_{N_a-1,N_a}^l \times \hat{s}_{N_a}$ . Similarly, the estimation of  $s_{N_a-1}$  can be found by using the above estimation method for  $s_{N_a}$  in (11). Then, repeat the above estimation method to find  $\hat{s}_{N_a-2}, \dots, \hat{s}_1$ . Note that  $\{\hat{s}_1, \dots, \hat{s}_{N_a}\}$  is the estimation result for TAC *l*. It is worth mentioning that for each TAC, we deploy their individual hardware mechanisms (described in the next section), including the CORDIC-based module for the Givens rotation and symbol detection, to achieve the benefits of parallel processing.

#### Step 4: Compute the distances and choose the final solution:

In this step, the estimated symbol vectors of all TACs obtained at Step 3 are substituted to the following distance measurements,

$$\eta_l = -||y_l||^2 + ||y_l - R_l s_l||^2,$$
(15)

where  $l = 1, 2, \dots, K$ . At last, choose the estimated symbol vectors and its corresponding TAC index, whose distance in (15) is the minimum, as the final solution. Note that the proposed scheme is a Givens rotation-based approach to facilitate QR decomposition and uses the backward substitution mechanism for symbol estimation. To avoid the drawbacks of the above-mentioned threshold based detectors, no threshold setting is required in the proposed scheme and hence the estimations of each TAC's are independent. In particular, the proposed scheme is division-free and the estimation modules for all TACs are conducted in parallel. Hence, the goal of efficient hardware implementation for the proposed scheme can be achieved. For ease of reading, the detail flow of the proposed algorithm is shown in Table 1.

To verify the proposed algorithm, computer simulations with the settings,  $N_t = 4$ ,  $N_r = 4$ ,  $N_a = 2$ , and 16-QAM symbol modulation, are conducted and their results are shown in Figs. 1-3. Fig. 1 shows the BER performance versus number of iterations in order to provide the information for determining the number of iterations for the CORDIC modules of the proposed Givens rotation detector. This number of iterations is the number of the micro-rotations of the CORDIC modules

in (8). From the result of Fig. 1, we can observe that the proposed detector with 14- or 16-bit-fixed-point numerical representations can converge after about 6 iterations. Moreover, the BER performance under the 16-bit-fixed-point numerical representation is better than that under 14-bit-fixed-point, regardless using 2-norm measurement in (4) or 1-norm in (5). To provide more insight about the performance degradation while using fixed-point numerical representation for the proposed detector, the comparisons of BER performance versus different SNRs for the floating-point and the fixed-point numerical representation are shown in Fig. 2. From the results of Fig. 2, it can be observed that all proposed detector's performance with different fixed-point is sightly worse than that using the floating-point. Particularly, under the 16-bitfixed-point numerical representation, the BER performance of the proposed detector using 2-norm or 1-norm is very close. Hence, for the proposed detector, 16-bit-fixed-point numerical representation, 1-norm measurement for channel strength, and 6 micro-rotations for each CORDIC module will be adopted in the following simulations and architecture design.

Next, the performance comparisons among the proposed detector and the existing methods are provided in Fig. 3. The existing methods include the OB-MMSE [13] and the SSDA [16]. The performance of the ML method is also provided as the performance benchmark. From the result of Fig. 3, we can observe that the SSDA [16] indeed achieves the optimal ML performance, but the results are all under using the floating-point numerical representation. It is worth mentioning that in general, due to hardware cost, floatingpoint numerical representation is seldom used in hardware implementation. However, the proposed detector with 16-bit-fixed-point numerical representation, 1-norm channel strength measurement, and 6 micro-rotations outperforms the OB-MMSE [13] and achieves near-ML performance. Therefore, using the above simulation results, including 16-bit-fixed-point numerical representation, 1-norm channel



**FIGURE 1.** Comparisons of 14- and 16-bit-fixed-point numerical representations at SNR = 16 dB.



FIGURE 2. Comparisons of bit error rate performance.



FIGURE 3. BER comparisons of different algorithms.

strength measurement, and 6 micro-rotations are adopted for the design of the proposed detector. The detail demonstration of the proposed hardware architecture is shown in the next section.

#### **IV. HARDWARE ARCHITECTURE**

In this section, a CORDIC-based hardware architecture, as depicted in Fig. 4, is proposed to implement the proposed algorithm in Table 1. As the settings in Fig. 1, the proposed architecture is for the GSM MIMO system with  $N_t = 4$ ,  $N_r = 4$ ,  $N_a = 2$ , and 16-QAM symbol modulation. The proposed architectures can be readily extended for other settings. Furthermore, the 16-bit signed-magnitude notation is used to represent each real number of the proposed architecture. In addition, the proposed hardware is implemented in the pipelined architecture. Fig. 4 shows the overall modules of the hardware architecture, including: (1) Channel-gain adder and strength sorter, (2) CORDIC-based VM-H, CORDIC-based RM-y, and memory, (3) Symbol detection, and (4) Distance



FIGURE 4. Hardware architecture.

computation and choose the final solution. The detail discussion about the modules is shown as follows.

#### A. CHANNEL-GAIN ADDER AND STRENGTH SORTER

To implement Step 1 of the proposed algorithm, Channel-gain adder and strength sorter, as depicted in Fig. 4, are deployed at the begin of proposed architecture. The channel-gain adder is to facilitate the computation of  $\sum_{i=1}^{Nr} |h_{i,t}|, t = 1, \cdots, N_t$ , to find the strength of the corresponding channel column vectors. Then, the strength sorter is to rank the  $N_t$  channel strengths and then concurrently re-combine the antenna indices of all TACs. For example, the original antenna indices of TAC l is (1, 3). From the result of the adder and the strength sorter, the re-combination result of TAC l becomes (3, 1) if the channel strength of antenna 3 is larger than that of antenna 1. Particularly, it is worth mentioning that 1-norm is used to approximate the computation of channel strength in the proposed algorithm. Moreover, in the proposed algorithm and hardware architecture, the signed-magnitude notation is used for the representation of each real number. Hence, instead of using multipliers to compute the channel strengths, only adders are needed in our proposed architecture. As a result, the proposed algorithm and architecture are low complexity and efficient.

#### B. CORDIC-BASED VM- H AND RM- Y, AND MEMORY

To implement Step 2 of the proposed algorithm in Table 1, we assume that the fading channel H remains fixed in a block and changes block by block. Note that a block can be one or several packets. Hence, the QR-decomposition of the fading channel H is only conducted at the begin of each block. Further, to achieve (6), the overall CORDIC-based Givens rotation of the proposed algorithm can be mainly separated to a pre-processing stage and a payload stage. For the pre-processing stage, the CORDIC-based vectoring mode (VM)-H module is built to do QR-decomposition in order to find the upper triangular matrices  $R_l$ ,  $l = 1, \dots, K$  in (6). At the same time, the corresponding rotation angles created from the CORDIC-based VM-H module are stored in the memory [23]. The angles will be used again in the following payload stage, *i.e.* the CORDIC-based real mode-y module, to find the corresponding received signal vectors  $\tilde{y}_l, l =$  $1, \dots, K$ . Note that CORDIC could be used in two operating modes, vectoring and rotation [24]. In the VM, a vector,

e.g.  $\begin{bmatrix} a \\ b \end{bmatrix}$  is rotated to  $\begin{bmatrix} \sqrt{a^2 + b^2} \\ 0 \end{bmatrix}$ . As for the rotation mode (RM), a vector *v* is rotated by an angle  $\theta$ . The architectures of the CORDIC-based VM-**H** and the CORDIC-based RM-**y** 

are described in detail as follows.



FIGURE 5. CORDIC-based VM-H.



FIGURE 6. Architecture of the CGR.

CORDIC-based VM-H: Fig. 5 shows the architecture of the CORDIC-based VM-H, which is constituted by 2 CGRs and 3 reduced-CGR (R-CGR) modules. The architecture of the CGR is shown in Fig. 6 and it is used to achieve the computation in (7). In Fig. 5, the processing of each CGR will also save three corresponding angles  $\{\theta_a, \theta_b, \theta_{ab}\}$  of (7) in the memory. As for the R-CGRs, they are used when the input a in (7) is positive. Also, for each of the R-CGRs, two corresponding angles  $\{\theta_b, \theta_{ab}\}$  of (7) will also be saved in the memory module. Furthermore, 6 micro-rotations in (8) are constituted to achieve the functions of each CGR or of each R-CGR. Additionally, registers (i.e. D-type flip-flops (DFFs)) are also inserted into each of the CGRs and the R-CGRs for pipelined architecture. At last, the above-mentioned angles saved at the memory will be used in the following the RM-y module to rotate and find the received signal  $\tilde{y}_l$ .

Further, for ease of demonstration, choose TAC 1 as the example and assume its correspondingly relocated matrix  $\tilde{H}_1 = [h_1 \ h_2]$  is

$$\begin{bmatrix} h_{1,1} & h_{1,2} \\ h_{2,1} & h_{2,2} \\ h_{3,1} & h_{3,2} \\ h_{4,1} & h_{4,2} \end{bmatrix}.$$

For detail, using the elements of  $\tilde{H}_1$  as the example of the inputs, the relationships of the modules of the VM-*H* have been shown in Fig. 5. At last, Fig. 5 also shows the outputs of the corresponding upper triangular matrix  $\tilde{R}_l$ , which are  $\tilde{r}_{1,1}^l, \tilde{r}_{1,2}^l, 0$ , and  $\tilde{r}_{2,2}^l$ . In addition, the VM-*H* also totally

114238

outputs 12 angles, which are saved in the memory module for the following RM-y module. It is worth mentioning that in practical implementation result, the total number of practical clock cycles required by the VM-H in Fig. 5, is 47, instead of 51 if the five modules are successively conducted. The processing 47 clock cycles of VH-H also represent the latency of the proposed architecture. Furthermore, the architecture in Fig. 5 is easily extended as  $N_a > 2$ .

**CORDIC-based RM-y:** For the hardware implementation of the payload stage, the CORDIC-based RM-y is constructed. The CORDIC-based RM-y first loads the angles of all TACs from the memory module and then performs 6 micro-rotations in (8) to iteratively rotate the received signal vector y for each TAC to find the corresponding vectors  $\tilde{y}_l$ ,  $l = 1, \dots, K$ . in (6). Note that the number of micro-rotations stands for the iteration number of the CORDIC and from the results in Fig. 1, the number is determined as 6.

#### C. SYMBOL DETECTION

To facilitate Step 3 of the proposed algorithm in Table 1, the hardware architecture of symbol detection is proposed in Fig. 7. The modules of the figure contain two 16-QAM slicing and one backward substitution. In Fig. 7, the left and the right slicing modules are to compute (11) and (14) respectively. Further, the backward substitution module is used to calculate  $\ddot{y}_{N_a-1}^l = \tilde{y}_{N_a-1}^l - \tilde{r}_{N_a-1,N_a}^l \times \hat{s}_{N_a}$  in (14). In addition, registers (*i.e.* DFFs) are also deployed in Fig. 7 for pipelined implementation. It is noted that only adders and shifters are utilized at the two slicing modules, which aims at avoiding multipliers. As a result, the processing speed of the two slicing method, which only uses adders and shifters, has been demonstrated deeply at Step 3 of the proposed algorithm in Section III.



FIGURE 7. Symbol detection.

As for the backward substitution in Fig. 7, the detail architecture is shown in Fig. 8. First, finding  $\ddot{y}_{N_a-1}^l$  in (14) needs to



FIGURE 8. Backward substitution.



FIGURE 9. subblocks of the backward substitution: (a) PTCV and (b) PTRN.

calculate  $r_{N_a-1,N_a}^l \times \hat{s}_{N_a}$ . In addition, computing  $R_l s_l$  in (15) for Step 4 of the proposed algorithm also needs multiplication. Hence, a general module for the product of two complex vectors (PTCV) with dimension  $2 \times 1$ , has been proposed in Fig. 9. To further explain the PTCV operation, an example with two complex vectors,  $u = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix}$  and  $v = \begin{bmatrix} v_1 \\ v_2 \end{bmatrix}$ , is used. Then, the PTCV operation is defined as  $u^T v = u1 \times v1 + u2 \times v2$ . Note that for the implementation of the PTCV operation, 8 modules, namely product of two real numbers (PTRN), are deployed in Fig. 9(a). The PTRN architecture is also shown in Fig. 9(b). Specially, in the PTRN module, shifters and adders are used to approximate the multiplication of two real number. Moreover, due to computing  $r_{N_a-1,N_a}^l \times \hat{s}_{N_a}$  only



FIGURE 10. Distance computaion.

|                                               | Proposed  | SA-SrFSD [25] | MLD [27]  |
|-----------------------------------------------|-----------|---------------|-----------|
| Antenna Configuration<br>(Nt, Na, Nr)         | (4, 2, 4) | (5, 2, 4)     | (8, 1, 4) |
| Number of TACs                                | 4         | 8             | 8         |
| Modulation                                    | 16-QAM    | 64-QAM        | 64-QAM    |
| Number of transmitted<br>bits per symbol slot | 10        | 15            | 9         |
| Payload processing cycles                     | 1         | 3             | 8         |
| CMOS Technology                               | 90nm      | 90nm          | 90nm      |
| Gate Count<br>(KGEs)                          | 266       | 276.7         | 70.5      |
| Frequency<br>(MHz)                            | 200.8     | 322.6         | 384.6     |
| Throughput<br>(Mbps)                          | 2008      | 1613          | 432.68    |
| Hardware Efficiency<br>(Mbps/KGEs)            | 7.54      | 5.83          | 6.14      |

TABLE 2. Implementation results of the proposed architecture and other architectures for SM and GSM MIMOs.

 $\begin{array}{l} \mbox{Throughput} = \frac{Technology}{90nm} \times Frequency \times \frac{bits \; per \; GSM \; mapping}{Processing \; Cycles} \mbox{[27]} \\ \mbox{Hardware Efficiency} = \frac{Throughput}{Gate \; Count} \mbox{[25]} \end{array}$ 

being the product of two complex numbers, hence four inputs of the PTCV module in Fig. 9 are set to zeros.

## D. DISTANCE COMPUTATION AND CHOOSE THE FINAL SOLUTION

In Fig. 10, the architecture for the distance computation in (15) is proposed. In the architecture of distance computation, two PTCVs, as that of the backward substitution of Fig. 8, are used to compute  $R_l s_l$  in (15). In addition, 8 multipliers are also utilized to compute the norm of a complex vector with dimension  $4 \times 1$ . Note that the complex vector's dimension is determined by the number of the receive antennas. Moreover, based on computer simulations, the 8 multipliers are 12-bit fixed-point instead of 16-bit, aiming at reducing the computational complexity. At last, a sorter is used to find the minimum distance and output the estimated symbols and the index of the corresponding TAC.

At last, the architecture of the proposed hardware is synthesized and implemented under the TSMC 90-nm CMOS technology. The implementation results show the hardware architecture requires 266 KGEs, provides throughput rate 2.008 Gbps, works with pre-processing latency 47 clock cycles, and has hardware efficiency 7.54 (Mbps/KGEs) while operating at frequency 200.8 MHz. Furthermore, implementation comparisons in Table 2 have shown the proposed architecture provides high throughput rate as well as hardware efficiency, and works with only 47 clock cycles for the pre-processing stage. The small latency of the pre-processing stage implies that only a small buffer is needed to buffer the incoming received signal vectors for the payload stage of the proposed hardware.

#### **V. CONCLUDING REMARKS**

Efficient algorithms for SM MIMO and GSM MIMO systems have been shown in existing works [1], [2], [3], [4], [5], [6], [7], [8]. However, the hardware implementation of the low-complexity detectors for these systems is still a main challenge. In this paper, we have proposed a CORDIC-based symbol detector for GSM MIMO systems, aiming at efficient hardware implementation. The proposed detector uses the CORDIC-based Givens rotation to QR-decompose the corresponding channel matrix for each TAC. Then, it adopts the backward substitution and simple hardware such as adders and shifters to estimate their transmitted symbols and the corresponding TAC index. At last, the detector chooses the estimated results with the minimum distance as the final solution. In particular, the efficient hardware architectures of the proposed detector include the multiplication-free ranking mechanism, parallel CORDIC-based architecture, and shorter-bit-length multipliers. At last, computer simulations and hardware implementation are conducted under the antenna configuration  $(N_t, N_a, N_r) = (4, 2, 4)$ . The results of computer simulations and VLSI implementation under the TSMC 90-nm CMOS technology show the proposed detector indeed performs close to the optimal performance, and provides high hardware efficiency 7.54 (Mbps/KGEs) and throughput rate 2.008 Gpbs while operating at frequency 200.8 MHz. Without a doubt, in GSM MIMOs different antenna configurations will result in different performance results for symbol detection. However, extension of our proposed architecture for the application to other configurations is challenging but is possible.

#### ACKNOWLEDGMENT

The authors would like to thank the Taiwan Semiconductor Research Institute (TSRI) of Taiwan for providing technical assistance.

#### REFERENCES

- L. Lu, G. Y. Li, A. L. Swindlehurst, A. Ashikhmin, and R. Zhang, "An overview of massive MIMO: Benefits and challenges," *IEEE J. Sel. Topics Signal Process.*, vol. 8, no. 5, pp. 742–758, Oct. 2014.
- [2] A. Arnaz, J. Lipman, M. Abolhasan, and M. Hiltunen, "Toward integrating intelligence and programmability in open radio access networks: A comprehensive survey," *IEEE Access*, vol. 10, pp. 67747–67770, 2022.
- [3] K. Izadinasab, A. W. Shaban, and O. Damen, "Low-complexity detectors for uplink massive MIMO systems leveraging truncated polynomial expansion," *IEEE Access*, vol. 10, pp. 91610–91621, 2022.
- [4] K. A. Alnajjar, P. J. Smith, and G. K. Woodward, "Low complexity V-BLAST for massive MIMO," in *Proc. Austral. Commun. Theory Work-shop (AusCTW)*, Feb. 2014, pp. 22–26.

- [5] R. Y. Mesleh, H. Haas, S. Sinanovic, C. W. Ahn, and S. Yun, "Spatial modulation," *IEEE Trans. Veh. Technol.*, vol. 57, no. 4, pp. 2228–2241, Jul. 2008.
- [6] P. S. Koundinya, K. V. S. Hari, and L. Hanzo, "Joint design of the spatial and of the classic symbol alphabet improves single-RF spatial modulation," *IEEE Access*, vol. 4, pp. 10246–10257, 2016.
- [7] M. Di Renzo, H. Haas, A. Ghrayeb, S. Sugiura, and L. Hanzo, "Spatial modulation for generalized MIMO: Challenges, opportunities, and implementation," *Proc. IEEE*, vol. 102, no. 1, pp. 56–103, Jan. 2014.
- [8] R. Rajashekar, K. V. S. Hari, and L. Hanzo, "Reduced-complexity ML: Detection and capacity-optimized training for spatial modulation systems," *IEEE Trans. Commun.*, vol. 62, no. 1, pp. 112–125, Jan. 2014.
- [9] J. G. Proakis and M. Salehi, *Digital Communations*, 5th ed. New York, NY, USA: McGraw-Hill, 2008.
- [10] A. Younis, N. Serafimovski, R. Mesleh, and H. Haas, "Generalised spatial modulation," in *Proc. Asilomar Conf. Signals, Syst., Comput.*, Nov. 2010, pp. 1498–1502.
- [11] J. Wang, S. Jia, and J. Song, "Generalised spatial modulation system with multiple active transmit antennas and low complexity detection scheme," *IEEE Trans. Wireless Commun.*, vol. 11, no. 4, pp. 1605–1615, Apr. 2012.
- [12] P. Yang, M. Di Renzo, Y. Xiao, S. Li, and L. Hanzo, "Design guidelines for spatial modulation," *IEEE Commun. Surveys Tuts.*, vol. 17, no. 1, pp. 6–26, 1st Quart., 2014.
- [13] Y. Xiao, Z. Yang, L. Dan, P. Yang, L. Yin, and W. Xiang, "Low-complexity signal detection for generalized spatial modulation," *IEEE Commun. Lett.*, vol. 18, no. 3, pp. 403–406, Mar. 2014.
- [14] L. Xiao, P. Yang, Y. Xiao, S. Fan, M. Di Renzo, W. Xiang, and S. Li, "Efficient compressive sensing detectors for generalized spatial modulation systems," *IEEE Trans. Veh. Technol.*, vol. 66, no. 2, pp. 1284–1298, Feb. 2017.
- [15] J. A. Tropp and A. C. Gilbert, "Signal recovery from random measurements via orthogonal matching pursuit," *IEEE Trans. Inf. Theory*, vol. 53, no. 12, pp. 4655–4666, Jan. 2007.
- [16] T.-H. Liu, C.-E. Chen, and C.-H. Liu, "Fast maximum likelihood detection of the generalized spatially modulated signals using successive sphere decoding algorithms," *IEEE Commun. Lett.*, vol. 23, no. 4, pp. 656–659, Apr. 2019.
- [17] M. Saad, H. Hijazi, A. C. A. Ghouwayel, F. Bader, and J. Palicot, "Low complexity quasi-optimal detector for generalized spatial modulation," *IEEE Commun. Lett.*, vol. 25, no. 9, pp. 3003–3007, Sep. 2021.
- [18] T. K. Moon and W. C. Stirling, Mathematical Methods and Algorithms for Signal Processing. London, U.K.: Pearson, 1999.
- [19] D. Wübben, R. Böhnke, J. Rinas, V. Kühn, and K. D. Kammeyer, "Efficient algorithm for decoding layered space-time codes," *Electron. Lett.*, vol. 37, no. 22, pp. 1348–1350, Oct. 2001.
- [20] G. H. Golub and C. F. Van Loan, *Matrix Computations*, 3rd ed. Baltimore, MD, USA: Johns-Hopkins, 1996.
- [21] T.-H. Kim, "Low-complexity sorted QR decomposition for MIMO systems based on pairwise column symmetrization," *IEEE Trans. Wireless Commun.*, vol. 13, no. 3, pp. 1388–1396, Mar. 2014.
- [22] H. Lee, K. Oh, M. Cho, Y. Jang, and J. Kim, "Efficient low-latency implementation of CORDIC-based sorted QR decomposition for multi-Gbps MIMO systems," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 65, no. 10, pp. 1375–1379, Oct. 2018.
- [23] J. S. Lin, Y. T. Hwang, S. H. Fang, P. H. Chu, and M. D. Shieh, "Lowcomplexity high-throughput QR decomposition design for MIMO systems," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 23, no. 10, pp. 2342–2346, Oct. 2015.
- [24] P. K. Meher, J. Valls, T.-B. Juang, K. Sridharan, and K. Maharatna, "50 years of CORDIC: Algorithms, architectures, and applications," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 56, no. 9, pp. 1893–1907, Sep. 2009.
- [25] T.-H. Liu, S.-L. Wang, Y.-J. Lin, Y.-T. Hwang, C.-E. Chen, and Y.-S. Chu, "Fixed-complexity tree search schemes for detecting generalized spatially modulated signals: Algorithms and hardware architectures," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 68, no. 2, pp. 904–917, Feb. 2021.
- [26] G. H. Lee and T. H. Kim, "Implementation of a near-optimal detector for spatial modulation MIMO systems," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 63, no. 10, pp. 954–958, Oct. 2016.
- [27] T.-H. Liu, Y.-Z. Ye, C.-K. Huang, C.-E. Chen, Y.-T. Hwang, and Y.-S. Chu, "A low-complexity maximum likelihood detector for the spatially modulated signals: Algorithm and hardware implementation," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 66, no. 11, pp. 1820–1824, Nov. 2019.

### **IEEE**Access



**HOANG-YANG LU** (Member, IEEE) received the B.S., M.S., and Ph.D. degrees from the National Taiwan University of Science and Technology, Taipei, in 1991, 1993, and 2007, respectively, all in electronics engineering. From 1993 to 2007, he was a Lecturer with the Department of Electronics Engineering, Lee-Ming Institute of Technology, Taipei. He joined the Digital Technology Department, Kainan University, Taoyuan, Taiwan, in 2007, and the Department of Electronics Engi-

neering, Huafan University, Taipei, in 2008, where he was an Associate Professor. Since 2009, he has been an Assistant Professor with the Department of Electrical Engineering, National Taiwan Ocean University (NTOU), where he is currently an Associate Professor. His research interests include signal processing, communication systems, and machine learning.



**MAO-HSU YEN** received the B.S., M.S., and Ph.D. degrees in electronic engineering from the National Taiwan University of Science and Technology, Taipei, Taiwan, in 1991, 1993, and 2000, respectively. He has been a Faculty Member with the Department of Computer Science and Engineering, National Taiwan Ocean University, Keelung, Taiwan, since 2005, where he is currently an Associate Professor. His current research interests include the design of application-specific

integrated circuit, microcontroller unit, and FPGA architectures.



**CHE-WEI CHANG** was born in Taipei, Taiwan, in 2001. He is currently pursuing the bachelor's degree in electrical engineering with the National Taiwan Ocean University, Keelung, Taiwan, in 2022. His research interests include algorithms and hardware architectures of wireless communications, and machine learning for signal processing.



**CHUNG-WEI CHENG** was born in Hsinchu, Taiwan, in 2000. He is currently pursuing the bachelor's degree in electrical engineering with the National Taiwan Ocean University, Keelung, Taiwan, in 2022. His research interests include algorithms and hardware architectures of wireless communications, and machine learning for signal processing.



**TZU-CHING HSU** was born in Taoyuan, Taiwan, in 2000. He is currently pursuing the bachelor's degree in electrical engineering with the National Taiwan Ocean University, Keelung, Taiwan, in 2022. His research interests include algorithms and hardware architectures of wireless communications, and machine learning for signal processing.



**YU-CHI LIN** was born in Yilan, Taiwan, in 2001. He is currently pursuing the bachelor's degree in electrical engineering with the National Taiwan Ocean University, Keelung, Taiwan, in 2022. His research interests include wireless communications and ASIC implementation.

...