

Received August 11, 2017, accepted August 31, 2017, date of publication September 18, 2017, date of current version October 12, 2017. *Digital Object Identifier* 10.1109/ACCESS.2017.2751622

# **SDR Implementation of a Real-Time Testbed for Future Multi-Antenna Smartphone Applications**

# XINTONG LU<sup>1</sup>, (Student Member, IEEE), LUYAO NI<sup>1</sup>, (Student Member, IEEE), SHI JIN<sup>1</sup>, (Member, IEEE), CHAO-KAI WEN<sup>2</sup>, (Member, IEEE), AND WEN-JUN LU<sup>3</sup>, (Member, IEEE)

<sup>1</sup>National Mobile Communications Research Laboratory, Southeast University, Nanjing 210096, China
 <sup>2</sup>Institute of Communications Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan
 <sup>3</sup>Jiangsu Key Laboratory of Wireless Communications, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

Corresponding author: Shi Jin (jinshi@seu.edu.cn)

This work was supported in part by the National Science Foundation for Distinguished Young Scholars of China under Grant 61625106, in part the National Natural Science Foundation of China under Grant 61531011, and in part by the Hong Kong, Macao and Taiwan Science and Technology Cooperation Program of China under Grant 2016YFE0123100. The work of C.-K. Wen was supported in part by the Ministry of Science and Technology of Taiwan under Grant MOST 106-2221-E-110-019 and in part by ITRI, Hsinchu, Taiwan. The work of W.-J. Lu was supported by the National Natural Science Foundation of China under Grant 61427801 and Grant 61471204.

**ABSTRACT** With the equipment of mobile terminals with multiple antennas, a bandwidth-friendly approach for increasing data rate and reliability has become the main trend in the development of future fifth generation (5G). However, this scenario presents significant challenges in the antenna and hardware design. For further system development, building real-time testbeds is a desirable track given that this endeavor can demonstrate the possibilities and limitations of the technology. In this paper, we present the design, implementation, and evaluation of a multiple input, multiple output system with eight antennas at a mobile terminal on the basis a software-defined radio (SDR) platform. The system uses long-term evolution-like system parameters. We illustrate the hierarchical hardware architecture and the implementation features, including the timing and synchronization, the processing partitioning, and the performance indicators of a particular eight-antenna module, which could meet the demands of 5G smartphone applications. Link-level simulations corresponding to the designs are conducted to validate the feasibility of the system. Accordingly, the SDR-based testbed is implemented, and a series of experiments are carried out to test the performance of our design in realistic situations. The proposed system has experimentally demonstrated its capability to transmit four-independent high-definition video streams on the same time-frequency resource.

**INDEX TERMS** Multi-antenna terminal, 5G application, prototyping testbed, software-defined radio.

## I. INTRODUCTION

Long-term evolution (LTE) has become the most successful mobile wireless broadband technology among other similar technologies, serving over one billion users as of the beginning of 2016 [1]. Different releases have been updated and published in the recent decade. The first version of LTE (Release 8) was introduced in 2008 and later evolved to LTE-Advanced (Release 10) in 2011. LTE-Advanced Pro (Release 13) emerged in 2015, wherein a wide range of enhancements were added to address the challenges posed by existing services. To considerably reach and surpass the high data rate and transmission reliability requirements, we have considered increasing the number of antennas on a compact user equipment (UE) in near-term releases [2], where UE is proposed to be equipped with double antennas or more. Catering to this trend, the industry intends to make LTE-Advanced Pro an opportunity for future development and to begin with updates regarding existing LTE modem chipsets and multi-antenna modules. At the 2016 International Consumer Electronics Show (CES), Qualcomm showed off its latest X12 LTE modem add-on [3]. The unit features the possibility of enabling up to  $4 \times 4$  multiple input, multiple output (MIMO) on the downlink, resulting in enhanced bandwidth and coverage. Recent works [4]–[6] have proposed equipping compact smartphones with 8-, 10-, and 16-element antenna arrays, respectively, for fifth-generation (5G) operations within the 3400-3600 MHz frequency band.

Building over-the-air (OTA) testbeds is crucial in the further exploration and verification of the potential algorithms



FIGURE 1. The PSS allocation. The PSS is modulated on subcarriers around the DC-carrier according to a Zadoff-Chu root sequence; the sequence index u is defined in the LTE-Advance standard.

as well as circuit designs under the aforementioned standards. In previous works, a fair number of quotable software-defined radio (SDR) testbeds, such as the Argos [7]-[9] and the massive MIMO testbed at Southeast University (SEU) [10], have been introduced. However, despite considerable efforts in the field of SDR prototyping system design, UE has yet to be fully considered in past studies. Specifically, these testbeds commonly assumed that the terminal is equipped with a single antenna and that no strict requirements are imposed for the antenna modules at UE in terms of materials and frequency bands. Accordingly, a demand exists for continued efforts to conduct SDR-based MIMO testbed development considering UE, including UE designs and their operating schemes. The systems should operate under realistic conditions in real-time to measure the performance of newly designed internal multiantenna modules, which are proposed for 5G mobile devices. These works can be extended to large scales in the next step to cater to the trend of massive MIMO.

In this study, we design and implement a practical MIMO system on the basis of an off-the-shelf SDR platform. We design the platform based on LTE-like parameters and introduce several high-efficient data processing schemes. To validate the feasibility of the used parameters and the proposed schemes, we, initially, present the link-level simulation results. Subsequently, we accomplish real-time signal processing in the physical layer by employing a flexible SDR solution and a modular architecture. In particular, synchronization between base station (BS) and UE is realized based on primary synchronization signal (PSS), and high throughput bus is used for real-time data transfer. Finally, field tests are set up for experimental investigation, including the visualization of constellations and streamed high-definition (HD) videos, as well as the evaluation of performance metrics.

The rest of this paper is structured as follows: Section 2 provides a basic description of the physical layer. Section 3 presents the link-level simulation in detail. Section 4 discusses about the implementation of our prototype system. Section 5 exhibits the corresponding experimental evaluation. Finally, Section 6 concludes this paper.

*Notation:* Matrices and vectors are denoted by uppercase and lowercase letters in boldface. The  $N \times N$  identity matrix is denoted by  $\mathbf{I}_N$ ;  $(\cdot)^{\mathrm{H}}$  and  $(\cdot)^{-1}$  stand for the conjugate transpose and inverse operations respectively;  $\mathbb{E}\{\cdot\}$  is the statistical expectation; and  $\|\cdot\|$  denotes the Euclidean norm.

## **II. BASIC DESCRIPTION**

This section describes the basic aspects of the physical layer of our prototyping testbed, including the signal model and the proposed processing schemes.

# A. SIGNAL MODEL

Consider a MIMO system, in which BS is equipped with  $N_{BS}$  antennas and serves an  $N_{UE}$ -antenna UE adopting  $N_{FFT}$ -subcarrier orthogonal frequency division multiplexing (OFDM). When transmitting data symbols, the relationship between the received data vector  $\mathbf{y}_n$  and the transmitted vector  $\mathbf{s}_n$  for the subcarrier index *n* is indicated as follows:

$$\mathbf{y}_n = \mathbf{H}_n \mathbf{s}_n + \mathbf{z}_n,\tag{1}$$

where  $\mathbf{H}_n$  is an  $N_{\text{UE}} \times N_{\text{BS}}$  channel matrix on subcarrier *n*, and  $\mathbf{z}_n$  is a complex Gaussian noise vector with i.i.d. elements and  $\mathbb{E}\{\mathbf{z}_n \mathbf{z}_n^{\text{H}}\} = \sigma_{\mathbf{z}}^2 \mathbf{I}$ .

# **B. TIMING SYNCHRONIZATION**

A key requirement to initiate communications between BS and UE is the synchronization protocol. Designing a robust algorithm is important for the stabilization of the total baseband processing. Figure 1 indicates that the PSS is modulated on subcarrier index k around the direct current (DC) carrier by a Zadoff-Chu sequence [11] according to the following:

$$d_u(k) = \begin{cases} e^{-j\frac{\pi u k(k+1)}{63}}, & k = 0, 1, \dots, 30\\ e^{-j\frac{\pi u(k+1)(k+2)}{63}}, & k = 31, 32, \dots, 61, \end{cases}$$
(2)

where the Zadoff-Chu root sequence index u is defined in the LTE-Advanced standard, i.e., u = 25.

Table I exhibits that the synchronization algorithm includes two main parts, including the approximate synchronization (Step 1, Step 2, and Step 3) and the accurate synchronization (Step 4 and Step 5). For a single Rx data stream, given that the PSS is allocated in a part around the DC carrier, Step 1 indicates that the low-pass filtering of the baseband OFDM symbols is operated for the appropriate bandwidth. Given a large number of computations, down sampling is then conducted in Step 2 while checking whether the ratio ensures that spectrum aliasing is avoided. Figure 2 depicts that, for the determination of the approximate starting point  $\tau_1$  of the radio frame for each 10-ms duration, Step 3 should be done to find the index of the maximum value of the crosscorrelation with an  $N_{DS}$ -point sequence, which is sampled



FIGURE 2. The approximate synchronization. We find the index of the maximum value of the cross-correlation with an N<sub>DS</sub>-point sequence sampled from an N<sub>FFT</sub>-point IFFT of the ideal Zadoff-Chu sequence.

| TABLE 1. | S | ynchronization | a | lgorithm. |
|----------|---|----------------|---|-----------|
|----------|---|----------------|---|-----------|

| Step   | Algorithm                                                                                                                                                                                                                    |
|--------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Step 1 | Operate low-pass filtering after ADC sampling                                                                                                                                                                                |
| Step 2 | Operate the $N_{DS}$ -point down-sampling of both the filtered sequence in Step 1 and the $N_{FFT}$ -point IFFT of the ideal Zadoff-Chu sequence                                                                             |
| Step 3 | Find the approximate index $\tau_1$ by searching for the maximum value of the cross-correlation between two $N_{DS}$ -point sequences in Step 2                                                                              |
| Step 4 | Suppose $\tau_2 \in {\tau_1 - \Delta \tau, \tau_1 - \Delta \tau + 1, \dots, \tau_1 + \Delta \tau}$ as the starting point, and cut out these $2\Delta \tau + 1 N_{FFT}$ -point sequences from the filtered sequence in Step 1 |
| Step 5 | Find the accurate index $\tau_2$ by searching for the maximum value of the cross-correlation between the $2\Delta \tau + 1$ sequences in Step 4 and the ideal Zadoff-Chu sequence                                            |

from an  $N_{\text{FFT}}$ -point inverse fast Fourier transform (IFFT) of the ideal Zadoff-Chu sequence. In Step 4, we suppose  $\tau_2 \in \{\tau_1 - \Delta \tau, \tau_1 - \Delta \tau + 1, \dots, \tau_1 + \Delta \tau\}$  as the starting point, and cut out these  $2\Delta \tau + 1 N_{\text{FFT}}$ -point sequences from the filtered sequence in Step 1. To calculate the accurate index  $\tau_2$  in Step 5, we search for the maximum value of the crosscorrelation between the  $2\Delta \tau + 1$  sequences in Step 4 and the ideal Zadoff-Chu sequence.

For all the data streams of eight Rx antennas, we assume that the difference in over-the-air delays is within the cyclic prefix (CP) duration, such that the indexes of the maximum value of cross-correlation, which decide the starting points of different streams, are nearly similar [12]. However, if a fair number of chains abruptly receive the wrong starting points, we propose an appropriate method to improve the probability of successful synchronization. Specifically, we collect all the eight indexes as well as find the mode value to be defined and used as the public index for all data streams.

## C. CHANNEL ESTIMATION

In our system, when transmitting pilot symbols, a set of  $N_{\text{pilot}}$  independent Gaussian channels are modelled as follows:

$$\mathbf{y} = \mathbf{X}\mathbf{h} + \mathbf{z},\tag{3}$$

where **y** is a received pilot vector of size  $N_{\text{pilot}} \times 1$ , **X** is an  $N_{\text{pilot}} \times N_{\text{pilot}}$  matrix with elements of pilots on its diagonal, **h** is a channel attenuation vector of size  $N_{\text{pilot}} \times 1$  in the frequency domain, and **z** denotes an i.i.d. complex zero-mean Gaussian noise vector with variance  $\sigma_z^2$ . The low-complexity linear minimum-meansquare error (LMMSE) estimation mentioned in [13] is applied and expressed as follows:

$$\mathbf{h}_{\text{lmmse}} = \mathbf{W} \mathbf{h},\tag{4}$$

where  $\tilde{\mathbf{h}} = \mathbf{X}^{-1}\mathbf{y}$  is the least square (LS) estimates of  $\mathbf{y}$ , and  $\mathbf{W}$  is given by the following:

$$\mathbf{W} = \mathbf{R}_{\tilde{\mathbf{h}}\tilde{\mathbf{h}}} \left[ \mathbf{R}_{\tilde{\mathbf{h}}\tilde{\mathbf{h}}} + \sigma_z^2 \left( \mathbf{X} \mathbf{X}^{\mathrm{H}} \right)^{-1} \right]^{-1}.$$
 (5)

Due to the difficulty in calculating the channel autocorrelation matrices  $\mathbf{R}_{h\tilde{h}}$  and  $\mathbf{R}_{\tilde{h}\tilde{h}}$ , we consider a fading multipath channel model, wherein the delays are uniformly and independently distributed over the length of the CP. Accordingly, the element of the correlation matrix

$$\mathbf{R} = \begin{bmatrix} 1 & \cdots & r_{1,N_{\text{FFT}}} \\ \vdots & \ddots & \vdots \\ r_{N_{\text{FFT}},1} & \cdots & 1 \end{bmatrix}$$
(6)

can be approximately expressed as the uniform channel correlation between attenuations  $h_p$  and  $h_q$ , which is expressed as

$$r_{p,q} = \begin{cases} 1, & p = q \\ \frac{1 - e^{-j2\pi L \frac{p-q}{N_{\rm FFT}}}}{j2\pi L \frac{p-q}{N_{\rm FFT}}}, & p \neq q. \end{cases}$$
(7)

This scenario only depends on the distance between the tones p - q and the length of the CP L [14]. A balance between



**FIGURE 3.** The block diagram of the single-cell MIMO system when  $N_{BS} = 4$  and  $N_{UE} = 8$  under a 20MHz bandwidth with OFDM utilized. The transmitter first prepares raw bits and map them to symbols. Afterwards, IFFT and CP adding are computed. Either the pilot or data OFDM symbols are converted to the analogue domain and sent out. At the receiver, we synchronize BS and UE with the PSS. FFT and CP removal are then carried out. After channel estimation and MIMO detection, QAM demodulation is conducted to recover the initial binary digits.

the performance improvement and the increased complexity must exist; therefore, a simplified MMSE channel estimation is carried out to facilitate the implementation. The main goal is to reduce the complexity of matrix multiplication  $W\tilde{h}$ by partitioning the channel autocorrelation matrix into *K* submatrices on the basis of the correlations among different subcarriers, so that

$$\mathbf{h}_{\text{lmmse}} \approx \begin{bmatrix} \mathbf{W}_1 \hat{\mathbf{h}}_1 \\ \vdots \\ \mathbf{W}_K \tilde{\mathbf{h}}_K \end{bmatrix}, \qquad (8)$$

where  $\mathbf{W}_k$  is part of  $\mathbf{W}$ , and  $\tilde{\mathbf{h}}_k$  is extracted from  $\tilde{\mathbf{h}}$ .

# **D. MIMO DETECTION**

The MIMO detector plays an important role at the receiver. Among all the detection algorithms, the LMMSE detector is highly acknowledged and shows a significant performance. The estimated channel on subcarrier *n* can be given as  $\hat{\mathbf{H}}_n$ , with which the detected signal vector is given by the following:

 $\hat{\mathbf{s}}_n = \mathbf{W}_{1\text{mmse},n}\mathbf{y}_n,$ 

where

$$\mathbf{W}_{\text{lmmse},n} = \left[ \left( \hat{\mathbf{H}}_n \right)^{\text{H}} \hat{\mathbf{H}}_n + \sigma_z^2 \mathbf{I}_{N_{\text{BS}}} \right]^{-1} \left( \hat{\mathbf{H}}_n \right)^{\text{H}}.$$
 (10)

We assume that

$$\mathbf{B}^{\mathrm{H}}\mathbf{B} = \left(\hat{\mathbf{H}}_{n}\right)^{\mathrm{H}}\hat{\mathbf{H}}_{n} + \sigma_{z}^{2}\mathbf{I}_{N_{\mathrm{BS}}},\tag{11}$$

thus

$$\mathbf{W}_{\text{lmmse},n} = (\mathbf{B}^{\text{H}}\mathbf{B})^{-1} \left(\hat{\mathbf{H}}_{n}\right)^{\text{H}}.$$
 (12)

No existing modules for matrix inversion exist in LabVIEW FPGA, and a widely-used approach is written as follows:

$$(\mathbf{B}^{\mathrm{H}}\mathbf{B})^{-1} = \frac{(\mathbf{B}^{\mathrm{H}}\mathbf{B})^{\mathrm{H}}}{|\mathbf{B}^{\mathrm{H}}\mathbf{B}|}.$$
 (13)

Equation (13) becomes prohibitive for real-time applications and large matrix sizes. To realize this scenario with inconsiderable complexity, we use a channel detection approach based on QR decomposition. With QR decomposition, matrix **B** can be expressed as follows:

$$\mathbf{B} = \begin{bmatrix} \hat{\mathbf{H}}_n \\ \sigma_{\mathbf{z}} \mathbf{I}_{N_{\rm BS}} \end{bmatrix} = \mathbf{Q} \mathbf{R} = \begin{bmatrix} \mathbf{Q}_1 \\ \mathbf{Q}_2 \end{bmatrix} \mathbf{R}, \tag{14}$$

such that

and

(9)

$$\mathbf{R}^{-1} = \frac{\mathbf{Q}_2}{\mathbf{Q}_2}.$$
 (16)

(15)

By substituting (14) and (15) with (12),  $\mathbf{W}_{\text{lmmse},n}$  is derived as the following:

 $\sigma_{\mathbf{Z}}$ 

 $\hat{\mathbf{H}}_n = \mathbf{Q}_1 \mathbf{R},$ 

$$\mathbf{W}_{\text{lmmse,n}} = \left(\mathbf{R}^{H}\mathbf{R}\right)^{-1} \left(\hat{\mathbf{H}}_{n}\right)^{\text{H}}.$$
 (17)

Therefore, by combining (15), (16), and (17), the filter matrix at subcarrier *n* is written as follows:

$$\mathbf{W}_{\text{lmmse},n} = \frac{\mathbf{Q}_2 \mathbf{Q}_1^{\text{H}}}{\sigma_{\mathbf{z}}}.$$
 (18)

Consequently, we only need to calculate the QR decomposition of **B** in (14), followed by a simple matrix multiplication in (18), instead of calculating  $\mathbf{B}^{H}\mathbf{B}$  and the determinant in (13). In addition, the QR decomposition can be carried out with Schmidt orthogonalization, which is convenient for real applications.

#### **III. LINK-LEVEL SIMULATION**

This section presents the link-level simulation of our system. First, we show the detailed block diagram of the link-level procedure. The general parameters and the time-frequency resource grids are also presented. The numerical results of the simulation are provided, including the bit error rates (BERs) for different Tx data streams and the total throughput with different modulation schemes.

## A. BLOCK DIAGRAM

Consider a single-cell MIMO system when  $N_{\rm BS} = 4$  and  $N_{\rm UE} = 8$  under a 20-MHz bandwidth using OFDM. At the transmitter, as illustrated in Figure 3, a serial of binary digits is first prepared and mapped to symbols with a proper modulation scheme (QPSK, 16-QAM, and 64-QAM). Subsequently, IFFT and CP adding are computed, thereby providing a set of complex samples. Either the pilot or the data OFDM symbol is converted to the analogue domain and sent out via  $N_{\rm BS}$  Tx antennas. We synchronize BS and UE at the receiver by identifying the start position of a radio frame with the autocorrelation of PSS. Thereafter, fast Fourier transform (FFT), and CP removal are carried out to convert the time domain signals back to the frequency domain.

#### TABLE 2. General parameters.

| Variable    | Value                                                 |                                                                                                                                                                                                                                                                                   |
|-------------|-------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $N_{UE}$    | 8                                                     |                                                                                                                                                                                                                                                                                   |
| $N_{BS}$    | 4                                                     |                                                                                                                                                                                                                                                                                   |
| W           | 20MHz                                                 |                                                                                                                                                                                                                                                                                   |
| $N_{FFT}$   | 2048                                                  |                                                                                                                                                                                                                                                                                   |
| $N_{DS}$    | 128                                                   |                                                                                                                                                                                                                                                                                   |
| $N_{SC}$    | 1200                                                  |                                                                                                                                                                                                                                                                                   |
| $N_{pilot}$ | 200                                                   |                                                                                                                                                                                                                                                                                   |
| _           | Normal                                                |                                                                                                                                                                                                                                                                                   |
|             | $\begin{tabular}{lllllllllllllllllllllllllllllllllll$ | $\begin{tabular}{ c c c c } \hline Variable & Value \\ \hline $N_{UE}$ & 8 \\ \hline $N_{BS}$ & 4 \\ \hline $W$ & 20MHz \\ \hline $N_{FFT}$ & 2048 \\ \hline $N_{DS}$ & 128 \\ \hline $N_{SC}$ & 1200 \\ \hline $N_{pilot}$ & 200 \\ \hline $-$ & Normal \\ \hline \end{tabular}$ |



FIGURE 4. The frame structure. A 10-ms radio frame is divided into 10 subframes with the same structure of two 0.5-ms time slots. Subframe 0 is set up for the initial synchronization between BS and UE, and Subframe 1 to Subframe 9 are for data transmission. Each time slot contains seven OFDM symbols; four symbols are for data, and the remaining ones are for pilots.

To eliminate the influence of noise and make full use of subcarrier correlation, we apply the low-complexity LMMSE estimator mentioned in [13]. With the estimated channel matrices, the LMMSE detector based on QR decomposition is implemented [22]. Finally, QAM demodulation is conducted to recover the initial binary digits, and the performance of our system is evaluated.

#### **B. GENERAL PARAMETERS**

We focus on TDD operation in our proposed system. In the current setting, the testbed operates with many parameters identical to TDD-based LTE cellular systems (Table II). Common parameters with LTE result in the easy evaluation of how our system would influence current cellular systems.

## C. TIME-FREQUENCY RESOURCE GRIDS

According to the LTE-Advanced standard, the timefrequency resource grids are reasonably assigned as follows.

# 1) FRAME STRUCTURE

Figure 4 presents the LTE-like frame structure adopted in this study. Specifically, the 10-ms radio frame is divided into 10 subframes with the same structure of two 0.5-ms time slots. The first subframe is set up for the initial synchronization between BS and UE, and the other subframes are set aside for data transmission. Figure 4 exhibits that each time slot contains seven OFDM symbols, among which four symbols are assigned for data and the others are assigned as pilot symbols.



FIGURE 5. The pilot allocation. As mentioned in the LTE-Advanced standard, the N<sub>BS</sub> groups of orthogonal pilots are placed on certain subcarriers, padding the remaining subcarriers with zero.

# 2) PILOT ALLOCATION

Figure 5 shows the pilot allocation of the proposed system. According to the 3GPP LTE-Advanced standard, the  $N_{BS}$  groups of orthogonal pilots are regularly and discontinuously allocated in the frequency domain. The remaining subcarriers are padded with zero for ease of implementation.

# D. NUMERICAL RESULTS

In the numerical results presented below, the BERs for different Tx streams with different modulations, along with the total throughput, are investigated.

# 1) BER PERFORMANCE

Figure 6 shows the BERs for different modulation schemes. On the one hand, the BER can be  $10^{-4}$  when the signal-tonoise ratio (SNR) is 10dB with QPSK modulation, which is always acceptable on the premise of none channel coding. On the other hand, high-order modulation leads to a bad BER performance; hence, choosing the appropriate modulation scheme is crucial. The performance of the total throughput is also considered as follows.

## 2) TOTAL THROUGHPUT

On the basis of the BERs, the total throughput of the proposed system can be calculated as follows:

$$T_{\text{total}} = \sum_{i=0}^{3} N_{\text{sym}} \times (1 - BER_i) \times N_{\text{QAM}}, \qquad (19)$$

where  $N_{\text{QAM}}$  is the number of bits per modulated symbol, and  $N_{\text{sym}}$  is the number of data symbols per second, which is indicated as follows:

$$N_{\rm sym} = \frac{4 \times N_{\rm SC}}{0.5 \rm ms} \times \frac{9}{10},\tag{20}$$

given the presence of four data symbols per 0.5-ms time slot.  $N_{SC}$  is shown in Table II. Additionally, nine-tenths of the radio frame is intended for data transmission. Figure 7 illustrates that a system with high-order modulation results in high throughput. Consequently, a tradeoff between BER



FIGURE 6. The BER performance of different Tx streams with different modulation schemes, i.e., QPSK, 16-QAM, and 64-QAM. As can be seen, a higher modulation order leads to worse BER performance.



**FIGURE 7.** The total throughput with different modulation schemes, i.e., QPSK, 16-QAM, and 64-QAM. The theoretical throughput is presented as a baseline for each modulation scheme. The system with a high-order modulation results in a higher throughput.

performance and system throughput should be evident. In our testbed, we choose QPSK for video streaming application to acquire an enhanced BER performance in the absence of channel coding.

Considering the hardware components, we translate each sub-block into LabVIEW codes after MATLAB simulation. With LabVIEW, we can integrate additional programming approaches, such as ANSI C/C++, VHDL and even .m file scripts. The LabVIEW codes are simulated functionally and verified by comparing them with MATLAB simulation results. Subsequently, the corresponding compilation is executed, generating bit files for the configuration and operation of the FPGA device. Appropriate hardware components constitute our testbed, which are illustrated in Section 4.



FIGURE 8. The hierarchical overview of UE. The entire framework is made up of a central controller, a switch, a co-processor and four SDRs.

#### **IV. PROTOTYPE SETUP**

Having finished the link-level simulation, we present in this section the prototype setup. This section presents an illustration of hierarchical hardware architecture and descriptions of the implementation features, including the timing and synchronization, the processing partitioning along with hardware devices, as well as the specifications and performance indicators of the particular 5G antenna array at UE.

## A. HARDWARE ARCHITECTURE

For the proper use of the off-the-shelf hardware (e.g., the NI 2943R manufactured by National Instruments), the hierarchical architecture of UE is proposed (Figure 8), whose main blocks are detailed below.

# 1) CENTRAL CONTROLLER

The central controller provides a user interface for parameter setting and bitfile deployment. Moreover, this controller serves as the sink for the user data.

## 2) SWITCH

The switch acts as the real-time signal processing engine and main data aggregation node, e.g., between SDRs and co-processors.

#### 3) CO-PROCESSOR

The co-processor is used to help the CPU process the baseband data, including MIMO detection and QAM demodulation.

#### 4) SDR

The SDR provides an integrated hardware and software solution. All the SDRs are connected to the antenna array, and part of the baseband processing is assigned to the FPGA.

On the basis of the aforementioned data, specific hardware components are assembled (Figure 9). Table III demonstrates the corresponding features of selected hardware modules introduced in [15]–[21]. Specifically, the PXIe-8135 embedded in the PXIe-1085 chassis serves as the central controller, whereas the NI-2943R and the PXIe-7976 FPGA co-processor are used for data processing. Notably, the PXIe-6674T is used as the timing and synchronization module, generating a stable 10-MHz reference clock and enlarging

#### TABLE 3. General parameters.

| Module     | Туре                   | Features                                                                                 |
|------------|------------------------|------------------------------------------------------------------------------------------|
| PXIe-8135  | Central controller     | Intel Core i7-3610QE quad-core processor 2.3GHz base frequency and 3.3GHz quad-core      |
|            |                        | CPU                                                                                      |
| PXIe-1085  | Switch                 | 16 hybrid slots and 1 PXI Express system timing slot 8 GB/s per-slot dedicated bandwidth |
|            |                        | and 24 GB/s system bandwidth                                                             |
| PXIe-7976R | Co-processor           | DSP-focused Xilinx Kintex-7 FPGA Up to 400 Mb/s single-ended configuration rates         |
| NI 2943R   | SDR                    | 2 RF front ends and 1 Xilinx Kintex-7 FPGA Independent frequencies with options from     |
|            |                        | 50MHz to 6GHz                                                                            |
| PXIe-8262  | Expansion module       | Connected via a x4 cabled PCI Express cable Software-transparent link without program-   |
|            |                        | ming                                                                                     |
| PXIe-6674T | Reference clock source | 6 configurable I/O connections 10MHz clock based on an onboard precision OCXO            |
|            |                        | reference                                                                                |
| OctoClock  | Clock distribution     | Integrated timing and clocking source with 8-way distribution Source detection with      |
|            |                        | automatic switch-over                                                                    |



FIGURE 9. The hardware architecture of UE. The PXIe-8135 embedded in the PXIe-1085 chassis serves as the central controller. The NI-2943R and PXIe-7976R FPGA co-processors are used for data processing. The PXIe-6674T is used for the timing and synchronization.

digital triggers that can be routed among multiple devices. In future works, prototyping testbeds with large scales can be attained by extending current hierarchical architecture, thereby catering to the trend of massive MIMO.

## **B. TIMING AND SYNCHRONIZATION**

Timing and synchronization are the important aspects of the MIMO system. Section 2 mentions that PSS is used to achieve the synchronization between BS and UE. Specifically, each SDR of UE performs a cross-correlation of received PSS with local PSS and searches for the peak index, which is immediately transmitted to the central controller via the invoke method. Subsequently, the central controller finds out the predominant number as the unified peak index and conveys the number to 4 USRP RIOs to align all the radio devices.

In addition to the timing and synchronization between BS and UE, two other aspects are assumed to be considered due to the limitation of the distributed architecture. First, a precise and stable 10-MHz reference clock is used as the clock source. Second, digital triggers are supposed to be used to start generation or acquisition on the USRP RIOs at BS and UE. Figure 10 indicates the clock and trigger signal distribution network of UE. The source of the clock is an



FIGURE 10. The clock and trigger distribution network of UE. The PXIe-6674T first provides a 10-MHz reference clock and an amplified trigger. Then, the reference clock and trigger signal are transmitted to the Octoclock for further amplification and distribution.

oven-controlled crystal oscillator within the PXIe-6674T, which provides a 10-MHz reference clock. An amplified trigger is obtained by a start pulse of the master SDR at UE via the software trigger. Then, the reference clock and trigger signal are transmitted to the Octoclock for further amplification and distribution. Consequently, all the antennas of UE share the same reference clock and trigger, so that all radio devices can start data collection synchronously.

## C. PROCESSING PARTITIONING

On the basis of the block diagram in Section 3 and the aforementioned hardware architecture, Figure 11 shows how the baseband processing is partitioned across the hardware platform. At BS, a serial stream of binary digits is generated in the central controller and then transmitted to the master SDR via DMA FIFO. To enable data exchange among SDRs, we adopt a P2P FIFO approach. After resource mapping and OFDM modulation, the high-rate complex symbols are conveyed to every RF chain for up conversion and then sent out by four transmitting antennas. At UE, the RF signals acquired by four NI 2943Rs first go through eight RF chains and perform down conversion, analog to digital conversion, and quantization. Synchronization, FFT along with CP removal and LMMSE channel estimation are then carried out in each FPGA of NI 2954R. Thereafter, the valid baseband data and obtained channel information are aggregated to the



FIGURE 11. The block diagram related to hardware implementation for data transmission at both BS and UE. At BS, binary digits are generated and transmitted to the master SDR via DMA FIFO. The P2P FIFO approach is then adopted to enable data exchange between SDRs. After resource mapping and OFDM modulation, the symbols are sent out. At UE, the RF signals go through eight RF chains and perform down-conversion, analog-to-digital conversion, and quantization. After synchronization, FFT along with CP removal and LMMSE channel estimation, the valid data and channel information are aggregated to the FPGA co-processor PXIe-7976R via P2P FIFO to achieve LMMSE MIMO detection and QAM demodulation.



FIGURE 12. Photograph of the multi-antenna module of UE.

FPGA co-processor PXIe-7976R via P2P FIFO to achieve LMMSE MIMO detection as well as QAM demodulation. Finally, the demodulated data bytes are transmitted back to the embedded controller for further analysis and display.

## D. ANTENNA ARRAY

A 3.4-3.6 GHz smartphone antenna prototype is designed and used as the UE antenna array. Figure 12 depicts that eight identical slotted planar inverted-F antenna (PIFA) elements (Ant 1-Ant 8) are integrated along the two long side edges of a printed circuit board (PCB). The antenna element is a modified, double-C-shaped-like PIFA [23]. To enhance the isolation among elements in proximity, additional slots [24] and side walls composed of metallic via-holes [25] are introduced on the ground plane. All antennas are fed by coaxial probes and connected to the external system through standard 50- $\Omega$  SMA launchers. Unlike traditional MIMO systems without metallic packages, the developed system is packaged with a metallic handset with both backed cavity and frames made of copper. The measured reflection coefficient of each element is less than -6 dB, and the measured minimum isolation among them is greater than 10 dB. Within the desired 3.4-3.6 GHz band, the measured average antenna efficiency is 37%, and the corresponding average gain is 1.8 dBi. Moreover, the envelop correlation coefficient (ECC) is less than 0.2.

## **V. EXPERIMENTAL EVALUATION**

This section discusses about the setting up of experiments to test feasibility of our design in a realistic approach. To validate the performance of the MIMO system, we visualize constellations and streamed HD videos, as well as evaluate the performance results by streaming random information bits.

# A. EXPERIMENT DEPLOYMENT

Real-time data transmission test is carried out in a typical indoor environment. A picture of the testbed is supplied in Figure 13, which labels the main components, including the central controller, switch, co-processor, SDRs, and antenna arrays. In addition, the 8-element antenna module with a height of 1.3 m is located near the chassis, 5.5 meters away from the BS antennas. Table IV indicates the parameters



FIGURE 13. Photograph of the assembled testbed. Left: the photo of UE, including an 8-antenna module, a display screen, and baseband processing aspects. Right: the photo of BS, including a 4-antenna module and baseband processing aspects. The 8-element antenna module is located near the chassis, 1.5 meters away from the BS antennas.

#### TABLE 4. Front panel parameters.

| Parameter            | Variable | Value                     |
|----------------------|----------|---------------------------|
| Sampling rate        | $F_{SR}$ | 30.72MS/s                 |
| Carrier frequency    | f        | 3.5Ghz                    |
| LO frequency         | $f_{LO}$ | 3.49Ghz                   |
| $Sigma_{\mathbf{z}}$ | $\sigma$ | 0.001                     |
| Tx gain              | $G_T$    | $\{0,1,\cdots,30\}$ dB    |
| Rx gain              | $G_R$    | 25dB                      |
| Modulation           | _        | QPSK, 16-QAM, 64-QAM, and |
|                      |          | 256-QAM                   |



FIGURE 14. Test results for multiple video streaming transmission. UE successfully recovers the multiple video streams in the absence of channel coding on the left portion of the screen. The obtained constellation points are clearly shown on the right side.

set in the front panel. Note that sigma is the variable for LMMSE MIMO detection (Section 2). On the basis of the aforementioned data, a series of experiments are carried out in the deployment, including the visualization of video streams and constellation points, as well as the measurements of specific performance indicators.

#### **B. EXPERIMENT RESULTS**

To verify the performance from a reliability standpoint, we enable HD video streaming transmission. We also



**FIGURE 15.** The equalized constellation points of all the available modulation schemes. We use QPSK for Tx 0, 16-QAM for Tx 1, 64-QAM for Tx 2 and 256-QAM for Tx 3.

evaluate the performance metrics by transmitting random information bits.

## 1) HOST-BASED VISUALIZATION

In our system, we choose QPSK to transmit videos. Figure 14 exhibits that the visualization results for multiple video streaming transmissions verify the feasibility of our design. Evidently, UE successfully recovers the multiple video streams in the absence of channel coding and displays them on the left portion of the screen. Meanwhile, the clear constellation points of the received OFDM data symbols are shown on the right side. Accordingly, QPSK is sufficient to meet the requirement for video transmission, and high-order modulation schemes provide opportunities for highly demanding data transmission requirements. Figure 15 depicts that each constellation represents a single stream that adopts a certain modulation scheme. Thereinto, we use QPSK for Tx 0, 16-QAM for Tx 1, 64-QAM for Tx 2, and 256-QAM for Tx 3.

#### 2) PERFORMANCE METRICS

As one of the main performance metrics, the BER curves obtained by sweeping all TX PA gain values while keeping other system parameters constant are indicated in Figure 16. Over  $2.4 \times 10^6$ ,  $4.8 \times 10^6$ ,  $7.2 \times 10^6$ , and  $9.6 \times 10^6$ 



FIGURE 16. The BER performance according to the Tx power of different data streams for (a) QPSK, (b) 16-QAM, (c) 64-QAM, and (d) 256-QAM modulation. The BERs can be extremely small with enough transmission power using QPSK, 16-QAM, and 64-QAM. With 256-QAM modulation, all streams show an error-floor towards higher Tx gain values.

transmitted bits are collected per TX power amplifier (PA) gain for the QPSK, 16-QAM, 64-QAM, and 256-QAM modulations, respectively, to reliably obtain the result average. Notably, the BERs can be significantly small with enough transmission power using QPSK, 16-QAM, and 64-QAM. With the 256-QAM modulation, all streams show an error floor toward high Tx gain values, which is probably because the peak values of the channel estimation results are out of range and cut off at high Tx power. On the basis of (17) and the aforementioned BERs, the maximum throughput of 275.51 Mbps/s can be obtained using the 256-QAM modulation in the current bandwidth. Moreover, Tx 3 shows the best BER, followed by Tx 1 (the BER of Tx 2 is close to that of Tx 1), and the performance of Tx 0 is the worst. This finding can be attributed to the limitations of different Tx streams, that is, different antennas and RF chains provide different SNRs.

Figure 17 provides the snap shot of the SNR calculation. For each pilot symbol, FFT is conducted and 1200 middle subcarriers consisting of pilot signals as well as noise are extracted. We first calculate the average noise variance  $\sigma_z^2$ .



**FIGURE 17.** The snap shots of SNR calculation. We calculate the average noise variance  $\sigma_z^2$ , applying which SNR is measured with an approximated expression.

By applying this result, the SNR at the *n*-th index can be defined as follows:

$$SNR = \frac{\mathbb{E}\left\{\|\mathbf{h}_{n}\mathbf{s}_{n}\|^{2}\right\}}{\sigma_{z}^{2}} = \frac{\mathbb{E}\left\{\|\mathbf{y}_{n} - \mathbf{z}_{n}\|^{2}\right\}}{\sigma_{z}^{2}}.$$
 (21)



**FIGURE 18.** The SNR performance according to the Tx power of different Tx streams. The required SNRs appear to be rising curves in the Tx power region.

Assuming that noise is additive white Gaussian noise (AWGN), we have an approximated expression given by the following:

$$\operatorname{SNR} \simeq \frac{\mathbb{E}\left\{\|\mathbf{y}_{\mathbf{n}}\|^{2}\right\} - \sigma_{z}^{2}}{\sigma_{z}^{2}}.$$
 (22)

We set up single-input single-output (SISO) tests to observe SNRs sweeping in all four Tx channels. In Figure 18, we observe that the required SNRs exhibit rising curves in the Tx power region and that certain gaps exist in all Tx streams.

# **VI. CONCLUSION**

In this study, we design and implement a MIMO testbed equipped with a particular 5G multi-antenna module according to the LTE-Advanced standard. The block diagram and the general parameters are provided to profoundly explain the prototyping system design. Moreover, highly efficient but low-complexity processing schemes are proposed for the rational use of hardware. Subsequently, link-level simulation is presented, and our prototyping testbed is implemented. Several field trials are then carried out to test the performance of the proposed system. The results confirm that four HD videos have been transported successfully whitout channel coding. Finally, the main performance metrics are evaluated.

## REFERENCES

- R. W. Heath, Jr., M. Honig, S. Nagata, S. Parkvall, and A. C. K. Soong, "LTE-Advanced Pro: Part 1," *IEEE Commun. Mag.*, vol. 54, no. 5, pp. 52–53, May 2016.
- [2] Evolved Universal Terrestrial Radio Access (E-UTRA), document 3GPP TS 36.211 V14.0.0, Sept. 2016.
- [3] Qualcomm shows off 4.5G LTE Advanced Pro. Accessed: Jan. 2017.
   [Online]. Available: http://fudzilla.com/news/mobile/39759-qualcommshows-off-next-gen-lte-advanced-pro
- [4] A. A. Al-Hadi, J. Ilvonen, R. Valkonen, and V. Viikari, "Eight-element antenna array for diversity and MIMO mobile terminal in LTE 3500 MHz band," *Microw. Opt. Technol. Lett.*, vol. 56, no. 6, pp. 1323–1327, Jun. 2014.
- [5] K.-L. Wong and J. Y. Lu, "3.6-GHz 10-antenna array for MIMO operation in the smartphone," *Microw. Opt. Technol. Lett.*, vol. 57, no. 7, pp. 1699–1704, Jul. 2015.

- [7] C. Shepard, H. Yu, and L. Zhong, "ArgosV2: A flexible many-antenna research platform," in *Proc. Annu. Int. Conf. Mobile Comput. Netw.*, Oct. 2013, pp. 163–166.
- [8] C. W. Shepard, "Argos: Practical base stations for large-scale beamforming," Ph.D. dissertation, Dept. Elect. Comput. Eng., Univ. Rice, Houston, TX, USA, Apr. 2012.
- [9] C. Shepard et al., "Argos: Practical many-antenna base stations," in Proc. Annu. Int. Conf. Mobile Comput. Netw., Aug. 2012, pp. 53–64.
- [10] X. Yang et al. (2016). "Design and implementation of a TDD-based 128-antenna massive MIMO prototyping system." [Online]. Available: https://arxiv.org/abs/1608.07362
- [11] L.-C. Wung, Y.-C. Lin, Y.-J. Fan, and S.-L. Su, "A robust scheme in downlink synchronization and initial cell search for 3GPP LTE system," in *Proc. 7th Int. Symp. Wireless Pervasive Comput.*, Dalian, China, Feb. 2011, pp. 1–6.
- [12] M. K. Chung, M. S. Sim, D. K. Kim, and C.-B. Chae. (2016). "Compact full duplex MIMO radios in D2D underlaid cellular networks: From system design to prototype results." [Online]. Available: https://arxiv.org/abs/1612.06112
- [13] M. Noh, Y. Lee, and H. Park, "Low complexity LMMSE channel estimation for OFDM," *IEE Proc.-Commun.*, vol. 153, no. 5, pp. 645–650, Oct. 2006.
- [14] O. Edfors, M. Sandell, J.-J. van de Beek, and S. K. Wilson, "OFDM channel estimation by singular value decomposition," *IEEE Trans. Commun.*, vol. 46, no. 7, pp. 931–939, Jul. 1998.
- [15] NI PXIe-8135 User Manual. Accessed: Jan. 2017. [Online]. Available: http://www.ni.com/pdf/manuals/373716b.pdf
- [16] NI PXIe-1085 Series User Manual. Accessed: Jan. 2017. [Online]. Available: http://www.ni.com/pdf/manuals/373712h.pdf
- [17] PXIe-7976 PXI FPGA Module for FlexRIO. Accessed: Jan. 2017. [Online]. Available: http://www.ni.com/en-us/support/model.pxie-7976.html
- [18] USRP-2943 (Software Defined Radio Reconfigurable Device). Accessed: Jan. 2017. [Online]. Available: http://www.ni.com/enus/support/model.usrp-2943.html
- [19] Installation Guide NI 8262. Accessed: Feb. 2017. [Online]. Available: http://www.ni.com/pdf/manuals/372354b.pdf
- [20] NI PXIe-6674T User Manual. Accessed: Feb. 2017. [Online]. Available: http://www.ni.com/pdf/manuals/373089c.pdf
- [21] OctoClock & OctoClock-G. Accessed: Mar. 2017. [Online]. Available: https://www.ettus.com/product/details/OctoClock
- [22] M. Karkooti, J. R. Cavallaro, and C. Dick, "FPGA implementation of matrix inversion using QRD-RLS algorithm," in *Proc. 39th Asilomar Conf. Signals, Syst., Comput.*, Feb. 2005, pp. 1625–1629.
- [23] M. Sanad, "Double C-patch antennas having different aperture shapes," in *Proc. IEEE AP-S Symp.*, Newport Beach, CA, USA, Jun. 1995, pp. 2116–2119.
- [24] Y. Cheng, W. J. Lu, and C. H. Cheng, "Printed diversity antenna for ultra-wideband applications," in *Proc. IEEE Int. Conf. Ultra-Wideband*, Nanjing, China, Sep. 2010, pp. 1–4.
- [25] G. M. Zelinski, G. A. Thiele, M. L. Hastriter, M. J. Havrilla, and A. J. Terzuoli, "Half width leaky wave antennas," *Microw., Antennas Propag.*, vol. 1, no. 2, pp. 341–348, Apr. 2007.



XINTONG LU was born in Jiangsu, China, in 1993. She received the B.S. degree from Southeast University, Nanjing, China, in 2015, where she is currently pursuing the M.S. degree with the School of Information Science and Engineering. Her main research interests include massive MIMO systems, multiantenna user terminals, and finite-bit ADCs.



**LUYAO NI** was born in Jiangsu, China, in 1992. She received the B.S. degree from Southeast University, Nanjing, China, in 2015, where she is currently pursuing the M.S. degree with the School of Information Science and Engineering. Her research interests include PAPR reduction in massive MIMO systems.



**CHAO-KAI WEN** (S'00–M'04) received the Ph.D. degree from the Institute of Communications Engineering, National Tsing Hua University, Taiwan, in 2004. He was with the Industrial Technology Research Institute, Hsinchu, Taiwan, and MediaTek Inc., Hsinchu, from 2004 to 2009. He is currently an Associate Professor with the Institute of Communications Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan. His research interests include the opti-

mization in wireless multimedia networks.



**SHI JIN** (S'06–M'07) received the B.S. degree in communications engineering from the Guilin University of Electronic Technology, Guilin, China, in 1996, the M.S. degree from the Nanjing University of Posts and Telecommunications, Nanjing, China, in 2003, and the Ph.D. degree in communications and information systems from Southeast University, Nanjing, in 2007. From 2007 to 2009, he was a Research Fellow with University College London at Adastral Park, London, U.K. He is cur-

rently with the Faculty of the National Mobile Communications Research Laboratory, Southeast University. His research interests include space-time wireless communications, random matrix theory, and information theory. He and his coauthors received the 2010 Young Author Best Paper Award by the IEEE Signal Processing Society and the 2011 IEEE Communications Society Stephen O. Rice Prize Paper Award in the field of communication theory. He serves as an Associate Editor of the IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, the IEEE COMMUNICATIONS LETTERS, and the *IET Communications*.



**WEN-JUN LU** (M'12) was born in Jiangmen, China, in 1978. He received the B.E. and Ph.D. degrees in communication engineering and electrical engineering from the Nanjing University of Posts and Telecommunications (NUPT), Nanjing, China, in 2001 and 2007, respectively. He was with the Jiangsu Key Laboratory of Wireless Communications, NUPT, as a Lecturer from 2007 to 2009 and an Associate Professor from 2009 to 2013, where he has been a Professor since 2013.

He has authored or coauthored over 100 technical papers published in peer-reviewed international journals and conference proceedings. He is the translator of the Chinese version of *The Art and Science of Ultra Wideband Antennas* (by H. Schantz). He has authored the book *Antennas: Concise Theory, Design and Applications* (in Chinese). His research interests include antenna theory, wideband antennas, and arrays. He has been serving as an Editorial Board Member of the *International Journal of RF and Microwave Computer-Aided Engineering* since 2014. He was a recipient of the Exceptional Reviewers Award of IEEE TRANSACTIONS on ANTENNAS AND PROPAGATION in 2016, the Award of New Century Excellent Talents in Universities from the Ministry of Education of China in 2012, and the Nomination Award of Top-100 Outstanding Ph.D. Dissertation of China in 2009. He was also a co-recipient of other six scientific and technological awards granted by the Jiangsu Province, the Chinese Institute of Electronics, and the Chinese Institute of Communications, respectively.