

Received April 25, 2021, accepted May 7, 2021, date of publication May 10, 2021, date of current version May 18, 2021. *Digital Object Identifier* 10.1109/ACCESS.2021.3078844

# An 11.05 mW/Gbps Quad-Channel 1.25-10.3125 Gbps Serial Transceiver With a 2-Tap Adaptive DFE and a 3-Tap Transmit FFE in 40 nm CMOS

# HONG CHEN<sup>1</sup>, (Senior Member, IEEE), DENGJIE WANG<sup>1</sup>, ZIQIANG WANG<sup>2</sup>, SHUAI YUAN<sup>1,3</sup>, CHUN ZHANG<sup>1</sup>, (Member, IEEE), AND ZHIHUA WANG<sup>1</sup>, (Fellow, IEEE) Institute of Microelectronics, Tsinghua University, Beijing 100084, China

<sup>2</sup>Guangdong Engineering Research Center on ICs for Wireless Healthcare, Research Institute of Tsinghua University in Shenzhen, Shenzhen 518057, China <sup>3</sup>HiSilicon Technologies Company Ltd., Beijing 100084, China

Corresponding author: Dengjie Wang (wangdj15@mails.tsinghua.edu.cn)

This work was supported in part by the National Science and Technology Major Project from Minister of Science and Technology, China, under Grant 2018AAA0103100, in part by the National Natural Science Foundation of China under Grant U19B2041, in part by the Science, Technology and Innovation Commission of Shenzhen Municipality under Grant 20180306170609470 and Grant SGLH20180622095014688, in part by the Beijing Engineering Research Center under Grant BG0149, and in part by the Tsinghua National Laboratory for Information Science and Technology under Grant 042003266.

**ABSTRACT** This paper presents a quad-channel 1.25-10.3125 Gbps wireline transceiver implemented in 40 nm CMOS technology. The transmitter consists of a bit width adjustment, a 40:2 multiplexer, a 2:1multiplexer, and a current-mode logic driver with a 3-tap feedforward equalizer. The receiver has a two-stage continuous-time linear equalizer, a 2-tap half-rate fully adaptive decision-feedback equalizer, a phase interpolation-based digital clock and data recovery (CDR) followed by a 2:40 demultiplexer, a bit width adaption. The transceiver also supports AC/DC coupling, CDR locking detection, PLL locking detection, loss of signal detection, automatic termination impedance calibration. A ring VCO-based PLL is designed in each lane to save power consumption, and a dual-core LC VCO-based PLL is implemented in each bank to generate a low jitter clock signal. At 10.3125 Gbps, the transceiver can equalize 28 dB Nyquist loss at a bit error rate of  $10^{-12}$ , and it consumes 114 mW with a 1.1 V supply. This work presents a high power efficiency of 11.05 mW/Gbps, and the transceiver is suitable for multi-standard applications due to its flexibility and power efficiency.

**INDEX TERMS** Transceiver, feedforward equalizer, continuous-time linear equalizer (CTLE), adaptive decision-feedback equalizer (DFE), clock and data recovery (CDR).

# I. INTRODUCTION

With the development of interconnect technology, high-speed serial transceivers are widely used in computers, embedded systems, communication networks, and consumer electronic products. And different applications often use various protocols. Therefore, it is of great significance to develop a multiple standards wireline transceiver to meet the urgent needs of real-time, flexible interconnection, and high-speed data exchange among devices with different protocols. Most of the research on multiple standards has been reported in which these designs targeted at field-programmable

The associate editor coordinating the review of this manuscript and approving it for publication was Gian Domenico Licciardo<sup>10</sup>.

gate arrays (FPGAs) application [1]–[6]. And an FPGA transceiver support many protocols and must have fine-grain programmability. Therefore, the transceiver structure is relatively complex, and its power consumption is high. Ethernet [7], RapidIO [8], and Fibre Channel (FC) [9] protocols are three widely used interconnect technology with high performance, high bandwidth, and low latency. It is desirable to have a customized transceiver that can support these standards with high power efficiency.

Such transceivers must contend with significant challenges from aggressive equalization, power efficiency, wide frequency range, and different electrical requirements of multi-standards. The transceiver must support a variety of channels with different channel losses at the maximum data



FIGURE 1. Quad transceiver architecture.

rate, which requires a flexible equalization scheme of the transceiver. On the other hand, the tradeoff between equalization capability and power efficiency makes the design of the transceiver more difficult. Moreover, the transceiver should work normally at nine data rates of 1.25 Gbps, 2.5 Gbps, 5 Gbps, 2.125 Gbps, 4.25 Gbps, 8.5 Gbps, 3.125 Gbps, 6.25 Gbps, and 10.3125 Gbps. The transceiver may support channel-binding mode (different protocols with various lanes). Also, the transceiver needs to support 10-bit, 20-bit, and 40-bit parallel data width. All requirements make the clock distribution network more complex and increase the design difficulty of the transceiver.

In order to cope with these challenges, a 1.25-10.3125 Gbps transceiver with a 3-tap transmitter (TX) feed-forward equalizer (FFE), a 2-tap adaptive decision feed-back equalizer (DFE), and 2-stage continuous-time-linear-equalizer (CTLE) is presented in this paper. Fig. 1 shows the overall architecture of the quad transceiver, which employs a flexible clock structure of one Ring-PLL (RPLL) per lane and one dual-core LC PLL (CPLL) per quad to guarantee the wide working data rate and allow independent or shared clock source for each lane for supporting the channel-binding mode. Moreover, CMOS logic is suitable for low power design, and the power can scale with the data rate. In our work, we adopt CMOS logic operation as much as possible.

The paper is organized as follows: Section II describes the architecture of the TX, section III details the receiver (RX), section IV introduces wideband clocking techniques, section V presents the measurement results, and this paper is concluded in Section VI.

## **II. TRANSMITTER**

Fig. 2 shows the proposed TX architecture. The programmable serializer first converts 10-bit, 20-bit, or 40-bit parallel data into 40-bit, and a 40:2 multiplexer (MUX) using the C<sup>2</sup>MOS technique converts 40 bit into 2-bit data at half rate. A 2:1 MUX (to convert half-rate data stream to full-rate data stream) based on the dynamic transmission-gate [10] is employed to trade off the power and speed. The TX includes a current-mode logic (CML) driver with a 3-tap FFE to realize the TX de-emphasis for pre-distorting the signal before transmission. In addition, a separate CML pre-driver is designed as the source for the TX-RX loopback path for full-speed production testing.

The CML driver with adjustable tail currents controlled by digital to analog converters (DACs) is designed to achieve the flexibility of output swing, equalization, as well as highenergy efficiency. As shown in Fig. 3, the driver taps (PRE, MAIN, and POST) are implemented as open-drain CML different pairs. The outputs of driver taps are shorted together and connected to a calibrated resistor. The adjustable output swing and optimal equalization can be achieved by adjusting the DAC at the tail current of each driver tap. Take the DAC with the MAIN tap as an example, it has an 8-bit DAC, where eq < 0:3 is used to adjust the equalization coefficient (also can be used for swing), sw < 0.3 is used to adjust the output swing (also can be used for equalization), and vref (adjustable) is used to ensure the basic current of the output driver. Similarly, PRE and POST tap are regulated by a 4-bit DAC (1-bit for redundancy) and a 6-bit DAC (2bit for redundancy). Using  $C_{-1}$ ,  $C_0$ , and  $C_1$  to represent the tap coefficients of PRE, MAIN, and POST, respectively, the equalization capability of this FFE can be given by (1). When PRE and POST taps are turned to the maximum, the FFE equalization ability is the largest as described in (2), in which when  $C_0$  is 0, the equalization capability reaches 12.3 dB.

$$EQ_{FFE} = -20\log\left(\frac{36+C_0-C_{-1}-C_1}{36+C_0+C_{-1}+C_1}\right) \quad (1)$$

$$EQ_{FFEMAX} = -20\log\left(\frac{14+C_0}{58+C_0}\right)$$
(2)

The poly-resistance, which dominates the output impedance in CML drivers used in the transceiver, has a 20% absolute variation over process, voltage, and temperature (PVT) leading to an over 20% variation in output impedance. However, only a  $\pm 10\%$  variation of termination resistance is allowed in standards. To track this wide variation, a termination resistor calibration circuit is designed. The principle is that a same current flow through the onchip resistor array (a replica of termination) and the off-chip precision resistor respectively, and then the control code is adjusted by successive-approximation register (SAR) logic according to the voltage comparison result of the on-chip and off-chip resistor. When the two voltages are equal, the calibration is complete. Details on the transistor-level design of the calibration circuit are provided in our previous work [11]. Fig. 4 demonstrates the calibration process. C < 5:0 > are the control word of the on-chip replica resistor array of the termination, the red line VRE in Fig. 4 is the voltage on the off-chip precision resistor, and the blue line VR is the voltage on the replica resistor array. The VR approaches VRE according to SAR logic, and finally, the termination resistance is calibrated to 50 Ohm. Moreover, the simulation results show that the calibration circuit achieves 4% resistance variation accuracy over PVT and allows high power efficiency without



FIGURE 2. Transmitter block diagram.



FIGURE 3. Schematic of driver.



FIGURE 4. Termination calibration process.

any significant impact on inter-symbol interference (ISI), and the rise/fall time of output driver.

#### **III. RECEIVER**

Fig. 5 shows the block diagram of the receiver, which supports AC/DC coupling. And the termination resistor is the same as the TX. The RX inputs first go through two stages of CTLE and GAIN to cancel long-tail ISI. Then, the signal is buffered into the DFE sampler to generate data, edge error,

and DFE error. The data and DFE error are sampled at the center of the data eye. The edge is sampled at the edge of the data eye. After de-multiplexing, data and edge are used for clock and data recovery (CDR) to recover the sampling clock, and DFE error is used for DFE adaption. Finally, the data is deserialized to 40-bit and then converted to 10/20/40-bit output according to the demand of the standard. Additionally, the RX also has some function blocks to meet the requirements of standards (e.g. an on-chip eye monitor, a signal loss detector, a CDR locking detection).

## A. LINEAR EQUALIZER

The CTLE circuit removes the long-tail post-cursor ISI by boosting the high-frequency content of the received signal. A traditional CTLE with capacitive and resistor source degeneration [6], [12], [2] is popular for its low power and area overhead, but there is a tradeoff between the low-frequency gain and the boost factor. To solve this, the passive inductors [13], [14] are always adopted to broaden the bandwidth and enlarge the boost. However, the inductor brings a large area. The active inductor [15] and negative capacitor [16] solutions are proposed to replace the inductor, which has advantages of area and flexibility at the cost of high supply



FIGURE 5. Receiver block diagram.

voltage or power. And in the quad-line transceiver design, we give priority to the area-power efficiency. Therefore, the traditional CTLE structure with capacitive and resistor source degeneration is chosen. As presented in Fig. 6(a), the degeneration resistor is composed of an NMOS transistor  $M_5$ and a poly resistor  $R_1$ , the degeneration capacitor consists of two MOS varactors ( $C_1$ ,  $C_2$ ). The gate of  $M_5$  is connected to the positive of  $C_1$  and  $C_2$ .  $R_S$  and  $C_S$  are controlled by the voltage of  $V_{CTLE}$ , which affects the zero frequency ( $\omega_{Z1}$ ), the amount of peaking, and DC gain. As mentioned before, to avoid the large loss of low-frequency and realize a 16db boost target, a two stages CTLE and GAIN is designed. The GAIN stage is a differential amplifier to provide a low DC gain for making up the DC loss of CTLE. As the number of stages in the cascade such as the CTLE and GAIN increases to achieve a higher boost factor, the overall bandwidth tends to drop unless a greater low-frequency loss is allowed in each stage. Therefore, we put the peaking point frequency at 1.2-1.4 times of Nyquist frequency in this design to ensure the overall bandwidth. The simulation results of cascaded CTLE and GAIN are given in Fig. 7. When  $V_{CTLE}$  changes from 100mV to 0.9V, the single CTLE and GAIN's boost gain at 5GHz varies from 8.9 dB to 0.3 dB with a max DC loss of -115mdB, two stages can provide a maximum equalization gain of 17.8 dB.

# B. DFE

Compared with full-rate DFEs [17]–[19], the half-rate DFEs [20]–[22] are more favored for their simpler design of the CDR circuit and the clock path. Given that this paper is targeted for a quad-channel, a 2-tap half-rate direct feedback DFE is designed. As presented in Fig. 7, the DFE has two DFE slicers: even and odd slicer. Each slicer consists of DFE summer, edge sampling, data sampling, and error sampling. The data and edge sampling include two D flip-flops (DFF) and a CML signal to CMOS signal converter; the error sampling has two summers, four DFFs, and two CML to CMOS signal converter. An extra latch is added in the even slicer for data synchronizing.



FIGURE 6. (a) Schematic of CTLE and (b) Transfer function and simulation results of CTLE.

The DFE summer depicted in Fig. 8 is a CML summer, which completes the summation of the weighted feedback signal from previous data decisions and the signal from CTLE.  $V_{IP}$  and  $V_{IN}$  are the output of CTLE,  $D_{1P}$  and  $D_{1N}$ are the first feedback tap data, and  $D_{2P}$  and  $D_{2N}$  are the second. Tap coefficients are adjusted by gate voltages  $TAP_1$ and  $TAP_2$  of  $M_{11}$  and  $M_{12}$  generated by a 6-bit DAC. The signal DFE\_MODE can be configured to disable the DFE function when the insertion loss is low.  $M_7$  in the MAIN tap is for matching the tap branch. And the summer designed in error sampling is similar to the DFE summer and used as a comparator. The signal from DFE summer  $(V_{IP}, V_{IN})$  and eye height  $(V_H - V_L \text{ or } V_L - V_H)$  are compared by tail current. The DFF composed of two traditional CML latch is designed to convert an analog signal to a "digital" signal, which is then feedback to cancel the corresponding post-cursor ISI. Moreover, an amplifier-based CML to CMOS converter is designed to convert CML level to CMOS level for saving power.

According to the feedback nature of DFE, the key to DFE design is to meet the timing constraints set by the critical path



FIGURE 7. The structure of half-rate DFE.



FIGURE 8. Schematic of DFE summer.

of the feedback loop. Especially for the first tap, which timing constraint is one unite interval (UI). As shown by the red line in Fig. 7, the first tap critical path of the DFE includes two DFFs and a DFE summer. The total delay includes the delay from *CIP* sampling to the output data of the first DFF in the even slicer, the delay of DFE summer, and the setup time of the first DFF in the odd slicer (sampled by *CIN*), which should be less than 1UI. That is

$$T_{ckg} + T_{setup} + T_{setule} < 1UI, \tag{3}$$

 $T_{ckq}$  is the delay between clock and data of DFF,  $T_{setup}$  is the setup time of DFF,  $T_{settle}$  is the delay of DFE summer. Fig. 9 shows the closed-loop first tap with CML DFF. As the output of DFF is directly connected to the feedback stage ( $M_3$ and  $M_4$  in Fig. 9), a few conclusions can be drawn. First, The



FIGURE 9. Closed-loop first tap with CML DFF.



FIGURE 10. Effective tap current vs. single-ended TAP data swing.



FIGURE 11. T<sub>cka</sub> vs. single-ended output swing of DFF.

 $T_{ckq}$  of CML DFF is directly related to the Q point definition, which can be defined as the voltage that the input devices of the feed-back stage have to interpret the signal produced by the CML DFF as a *digital* level. Second, the larger the feedback stage is, the smaller clipping voltage (*digital* level) is, but it will produce a larger capacitive loading on the summation node. Considering that it is not difficult to achieve 5GHz bandwidth at 45nm CMOS, a larger feedback stage (2.5u/0.04u) is designed to reduce the requirement of CML DFF output swing and shorten the DFF delay. Fig. 10 shows the simulation result of effective feedback current of the first tap under the various single-ended swing of tap data ( $D_{1P}$ ,  $D_{1N}$ ) in the DFE summer.

We can conclude that when tap data swing is above 400mV, the current utilization rate is more than 90%. Meanwhile, as reported in Fig. 11, the relationship between  $T_{ckq}$  and



FIGURE 12. a) The input eye of DFE (b) the output eye of even DFE summer (c) the output eye of odd DFE summer.



FIGURE 13. Block diagram of CDR.

single-ended output swing of DFF is simulated under the condition of 300mv single-ended input swing of DFF. With Q point defined as 400 mV, the worst  $T_{ckq}$  is 47.97ps over PVT. Besides, the simulation results confirm that the summer of  $T_{settle}$  and  $T_{setup}$  is less than 30ps over PVT. That is, the total time of the critical path does not exceed 77.97ps (at 10.3125 Gbps 1UI is 96ps), which meets the time constraints over 1.25-10.3125 Gbps.

Fig. 12 shows the simulated eye diagram in the TT corner at the node of each DFE summer when the DFE is fed by a 400mV single-ended swing differential input data that has been filtered by a  $(0.7+0.2Z^{-1}+0.1Z^{-2})$  channel. Both good time and voltage margin are realized, verifying the function of DFE.

# C. CDR AND DFE ADAPTION

A phase interpolation (PI)-based digital CDR is designed to meet such a wide working data rate of 1.25-10.3125 Gbps.

Fig. 13 depicts the block diagram of CDR, which includes a 2 to 4 de-multiplexer, a bang-bang phase detector (BBPD), a majority voter, a digital filter, a PI decoder, and PI. The data and edge data from the DFE sampler are first deserialized to 4-bit, which reduces the working frequency of the subsequent digital module to a maximum of 2.5GHz. As a result, these modules can be designed with standard cells, which not only reduces the overall power dissipation but also shortens the design cycle. BBPD is adopted to generate phase error information (early/late), and the voter is for reducing the baud rate of phase error samples to a rate compatible with filter digital signal processing.

The digital filter is based on a state machine [23], which can be described as: an *early* signal is generated when the accumulated number of *early* is more than that of *late* by N; a *late* is output when the accumulated number of late is Nmore than early. The accumulated number of late is N more than early. The bandwidth of the loop filter can be adjusted by changing the value of N. The filter is implemented with a ring shift register, which has the advantages of a simple structure and low loop delay.

A conventional PI depicted in Fig.14 is designed in this work. The output clock is the weighted summation of the input quadrature clock CI, CQ. The principle of PI can be explained by (4), (5), and (6).

$$CK_{PI} = A1 \sin(\omega t) + A2 \cos(\omega t)$$
 (4)

$$CK_{PI} = \sqrt{A1^2 + A2^2} \sin(\omega t + \varphi) \tag{5}$$

$$\tan\left(\varphi\right) = \frac{A2}{A1} \tag{6}$$

The A2 and A1 are determined by the tail currents, which are controlled by C0<0:15>, C90<0:15>, C180<0:15>, and C270<0:15>. A fixed resistor is connected to the output terminal of the I/Q path, so the total current of the I and Q path must be kept constant to guarantee a fixed common-mode of the output clock against all PI codes. That is, the result of A1 + A2 is a constant. In the theoretical analysis (4), the input quadrature clock is a sine wave. However, when the input clock has a sharp edge, the PI linearity will deteriorate rapidly, and even it will make the interpolation fail. For example, when the data rate is 10 Gbps, the input clock is a sine wave, while when 1 Gbps, the input clock becomes a square wave, resulting in PI failure and the unlock of CDR. To address this problem, we design a variable bandwidth buffer (called clock preprocessing) before PI, as shown in Fig. 15. The clock preprocessing circuit consists of a CML buffer, resistors, capacitors, and switches. When the receiver is working under a high speed (8.5-10.3125 Gbps), the S1, S2, and S3 are closed, the resistors R, 2R, and 4R are shorted, and only the capacitor C with a smaller value is connected to the CML buffer, which makes the input clocks CKIP, CKIN, CKQP and CKQN of PI closer to the sine wave. When the receiver works at a lower speed (6.25-8.5 Gbps), its rising and falling edges are steeper than those at high speed are. S1 is open, the R and C are connected to the CML buffer simultaneously, and the RC filter smooths the edges of the clock to ensure the correct



FIGURE 14. Schematic of PI.



FIGURE 15. Variable bandwidth clock buffer.



FIGURE 16. Block diagram of DFE adaption loop.

function of PI. Similarly, at lower data rate range (4.25-6.25 Gbps, 1.25-4.25 Gbps), *S2* and *S3* are open correspondingly to make the input clocks *CKIP*, *CKIN*, *CKQP*, and *CKQN* of PI closer to a triangular wave. With the data rate from 1.25 Gbps to 10.3125 Gbps, the single-end output swing of the clock buffer varies from 500mV to 260mV. And the input referred 1dB compression of the PI circuit is about 600 mV, which guarantees the PI linearity over all data rates. The resistors and switches can save more area than adopting capacitors alone, and the passive components will not introduce extra power consumption.

The DFE adaption logic [10] based on the sign-sign leastmean-square (SS-LMS) algorithm is designed in this work. As presented in Fig.16, the DFE adaptation loop is similar to



FIGURE 17. (a) Definition of DFE error and (b) unite pulse response.

the CDR, and the input data of SS-LMS logic is 8-bit data de-multiplexed from the DFE sampler. The DFE error data X in Fig. 17 (a) is generated by the DFE sampler.  $V_H - V_L$ in Fig. 17 (a) is the expected eye height, when the data is higher than the expected eye height, the error data X is 1, otherwise is 0. The SS-LMS logic in [10] converts the DFE error (X) signal to the ISI error signal (S). The relationship between X and S is

$$S_N = -X_N * D_N * D_{N-K}.$$
(7)

Take the first post-tap as an example, and ignore the ISI from other taps. If the  $D_N$  and  $X_N$  are both 1, the  $t_{samp1}$  sampling point on the red line in Fig. 17 (a), if  $D_{N-1}$  (1UI preceding the  $D_N$ ) is 1, from the unite-pulse response (Fig.17 (b)), we observed that the ISI from  $D_{N-1}$  (*h1*) is positive. And the current data is higher than the ideal eye height, that is, the ISI has not been eliminated, so it is under equalizing. If  $D_{N-1}$  is 0, which's ISI on the current sampling point is negative, and the current data point should be lower than the expected eye height, but the data higher than it indicates over equalizing.

## **IV. CLOCKING**

The overall clock structure of one RPLL per lane and one CPLL per quad demonstrated in Fig. 1 is designed to achieve

high clock flexibility across multiple TX and RX. The clock source for each lane can be independently selected from three clock sources: RPLL, CPLL, and an external clock source. Besides, separate frequency dividers are instantiated in each lane to support lower data rates.

The Ring VCO proposed in previous work [24] covers a frequency range of 8-11 GHz. It uses two differential delay cells with cross-coupled MOSFET pair in the VCO. The tail current NMOS works in the linear region to expand the output swing and acquire better phase noise. And a single PMOS is used in cross-coupled pairs to reduce power consumption. The Ring VCO adopts a 16 bands design and VCO gain ( $K_{VCO}$ ) compensation technology to avoid high output spur caused by too large  $K_{VCO}$ . The RPLL output frequency using the above VCO includes 8.5GHz, 10GHz, and 10.3125GHz. 2.125GHz, 4.25GHz generated from 8.5GHz are for FC protocol. 1.25GHz produced by an eight-frequency division of 10GHz is for 1000Base-X. 10.3125GHz is for 1000Base-KR.

The CPLL with dual VCO cores in [25] is designed to generate 10GHz, 10.3125GHz, and 12.5GHz clocks. 5GHz, 2.5GHz, and 1.25GHz clocks could be generated by dividing the 10GHz clock. 6.25 GHz and 3.125GHz could be produced by frequency splitting of 12.5GHz. The dual VCO cores are adopted to provide the required frequency range and ensure high clock quality. VCO1 is enabled for 10GHz and 10.3125GHz, and VCO2 is selected for 12.5GHz. In order to avoid the interaction between dual VCO cores, only one VCO is working at any given time (the other is power down), and they have been placed 260um away from each other on the layout for isolation. The dual VCO cores with a 4-bit capacitor array enlarge the frequency tuning range a lot without degrading the LC tank's Q factor too much. The 4-bit capacitor array also reduces K<sub>VCO</sub> by a large scale, contributing to decrease the loop filter's area and lower the spur. The CPLL is a fully differential system that can effectively suppress common mode noise and the noise of substrate and ground. The clock generated by CPLL is low jitter and suitable for providing a clock for multiple lanes. And it provides fullrate, differential clocks for all the lanes through a symmetrical clock routing channel. TX and RX are neighbored in layout to minimize the clock routing distance.

In practical application, the oscillation frequency of either LC-VCO or Ring-VCO would vary over PVT. An adaptive frequency calibration (AFC) circuit is essential for correct locking of PLL. It can automatically select an appropriate frequency band when the environment change. The AFC in RPLL and CPLL are designed using open-loop AFC technology [26]. AFC algorithm is given in Fig. 18, which adopts binary search to go through the control code and store the optimal code simultaneously, and the AFC counts VCO's 8-frequency clock instead of the input clock of phase- frequency detector. Therefore, the proposed AFC has advantages over speed and accuracy.

Fig. 19 summarizes the clock distribution of a lane. The circuits in the dotted line are designed to realize flexible data



FIGURE 18. The AFC algorithm.

rate configuration. As shown in Fig. 19, differential full-rate clocks, which can be at the same frequency or different, are fed to TX and RX. After dividing by 2, the clocks enter two separate divider links, one is for providing clocks to TX *MUX* and the other is for providing clocks for bit width adaption and digital circuits. A low-power duty cycle corrector [27] is designed to ensure the duty cycle of the clock to the physical coding sub-layer (PCS) is 50%. The RX clock network is similar to TX.

## **V. MEASUREMENT RESULTS**

The multi-standard serial transceiver chip is fabricated in a 40nm CMOS process. Fig. 20 shows the test board, chip package, and chip micrograph. The chip measures 4 mm  $\times$  6 mm (including 401 pads, ESD protection, and a large area of decoupling capacitance), the core area of a quad is only 1.7 mm<sup>2</sup>. In addition to TX, RX, and RPLL, each lane also includes a PCS (in digital part) performing protocol level function, for instance, scrambling and descrambling decoding, encoding, etc. The chip is packaged in FC-BGA based on organic substrate and connected to the test PCB board through a customized socket.

The transceiver chip works well at the data rates from 1.25 Gbps to 10.3125 Gbps. Fig. 21 shows the eye diagrams of TX at the different data rates, and the total jitter (TJ) is also given in the corresponding figure. It can be seen that the signal quality deteriorates seriously at 8.5 Gbps and 10.3125 Gbps. The TX provides a 280 mV swing and exhibits a TJ of 35.93 ps (0.31UI) at 8.5 Gbps operations. Under 10.3125 Gbps, the TX can provide a 100 mV swing and



FIGURE 19. Clock distribution of a lane.



FIGURE 20. The PCB board, chip package, and die photo.

exhibits a TJ of 63.37 ps (0.65UI). The signal path from TX chip pin to testing point includes chip package, socket, PCB trace, SMA connector, and cable. The total insertion loss is about 14dB at 5.15625GHz, which is estimated by sending the fixed pattern of length "0" length "1" at 1.25 Gbps and the clock pattern of "0101" at 10.3125 Gbps.

A BERT is used for testing the bit error rate (BER) and jitter tolerance of RX. In the test loop, the BERT first sends out the high-speed PRBS data. After receiving the data, the RX outputs the 1:8 demuxed data from the chip. Finally, the low- speed data is returned to the BERT to detect the bit error. The jitter tolerance of CDR in RX is tested under nine data rates by using BERT. Fig. 22(a) shows the results from 1.25 Gbps to 8.5 Gbps, and the jitter tolerance complies with the 8GFC specification at 8.5 Gbps. It should be noted that the threshold of BER is set to  $10^{-9}$  in the jitter tolerance test because of test time limitations. Fig. 22(b) shows the BER test result at 10.3125 Gbps. The measurement results show that the transceiver designed in this paper works normally under the required data rates.

The TRX power consumption is 96 mW with a 1.1V supply, at 10.3125 Gbps. The RPLL and CPLL consume 18 mW and 28.6 mW respectively, the power of CPLL can be calculated as 7.15mW per lane. Therefore, the average power consumption of a single lane of the multi-protocol transceiver

#### TABLE 1. Performance summary.

|                                  | [27]             | [28]              | [1]              | [2]              | This<br>work     |
|----------------------------------|------------------|-------------------|------------------|------------------|------------------|
| Process<br>(nm)                  | 90               | 40                | 20               | 16               | 40               |
| Data rate<br>(Gbps)              | 1.25~10.<br>3125 | 1.0625~<br>14.025 | 0.5~16.3         | 0.5~16.3         | 1.25~10.<br>3125 |
| Clock                            | CPLL             | 2CPLL             | CPLL<br>+RPLL    | CPLL<br>+RPLL    | CPLL<br>+RPLL    |
| Equalizer                        | FFE+CT<br>LE+DFE | FFE+CT<br>LE+DFE  | FFE+CT<br>LE+DFE | FFE+CT<br>LE+DFE | FFE+CT<br>LE+DFE |
| Channel<br>loss (dB)             | 35.8             | 26                | 28               | 28               | 28               |
| Power<br>supply (V)              | 1.2              | 1.1               | 1.0              | 0.9              | 1.1              |
| Power per<br>lane (mW)           | 260              | 410               | 278              | 219              | 114              |
| Power<br>efficiency<br>(mW/Gbps) | 25.24            | 29.23             | 17.06            | 13.44            | 11.05            |

chip will not exceed 114mW. Fig. 24(c) details the power consumption of a lane with RPLL.

The performance of this work and a comparison with prior work are provided in Table 1. Note that our work realizes a better power efficiency of 11.05mW/Gbps, and the ability of equalization is also comparable to the others.



(g)

(h)

FIGURE 21. Measured TX output eye diagram with PRBS7 at (a) 1.25 Gbps, (b) 2.5 Gbps, (c) 3.125 Gbps, (d) 4.25 Gbps, (e) 5 Gbps, (f) 6.25 Gbps, (g) 8.5 Gbps, and (h) 10.3125 Gbps.

# **VI. CONCLUSION**

A 1.25-10.3125 Gbps multi-standard serial transceiver chip, supporting 1000 BASE-X, 10 GBASE-KR, FC-AE-ASM, and RapidIO 3.0 protocols, fabricated in 40nm CMOS process is presented in this paper. The core area of a quad is only 1.7 mm<sup>2</sup>. The TX adopts a flexible CML driver with 3-tap FFE, and a replica termination circuit for calibrating the poly-resistor to 50 Ohm. The RX contains a two-stage of CTLE and gain, a 2-tap half-rate direct feedback DFE, and a PI-based CDR. A dual-core CPLL per quad and an RPLL per lane are designed to ensure a wide working range and a good performance. With full equalization capability (3-tap TX FFE, CTLE, 2-tap DFE), we achieved 11.05 mW/Gbps at 10.3125 Gbps over 28 dB Nyquist insertion loss with a BER of  $10^{-12}$ .



FIGURE 22. (a) Measured jitter tolerance of CDR, (b) measured BER of RX with PRBS31 at 10.3125 Gbps, and (c) power breakdown of a lane.

# REFERENCES

- [1] Y. Frans, D. Carey, M. Erett, H. Amir-Aslanzadeh, W. Y. Fang, D. Turker, A. P. Jose, A. Bekele, J. Im, P. Upadhyaya, Z. D. Wu, K. C. H. Hsieh, J. Savoj, and K. Chang, "A 0.5–16.3 Gb/s fully adaptive flexible-reach transceiver for FPGA in 20 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 50, no. 8, pp. 1932–1944, Aug. 2015.
- [2] J. Savoj, K. Hsieh, P. Upadhyaya, F.-T. An, J. Im, X. Jiang, J. Kamali, K. W. Lai, D. Wu, E. Alon, and K. Chang, "Design of high-speed wireline transceivers for backplane communications in 28 nm CMOS," in *Proc. IEEE Custom Integr. Circuits Conf.*, San Jose, CA, USA, Sep. 2012, pp. 1–4.
- [3] S. D. Vamvakos, C. R. Gauthier, C. Rao, K. R. Canagasaby, P. Choudhary, S. Dabral, S. Desai, M. Hassan, K. C. Hsieh, B. Kleveland, and G. Mandal, "A 2.488–11.2 Gbps multi-protocol SerDes in 40 nm lowleakage CMOS for FPGA applications," in *Proc. IEEE 55th Int. Midwest Symp. Circuits Syst. (MWSCAS)*, Boise, ID, USA, Aug. 2012, pp. 5–8.
- [4] J. Savoj, H. Aslanzadeh, D. Carey, M. Erett, W. Fang, Y. Frans, K. Hsieh, J. Im, A. Jose, D. Turker, P. Upadhyaya, D. Wu, and K. Chang, "Wideband flexible-reach techniques for a 0.5-16.3 Gbps fully-adaptive transceiver in 20 nm CMOS," in *Proc. IEEE Custom Integr. Circuits Conf.*, San Jose, CA, USA, Sep. 2014, pp. 1–4.
- [5] J. Savoj, K. Hsieh, P. Upadhyaya, F.-T. An, A. Bekele, S. Chen, X. Jiang, K. W. Lai, C. F. Poon, A. Sewani, D. Turker, K. Venna, D. Wu, B. Xu, E. Alon, and K. Chang, "A wide common-mode fully-adaptive multi-standard 12.5 Gb/s backplane transceiver in 28 nm CMOS," in *Proc. Symp. VLSI Circuits (VLSIC)*, Honolulu, HI, USA, Jun. 2012, pp. 104–105.
- [6] M. Erett, J. Hudner, D. Carey, R. Casey, K. Geary, K. Hearne, P. Neto, T. Mallard, V. Sooden, M. Smyth, Y. Frans, J. Im, P. Upadhyaya, W. Zhang, W. Lin, B. Xu, and K. Chang, "A 0.5–16.3 Gbps multi-standard serial transceiver with 219 mW/Channel in 16-nm FinFET," *IEEE J. Solid-State Circuits*, vol. 52, no. 7, pp. 1783–1797, Jul. 2017.
- [7] (2007). IEEE 802.3ap. [Online]. Available: https://ieeexplore.ieee. org/doc-ument/4213276/
- [8] (2014). RapidIO Specification 3.0. [Online]. Available: http://rapidio. wpengine.com/rapidio-specifications/#tab1530
- [9] Fibre Channel Physical Interface-5 (FC-PI-5) Rev 6.00, ANSI INCITS, Washington, DC, USA, 2010.

- [10] S. Yuan, L. Wu, Z. Wang, X. Zheng, C. Zhang, and Z. Wang, "A 70 mW 25 Gb/s quarter-rate SerDes transmitter and receiver chipset with 40 dB of equalization in 65 nm CMOS technology," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 63, no. 7, pp. 939–949, Jul. 2016.
- [11] C. Ying, L. Fule, Z. Xuqiang, and Z. Chun, "Self-calibrating on-chip termination resistor for high-speed SerDes," in *Proc. Int. Conf. Consum. Electron., Commun. Netw. (CECNet)*, Xianning, China, Apr. 2011, pp. 5207–5210.
- [12] J.-S. Choi, M.-S. Hwang, and D.-K. Jeong, "A 0.18-µm CMOS 3.5-Gbps continuous-time adaptive cable equalizer using enhanced low-frequency gain control method," *IEEE J. Solid-State Circuits*, vol. 39, no. 3, pp. 419–425, Mar. 2004.
- [13] S. Gondi, J. Lee, D. Takeuchi, and B. Razavi, "A 10 Gbps CMOS adaptive equalizer for backplane applications," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, San Francisco, CA, USA, vol. 1, Feb. 2005, pp. 328–601.
- [14] Y. Chen, P.-I. Mak, and Y. Wang, "A highly-scalable analog equalizer using a tunable and current-reusable for 10-Gb/s I/O links," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 23, no. 5, pp. 978–982, May 2015.
- [15] D. Lee, J. Han, G. Han, and S. M. Park, "10 Gbit/s 0.0065 mm<sup>2</sup> 6 mw analogue adaptive equaliser utilising negative capacitance," *Electron. Lett.*, vol. 45, no. 17, pp. 863–865, 2009.
- [16] T. Masuda, H. Suzuki, H. Iizuka, A. Igarashi, K. Takeshita, T. Mogi, N. Shoji, J. Chatwin, I. Butler, and D. Mellor, "A 250 mW full-rate 10 Gb/s transceiver core in 90 nm CMOS using a tri-state binary PD with 100ps gated digital output," in *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, San Francisco, CA, USA, Feb. 2007, pp. 438–614.
- [17] H. Wang and J. Lee, "A 21-Gb/s 87-mW transceiver with FFE/DFE/Analog equalizer in 65-nm CMOS technology," *IEEE J. Solid-State Circuits*, vol. 45, no. 4, pp. 909–920, Apr. 2010.
- [18] Y. Chen, P. Mak, L. Zhang, and Y. Wang, "A 0.002-mm<sup>2</sup> 6.4-mW 10-Gbps full-rate direct DFE receiver with 59.6% horizontal eye opening under 23.3-dB channel loss at Nyquist frequency," *IEEE Trans. Microw. Theory Techn.*, vol. 62, no. 12, pp. 3107–3117, Dec. 2014.
- [19] Y. Lu and E. Alon, "Design techniques for a 66 Gb/s 46 mW 3-tap decision feedback equalizer in 65 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 48, no. 12, pp. 3243–3257, Dec. 2013.
- [20] Z.-H. Hong, Y.-C. Liu, and W.-Z. Chen, "A 3.12 pJ/bit, 19–27 Gbps receiver with 2-tap DFE embedded clock and data recovery," *IEEE J. Solid-State Circuits*, vol. 50, no. 11, pp. 2625–2634, Nov. 2015.
- [21] T. Norimatsu, K. Kogo, T. Komori, N. Kohmu, F. Yuki, and T. Kawamoto, "A 100-Gbps 4-lane transceiver for 47-dB loss copper cable in 28-nm CMOS," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 67, no. 10, pp. 3433–3443, Oct. 2020.
- [22] K. Huang, Z. Wang, X. Zheng, X. Ma, K. Yu, C. Zhang, and Z. Wang, "A novel clock and data recovery scheme for 10 Gbps source synchronous receiver in 65 nm CMOS," in *Proc. IEEE 55th Int. Midwest Symp. Circuits Syst. (MWSCAS)*, Boise, ID, USA, Aug. 2012, pp. 932–935.
- [23] X. Lin, Z. Wang, Y. He, Y. Zhou, C. Zhang, W. Luan, and M. Li, "An 8-11 GHz low phase noise ring voltage controlled oscillator," in *Proc. Int. Conf. Electron Devices Solid-State Circuits (EDSSC)*, Hsinchu, Taiwan, Oct. 2017, pp. 1–2.
- [24] Y. He, Z. Wang, H. Liu, F. Lv, S. Yuan, C. Zhang, Z. Wang, and H. Jiang, "An 8.5–12.5 GHz wideband LC PLL with dual VCO cores for multiprotocol SerDes," in *Proc. IEEE 60th Int. Midwest Symp. Circuits Syst.* (*MWSCAS*), Boston, MA, USA, Aug. 2017, pp. 791–794.
- [25] J. Shin and H. Shin, "A 1.9-3.8 GHz ΔΣ fractional-N PLL frequency synthesizer with fast auto-calibration of loop bandwidth and VCO frequency," *IEEE J. Solid-State Circuits*, vol. 47, no. 3, pp. 665–675, Mar. 2012.
- [26] H.-Y. Huang, C.-M. Liang, and S.-J. Sun, "Low-power 50% duty cycle corrector," in *Proc. IEEE Int. Symp. Circuits Syst.*, Seattle, WA, USA, May 2008, pp. 2362–2365.
- [27] Y. Hidaka, W. Gai, T. Horie, J. H. Jiang, Y. Koyanagi, and H. Osone, "A 4-channel 1.25–10.3 Gbps backplane transceiver macro with 35 dB equalizer and sign-based zero-forcing adaptive control," *IEEE J. Solid-State Circuits*, vol. 44, no. 12, pp. 3547–3559, Dec. 2009.
- [28] F. Zhong, S. Quan, W. Liu, P. Aziz, T. Jing, J. Dong, C. Desai, H. Gao, M. Garcia, G. Hom, and T. Huynh, "A 1.0625~14.025 Gbps multi-media transceiver with full-rate source-series-terminated transmit driver and floating-tap decision-feedback equalizer in 40 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 46, no. 12, pp. 3126–3139, Dec. 2011.



**HONG CHEN** (Senior Member, IEEE) received the Ph.D. degree from the Department of Electronic Engineering, Tsinghua University, in 2005. From 2005 to 2007, she worked with the Institute of Microelectronics in Tsinghua University (IMETU) as a Postdoctoral Fellow. From 2006 to 2016, she worked with the Medical Center, Nebraska University, and the Department of Electronics and Computer Engineering, Georgia Tech, as a Visiting Scholar, respectively. Since 2007, she

has been working with IMETU, where she is currently an Associate Professor. Her research interests include monitoring-system design for TKR/THR surgery, low-power digital integrated-circuit design, asynchronous circuit design, PZT power electronics, low-power mixed-signal SoC design, and serial transceiver.



**DENGJIE WANG** was born in Hebei, China. He received the B.S. degree in microelectronics from Xidian University, Xian, China, in 2015. He is currently pursuing the Ph.D. degree with the Institute of Microelectronics, Tsinghua University. His research interest includes high-speed wireline communication systems.



**ZIQIANG WANG** was born in Beijing, China, in 1975. He received the B.S. and Ph.D. degrees from the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 1999 and 2006, respectively.

After the doctor's graduation, he is currently working as a Research Assistant with the Institute of Microelectronics, Tsinghua University. Since 2015, he has been an Associate Professor with the Institute of Microelectronics. His research interest

includes analog circuit design.



**SHUAI YUAN** was born in Jilin, China. He received the B.S. degree from the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 2011, and the Ph.D. degree from the Institute of Microelectronics, Tsinghua University, China, in 2016.

He is currently with HiSilicon Technologies Company Ltd., Beijing. His has been working on include high-speed wireline transceiver.



**CHUN ZHANG** (Member, IEEE) received the B.S. and Ph.D. degrees from the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 1995 and 2000, respectively.

He has been with Tsinghua University, since 2000, where he was with the Department of Electronic Engineering, from 2000 to 2004. He has also been an Associate Professor with the Institute of Microelectronics, since 2005. His research interests include mixed-signal integrated circuits and

systems, embedded microprocessor design, digital signal processing, and radio-frequency identification.



**ZHIHUA WANG** (Fellow, IEEE) received the B.S., M.S., and Ph.D. degrees in electronic engineering from Tsinghua University, Beijing, China, 1983, 1985, and 1990, respectively.

He had served as a Full Professor and the Deputy Director for the Institute of Microelectronics, Tsinghua University, from 1997 to 2000. He was a Visiting Scholar with CMU, Pittsburgh, PA, USA, from 1992 to 1993, and KU Leuven, Leuven, Belgium, from 1993 to 1994, and was

a Visiting Professor with HKUST, Hong Kong, from September 2014 to March 2015. He has coauthored 13 books/chapters, more than 197 (514) articles in international journals (conferences), over 246 (29) articles in Chinese journals (conferences). He holds 118 Chinese and nine U.S. patents. His current research interests include CMOS RFIC and biomedical applications, involving RFID, PLL, low-power wireless transceivers, and smart clinic equipment combined with leading edge RFIC and digital image processing techniques. He has also served as a Technology Program Committee Member for IEEE ISSCC, from 2005 to 2011. He has been a Steering Committee Member of IEEE A-SSCC, since 2005. He has also served as the Chairman for the IEEE SSCS Beijing Chapter, from 1999 to 2009, and an AdCom Member for IEEE SSCS, from 2016 to 2019. He has also served as the Technical Program Chair for A-SSCC 2013, a Guest Editor for the IEEE JSSC Special Issues, in December 2006, December 2009, and November 2014, an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: EXPRESS BRIEFS, and the IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, and other administrative/expert committee positions in China's national science and technology projects.

. . .