

Received 26 May 2020; revised 7 September 2020; accepted 21 October 2020. Date of current version 8 January 2021. Digital Object Identifier 10.1109/OJCAS.2020.3034819

# A 52-Gb/s Sub-1-pJ/bit PAM4 Receiver in 40-nm CMOS for Low-Power Interconnects

CAN WANG<sup>10</sup> (Graduate Student Member, IEEE), LI WANG<sup>10</sup> (Graduate Student Member, IEEE), ZHAO ZHANG<sup>10</sup> (Member, IEEE), MILAD KALANTARI MAHMOUDABADI<sup>10</sup> (Member, IEEE), WEIMIN SHI<sup>1</sup> (Member, IEEE), AND C. PATRICK YUE<sup>10</sup> <sup>1,3</sup> (Fellow, IEEE)

<sup>1</sup>HKUST-Qualcomm Optical Wireless Lab, Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong <sup>2</sup>State Key Laboratory of Superlattices and Microstructures, Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China

<sup>3</sup>HKUST Shenzhen Research Institute, Shenzhen 518057, China

This article was recommended by Associate Editor P.-H. Hsieh.

CORRESPONDING AUTHOR: C. P. YUE (e-mail: eepatrick@ust.hk)

This work was supported in part by the Hong Kong Innovation and Technology Fund under Project GHP/004/18SZ; in part by the Research Grant Council General Research Fund under Project 16200419; and in part by the Research and Development Program in Key Areas of Guangdong Province under Grant 2019B010116002.

**ABSTRACT** This article presents a quarter-rate source-synchronous PAM-4 receiver for energy-efficient chip-to-module communication. A novel single-stage multiple peaking continuous-time linear equalizer (MP-CTLE) using feedback enabled multiple peaking scheme for both high-frequency equalization (HF-EQ) and low-frequency equalization (LF-EQ) is proposed to improve the BER performance and overall energy efficiency. The LF-EQ of the MP-CTLE eliminates the need for many DFE taps in SR/VSR applications to save power and area. A 1-tap feedforward equalizer (FFE) is used to further compensate for the high-frequency loss. We also use a ring oscillator based wide bandwidth phase-locked loop (WBW-PLL) as the multiphase clock generator (MPCG) in the clock and data recovery loop to save power with acceptable phase accuracy. Fabricated in 40-nm CMOS technology, the prototype receiver chip achieves error-free operation up to 52 Gb/s PAM-4 with superior bit efficiency of 0.126pJ/bit/s/dB while compensating 7.3-dB channel loss at 13GHz. With the proposed single-stage MP-CTLE, the receiver extends the error-free operation from PRBS-7 to PRBS-9. The BER bathtub curve at  $10^{-6}$  is improved from 0.018 UI to around 0.1 UI.

**INDEX TERMS** PAM4 receiver, wide bandwidth PLL, multiple peaking CTLE, energy efficient, quarterrate source-synchronous receiver.

# I. INTRODUCTION

G LOBAL IP traffic is predicted to triple in five years. To catch the stringent demand on bandwidth, the data rate of the next-generation I/O will exceed 50 Gb/s. Since the PAM-4 signaling achieves the doubled bandwidth efficiency compared to the NRZ counterpart, it becomes attractive for such high data-rate I/O. At the same time, the I/O density also grows rapidly. This requires high energy efficient PAM4 receiver circuits design to reduce the thermal dissipation. Compared with the ADC-based PAM4 receivers [1]–[3], which are widely used in the long reach applications, the

mixed-signal PAM4 receivers [4], [5] are more suitable for very-short-reach (VSR) and short-reach (SR) applications with medium lossy channels (less than 20 dB) since the equalization requirement of SR or VSR is relaxed. Thus, complicate and power-hungry digital equalizers are replaced by the low-power analog equalizer in pursuit of energy efficiency.

However, the high energy-efficient mixed-signal PAM4 receiver mainly suffers from two design challenges. Firstly, long tail ISI caused by skin effect loss proved to be crucial for bit error rate performance as it exhibits a dip in the channel response that kicks in effect



FIGURE 1. Proposed PAM-4 receiver system diagram.



FIGURE 2. Two-stage CTLE with proposed single stage MP-CTLE in 2nd stage.

from around 1.5 GHz. Many works have been conducted in compensating this low frequency loss dip, such as multiple DFE taps [6], [7] to remove the dominant ISI or multiple stage cascaded conventional CTLEs/CTLEs cascade with feedback/feed-forward enabled high-pass amplifiers [8], [9] are used to cancel the long tail ISI and high frequency channel loss simultaneously. They are usually very effective and powerful as the power and area consumption is typically significant. Secondly, the widely used phase interpolator (PI) based multiphase clock generator (MPCG) in the quarter-rate receiver is power hungry. Although the injection-locked ring oscillator (ILRO) can be used as a low-power MPCG and has been used in the guarter-rate forward clock receiver [5], it suffers from small locking range [10] and phase inaccuracy due to free running frequency shift.

In this article, an energy efficient quarter-rate 52-Gb/s Sub-1pJ/bit mixed-signal PAM-4 is presented as an expansion of [11]. A single-stage CTLE with multiple peaking scheme (MP-CTLE) is proposed. So, both skin effect loss [8] at low frequency and channel loss at high frequency can be compensated simultaneously using one stage CTLE to save power. Quarter-rate architecture is adopted to reduce the circuit power consumption. A RO-based wide bandwidth PLL (WBW-PLL) is adopted as the MPCG, which provides better phase accuracy performance compare to ILRO-based MPCG with lower power consumption than that of the PI-based MPCG. A 1-tap feed-forward equalizer (FFE) instead of the decision feedback equalizer (DFE) is adopted to cancel the first tap post-cursor. This can further save power because the power hungry first tap of the DFE is removed. This article is organized as follows. Section II introduces the receiver design details. Experimental results are presented in Section III.

# **II. CIRCUITS DESIGN**

# A. SYSTEM OVERVIEW

Fig. 1 shows the proposed PAM-4 receiver topology including four parts: 1) A two-stage CTLE with the proposed single-stage MP-CTLE; 2) Embedded FFE with four Data & Edge paths; 3) Transition selection PD and Charge Pump; 4) WBW-PLL as the MPCG. The two-stage CTLE aims at opening the data eye before the sample hold (S/H) stages. The compensated signal will then be sampled by quarter-rate clock CKDx/CKEx and decoded to RDx. The edge information is sampled by the slicer to REx. The data samples and edge samples control the PD, which drives the charge pump to adjust the voltage-controlled delay line (VCDL) in the reference clock path to align the sampling clock with input data. For wireline receiver, equalization and clocking are the most critical blocks that affects the BER performance and power consumption. By removing one extra stage in conventional multiple peaking cascade CTLE design [8], [9], our single-stage MP-CTLE can provide similar performance while saving significant amount of power. A brief introduction of our proposed MP-CTLE will be introduced in the following paragraph. The frequency response and power saving characteristic of the MP-CTLE will be discussed in Section B. Section C covers FFE design. Section D introduces transition selection PD. Charge pump is covered in Section E. The design of WBW-PLL and VCDL is introduced in Section F.

With the same design target of overall bandwidth, cascaded CML's overall bandwidth can be expressed by [12]

$$f_{-3dB_{overall}} \approx \frac{0.9}{\sqrt{N}} \cdot f_{-3dB_{single-stage}}.$$
 (1)

The 3-dB bandwidth of the cascaded individual CML stage is  $\frac{\sqrt{N}}{0.9}$  of the overall bandwidth design target. Assuming the load capacitance is fixed for different stages, it is straight forward to see the load resistor  $R_D$ , as part of the CML gain  $g_m \cdot R_D$ , decreases to achieve higher bandwidth. Decrement on the  $R_D$  inevitably causes the increase of  $g_m$ , thus the rise of the current consumption to maintain the same gain.

CTLE are alike CML amplifier in terms of power consumption, bandwidth, and gain trade-off. It creates peaking by degenerating the transconductance of CML differential pair capacitively at low frequency such that the amplifier's frequency response "peaks" at high frequency. The proposed single-stage MP-CTLE, as shown in Fig. 2, provides both low frequency equalization and high frequency equalization by only one stage. Overall, our receiver chip still uses two stage CTLE to obtain enough high frequency peak gain.



FIGURE 3. Simplified half-circuit diagram (a) conventional capacitive CTLE; (b) feedback enabled high-pass amplifier; (c) feedforward enabled high-pass amplifier with CTLE; (d) Simplified half circuit of proposed multiple peaking CTLE(MP-CTLE).



FIGURE 4. Simulated single-stage MP-CTLE frequency response.

Compare to existing works, the biggest distinction between different realization of the multiple peaking is the number of zeros each stage provides. The simplified half circuit diagram is drawn in Fig. 3 to help compare the difference.

As shown in Fig. 3 (a), the conventional CTLE uses capacitive degeneration  $R_SC_S$  to produce one zero in the transconductance. The parallel resistor and capacitor together, exhibits relatively high impedance at low frequency that degenerates the transconductance as the resistor dominates. While at a higher frequency, the capacitor shorts the resistor. Therefore, it creates one peaking at the desirable frequency. To compensate for both high frequency and low frequency channel loss, multiple stages of cascaded conventional CTLE are needed to provide peaking at different frequencies.

In Fig. 3 (b), the feedback enabled high-pass amplifier is another alternative. A LPF is used on CML amplifier to sense and feedback the output to the input. The feedback loop suppresses low-frequency energy until the LPF breaks the loop as the frequency rise above the designed frequency.

There are also some other variations of the peaking amplifiers.

Simplified half-circuit of feedforward high-pass amplifier is depicted in Fig. 3 (c). Based on the conventional source degenerated CTLE, a feedforward path is added in parallel with the main signal path. From the left to right, it consists of an LPF, and the feedforward transconductance  $g_{mfd}$ . This feedforward branch subtracts with the main path output signal, which effectively reduces the transconductance within the LPF bandwidth. As the frequency goes up, the feedforward path will be attenuated to generate a peaking.

Fig. 3 (d) shows simplified half circuits of our proposed MP-CTLE provides two peakings with the added feedback loop on top of the conventional capacitive source degeneration. The feedback loop uses a low-pass filter LPF to sense the output. Subsequently, the filtered signal will be amplified by the  $g_{mfb}$  to subtract with the input of the MP-CTLE to suppress the low frequency signal. The LPF attenuates the feedback loop gain beyond the roll-off frequency, thus avoiding the interference with source-degenerated high frequency peaking and restoring some of the gain before the high frequency peaking. Overall, the frequency response of the closed loop MP-CTLE shows two peaking point situated at frequencies determined by LPF and  $R_SC_S$ , respectively. The detailed analysis of our proposed single-stage MP-CTLE will be covered in the following sections.

## B. PROPOSED SINGLE-STAGE MP-CTLE

The overall two-stage CTLE with single stage MP-CTLE in this receiver design is shown in Fig. 2, the second stage is our proposed single stage MP-CTLE. Simulated single stage MP-CTLE frequency response of the two-stage CTLE is shown in Fig. 4. The single stage MP-CTLE provides a high frequency boost of 5.8 dB at 18 GHz, and a LF-EQ around 1.8 dB at 1.5 GHz. The proposed single stage MP-CTLE employs multiple peaking schemes by feedback the output signal through a low pass filter (LPF) and subtract with the input signal. Its major advantage is power saving by eliminating the extra cascade stage for a second CTLE/CML feedback amplifier in prior works. Intuitively, the proposed single stage MP-CTLE reuses the transconductance of  $M_2$ in the feedback loop to provide low frequency voltage gain suppression. After the LPF kicks in, the loop gain begins to decrease, thus, create the LF-EQ zero. For high frequency peaking, the LPF attenuates the feedback signal severely to break the loop, in other words, the feedback loop will



FIGURE 5. Simplified system diagram for the 2nd stage CTLE.



**FIGURE 6.** Numerical simulation of closed loop frequency response of MP-CTLE and approximation errors for closed-loop transfer function brought by approximation of using  $R_D/(1 + R_D C_D s + L_p C_D s^2)$  for Z<sub>L</sub>(s).

not compromise the HF-EQ peaking amplitude. Thus, the single stage MP-CTLE is more energy efficient compare to conventional cascade stage MP-CTLE by providing HF-EQ and LF-EQ at the same time.

With the added low frequency zero, the overall system transfer function becomes less explicit. The denominator consists of two complex poles and two real poles; hence, the undetermined coefficient method cannot be used to calculate the pole position [12]. Therefore, we propose an approximation approach to obtain the simplified overall system transfer function and pole position of this system. In Fig. 5,  $G_m$  is the degenerated transconductance of  $M_2$ ,  $Z_L$  is the output loading of the single stage MP-CTLE, which consists of shunt peaking inductor  $L_2$ , loading resistor  $R_{D2}$  and parasitic capacitance  $C_{L2}$ (refer to Fig. 2). The single stage MP-CTLE's output goes through an LPF, denotes as LPF(s). The feedback network gain  $g_{mFB} \cdot Z_{IN}(s)$  is close to unity in our case. We can write the complete closed loop transfer function as follow:

$$\frac{Vout(s)}{Vin(s)} = \frac{G_m(s) \cdot Z_L(s)}{1 + G_m(s) \cdot Z_L(s) \cdot LPF(s)}$$
(2)

where

$$G_m(s) = \frac{g_m(R_S C_S s + 1)}{1 + R_S C_S s + g_m R_S}$$
(3)

$$Z_L(s) = \frac{L_p s + R_D}{1 + R_D C_D s + L_p C_D s^2}.$$
 (4)

This complete closed loop transfer function can be simplified according to the fact that  $R_D/(1 + R_D C_D s + L_p C_D s^2)$ 



FIGURE 7. Numerical simulation of LPF(f)-ZL(f), LPF(f)-(RD Term) and LPF(f)-(LP Term).

dominates  $Z_L(s) \cdot LPF(s)$ , shown in Fig. 7. This is because of the LPF filters out high frequency inductive response of the  $Z_L(s)$ , thus the feedback network only sees the R<sub>D</sub> term. This approximation gives us a simplified version of the system transfer function. However, such approximations introduce errors at relatively wide and high frequency. Fig. 6 presents the numerical results |H(f)| of the closed loop transfer function equation (2), it shows that there is no significant difference at low frequency (< 2 GHz), meanwhile the pole location at high frequency remains relatively unchanged compare to the original system transfer function. It is therefore proper for us to write the simplified system transfer function and the approximated pole location as follow:

$$\frac{Vout(s)}{Vin(s)} \approx \frac{G_m(s) \cdot Z_L(s)}{1 + G_m(s) \cdot \frac{R_D}{2\pi R_Z C_Z}} = \frac{g_m(R_S C_S s + 1)(R_Z C_Z s + 1)}{\frac{2\tau_S \tau_Z}{2 + g_m R_S + 2g_m R_D} s^2 + \frac{2\tau_S + 2\tau_Z + g_m R_S \tau_Z + 2g_m R_D \tau_S}{2 + g_m R_S + 2g_m R_D} s + 1} \times \frac{L_p s + R_D}{1 + R_D C_D s + L_p C_D s^2}$$
(5)

where  $\tau_X = R_X C_X$ .

From the simplified transfer function, assume that  $2R_ZC_Z + g_mR_SR_ZC_Z \gg 2R_sC_s + 2g_mR_DR_sC_s$ , it is easy to write the poles as follow:

$$P_1 = -\left(1 + \frac{2g_m R_D}{2 + g_m R_S}\right) \cdot \frac{1}{2\pi R_Z C_Z} \tag{6}$$

$$P_2 = -\left(1 + \frac{g_m R_S}{2}\right) \cdot \frac{1}{2\pi R_S C_S}.$$
(7)

Different than the cascade-stage MP-CTLE's pole position, which is only affected by degeneration resistors and capacitors, the single stage MP-CTLE's first pole  $P'_1s$  location is determined by  $R_D$  and  $R_S$  together. From the above derivation, the major difference between our proposed singlestage MP-CTLE and cascade-stage MP-CTLE is that the first pole (6) of our proposed single stage MP-CTLE sits at a little higher frequency as the conventional cascade-stage MP-CTLE's first pole takes the same form of (7).



FIGURE 8. Simulation of compensated channel response with/without LF-EQ.



FIGURE 9. Unequalized PRBS-31 52-Gbps PAM-4 eye diagram.

The simulated performance of the proposed single stage MP-CTLE is used to examine the impact of the differences in pole locations. As shown in Fig. 8, our single stage MP-CTLE recovers the lossy channel with/without the LF-EQ, while there is a 1.8-dB dip around 1.5 GHz when LF-EQ is off. The 3-dB bandwidth of the LF-EQ on/off remain the same. After passing through the lossy channel, the 56-Gbps PAM-4 signal is totally corrupted as shown in Fig. 9. Recovered PAM-4 eye diagram with LF-EQ in Fig. 10 shows significant improvement over the no LF-EQ counterpart in Fig. 11. It exhibits > 30% enhancement on both vertical and horizontal opening. The differences are denoted by the comparison bars in the figures, where the red bar stands for the vertical and horizontal opening of recovered PAM-4 eye diagram and green bar stands for improvements by turning on the LF-EQ. It is therefore confident to claim that the proposed single stage MP-CTLE can compensate the low frequency loss as other prior arts did [6]–[8]. The  $g_{mfb}$  branch in our simulation, consumes roughly  $\sim 400$  uA current since the feedback loop gain is also provided by previous stage's loading, which is 60% less consumption than an extra CTLE stage at this data rate.

The conventional cascade stage MP-CTLE exhibits very good design flexibility, but poor power efficiency. Due to its cascade structure, 3-dB bandwidth of each stage needs to be much higher than the overall 3-dB bandwidth. Since 3-dB bandwidth is inversely proportional to  $R_D$ ,  $g_m$  needs to be larger for higher bandwidth to provide enough gain.



FIGURE 10. Recovered PRBS-31 52-Gbps PAM-4 eye diagram with LF-EQ.



FIGURE 11. Recovered PRBS-31 52-Gbps PAM-4 eye diagram without LF-EQ.



FIGURE 12. Schematics of FFE summer.

Thus, power consumption is proportional to the number of stages/peakings.

#### C. FFE SUMMER AND TIMING DIAGRAM

After CTLE and S/H circuits, the full-rate input data is converted into quarter-rate data. A feedforward equalizer is embedded with the S/H circuits by adding FFE summer to execute the summation of consecutive data sampling. FFE summer is as displayed in Fig. 12. It is a current-mode summer with resistive load. An adjustable degeneration resistor is used to improve the linearity of the signal and IB sets the output swing. Equalization coefficient is adjusted by changing the  $I_{FFE}$  tail current source. SD1 and SD2 are the consecutive sampled quarter-rate data of D1 and D2, respectively. The FFE uses SD1 to cancel its 1st post-cursor effect on the SD2. The timing diagram is shown in Fig. 14.



FIGURE 13. PMOS S/H circuit diagram.



FIGURE 14. Timing diagram of the FFE and simulated FFE compensated signal.



FIGURE 15. Timing diagram of the quarter-rate clock and S/H.

In SD1, D1 is hold for 2.5 unit-intervals (UI). The D2 in SD2 bears a 1-UI delay with respect to D1. The simulated FFE compensated eye diagram is also shown in Fig. 14. As denoted, the summation happens within the 1.5-UI overlap. The eye is widely opened in the first 1.5-UI region of the 2.5-UI D2 with the FFE turned on. After the 1.5-UI FFE window, data eye is compressed by the ISI. Thus, the slicing point for D2 is set using the same S/H clock CKD02 of as depicted in Fig. 15 to obtain the optimal eye opening.

This receiver adopts quarter-rate architecture. Driven by quarter-rate clock, the sample and hold circuits converts full rate data input to quarter-rate signal before the slicer and decoder. The D1 – D4 represent each of the quarter-rate data stream after the sample and hold. The original 1-UI bit is extended into 2.5 UI. Fig. 15 shows how this happens: take the D2 in the input data Din and quarter-rate clock CKD2 as an example, the recovered sampling clock CKD2 samples at the middle of the input data, and hold it for half clock cycle which extends the D2 duration to 2.5 UI. The hold clock



FIGURE 16. PAM-4 transitions (left) and transition selection phase detector (right).



FIGURE 17. Charge pump with middle branch.

period ends at the middle of the D4, so the 2.5-UI D2 is followed by a 0.5-UI D4.

The analog FFE summer and S/H circuit (as shown in Fig. 13) contributes certain noise to the signal. According to the simulation, the integrated output noise voltage from 10 KHz to 10 GHz of FFE summer and S/H circuit are 491 uV and 8.15 mV, respectively. The added noise from the equalization and S/H block needs to be optimized for noise sensitive application.

## D. TRANSITION SELECTION PD

The PAM-4 signal possesses complicated data transition patterns that creates pattern dependent input jitter, as illustrated in Fig. 16 (left). The red transition lines have its crossing point centered uniformly, while the crossing points of green lines deviate from the center points due to its asymmetries both vertical and horizontal. The deviation of crossing point causes pattern dependent jitter. The transition selection PD acts like a filter that only the red transitions will be valid for controlling the charge pump. Fig. 16 (right) shows the simplified transition selection phase detector diagram. It is a bang-bang PD using XNOR logic to justify the lag/lead relation between data and sampling clock. If only MSB is considered, the crossing points of the thick green lines will transfer to clock phase wandering that deteriorates the recovered clock and thus, lower the bit error rate performance. To avoid this, the LSB toggle detection branch (red dotted box) is added so that only when LSB and MSB both toggles at the same time, the early-late information will be considered valid. That is:

$$M0 \oplus M1 \&\& L0 \oplus L1 = 1.$$

must hold true.



FIGURE 18. Simulation of charge pump output voltage under monotonic early and late input.



FIGURE 19. System diagram RO-based WBW PLL.



FIGURE 20. Schematics of CMOS delay cell.

### E. CHARGE PUMP WITH MIDDLE BRANCH

The simplified circuit diagram is shown in Fig. 17. Early and Late steers the current in the differential pair. The steering current is copied to rail-to-rail push-pull output stage by current mirror. When Early = 1 and Late = 0, the bias current goes to the left-half circuit and will be mirrored to discharge the CLP, causing a decrease on the output voltage. If the sampling clock leads, bias current will go to the righthalf and be mirrored to charge the CLP. The differential pair works very well when consecutive transition happens, and valid Early/Late information are fed into the CP. For long zero and one data, both Early and Late will be 0 and the tail current source IB will be completely shut down. To maintain a satisfactory bandwidth, a middle branch controlled by Early and Late input is added to avoid the large signal on-off operation with invalid Early/Late information. As shown in Fig. 17, the middle branch forms a pseudo differential pair by using PMOS to turn on the middle branch. The AC current flow when Early and Late are both zero is denoted as the red



3.48 (us) 3.46 Ap 3.44 #27 ក្ខ 3.42 ff-40 3.4 3.38 0.2 0.3 0.4 0.5 0.6 0.8 VC (V)

FIGURE 21. Simulated VCDL delay range versus control voltage VC across corners.

arrow in the figure. Simulation results in Fig. 18 show the output voltage  $V_{LP}$  under consecutive Early or Late input, the output voltage changes linearly with respect to time.

#### F. RO-BASED PLL AND VCDL

3.50

The ring oscillator (RO) based wide bandwidth (WBW) phase-locked loop (PLL) in this design as shown in Fig. 19, employs a 4-stage ring oscillator. The RO uses its six phases for data and edge sampling. The other two phases are feedback to PFD/CP without division. There are two major advantages of this RO-based WBW-PLL MPCG. Firstly, the PLL's bandwidth was designed to be tunable from 200 MHz to 500 MHz. During the measurement, the PLL bandwidth is set to 300 MHz to suppress the phase noise of the ring oscillator, which has a strong dependency on the supply noise. Since the intrinsic phase noise of the ring oscillator is inversely proportional to power consumption, with the help of large loop bandwidth PLL, the ring oscillator can be a little noisy to save power. Secondly, open loop injection MPCG used in [5], [14] suffers from poor phase accuracy due to its unbalanced loading between injection node and other nodes. This RO-based WBW-PLL is fully symmetric, the phase accuracy is improved further by proper layout design. The measured frequency locking range is around 500 MHz at 6.5 GHz.

In Fig. 1, the frequency-synchronous external clock CKREF works as the reference clock of the ring oscillator based PLL after passing a VCDL. One of the output clocks of the PLL aligns with the delayed CKREF, the delay adjustment of the VCDL translates to phase adjustment of the output clocks of the PLL. The schematics of VCDL's delay cell is shown in Fig. 20. The delay cell adopts CMOS logic fashion, by tuning the loading resistance, the delay can be controlled. The cross coupled PMOS exhibits negative resistance, where the PMOS controlled by control voltage VC provides positive resistance. Thus, the loading resistance can be adjusted through VC. To guarantee correct clock phase recovery, the tunable delay range of the VCDL should be at least 1-UI (~38.46 ps). The designed tunable delay range should also have enough margin to cover the PVT and accumulated phase shift of input data. A delay chain is employed and the simulated tunable delay range under different PVT are plotted with respect to control voltage VC, as shown in Fig. 21. Under ff corner and  $-40^{\circ}C$ , the minimum tunable



#### TABLE 1. Performance summary and comparison.\*

|                               | [14] **              | [16] **                           |                  | [3] **            |                   | [15]                 | [2] **          | [17]              | [18] **       | This work              |
|-------------------------------|----------------------|-----------------------------------|------------------|-------------------|-------------------|----------------------|-----------------|-------------------|---------------|------------------------|
| Clocking                      | 1/2                  | 1/4                               |                  | 1/2               |                   | 1/2                  | 1/4             | 1/4               | 1/4           | 1/4                    |
| Functions and<br>Equalization | CTLE, DFE,<br>PI-CDR | CTLE, DFE,<br>FFE, ADC,<br>Pl-CDR | CTLE, PI-<br>CDR | CTLE, FFE,<br>ADC | CTLE,<br>ADC      | CTLE, DFE,<br>PI-CDR | CTLE,<br>PI-CDR | CTLE, DFE         | CTLE          | CTLE, FFE,<br>VCDL-CDR |
| DR (Gb/s)                     | 40-56                | 56                                |                  | 64                |                   | 56                   | 64              | 100               | 56            | 52                     |
| PRBS-n                        | 31                   | 31                                |                  | 15                |                   | 7                    | NA              | 15                | 31            | 7                      |
| Ch. Att (dB)                  | 10                   | 32                                | 7.4              | 29.5              | 8.6               | 24                   | 16.8            | 19.2              | 17.8          | 7.3                    |
| Modulation                    | PAM-4                | PAM-4                             | PAM-4            | PAM-4             | PAM-4             | PAM-4                | PAM-4           | PAM-4             | PAM-4         | PAM-4                  |
| Application                   | VSR                  | LR                                | VSR              | LR                | VSR               | MR                   | SR              | SR                | SR            | VSR                    |
| BER                           | 10 <sup>-12</sup>    | 10-12                             | 10-12            | 10 <del>-</del> 6 | 10 <del>-</del> 4 | 10-12                | 10-12           | 10 <sup>-12</sup> | 10-12         | 10 <sup>-12</sup>      |
| H. W. @10 <sup>-6</sup> (UI)  | 0.2                  | 0.15                              | 0.18             | N/A               | N/A               | 0.31                 | 0.19            | 0.09              | 0.12          | 0.20                   |
| Supply (V)                    | 0.9/1.2              | 0.85/0.9/1.2/1.8                  |                  | 0.9/1.2           |                   | 1                    | 1               | 0.8/0.9/1         | 0.75/0.85/1.3 | 1                      |
| Pwr (mW)                      | 230                  | 450                               | 270              | 284               | 100               | 420                  | 180             | 111.4             | 104           | 48                     |
| Eff. (pJ/b/s)*                | 4.1                  | 8.0                               | 4.8              | 4.4               | 1.6               | 7.5                  | 2.8             | 1.1               | 1.87          | 0.92                   |
| Eff. (pJ/b/s/dB) *            | 0.41                 | 0.25                              | 0.64             | 0.149             | 0.186             | 0.313                | 0.167           | 0.057             | 0.105         | 0.126                  |
| Process                       | 16nm                 | 16nm                              |                  | 16nm              |                   | 40nm                 | 28nm            | 14nm              | 7nm FinFET    | 40nm                   |
| Area (mm <sup>2</sup> )       | 0.36                 | 2.2 (Tx+Rx)                       |                  | 0.16              |                   | 1.6                  | 0.32            | 0.053             | 0.32          | 0.72                   |

\*The energy efficiency compared only consider the receiver side unless otherwise specified.

\*\*Transmitter FFE included, thus the energy efficiency of such a work is overestimated.



FIGURE 22. Measured VCDL delay range



FIGURE 23. Die photo and power consumption breakdown.

delay range is 58 ps, which leaves enough margin to resist PVT variations. The tunable delay range was measured by sweeping the VC, as shown in Fig. 22, the actual delay range reaches 66 ps.

The WBW-PLL with VCDL only draws roughly 10mW (in Fig. 23) from the supply, while providing good phase noise performance to ensure proper clocking for 52-Gb/s data.

## **III. MEASUREMENT RESULTS**

This PAM4 receiver was designed and fabricated in 40-nm CMOS technology. The die photo is shown in Fig. 23. Occupies a total area of 0.72 mm<sup>2</sup>. The PAM4 receiver was measured under 1-V supply and consumes 48mW of power. The 52-Gb/s PAM4 input signal was generated by combining two 26-Gb/s PRBS-7/PRBS-9 sequences, as depicted in Fig. 25. Compensating 7.3-dB channel loss at 13GHz, including 5.8-dB of coaxial cable loss and 1.5-dB PCB loss.



FIGURE 24. Phase noise profile of the recover clock.



FIGURE 25. Measured PRBS-9 52-Gbps PAM-4 eye diagram.

The recovered quarter-rate data achieves error-free operation at 6.5Gb/s. The WBW-PLL achieves an integral jitter of 550 fs from 2 kHz to 2 GHz, its phase noise profile is shown in Fig. 24. The recovered clock exhibits time domain RMS jitter of 380 fs and peak-to-peak jitter of 2.2 ps, shown in Fig. 26.

The proposed single-stage MP-CTLE helps extend errorfree operation from PRBS-7 to PRBS-9, enlarge the horizontal opening at BER = 10E-6 from 0.018 to around 0.1 UI, as shown in Fig. 27. Due to the noise coupling from the power supply, it is hard to push it further to longer PRBS sequences. Overall system power consumption is around 48mW. The source-synchronous quarter-rate receiver achieves error-free



FIGURE 26. Time domain jitter measurement results.



FIGURE 27. BER bathtub curve of PRBS-7/9 with or without LF-EQ.

operation with a superior energy efficiency of 0.92 pJ/bit/s by utilizing RO-based wide bandwidth (WBW) PLL and singlestage MP-CTLE to save power. Performance comparison with previous works is listed in Table 1.

#### ACKNOWLEDGMENT

The authors would like to thank Dr. Guang Zhu for his encouragement to young students and warm discussion on the papers. Gratitude also goes to Xuan WU for her generous help on testing and discussions.

#### REFERENCES

- Y. Frans *et al.*, "A 56-Gb/s PAM4 wireline transceiver using a 32-way time-interleaved SAR ADC in 16-nm FinFET," *IEEE J. Solid-Sate Circuits*, vol. 52, no. 4, pp. 1101–1110, Apr. 2017.
- [2] E. Depaoli et al., "A 64 Gb/s low-power transceiver for shortreach PAM-4 electrical links in 28-nm FDSOI CMOS," *IEEE J. Solid-State Circuits*, vol. 54, no. 1, pp. 6–17, Jan. 2019, doi: 10.1109/JSSC.2018.2873602.
- [3] L. Wang, Y. Fu, M.-A. LaCroix, E. Chong, and A. C. Carusone, "A 64Gb/s PAM-4 transceiver utilizing an adaptive threshold ADC in 16nm FinFET," in *Proc. IEEE Int. Solid -State Circuits Conf. (ISSCC)*, San Francisco, CA, USA, 2018, pp. 110–112.
- [4] P.-J. Peng, J.-F. Li, L.-Y. Chen, and J. Lee, "A 56Gb/s PAM-4/NRZ transceiver 40nm CMOS," in *IEEE Int. Solid-State Circuit Conf. Dig. Tech. Papers*, Feb. 2017, pp. 110–111.
- [5] G. Zhu et al., "A 26-Gb/s 0.31-pJ/bit receiver with linear sampling phase detector for data and edge equalization," *IEEE Solid-State Circuits Lett.*, vol. 1, no. 2, pp. 46–49, Feb. 2018.
- [6] J. F. Bulzacchelli et al., "A 28Gb/s 4-tap FFE/15-tap DFE serial link transceiver in 32nm SOI CMOS technology," in Proc. IEEE Int. Solid-State Circuits Conf., San Francisco, CA, USA, 2012, pp. 324–326.
- [7] J. Savoj et al., "Design of high-speed wireline transceivers for backplane communications in 28nm CMOS," in Proc. IEEE Custom Integr. Circuits Conf., San Jose, CA, USA, 2012, pp. 1–4.
- [8] S. Parikhet al., "A 32Gb/s wireline receiver with a low-frequency equalizer, CTLE and 2-tap DFE in 28nm CMOS," in *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers*, San Francisco, CA, USA, 2013, pp. 28–29.

- [9] B. Zhang et al., "A 28 Gb/s multistandard serial link transceiver for backplane applications in 28 nm CMOS". *IEEE J. Solid-State Circuits*, vol. 50, no. 12, pp. 3089–3100, Dec. 2015.
- [10] G. Anzalone et al., "A 0.2–11.7GHz, high accuracy injection-locking multi-phase generation with mixed analog/digital calibration loops in 28nm FDSOI CMOS," in Proc. ESSCIRC Conf. 42nd Eur. Solid-State Circuits, Lausanne, Switzerland, 2016, pp. 335–338.
- [11] C. Wang, G. Zhu, Z. Zhang, and C. P. Yue, "A 52-Gb/s sub-1pJ/bit PAM4 receiver in 40-nm CMOS for low-power interconnects," in *Proc. Symp. VLSI Circuits*, Kyoto, Japan, 2019, pp. C274–C275.
- [12] R. Behzad, *Design of Integrated Circuits for Optical Communications*. Hoboken, NJ, USA: Wiley, 2012.
- [13] X. Yang *et al.*, "An open-loop 10GHz 8-phase clock generator in 65nm CMOS," in *Proc. IEEE Cust. Integr. Circuits Conf. (CICC)*, San Jose, CA, USA, 2011, pp. 1–4.
- [14] J. Im et al., "A 40-to-56 Gb/s PAM-4 receiver with ten-tap direct decision-feedback equalization in 16-nm FinFET," *IEEE J. Solid-Sate Circuits*, vol. 52, no. 12, pp. 3486–3502, Dec. 2017.
- [15] J. Lee, P.-C. Chiang, and C.-C. Weng, "56Gb/s PAM4 and NRZ SerDes transceiver in 40nm CMOS," in *IEEE Symp. VLSI Circuits Dig. Tech. Papers*, Jun. 2015, pp. 118–119.
- [16] P. Upadhyaya *et al.*, "A fully adaptive 19-to-56Gb/s PAM-4 wireline transceiver with a configurable ADC in 16nm FinFET" in *Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC)*, San Francisco, CA, USA, 2018, pp. 108–110, doi: 10.1109/ISSCC.2018.8310207.
- [17] A. Cevrero et al., "6.1 A 100Gb/s 1.1pJ/b PAM-4 RX with dual-mode 1-tap PAM-4/3-tap NRZ speculative DFE in 14nm CMOS FinFET," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), San Francisco, CA, USA, 2019, pp. 112–114, doi: 10.1109/ISSCC.2019.8662495.
- [18] S. Shahramian et al., "30.5 A 1.41pJ/b 56Gb/s PAM-4 wireline receiver employing enhanced pattern utilization CDR and genetic adaptation algorithms in 7nm CMOS," in *Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC)*, San Francisco, CA, USA, 2019, pp. 482–484, doi: 10.1109/ISSCC.2019.8662421.



**CAN WANG** (Graduate Student Member) received the bachelor's degree in microelectronics from the Chongqing University of Posts and Telecommunications, Chongqing, China, in 2016. He is currently pursuing the Master of Philosophy degree in electronics and computer engineering with the Hong Kong University of Science and Technology under the supervision of Prof. C. P. Yue.

He was with PhotonIC Technologies and the State Key Laboratory of ASIC and System, Fudan

University, Shanghai, China, in 2017. His research interests focus on advanced equalization techniques in wireline and optical communications, compact modeling, and EDA.



LI WANG (Graduate Student Member, IEEE) received the bachelor's degree in microelectronics from the Huazhong University of Science and Technology in 2016. He is currently pursuing the Ph.D. degree with the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong.





**ZHAO ZHANG** (Member, IEEE) received the B.S. degree from the Beijing University of Posts and Telecommunications, Beijing, China, in 2011, and the Ph.D. degree from the Institute of Semiconductors, Chinese Academy of Sciences, Beijing, in 2016.

From 2016 to 2018, he was with the Hong Kong University of Science and Technology, as a Postdoctoral Fellow working on the design of ultra-low-jitter PLLs and PAM4 CDRs. From 2019 to 2020, he was with Hiroshima University,

Higashi-Hiroshima, Japan, as an Assistant Professor. He is currently with the Institute of Semiconductors, Chinese Academy of Sciences and with the Center of Material Science and Optoelectronics Engineering, University of Chinese Academy of Sciences, Beijing. His research interests include the design of low-jitter and low-power PLLs, RF/mm-wave frequency synthesizers, and wireline transceivers.



WEIMIN SHI (Member, IEEE) received the B.S. degree in electronic information science and technology from the Shandong Institute of Business and Technology, Yantai, China, in 2013, and the Ph.D. degree in circuits and systems from the University of Electronic Science and Technology of China, Chengdu, China, in 2019.

He is currently a Postdoctoral Fellow with the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong. His current research officient beoghend power purplifiere, and CMOS

interests include highly efficient broadband power amplifiers and CMOS millimeter-wave IC design.



**C. PATRICK YUE** (Fellow, IEEE) received the B.S. degree (with Highest Hons.) from the University of Texas, Austin, in 1992, and the M.S. and Ph.D. degrees in electrical engineering from Stanford University in 1994 and 1998, respectively.

He is a Professor with the Department of Electronic and Computer Engineering and the Founding Director of the HKUST-Qualcomm Joint Innovation and with the Research Lab, Hong Kong University of Science and Technology (HKUST). In 1998, he cofounded Atheros Communications

(currently, Qualcomm-Atheros). While working in Silicon Valley, he served as a Consulting Assistant Professor with Stanford. In 2003, he joined Carnegie Mellon University as an Assistant Professor. In 2006, he moved to the University of California at Santa Barbara and was promoted to a Professor in 2010. From 2014 to 2015, he served as the Associate Provost for knowledge transfer. He has contributed to more than 180 peer-reviewed papers, two book chapters, and holds 17 U.S. patents. His current research interests focus on optical communication and millimeter-wave system-onchip design, visible and laser light communication systems, and wireless power transfer techniques for IoT applications.

Prof. Yue was presented the 11th Guanghua Engineering Science and Technology Youth Award by the Chinese Academy of Engineering in 2016. Together with his students, he has received the Best Student Paper Award with the IEEE International Solid-State Circuits Conference in 2003, and the IEEE International Wireless Symposium in 2016, and the IEEE Circuits and Systems Society Outstanding Young Author Award in 2017. He currently serves on the committees of IEEE International Solid-State Circuits Conference, IEEE Symposium on VLSI Circuits, and IEEE European Solid-State Circuits Conference. He is currently serving the IEEE Solid-State Circuit Society (SSCS) as the Vice President of Membership. He is an Editor of the PROCEEDINGS OF THE IEEE. He has served as an Editor for the IEEE ELECTRON DEVICE LETTERS and IEEE Solid-State Circuit Society Magazine, and a Guest Editor for the IEEE TRANSACTIONS ON MICROWAVE THEORY AND TECHNIQUES SOCIETY. He was an IEEE SSCS Distinguished Lecturer in 2017 and an Elected IEEE SSCS AdCom Member from 2015 to 2017. He is a Fellow of OSA.



MILAD KALANTARI MAHMOUDABADI (Member, IEEE) received the M.Sc. degree in electrical engineering from the Sharif University of Technology, Tehran, Iran, in 2011, and the dual Ph.D. degrees in electrical engineering from the Sharif University of Technology, Tehran, and the Hong Kong University of Science and Technology (HKUST), Hong Kong, in 2020.

His Ph.D. thesis focuses on miniaturized CMOS integrated circuits and systems including radar frontends at the millimeter-wave band. He is cur-Curclearer Leicht Lencouting and Research Leb

rently with the HKUST-Qualcomm Joint Innovation and Research Lab, Hong Kong, as a Postdoctoral Fellow, working on integrated millimeterwave radars and wireless system-on-chips for 5G applications.