# Edge-Based Adaptation for a 1 IIR + 1 Discrete-Time Tap DFE Converging in 5 $\mu$ s

Shayan Shahramian, Behzad Dehlaghi, *Student Member, IEEE*, and Anthony Chan Carusone, *Senior Member, IEEE* 

Abstract-A 16 Gb/s 1-tap Infinite impulse response (IIR) + 1-tap discrete-time (DT) decision feedback equalizer (DFE) with integrated clock recovery and adaptation is demonstrated in 28 nm FD-SOI CMOS. Using a CMOS phase rotator, 0.7 unit interval (UI) high-frequency jitter tolerance is achieved when operating mesochronously, and over 0.4 UI operating plesiochronously. The half-rate architecture includes a novel 2:1 multiplexer to reduce delay in the IIR feedback path. With a 28 dB loss channel, a BER below  $10^{-12}$  is measured over a 0.32 UI timing window with a TX swing of 0.8 Vpp-diff. Using a 2 Vpp-diff TX swing, a 30 dB loss channel has a measured BER below  $10^{-12}$ over a 0.3 UI timing window. A novel edge-based algorithm adapts both IIR and DT equalizer coefficients using the same high-speed circuitry and signals required for clock recovery. The algorithm utilizes all transitions to inform the adaptation of all equalizer coefficients, thereby providing faster convergence than previously-reported algorithms which await specific patterns. Moreover, the adaptation freezes automatically unless a diverse set of data patterns is received, thereby making the algorithm robust in the presence of poorly-conditioned data. The adaptive DFE converges within 5  $\mu$ s.

*Index Terms*—Clock recovery, clockless multiplexor, decision feedback equalizers, DFE adaptation, edge-based adaptation, infinite impulse response filters, IIR DFE, phase interpolator.

## I. INTRODUCTION AND BACKGROUND

**I** NCREASING demand for higher speed data networking and computer infrastructure has led I/O bandwidths to double approximately every 24 months [1]. In data center switch chips, the aggregate data rate has reached 1.28 Tb/s using a 128 lanes at 10 Gb/s [1]. Similarly, demand for faster chip-to-chip communication links has been increasing. Ultrashort reach links (distances < 2.5 cm) all the way up to longreach links (up to 100 cm) over a backplane are required between chips at increasing speeds.

Increasing data-rates provide both signal integrity challenges and raises thermal issues. Massive infrastructure

Manuscript received April 13, 2016; revised June 11, 2016; accepted July 12, 2016. Date of publication September 8, 2016; date of current version November 21, 2016. This paper was approved by Guest Editor Elad Alon.

S. Shahramian is with the Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 3G4, Canada (e-mail: shayan.shahramian@utoronto.ca).

B. Dehlaghi and A. Chan Carusone are with the Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 3G4, Canada.

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSSC.2016.2594209

is needed to cool growing data centers which makes energy efficiency a critical parameter. Furthermore, the power consumption of chip-to-chip communication is limited at the point where the chip heats to the point of catastrophic breakdown. The maximum power density for an air cooled chip is approximately 100 W/cm<sup>2</sup> [2]. To obtain a reasonable yield, the maximum die size is approximately 1.5 cm x 1.5 cm and can dissipate somewhere between 100–200 W before thermal failure. Allocating 10–20% of the power for the I/O of the chip provides approximately 10–40 W. Hence, the chips that require an aggregate bandwidth of 1–4 Tb/s require an energy efficiency of 2.5–10 pJ/bit while operating at 10+ Gb/s. At such data rates, many chip-to-chip links are required to equalize inter-symbol interference (ISI) over 10 or more post-cursor unit intervals (UIs).

To compensate for channel losses, continuous-time linear equalizers can be used however they amplify high-frequency noise and crosstalk. Traditional discrete-time (DT) decision feedback equalizers (DFEs) can be used to remove the postcursor ISI, however, for channels with over 10 UI of postcursor ISI, DT DFE power consumption can become prohibitive. IIR DFEs are a low-power technique for canceling long post-cursor ISI tails and have been demonstrated compensating over 20 dB loss at  $f_{bit}/2$  up to 10 Gb/s [3]–[7]. Fig. 1(a) shows the implementation of a 1 IIR + 1 DT tap DFE with the pulse response shown in Fig. 1(b). By adjusting the gain and filter time-constant, multiple post-cursor ISI samples can be canceled at once. It has been shown in [6] that the feedback loop delay in an IIR DFE degrades the performance significantly even if the delay is less than 1 UI. It has also been shown in [6] that adding a DT tap will help alleviate loop timing issues and improve the performance. Although adding DT taps to IIR DFEs have been used in previous NRZ implementations to help with timing issues, their maximum data rates have been limited to 10 Gb/s [3], [6], [7]. This work operates with 1 IIR + 1 DT tap up to 16 Gb/s.

To determine the IIR DFE parameters and to maintain signal integrity in time-varying channels and circuit conditions, equalizer adaptation is required. Robust adaptation algorithms suitable for DT DFEs are well-established [8]–[10], but there are few examples of adaptive algorithms for IIR DFEs [4], [7], each exhibiting relatively slow convergence, additional high-bandwidth hardware and/or requiring the input data statistics to meet specific criteria. In this work, a 16 Gb/s

0018-9200 © 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.html for more information.



Fig. 1. (a) 1 IIR + 1 DT tap DFE architecture. (b) Channel, DT tap, and IIR pulse responses.



Fig. 2. Block diagram of the proposed receiver.

IIR DFE is integrated into a clock recovery unit (CRU), and the adaptation algorithm makes use of signals available in a regular binary phase detector to simultaneously adapt the IIR and DT taps. The novel algorithm provides faster and more robust convergence than has been previously demonstrated for IIR DFEs.

The remainder of the paper is organized as follows. Section II shows the circuit implementation details of the half-rate 1 IIR + DT DFE, the CRU, the phase rotator (PR), and DFE adaptation algorithm. The measurement results are provided in Section III. Finally, Section IV concludes the paper.

#### II. PROPOSED RECEIVER

Fig. 2 shows a block diagram of the implemented system. A half-rate DFE with 1 IIR + 1 DT tap is incorporated into a phase detector providing binary samples of the received data and edges. The data and edge samples, both needed by the adaptation algorithm and the digital CRU, are demultiplexed 2:64 using custom high-speed demultiplexers (2:8) followed by synthesized logic (8:64). The 64 demultiplexed signals allow the digital logic for adaptation and clock recovery to be synthesized using standard cells. The output of the DFE adaptation algorithm is 15 bits: 5 for the DT gain, 5 for the IIR gain, and 5 for the IIR time-constant. The digital CRU outputs codes needed for the phase rotator which adjusts the phase of the 4 sampling clocks (CLK0, CLK90, CLK180, CLK270) for the 4 comparators in the half-rate DFE.

## A. Half-Rate 1 IIR + 1 DT DFE

Fig. 3 shows the implementation of the half-rate IIR DFE. The received data is captured alternately at  $D_{ODD}$  and  $D_{EVEN}$ ,



Fig. 3. Half-rate 1 IIR + 1 DT tap DFE block diagram.



Fig. 4. Clockless 2:1 multiplexer schematic.

and the edge samples EODD and EEVEN are used for coefficient adaptation and clock phase detection. For the DT tap, the output of the SR latch is fed back into the other half of the DFE [6]. A double-tail latch is used [11] with the subtraction performed directly inside the latch [6]. A key challenge which has limited the speed of past IIR DFEs is their feedback loop delay, which includes a full-rate multiplexer. In this work, instead of connecting the output of the SR latch to a 2:1 multiplexer [3], [6], [7], the output of the double-tail latch is directly connected to a new clockless 2:1 multiplexer. This allows the IIR DFE loop delay to be reduced, which has a significant impact on the performance of the system [6]. The output of the 2:1 multiplexer is fed into the IIR filter which is then connected to all latches in the DFE as shown in Fig. 3. In this system, the same coefficients are used by the data and edge latches. The algorithm removes the edge ISI, therefore, the data latches are not optimally cancelling the ISI at the center of the eye. For an optimal solution, the data and edge



Fig. 5. Filter bandwidth for an IIR filter with tunable BWR vs. BWC. (b) Normalized pulse response for an IIR filter with tunable BWR vs. BWC.

latches could have different coefficients to optimize for vertical and horizontal eye opening, respectively.

The clockless multiplexer schematic is shown in Fig. 4. It has several advantages over the clocked multiplexers in [3], [6], [7]. First, similar to [5], [12], the multiplexer is connected to the output of the double-tail latch, as opposed to the output of the SR latch, as shown in Fig. 3. This allows the delay to be minimized improving the maximum data rate of the IIR DFE. In this work, the multiplexer is, in fact, two SR latches in parallel which alternately control its output. When the output of the even path of the DFE is evaluating, the output of the multiplexer is set by in1 coming from the even data latch (clocked by CLK0). It therefore latches to the even value. Meanwhile, the other odd path double-tail latch (clocked by CLK180) is reset to zero and has no impact on the multiplexer output. Similarly, during the other phase of the half-rate DFE, the odd double-tail latch output is valid and determines the multiplexer output. Therefore, together the two SR latches function as a 2:1 clockless multiplexer. The second advantage of this multiplexer is that it does not need a half-rate clock input with precise phase applied to it, reducing the load on the clock buffers and their power consumption. It should be noted that the additional load from the multiplexer on the double-tail latch requires an increase in the driving strength which will increase the power consumed by the latch.

Fig. 5(a) shows the filter bandwidth vs. filter code for two possible implementations of the IIR filter. The first case uses binary weighted capacitors (BWCs) [6], [7] and the alternative implementation uses binary weighted resistors (BWRs) to set the filter bandwidth. Both approaches can be designed to give the same total tuning range for the filter bandwidth as shown in Fig. 5(a). Using BWCs allows for more accuracy setting the time constant at low bandwidths while using BWRs varies the the filter bandwidth linearly with the resistor code. Fig. 5(b) shows IIR filter pulse responses for each of the two cases. Using the BWR approach allows for more granularity in matching the shape of the pulse response to cancel the first few post-cursor ISI, whereas, the BWC will give more accuracy to cancel distant post-cursor ISI. Since the first few post-cursor ISI affect the performance more severely, a BWR scheme is chosen to allow for better cancellation of the



Fig. 6. IIR filter implementation using BWR.



Fig. 7. Clock recovery unit block diagram

dominant post cursor ISI. In [6], the imprecise ISI cancellation of the BWC scheme limited measured loss compensation to 24 dB. The filter schematic is shown in Fig. 6. The resistors are binary weighted with a constant capacitor. The switch sizes are also binary weighted to keep the total resistance in each branch binary weighted. The filter is designed to have 31 bandwidth settings with 75 MHz resolution over the range 75 MHz–2.4 GHz.

# B. Clock Recovery Unit

The digital CRU uses the 64 demultiplexed data,  $a_k$ , and edge,  $e_k$ , samples to track the incoming data phase with the recovered clock. Fig. 7 shows a block diagram of the CRU. The bang-bang phase detector logic looks at all 64 bits of incoming data and edge samples to determine the number of



Fig. 8. Multi-phase generator implementation.

early/late clock occurrences. The early/late counts are then subtracted and passed through parallel paths with gains of Kp (proportional path) and Ki (integral path), respectively. The gains Kp and Ki are programmable by factors of 2 (i.e.  $1 \times$ ,  $2\times$ ,  $4\times$ ,...). The integral path can also be disabled. The outputs of the two paths are then integrated producing a 24-bit output, which in turn is then truncated to 7 bits and converted to thermometer and gray encoded signals for the phase rotator. The truncation provides some averaging so that the phase rotator code is only updated when the 7 MSB bits change out of the total 24 bits at the integrator output. The skew between the data sampling clocks and edge clocks can be adjusted by applying a programmable offset between the phase codes applied to the data and edge phase rotators. This allows calibration to ensure the phases are in fact 90° apart. The phase rotator code is down-sampled by  $7 \times$  and can be monitored off-chip. This allows the phase code vs. time to be plotted illustrating the locking behaviour of the CRU discussed in Section III.

#### C. Phase Rotator

The phase rotator consists of a multi-phase generator (MPG) and 8 single-ended phase interpolators (PIs), 4 for data sampling clocks and 4 for edge sampling clocks, followed by two pseudo-differential multiplexers.

The MPG input is a differential clock, CLKin, AC-coupled to a two-stage ring oscillator that generates differential quadrature clocks I/Ib and Q/Qb as shown in Fig. 8. The free running frequency of the MPG is adjusted externally via an 8-bit control to bring the free running frequency as close to half the incoming data rate as possible. The further the free running frequency of the MPG deviates from the CLKin frequency, the more phase error arises between the I/Q outputs of the MPG. In a multi-lane transceiver, the MPG free-running frequency codes could be slaved to a single central master onchip frequency-locked loop that uses a duplicate MPG as its digitally-controlled oscillator [13]. The free running frequency can further be influenced by the body bias voltage of the SOI process. The 28 nm SOI process allows the threshold of the NMOS and PMOS transistors to be adjusted, further increasing the tuning range of the MPG [14]. A replica CLKin driver is included for better I/Q phase generation when there is a mismatch between the free running frequency of the



Fig. 9. Phase interpolator implementation.

ring oscillator and the injected clock. However, if the two frequencies are identical, the replica may result in higher I/Q mismatch.

Each of the four PIs for the data samplers is responsible for covering one quadrant of possible clock phases:  $0^{\circ}-90^{\circ}$ ,  $90^{\circ}-180^{\circ}$ , etc as shown in Fig 9. Identically, four phase interpolators are used for the edge samplers. Each phase interpolator is comprised of two sets of 31 inverters driving the same node. The inverters are selectively (de)activated to provide a weighted combination of the input phases depending on the input phase code. To improve PI linearity, capacitor  $C_1$  reduces the swing at the inverter-bank output to approximately 400 mVpp. AC-coupling capacitor  $C_2$  and an inverter with resistive feedback follow to alleviate sensitivity to common-mode variations. Finally, the multiplexer outputs select between the different phase interpolator outputs depending on the quadrant of the selected phase.

Fig. 10(a) shows the operation of the phase rotator when phase code varies from 0 to 63. During phase rotator codes 0-31, PI A is selected by Data<sub>MUX</sub> and the output phase of CLK0 is between 0° and 90°. At the same time, PI C is selected to provide CLK180. To ensure a smooth transition between PI A and PI B, the clock phase codes for PI A and PI B must be coincident at the switching point. Therefore, PI B is encoded so that its phase is changing in the opposite direction, from  $180^{\circ}$  to  $90^{\circ}$  as shown in Fig. 10(a). When switching from PI A to PI B, to ensure there is no missing clock edge that would result in a bit error due to skew between Data<sub>MUX</sub> control signals, both PIs are selected for one code at the boundary between two PIs (90°, in this case). Although this approach prevents glitches in the clock signals, it will degrades the linearity of the phase rotator. This is mainly because at these boundary codes two PIs are driving the clock buffers which will reduce the delay compared to the case where only one PI is selected. The thermometer coding of the PIs (Data<sub>PI</sub>) and gray coding of the multiplexers (Data<sub>MUX</sub>) for CLK0 are shown in Fig. 10(b).



Fig. 10. (a) Phase diagram for phase interpolator showing wrapping behavior. (b) Details of phase interpolator and multiplexer coding.



Fig. 11. Proposed edge-based adaptation algorithm using all patterns with a transition  $(a_0 \neq a_{-1})$  to update the DFE coefficients.

To reduce phase rotator power consumption, PIs corresponding to unused clock phase quadrants may be powered down. When operating plesiochronously, the phase code rotates through all quadrants, so the CRU powers up the other phase interpolators whenever the recovered phase code approaches a transition between quadrants (at 0°, 90°, 180°, and 270°) to avoid glitches. Unused PIs are powered off again when the clock phase slips away from the transitions.

# D. DFE Adaptation Algorithm

This section describes the novel edge-based adaptation algorithm employed in this work, and explains how it provides significantly faster convergence than the prior art. Let,  $d_m \in \{-1, 1\}$  represent the transmitted data, c(t) the channel pulse response including the transmitter and receiver front end, T the bit period, and  $t_{samp} \in \{0, T\}$  the sampling point for recovering data. The received pulse response data samples are

$$h_m = c(t_{samp} + mT) \tag{1}$$

where  $m \in \mathbb{Z}$ . The received pulse response edge samples are

1

$$a_{m+0.5} = c(t_{samp} + (m+0.5)T),$$
 (2)

These are graphically shown in Fig 11. Let G be the DFE DT tap gain, and f(t) the IIR filter response. The sampled IIR filter response is

$$f_m = f(mT). \tag{3}$$

Let  $u_m$  be the step function where  $u_m = 0$  for m < 0 and  $u_m = 1$  for  $m \ge 0$ . The equalized data at the sampling point can be expressed as

$$D_m = \sum_{i=-\infty}^{\infty} d_i \cdot h_{m-i} - G \cdot \operatorname{sign}(D_{m-1}) - \sum_{i=-\infty}^{\infty} \operatorname{sign}(D_{m-2}) \cdot f_{m-i} \cdot u_{m-2}.$$
(4)

The equalized edge samples can be expressed as

$$E_{m} = \sum_{i=-\infty}^{\infty} d_{i} \cdot h_{m+0.5-i} - G \cdot \operatorname{sign}(D_{m-1}) - \sum_{i=-\infty}^{\infty} \operatorname{sign}(D_{m-2}) \cdot f_{m+0.5-i} \cdot u_{m-2}.$$
(5)

The recovered data is  $a_m = \text{sign}(D_m)$ , and the recovered edge samples are  $e_m = \text{sign}(E_m)$ .

Edge-based DFE adaptation strives to eliminate the correlation between each edge sample,  $E_m$ , and the preceding bits,  $a_{m-2}$ ,  $a_{m-3}$ , etc., which is caused by post-cursor ISI. For independent random data and assuming  $d_i \approx a_i$ , all terms in the product  $a_{m-k-1} \cdot E_{m-1}$  average to zero over time, except for a term proportional to residual edge ISI samples  $d_{k+1,5}$ minus corresponding terms containing G and f(t). Hence, an LMS algorithm that iteratively updates G and f(t) to drive the product to zero will adapt the DFE response to cancel  $h_{k+0.5}$  eliminating post-cursor zero-crossing ISI. However, computing the product  $a_{m-k-1} \cdot E_{m-1}$  would require the analog value of  $E_{m-1}$  to be digitized with a high-speed and highresolution ADC. Instead, the 1-bit quantized version of  $E_{m-1}$ that is already available inside any Alexander bang-bang phase detector,  $e_{m-1}$ , can be used [15]. This is analogous to other sign-sign LMS adaptation algorithms [8], [9], [16] which use 1-bit quantized versions of their gradient estimates to perform adaptation.

Fig. 11 illustrates the 1 IIR + 1 DT adaptation algorithm. The same edge,  $e_k$ , and data,  $a_k$ , samples required by the CRU are used to inform the adaptation. Following the preceding discussion, correlations between early-late phase detector outputs  $e_{k-1}$ , and the four preceding bits  $a_{k-2} \dots a_{k-5}$  have the same sign as residual post-cursor edge ISI terms  $h_{1.5}$ ,  $h_{2.5}$ ,  $h_{3,5}$ , and  $h_{4,5}$ . The algorithm updates each DFE coefficient iteratively moving the observed correlations  $(a_{-k} \times e_{-1})$ towards zero, thereby minimizing post-cursor ISI. The binary product  $(a_{-k} \times e_{-1})$  requires simply a logical XOR. Using this approach, no training pattern or lengthy BER measurements are required to perform adaptation as in, for example, [7]. No additional high-speed comparator is required for adaptation, avoiding the associated extra power, loading on a critical node in the DFE and phase-adjustment circuitry. Note that the bang-bang phase detector only provides useful information when there is a transition in the data; i.e.  $a_{m-1} \neq a_m$ . The adaptation algorithm assumes a relatively smooth channel response without any major discontinuities since that is where IIR DFEs are most useful [17]. While the adaptation algorithm will converge based on the cost function, if there are major bumps or notches in the pulse response, the adapted coefficients may not lead to the best possible horizontal eye opening.

The algorithm will next be described more rigorously. Consider a data pattern *N*-bits long,  $[a_{m-N-1} \ a_{m-N} \dots a_m]$ . The pattern is assigned index [A, k] where  $A \in \{1 : 2^{N-2}\}$ when  $a_{m-1} \neq a_m$  and  $a_{m-k-1} = -1$ , while the index [A', k]is assigned to the same pattern except bit  $a_{m-k-1} = 1$ . An example for N = 6 is shown in Fig. 12. The patterns [A, k] and [A', k] are used together with the corresponding edge samples  $e_{m-1}$  to inform the adaptation of the DFE. Let  $\Gamma_m^{[A,k]} = 1$  when pattern [A, k] occurs and 0 otherwise. The equalizer parameter updates indicated by each occurrence of pattern [A, k] are recorded as follows,

$$P_{m+1}^{[A,k]} = P_m^{[A,k]} + e_{m-1} \cdot a_{m-k-1} \cdot \Gamma_m^{[A,k]}$$
(6)

and a similar equation can be written for  $P_m^{[A',K]}$ .

Next, the parameter updates for all received patterns are summed and  $\zeta_m^k$  is calculated every  $\Lambda$  bits. Specifically, let  $\psi$  be 1 every  $\Lambda$  bits and zero otherwise. Then the adaptation gain step can be written as

$$\zeta_m^k = \mu \cdot \sum_{A=1}^{2^{N-2}} \left( P_m^{[A,k]} + P_m^{[A',k]} \right) \cdot \psi.$$
(7)

The values of  $P_m^{[A,k]}$  in equation (6) are reset to zero every  $\Lambda$  bits, after each DFE parameter update. By utilizing all patterns to inform the adaptation, this approach allows larger DFE coefficient updates to be performed on each iteration and faster convergence than, for example, [4] where only one specific pattern is used at a time.

Recall that  $\zeta_m^k$  will provide information regarding the amount of ISI present at  $h_{k+0.5}$ . The update equation for the first post-cursor DT tap, G is therefore

$$G_{m+1} = G_m + \zeta_m^1 \cdot \psi. \tag{8}$$



Fig. 12. Example patterns of length N = 6 used for obtaining ISI information at  $h_{0.5+k}$ .

Essentially the DT tap weight is iteratively updated to drive the correlation  $(a_{-2} \times e_{-1})$  towards zero, minimizing ISI at  $h_{1.5}$ .

One challenge is how to independently adapt the IIR DFE gain, B, and time constant,  $\tau$ , both of which contribute to the cancellation of all post-cursor ISI terms, along with the DT tap weight, G. In [4] only  $h_{2.5}$  information is used to adapt the IIR time constant which may not result in a good fit to a long tail in the channel pulse response. Moreover, [4] does not include a DT tap in the DFE which leaves its performance sensitive to any process or voltage variations in the DFE feedback delay; if a first post-cursor discrete-time tap were added to the architecture in [4], its adaptation would certainly interact and confuse the adaptation of the IIR tap using that scheme. In this work, the product  $(a_{-3} \times e_{-1})$  is used, via  $\zeta_m^2$ , to guide the IIR gain coefficient, B, towards zero ISI at  $h_{2.5}$ . The update equation for the IIR gain is

$$B_{m+1} = B_m + \zeta_m^2. (9)$$

Finally, the IIR time constant  $\tau$  is guided by both the products  $(a_{-4} \times e_{-1})$  via  $\zeta_m^3$  and  $(a_{-5} \times e_{-1})$  via  $\zeta_m^4$ , thereby adjusting  $\tau$  to remove ISI at both  $h_{3.5}$  and  $h_{4.5}$ . It should be noted that changing the time-constant of the IIR filter will also affect  $h_{2.5}$  and, hence, the gain *B*. To slow down the interaction between these two adaptation loops, the IIR gain is updated more frequently than the IIR time constant. This allows the gain to always adjust to cancel  $h_{2.5}$  as the time-constant is adapted. The update equation for the IIR filter time constant is

$$\tau_{m+1} = \tau_m + (\zeta_m^3 + \zeta_m^4) \cdot \chi.$$
(10)

where  $\chi = 1$  every  $3 \times \Lambda$  bits and zero otherwise. In the proposed design,  $\Lambda = 64$  which allows the data to be demultiplexed by 64 and each group of 64 samples are used for a single update of the DFE coefficients. All equalizer coefficient updates are calculated at the demultiplexed clock rate,  $f_{bit}/64$ .

Fig. 11 shows a channel pulse response highlighting which ISI terms are used to direct the adaptation of each DFE



|   | Area (µm²)                                     |        |
|---|------------------------------------------------|--------|
| A | DFE<br>(Data + Edge Latches + 2:1 Mux)         | 11,900 |
| В | Demux (2:8)                                    | 8,000  |
| С | Digital Logic<br>(Adaptation, CRU, 8:64 demux) | 41,000 |
| D | MPG                                            | 5,700  |
| E | PIs                                            | 13,000 |
| F | Clock Buffers                                  | 5,800  |

Fig. 13. Chip die photo and area breakdown.



Fig. 14. Measurement setup.

coefficient. Using two edge ISI samples  $(h_{3.5} \& h_{4.5})$  allows the IIR time-constant to achieve a better fit to the channel pulse response than in [4].

Fundamentally, to infer complete information about a channel response, spectrally-rich data patterns are required. Other edge-based adaptation algorithms wait for specific patterns to arrive before updating the equalizer [4]. By contrast, this algorithm updates the equalizer upon receiving any 64-bit demultiplexed word containing any 10 (or more) different 6-bit sequences  $(a_{-5}, a_{-4}, \ldots a_0)$  having transitions  $(a_0 \neq a_{-1})$ . This criteria is easy to implement in digital logic, and prevents instability of the adaptation algorithm in the presence of patterns with insufficient spectral diversity. Yet the criteria also provides much faster convergence than previous approaches that await a specific pattern [4]. Section III will provide measured results of input patterns that do not meet this spectral diversity requirement and shows that the DFE coefficients diverge without this feature enabled.

## **III. MEASUREMENT RESULTS**

The 28 nm FD-SOI chip die photo along with an area breakdown are shown in Fig. 13. The measurement setup is shown

![](_page_6_Figure_10.jpeg)

Fig. 15. Half-rate 8 Gb/s re-timed output eye.

in Fig. 14. An Agilent N4951B Pattern Generator is used to provide PRBS and programmable data patterns to the chip. A Centellax TG1B1-A BERT unit is used to measure BER of the half-rate output data transmitted off-chip. An Agilent N4960A clock synthesizer provides a half-rate, 8 GHz, clock to the DUT. The clock synthesizer also provides another 8 GHz clock for the BERT. For jitter tolerance measurements, either the jittered or divided clock is used for the BERT as shown in Fig. 14. An Agilent DSA-X 91604A Digital Signal Analyzer

![](_page_7_Figure_1.jpeg)

Fig. 16. Measured INL/DNL plots vs. PI code.

![](_page_7_Figure_3.jpeg)

Fig. 17. Measured jitter tolerance for 0 ppm, 100 ppm and 150 ppm frequency offset and transmit swings of 0.8 Vpp-diff and 2 Vpp-diff.

![](_page_7_Figure_5.jpeg)

Fig. 18. Measured phase code vs. time for 0 ppm and 100 ppm frequency error.

is used to capture the adaptation curves and phase code vs. time from the chip. Fig. 15 shows an 8 Gb/s half-rate re-timed output of the chip which has 2.55 ps RMS Jitter.

The PI DNL and INL curves are shown in Fig. 16. As explained in Section II-C, there is a significant nonlinearity at the quadrature boundaries between PIs, at which points two of PIs are enabled at the same time. Turning off the unused PIs saves 5 mW leading to a total phase rotator power of 41.9 mW (20.2 mW for the MPG and 21.7 for the PIs) at 16 Gb/s.

Fig. 17 shows measured jitter tolerance with a PRBS7 input at 16 Gb/s. The total setup loss introduced is 2.7 dB at 8 GHz. The setup loss includes the loss of the characterization PCB

![](_page_7_Figure_10.jpeg)

![](_page_7_Figure_11.jpeg)

Fig. 19. Measured insertion loss for channels including setup losses.

which consists of a 1" co-planar waveguide trace on Rogers RO 4003 material. The setup loss also includes the losses of the QFN package, mainly the pad capacitance and the bondwire inductance. In Fig. 17, measurements are shown for both mesochronous, and plesiochronous (100-150 ppm frequency error) half-rate receiver input clocks. Both show similar low-frequency jitter tolerance, demonstrating proper phase rotation as plotted in Fig. 18. The jitter tolerance is provided for two different input amplitudes of 2 Vpp-diff and 0.8 Vpp-diff and show there is only a slight degradation at the lower input amplitude. The clock recovery and adaptation came up without any special settings. The edge-based adaptation algorithm helps the clock recovery by improving the zero crossing distribution. As the DFE start to cancel ISI, the clocking moves towards the zero crossing location which in turn helps the coefficients cancel even more ISI. This process helps the clock recovery loop to lock even for high loss channels.

The measured channel insertion losses are shown in Fig. 19. Four different channels were used for the characterization of the DFE ranging from 15.7 dB to 30 dB of attenuation at half the bit-rate.

The adaptation curves for channels 1 and 2 with a PRBS7 input are shown in Fig. 20a. Increasing  $\tau$  corresponds to increasing the IIR filter time constant. All the coefficients converge within 80,000 UI, over an order of magnitude faster than in [4], after which the BER is below  $10^{-12}$ .

![](_page_8_Figure_1.jpeg)

Fig. 20. (a)/(b) Measured channel 1/2 coefficient adaptation with PRBS7 input.

![](_page_8_Figure_3.jpeg)

Fig. 21. (a)/(b) Measured channel 1/2 coefficient adaptation with 5  $\mu$ s intervals of repeating patterns (0000001100111111, 10101010101010, 111111000000) followed by PRBS7.

![](_page_8_Figure_5.jpeg)

Fig. 22. Measured adaptation curves for channel 1 with the (a) STM64 and (b) SSPS64 input patterns.

Fig. 21 shows measured adaptation curves for channels 1 and 2 when repeating patterns are inserted. It is evident that the equalizer coefficients are not updated when the repeating patterns are present. Deactivating this feature, the coefficients diverge in Fig. 21(a) and (b) and the BER increases when the repeating patterns arise.

Fig. 22(a) and (b) show the measured adaptation curves for channel 1 with the STM64 and SSPS64 patterns [18],

respectively. There is some variation in the coefficients from their adapted values once the repeating patterns protect feature is disabled. The variations will increase as the duration of the repeating patterns increase.

In all edge-based adaptation algorithms [4], [10], [19], removing the edge ISI leads to the best horizontal eye opening, however, even though the edge ISI is cancelled (i.e.  $h_{1.5}$ ,  $h_{2.5}$ , etc.), the main cursor tap  $h_1$  may not be perfectly cancelled.

![](_page_9_Figure_1.jpeg)

Fig. 23. (a)/(b) Measured adaptation curves for channel 3/4 with the DT tap fixed.

![](_page_9_Figure_3.jpeg)

Fig. 24. Measured bathtub curves for all 4 channels for two different transmit swings

In channels 3 and 4, this led to a scenario where the vertical eye opening is not sufficient for the latches to operate at the required speed, increasing the BER. To show that this is in fact the problem, in Fig. 23 the discrete-tap gain was fixed and the IIR coefficients were adapted. The discrete-tap value is chosen somewhat larger than the value to which it adapts under the edge-based algorithm so that there is enough cancellation of  $h_1$  to allow for vertical eye opening larger than the latch sensitivity. This correction could be incorporated into appropriate on-chip adaptation logic, or preamplifiers could be introduced preceding the latches to improve their sensitivity and obviate this problem.

The bathtub curves for all 4 channels are shown in Fig. 24 for both a 2 Vpp-diff and a 0.8 Vpp-diff transmit swing with the coefficients frozen after they have adapted. All the bathtub curves are closed if the DFE is disabled.

To characterize the clock recovery with the DFE, jitter tolerance was measured with channel 1. Fig. 25 shows the measured jitter tolerance for channel 1 with 0 ppm, 100 ppm, and 150 ppm frequency offset for both 2 Vpp-diff and 0.8 Vpp-diff transmit swing amplitudes. The reduction in jitter tolerance at the lower input amplitude with a frequency offset is due to the slight clock amplitude variations in the phase interpolator's output at different phase codes. At higher input amplitudes, the latches resolve quicker and are less affected by the variations in the clock amplitude. At lower input

![](_page_9_Figure_8.jpeg)

Fig. 25. Measured jitter tolerance with channel 1 for 0 ppm, 100 ppm, and 150 ppm frequency offset and transmit swings of 0.8 Vpp-diff and 2 Vpp-diff.

amplitudes, the latches take longer to resolve and when that is combined with the clock amplitude variations, the increase in the latch delay affects the DT DFE timing closure which degrades the BER of the receiver.

Fig. 26 shows the power breakdown for the chip at 16 Gb/s. The adaptation engine, CRU and 8:64 demux consume 24.3 mW and occupy an area of 41,000  $\mu$ m<sup>2</sup>.

|                                    | [3]                 | [4]                        | [5]                 | [6]                 | [7]                                                    | This work                 |
|------------------------------------|---------------------|----------------------------|---------------------|---------------------|--------------------------------------------------------|---------------------------|
| Data Rate (Gb/s)                   | 10                  | 6                          | 10                  | 10                  | 5                                                      | 16                        |
| Architecture                       | 1 IIR + 1<br>DT DFE | 1 IIR DFE                  | 2 IIR<br>DFE        | 2 IIR + 1<br>DT DFE | 1 IIR + 1 DT DFE +<br>Pass. EQ                         | 1 IIR + 1 DT<br>DFE       |
| Loss @ Half Bitrate<br>(dB)        | 27                  | 32.7                       | 35                  | 24                  | 15                                                     | 28                        |
| Technology                         | 65 nm               | 90 nm                      | 65 nm               | 28 nm (LP)          | 65 nm (LP)                                             | 28 nm FDSOI               |
| Supply (V)                         | 1                   | -                          | 1                   | 1                   | 1.2                                                    | 1                         |
| DFE Power (mW)<br>[Data path only] | 3.5                 | 4                          | 9.9                 | 4.1                 | 2.3                                                    | 15.8                      |
| mW / Gbps                          | 0.35                | 0.67                       | 0.99                | 0.41                | 0.46                                                   | 0.99                      |
| Area (µm <sup>2</sup> )            | 17,250              | 89,000                     | 30,400 <sup>a</sup> | 8,760 <sup>a</sup>  | 23,321                                                 | <b>8,100</b> <sup>a</sup> |
| Adaptive<br>Equalization           | NO                  | YES                        | NO                  | NO                  | YES                                                    | YES                       |
| Adaptation Time                    | -                   | 250 μs<br>(1.5 million UI) | -                   | -                   | Off-chip, 1 hour 25 min<br>( $2.55 \times 10^{13}$ UI) | 5 μs<br>(80,000 UI)       |

TABLE I Performance Summary Comparison

<sup>a</sup>Area of DFE core only.

Power Breakdown (mW)

![](_page_10_Figure_5.jpeg)

DFE Data Path Includes: Data Latches + 2:1 Mux DFE Edge Path Includes: Edge Latches Digital Logic Includes: Adaptation, CRU, 2X 8:64 demux

Fig. 26. Power breakdown.

Table I shows a comparison with previous work. Among the previous work, this work has an adaptation algorithm that is at least  $18 \times$  faster than other IIR adaptation implementations. The DFE data path consumes 0.99 mW/Gbps operating at 16 Gb/s while equalizing 28 dB.

# IV. CONCLUSION

A 16 Gb/s 1 IIR + 1 DT DFE was demonstrated in 28 nm FD-SOI CMOS with integrated clock recovery and adaptation. The novel edge-based adaptation algorithm reuses the high-speed circuitry and signals required for clock recovery, is robust in the presence of ill-conditioned data statistics, and yet converges over an order of magnitude faster than previous techniques. The energy efficiency of the DFE, PIs, MPG, two 2:64 demux paths, clock buffers as well as the digital logic for the clock recovery and adaptation is 8.82 pJ/bit.

#### ACKNOWLEDGMENT

The authors would like to thank Huawei and Semtech for financial support, STMicroelectronics for IC fabrication, and CMC Microsystems for test equipment.

## REFERENCES

- N. Kocaman *et al.*, "A 3.8 mW/Gbps quad-channel 8.5–13 Gbps serial link with a 5 tap DFE and a 4 tap transmit FFE in 28 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 51, no. 4, pp. 881–892, Apr. 2016.
- [2] G. Shahidi, "Evolution of CMOS technology at 32 nm and beyond," in *Proc. IEEE Custom Integr. Circuits Conf. (CICC)*, Sep. 2007, pp. 413–416.
- [3] B. Kim, Y. Liu, T. Dickson, J. Bulzacchelli, and D. Friedman, "A 10-Gb/s compact low-power serial I/O with DFE-IIR equalization in 65-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 44, no. 12, pp. 3526–3538, Dec. 2009.
- [4] Y.-C. Huang and S.-I. Liu, "A 6 Gb/s receiver with 32.7 dB adaptive DFE-IIR equalization," in *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers (ISSCC)*, Feb. 2011, pp. 356–358.
- [5] O. Elhadidy and S. Palermo, "A 10 Gb/s 2-IIR-tap DFE receiver with 35 dB loss compensation in 65-nm CMOS," in *Proc. Symp. VLSI Circuits (VLSIC)*, Jun. 2013, pp. C272–C273.
- [6] S. Shahramian and A. C. Carusone, "A 0.41 pJ/Bit 10 Gb/s hybrid 2 IIR and 1 discrete-time DFE tap in 28 nm-LP CMOS," *IEEE J. Solid-State Circuits*, vol. 50, no. 7, pp. 1722–1735, Jul. 2015.
- [7] S. Son *et al.*, "A 2.3-mW, 5-Gb/s low-power decision-feedback equalizer receiver front-end and its two-step, minimum bit-error-rate adaptation algorithm," *IEEE J. Solid-State Circuits*, vol. 48, no. 11, pp. 2693–2704, Nov. 2013.
- [8] H.-J. Chi et al., "A single-loop SS-LMS algorithm with singleended integrating DFE receiver for multi-drop DRAM interface," *IEEE J. Solid-State Circuits*, vol. 46, no. 9, pp. 2053–2063, Sep. 2011.
- [9] Z.-H. Hong, Y.-C. Liu, and W.-Z. Chen, "A 3.12 pJ/bit, 19–27 Gbps receiver with 2-tap DFE embedded clock and data recovery," *IEEE J. Solid-State Circuits*, vol. 50, no. 11, pp. 2625–2634, Nov. 2015.
- [10] R. Payne *et al.*, "A 6.25-Gb/s binary transceiver in 0.13-μm CMOS for serial data transmission across high loss legacy backplane channels," *IEEE J. Solid-State Circuits*, vol. 40, no. 12, pp. 2646–2657, Dec. 2005.
- [11] D. Schinkel, E. Mensink, E. Klumperink, E. van Tuijl, and B. Nauta, "A double-tail latch-type voltage sense amplifier with 18 ps setup+hold time," in *IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers (ISSCC)*, Feb. 2007, pp. 314–605.

- [12] O. Elhadidy, A. Roshan-Zamir, H.-W. Yang, and S. Palermo, "A 32 Gb/s 0.55 mW/Gbps PAM4 1-FIR 2-IIR tap DFE receiver in 65-nm CMOS," in *Proc. Symp. VLSI Circuits (VLSI Circuits)*, Jun. 2015, pp. C224–C225.
- [13] J. Lee and M. Liu, "A 20-Gb/s burst-mode clock and data recovery circuit using injection-locking technique," *IEEE J. Solid-State Circuits*, vol. 43, no. 3, pp. 619–630, Mar. 2008.
- [14] M. Raj and A. Emami, "A wideband injection-locking scheme and quadrature phase generation in 65-nm CMOS," *IEEE Trans. Microw. Theory Techn.*, vol. 62, no. 4, pp. 763–772, Apr. 2014.
- [15] A. Carusone, "An equalizer adaptation algorithm to reduce jitter in binary receivers," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 53, no. 9, pp. 807–811, Sep. 2006.
- [16] C. Thakkar, L. Kong, K. Jung, A. Frappe, and E. Alon, "A 10 Gb/s 45 mW adaptive 60 GHz baseband in 65 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 47, no. 4, pp. 952–968, Apr. 2012.
- [17] S. Shahramian, H. Yasotharan, and A. C. Carusone, "Decision feedback equalizer architectures with multiple continuous-time infinite impulse response filters," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 59, no. 6, pp. 326–330, Jun. 2012.
- [18] P. Anslow, "CEI short stress patterns white paper," Optical Internetworking Forum, 2007.
- [19] Y. Hidaka, W. Gai, T. Horie, J. H. Jiang, Y. Koyanagi, and H. Osone, "A 4-channel 1.25–10.3 Gb/s backplane transceiver macro with 35 dB equalizer and sign-based zero-forcing adaptive control," *IEEE J. Solid-State Circuits*, vol. 44, no. 12, pp. 3547–3559, Dec. 2009.

![](_page_11_Picture_10.jpeg)

Shayan Shahramian received the M.A.Sc. and Ph.D. degrees from the Department of Electrical and Computer Engineering at the University of Toronto, Canada, in 2011 and 2016, respectively. His focus is on high-speed chip-to-chip communication for wireline applications. He is the recipient of the NSERC Industrial Postgraduate scholarship in collaboration with Semtech Corporation (Gennum Products). He is the recipient of the best young scientist paper award at ESSCIRC 2014 and received the Analog Devices outstanding designer award for 2014. He has won

five teaching assistant awards at the University of Toronto. He joined Huawei Canada in January 2016.

![](_page_11_Picture_13.jpeg)

**Behzad Dehlaghi** (S'08) received the B.Sc. degree from University of Tehran, Tehran, Iran, in 2009, and the M.Sc. degree from University of Calgary, AB, Canada, in 2012, both in electrical engineering. Since 2013, he has been working towards the Ph.D. degree in electrical engineering at the University of Toronto, ON, Canada.

During his M.Sc. studies, he worked on on-chip circuits for jitter measurement and signal capture with sub-picosecond resolution. From February 2012 to December 2012, he worked at Semtech Canada

Corporation as an Analog Design Engineer where he was involved in the development and modeling of high-speed transceivers. His research interests include on-chip measurement circuits and high-speed chip-to-chip communication.

![](_page_11_Picture_17.jpeg)

Anthony Chan Carusone (S'96–M'02–SM'08) received the Ph.D. from the University of Toronto in 2002 and has since been with the Department of Electrical and Computer Engineering at the University of Toronto where he is currently a Professor. He is also an occasional consultant to industry in the areas of integrated circuit design, clocking, and digital communication.

Prof. Chan Carusone has co-authored over 90 conference and journal papers on integrated circuit design, including the Best Student Papers at the

2007, 2008 and 2011 Custom Integrated Circuits Conferences, the Best Invited Paper at the 2010 Custom Integrated Circuits Conference, the Best Paper at the 2005 Compound Semiconductor Integrated Circuits Symposium, and the Best Young Scientist Paper at the 2014 European Solid-State Circuits Conference. He authored, along with David Johns and Ken Martin, the 2nd edition of the classic textbook *Analog Integrated Circuit Design*. He was Editor-in-Chief of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS in 2009, and has served on the technical program committee for the Custom Integrated Circuits Conference and the VLSI Circuits Symposium. He currently serves on the editorial board of the IEEE JOURNAL OF SOLID-STATE CIRCUITS, as a member of the Technical Program Committee of the International Solid-State Circuits Society.