A 21-GS/s Single-Bit Second-Order ΔΣ Modulator for FPGAs

Haolin Li, Laurens Breyne, Joris Van Kerrebrouck, Michiel Verplaets, Chia-Yi Wu, Piet Demeester, Fellow, IEEE, and Guy Torfs, Member, IEEE

Abstract—A new high-speed delta-sigma modulator (DSM) topology is proposed by cascading a bit reduction process with a multi-stage noise shaping MASH-1-1 DSM. This process converts the two-bit output sequence of the MASH-1-1 DSM to a single-bit sequence, merely compromising the DSM noise-shaping performance. Furthermore, the high clock frequency requirements are significantly relaxed by using parallel processing. This DSM topology facilitates the design of e.g. wideband software defined radio (SDR) transmitters and delta-sigma radio-over-fiber transmitters. Experimental results of the FPGA implementation show that the proposed low-pass DSM can operate at 21 GS/s, providing 520 MHz baseband bandwidth with 42.76 dB signal-to-noise-and-distortion ratio (SNDR) or 1.1 GHz bandwidth with 32.04 dB SNDR (based on continuous wave measurements). An all-digital transmitter based on this topology can generate 218.75 MBd 256-SNDR (based on continuous wave measurements). The proposed low-pass DSM can operate at 21 GS/s, providing 520 MHz baseband bandwidth with 42.76 dB signal-to-noise-and-distortion ratio (SNDR) or 1.1 GHz bandwidth with 32.04 dB SNDR (based on continuous wave measurements). An all-digital transmitter based on this topology can generate 218.75 MBd 256-QAM over 200 m OM4 multimode fiber in real-time, with 7-GS/s sampling rate and an error vector magnitude below 1.89%.

Index Terms—Delta-sigma modulator, multi-stage noise shaping (MASH), software defined radio, quantization noise, FPGA.

I. INTRODUCTION

DELTA-SIGMA modulators (DSMs) have attracted significant interest since they provide the possibility for all-digital transmitters and are compatible with advanced CMOS technologies. To achieve a high signal-to-noise and distortion ratio (SNDR), high-order DSMS and high sampling rates are pursued [1]. A single-bit modulator can be beneficial as it avoids a high-speed multi-bit digital-to-analog converter (DAC) and can directly drive an energy-efficient switching-mode power amplifier (PA) in all-digital transmitters or modulate a nonlinear device e.g. a vertical-cavity surface-emitting laser (VCSEL) in delta-sigma radio-over-fiber systems [2].

A single-loop high-order DSM with a one-bit quantizer generates a one-bit output stream. However, its sampling rate is usually limited and dependent upon the hardware realization (e.g. ASIC or FPGA), since it involves several high-precision additions in a single clock cycle. As a result, most prior works rely on time-interleaved DSMs (TI DSMs) for the implementation of a high-speed DSM [3]–[6]. However, the loop-unrolled architectures in TI DSM are eventually limited by the critical path (CP). In [6], a bit separation architecture was proposed to increase the sampling rate where all possible internal states of the high-order bit calculation (where the feedback path exists) were stored in memory. [7] and [8] proposed a high-speed parallel DSM architecture using multicore DSMS, where samples were sliced into multiple blocks and each block was processed by one DSM core.

Another approach is using a multi-stage noise shaping (MASH) DSM topology that cascades multiple low-order DSMS, and hence can easily reduce the required clock frequency thanks to the efficient parallelization of these low-order DSMS [9] while avoiding instability issues [10]. However, a high-order MASH topology increases the number of output bits by summing up multiple low-order DSMS, requiring a high-speed multi-bit DAC followed by a linear stage (e.g. a linear PA). As a consequence, compared to single-bit transmitters, the overall design complexity is significantly increased.

Our work employs a two-stage MASH-1-1 DSM topology as it has the possibility to dramatically boost the sampling rate by using the parallelization technique in [9]. To benefit from the immunity to nonlinear distortions of a bi-level signal and to make the DSM compatible with the high-speed serial interface on FPGAs, we introduce a bit reduction process cascaded with the MASH-1-1 DSM. Moreover, we have efficiently parallelized this process without any critical path limiting the degree of parallelization.

In Section II, the bit reduction process and a feedback loop unrolling technique are proposed. The hardware implementation is elaborated in Section III. The simulation and experimental results are discussed in Section IV. Finally, conclusions are drawn in Section V.

II. DELTA-SIGMA MODULATOR AND BIT REDUCTION PROCESS

A. Delta-Sigma Modulator and Bit Reduction Process

A regular second-order low-pass DSM with a one-bit quantizer is presented in Fig. 1(a). A MASH-1-1 DSM is presented in Fig. 1(b), comprising two identical first-order low-pass DSMs (each DSM generates a one-bit output 0 or 1) [10]. The outputs \( v[n] \) (one-bit output of Fig. 1(a)) and \( v_{\text{out}}[n] \) (two-bit output of Fig. 1(b)) contain the same information as their signal transfer functions are both all-pass filters and their noise transfer functions are both second-order high-pass filters (\( 1 - z^{-2} \)). The two-bit \( v_{\text{out}}[n] \) in Fig. 1(b) is given by

\[
v_{\text{out}}[n] = v_1[n] + v_2[n] - v_2[n-1]
\]

where both \( v_1[n] \) and \( v_2[n] \) can be either 0 or 1. This creates four discrete levels: \( \{-1, 0, 1, 2\} \). This output can again be reduced to a one-bit stream by applying the signal \( v_{\text{out}}[n] \) to
Fig. 1. Block diagrams of (a) a regular single-loop second-order low-pass DSM with a one-bit quantizer and (b) the proposed DSM topology with a MASH-1-1 DSM and the bit reduction process in cascade.

the second-order DSM shown in Fig. 1(a). This ideal bit reduction process is cascading a second-order DSM (with one-bit quantizer and two-bit input width) with the MASH-1-1 DSM. Identical spectra are observed in Matlab simulations between the second-order DSM and the proposed DSM. However, the achievable sampling rate of the ideal bit reduction process is still limited by the critical path, even though the reduced input width (from \( N \) bits to two-bits) can ease the calculations.

B. Feedback Loop Unrolling for the Bit Reduction Process

We propose a technique of feedback loop unrolling for the ideal bit reduction process. Note that, the MASH-1-1 output samples 0 or 1 prior to the first-occurring -1 or 2, do not generate a feedback error in the ideal bit reduction process (no quantization error is made by the one-bit quantizer). Therefore, we can assume that the MASH-1-1 output is -1 or 2 at time index \( n = 0 \). If, for example, applying the output sample -1 to the ideal bit reduction process in Fig. 1(a), then the integration signal \( s[0] \) becomes -1 and the quantizer output \( v[0] \) becomes 0, yielding a feedback error \( w[0] = -1 \). Due to the two delay blocks in the feedback loop in Fig. 1(a), the two subsequent samples of \( s[n] \) are also influenced by this feedback error \( w[0] \), being \( s[1] = u[1] + 2w[0] \) and \( s[2] = u[2] + 2u[1] - w[0] \). These bit reduction operations are equivalent to adding a vector \( V_{\text{equiv}} \) to the original samples, being

\[
\begin{bmatrix}
-1 & u[1] \\
0 & u[2]
\end{bmatrix}
\rightarrow
\begin{bmatrix}
-1 & u[1] \\
0 & u[2]
\end{bmatrix}
+ \begin{bmatrix}
1 & -1 & 2
\end{bmatrix}
\quad (2)
\]

A similar vector \([-1 2 -1]\) can be obtained when \( v \) is applied. Remark, these operations can yield a multi-bit (even more than two-bit) \( s[n] \) for the two subsequent samples.

The possible values of \( s[n] \) can be mathematically obtained by enumerating the possible values of \( v_{\text{out}}[n] \) using (1) and sequentially performing the ideal bit reduction process on these values. If \( v_{\text{out}}[n] = -1 \), the only possible combination of \( \{v_1[n], v_2[n], v_2[n-1]\} \) is \( \{0, 0, 1\} \). The possible values of the subsequent samples \( v_{\text{out}}[n+1] \) and \( v_{\text{out}}[n+2] \) are enumerated in Fig. 2(a). By performing the ideal bit reduction process on \( v_{\text{out}}[n] \) with the equivalent vector \([-1 -1 \ 2] \) from (2), we obtain the quantizer output \( v[n] = 0 \) and the new integration signals \( s[n+1] \) and \( s[n+2] \) as shown in Fig. 2(b). Then we compare \( s[n+1] \) with zero (quantizer threshold), and obtain the corresponding quantizer output \( v[n+1] \) and the new \( s[n+2] \) in Fig. 2(c). This process continues for each newly obtained integration signal \( s[n] \). When \( v_{\text{out}}[n] = 2 \), a similar expansion tree structure is shown in Fig. 3.

Note that most paths in Fig. 2 and Fig. 3 can be covered by \( s[n] \in \{-2, -1, 0, 1, 2, 3\} \) (marked in gray) except two extreme paths, being path 1 and path 16 (in red dashed line). In these two paths, both \( v_1[n] \) and \( v_2[n] \) are binary sequences of 000... and 111..., respectively. These correspond to input signals that approach the limits of the MASH-1-1 DSM input dynamic range. It should be pointed out that some paths do not appear in the actual bit reduction process as some combinations of \( v_{\text{out}}[n] \) are not generated by MASH-1-1 DSM.

Two methods are proposed to unroll the feedback loop and to explore how many paths should be included without performance degradation. In method I, we sequentially apply equivalent vectors \([-1 -1 \ 2] \) or \([-1 2 -1] \) to the three samples when \( s[n] \) is either -1 or 2. In method II, we sequentially compare each sample \( s[n] \) with -2, -1, 2 and 3. The equivalent vectors for method II are summarized as follows

\[
V_{\text{equiv}} = \begin{cases}
-1 -1 & 2, & s[n] \geq 3 \\
-1 2 -1, & s[n] = 2 \\
1 -1 1, & s[n] = 1 \\
[-1 -2 -1], & s[n] < -2
\end{cases}
\quad (3)
\]

Note that for \( s[n] = 3 \) (or \( s[n] = -2 \), the equivalent vector is \([-2 4 -2] \) (or \([2 -4 2] \). For \( s[n] > 3 \) (or \( s[n] < -2 \), the quantizer output is forced to be 1 (or 0) and the same feedback error vector \([-4 -2] \) (or \([-2 4] \) is added to subsequent samples. All possible values of \( s[n] \) can be covered by the...
segmentation in (3). In this way, the integration signal \( s[i] \) in the bit reduction process is made predictable, completely unrolling the feedback loop and easing the parallelization. As will be shown in Section IV, the proposed method II performs almost as well as the ideal bit reduction process.

### III. HARDWARE IMPLEMENTATION

A parallelization degree \( p \) is assumed, implying that in each clock cycle, \( p \) samples are processed in parallel and that the clock frequency of the MASH-1-1 DSM and the bit reduction is lowered from the sampling frequency \( f_s \) to \( f_s/p \). Fig. 4 depicts the parallelized implementation of the proposed bit reduction process (method I). At the input stage, \( p \) samples from the same clock cycle (represented by clock cycle index \( i \)) of the parallelized MASH-1-1 DSM output \( X_{in} \) are defined as a column vector. Note that these multi-bit samples can also be located at the transition between two subsequent vectors (between \( i \)-th and \((i+1)\)-th vectors of \( X_{in} \)). Since each time three samples are corrected by the equivalent vector, the first two samples from the \((i+1)\)-th vector are therefore concatenated in the \( p \)-th and \((p+1)\)-th samples of the vector \( X \) (marked as grey cells) for the bit reduction process.

Then we sequentially compare each sample in \( X \) with -1 and 2, and apply the equivalent vectors to the three consecutive samples. To alleviate the computation during one clock cycle, we split the iterative bit reduction into \( p \) pipelining stages.

Due to the possible multi-bit samples at the transition, the \((i+1)\)-th vector of \( X_{in} \) must not be processed before \( i \)-th vector being corrected, because the \( i \)-th vector can overwrite the first two samples of the \((i+1)\)-th vector from the next clock cycle. However, the \( i \)-th vector can only be corrected after \( p \) clock cycles owing to the pipelining stages. This can be resolved by duplicating the pipelined bit reduction stage for all nine possible combinations of multi-bit \( X[k] \) appearing at the transition, since both \( X[k = p-2] \) and \( X[k = p-1] \) \( \in \{-1,2, \text{others}\} \). Therefore, the first two samples of \( X \) are first corrected by assuming one of the nine combinations to be applied. In Fig. 4, the desired output \( Y \) is selected from the nine possible outcomes: \( Y_0, Y_1, Y_2, Y_3, \ldots, \delta Y_8 \).

This implementation is also applicable for method II, where 25 different combinations (both \( X[k = p-2] \) and \( X[k = p-1] \) \( \in \{-2,-1,2,3, \text{others}\} \)) should be taken into account.

### IV. EXPERIMENTAL VERIFICATION

The complete DSM topology (a MASH-1-1 DSM with cascaded bit reduction) was implemented on a Xilinx Vortex Ultrascale VCU108 FPGA. The measurement setup is shown in Fig. 5(a). A retiming chip from [11] was used to resample the FPGA serial output, aiming to reduce the effect of jitter, duty-cycle distortion and the FPGA output noise floor. The measurements were performed using an Anritsu vector signal analyzer of which the clock was synchronized with the clocks of FPGA and the retiming chip. The eye diagrams before and after retiming a 21-GS/s stream are shown in Fig. 5(b).

Fig. 6 presents the measured spectra of a single sinusoidal tone at the retiming chip output. A 19 MHz single-tone was processed by the DSM at 7 GS/s in Fig. 6(a). A slope of the second-order DSM (40 dB/dec) was observed for both methods, validating that the single-tone was indeed modulated by a second-order DSM. For method II, a slope reduction from 40 dB/dec to 34 dB/dec has been perceived when increasing the DSM input amplitude from -3.75 dBFS to -2.47 dBFS. Method I can no longer perform second-order DSM operations when the input amplitude exceeds -6.22 dBFS. Fig. 6(b) shows that a 21-GS/s low-pass second-order DSM can be achieved on FPGA. It should be pointed out that the noise floor saturates at low frequencies due to limited driver performance of the retiming chip, as shown in Fig. 6(a) and Fig. 6(b). Simulation results, which are not degraded by noise, jitter and duty-cycle distortion, show the accuracy of the bit-reduction algorithm.

Fig. 7 depicts the simulated and measured in-band SNDR as a function of the DSM input signal amplitude. The SNDR was measured by generating a single-tone input at 19 MHz in the baseband and taking into account all the noise, and the second and the third harmonic distortions (HDs) up to...
65 MHz. The simulations comparing the proposed methods to a second-order DSM show that the input dynamic range of method I is compromised by 3.37 dB and the peak SNDR is reduced by 0.87 dB. The input dynamic range of method II is only compromised by 0.90 dB and the peak SNDR reduction is unnoticeable. The measured peak SNDR of 51.41 dB is found at the input amplitude of -6.22 dBFS for method I, and for method II, the measured peak SNDR of 52.3 dB is found at the input amplitude of -4.18 dBFS. These input amplitudes are in good agreement with the simulation. The 14.6 dB SNDR gap is mainly caused by the limited driver performance.

The relationship between the in-band peak SNDR and the baseband signal bandwidth (BW) is also explored with different sampling rates (method II), as shown in Fig. 8. As with the measurements in Fig. 7, the in-band noise, and the second and the third HDs are taken into account. A 21-GS/s DSM provides a large BW of 1.1 GHz with an SNDR of 32.04 dB, implying that a baseband transmitter for the IEEE 802.11ad (“WiGig”) standard (baseband BW > 880 MHz for both I and Q channels) is realizable on FPGA, avoiding the use of high-speed DACs.

An all-digital single-bit RF transmitter based on the proposed DSM topology (method II) was implemented on FPGA following the structure in [12]. In this structure, the carrier frequency is a half of the low-pass DSM sampling frequency owing to the digital upconversion. For the 3.5 GHz band, the sampling rate is 7 GS/s for both I and Q signals. Fig. 9(a) shows the spectrum and the demodulated constellation of a 256-QAM signal (I-channel) processed by a 21-GS/s DSM (p = 64).

DSM (p = 16). This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2855962, IEEE Transactions on Circuits and Systems II: Express Briefs.
TABLE I

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>BW (MHz)</td>
<td>Single-tone</td>
<td>Single-tone</td>
<td>64-QAM</td>
<td>16-QAM</td>
<td>64-QAM</td>
</tr>
<tr>
<td>SNDR (dB)</td>
<td>200</td>
<td>1100</td>
<td>122</td>
<td>31.4</td>
<td>35</td>
</tr>
<tr>
<td>EVM (%)</td>
<td>26*</td>
<td>39*</td>
<td>31.4</td>
<td>35</td>
<td>–</td>
</tr>
<tr>
<td>Sampling Rate (GS/s)</td>
<td>8</td>
<td>11</td>
<td>3.2</td>
<td>10.24</td>
<td>9.6</td>
</tr>
<tr>
<td>Critical Path Limitation†</td>
<td>Yes</td>
<td>Yes</td>
<td>No</td>
<td>No</td>
<td>No</td>
</tr>
<tr>
<td>Realization/Output Bits</td>
<td>65nm CMOS/3</td>
<td>65nm CMOS/4</td>
<td>FPGA/1</td>
<td>FPGA/1</td>
<td>FPGA/1</td>
</tr>
<tr>
<td>Topology</td>
<td>2nd-order</td>
<td>2nd-order</td>
<td>1st-order</td>
<td>2nd-order</td>
<td>2nd-order</td>
</tr>
</tbody>
</table>

* Signal-to-noise ratio (SNR): measured from DC to input frequency without harmonic distortions.
† This describes if the parallelization degree or the total sampling rate is limited by the critical path in the DSM.

![Fig. 10. Implementation consumption of the proposed bit reduction processes on FPGA as a function of the parallelization degree p.](image)

Different values for $p$ are chosen in the implementation to meet both the required timing constraints and the input data width of the FPGA parallel-to-serial conversion. However, the bit reduction process and its hardware implementation in Fig. 4 are generally not limited to these values. The DSM can run at approximately 437.5 MHz on VCU108 and the sampling rate of 7 GS/s is achieved at relatively low hardware consumption.

A. Comparison to State-of-the-Art

Table I gives an overview of relevant publications on high-speed $\Delta\Sigma$ modulators. In [3], the critical path includes two 10-bit back-to-back adders and a NOR-gate that must operate at over 4 GHz. The critical path in [5] is only limited by one adder by decoupling the two back-to-back adders in [3] with a look-ahead adder. However, the adder in the critical path should operate at 5.52 GHz to achieve the 11 GHz sampling frequency. The speed of addition in [3] and [5] is far beyond the FPGA’s capability. Besides, at 1.1 GHz bandwidth, 32.04 dB SNDR of our DSM is comparable with the 39 dB SNR achieved in [5] where a four-bit output format is used. In [7] and [8], samples are deinterleaved into multiple blocks and processed by multicore DSMs in parallel, leading to increased memory usage and latency. In [6], good performance has been achieved using a bit separation architecture where all possible internal states of the high-order bits calculation were stored in memory. It is, however, unclear how much the performance is compromised by reducing the internal states. In this work, the sampling rate of each MASH stage is not directly limited by the critical path in the DSM [9] and the feedback path in the bit reduction process is removed by duplicating the bit reduction process and pipelining the multi-bit reduction.

V. CONCLUSION

In this brief, a bit reduction process and its parallelization technique have been proposed to make the second-order MASH-1-1 DSM compatible with a single-bit transmitter. A low-pass second-order DSM based on the proposed DSM topology achieves a high sampling rate of 21 GS/s. This is the fastest second-order DSM among state-of-the-art FPGA-based DSMs. Even higher sampling rates are achievable as the degree of parallelization and the total sampling rate are not limited by the critical path in the DSM.

REFERENCES