

Received March 31, 2021, accepted May 31, 2021, date of publication June 11, 2021, date of current version June 21, 2021. Digital Object Identifier 10.1109/ACCESS.2021.3088448

# Time-to-Digital Converter IP-Core for FPGA at State of the Art

FABIO GARZETTI<sup>®</sup>, (Member, IEEE), NICOLA CORNA<sup>®</sup>, (Member, IEEE), NICOLA LUSARDI<sup>®</sup>, (Member, IEEE), AND ANGELO GERACI<sup>®</sup>, (Senior Member, IEEE) Politecnico di Milano, 20133 Milano, Italy

Corresponding author: Fabio Garzetti (fabio.garzetti@polimi.it)

**ABSTRACT** The Field Programmable Gate Array (FPGA) structure poses several constraints that make the implementation of complex asynchronous circuits such as Time-Mode (TM) circuits almost unfeasible. In particular, in Programmable Logic (PL) devices, such as FPGAs, the operation of the logic is usually synchronous with the system clock. However, it can happen that a very high-performance specifications demands to abandon this paradigm and to follow an asynchronous implementative solution. The main driver forcing the use of programmable logic solutions instead of tailored Application Specific Integrated Circuits (ASIC), best suiting an asynchronous design, is the request coming from the research community and industrial R&D of fast-prototyping at low Non Recursive Engineering (NRE) costs. For instance in the case of a high-resolved Time-to-Digital Converter (TDC), a signal clocked at some hundreds of MHz implemented in FPGA allows implementing a TDC with resolution at ns. If a higher resolution is required, the signal frequency cannot be increased further and one of the aces up the designer's sleeve is the propagation delay of the logic in order to quantize the time intervals by means of a so-called Tapped Delay-Line (TDL). This implementation of TDL-based TDC in FPGAs requires special attention by the designer both in making the best use of all available resources and in foreseeing how signals propagate inside these devices. In this paper, we investigate the implementation of a high-performance TDL-TDC addressed to 28-nm 7-Series Xilinx FPGA, taking into account the comparison between different technological nodes from 65-nm to 20-nm. In this context, the term high-performance means extended dynamic-range (up to 10.3 s), high-resolution and single-shot precision (up to 366 fs and 12 ps r.m.s respectively), low differential and integral non-linearity (up to 250 fs and 2.5 ps respectively), and multi-channel capability (up to 16).

**INDEX TERMS** Bubble errors, calibration, decoding, Field Programmable Gate Array (FPGA), interpolation, Nutt–Interpolation, Sub–Interpolation, Tapped Delay–Line (TDL), Time–to–Digital Converter (TDC).

# I. INTRODUCTION

Today, and especially in a long-time perspective, Timeto-Digital Conversion (TDC) measurement techniques are the reference for determining the moments in which digital events occur, a procedure at the base of the latest generation digital electronic circuits called Time-Mode circuits, in which the information representation philosophy radically changes. In fact, these circuits encode information based on the difference between instants of time in which digital events occur rather than based on the values of the voltages at the nodes or currents in the branches of the electrical networks. In this scenario, TDC circuits are consequently the core of modern Time–of–Flight (ToF) [1] measurements, which are

The associate editor coordinating the review of this manuscript and approving it for publication was Marco Anisetti<sup>D</sup>.

last generation solutions in medical diagnostics (e.g. TOF Positron Emission Tomography, ToF–PET [2], [3]), in automotive (e.g. LiDAR rangefinders and 3D mapping [4]–[6]), in spectroscopy (e.g. Time Correlated Single Photon Counting, TCSPC [7]–[9]). The huge variety of applications of temporal measures explains the research that has increasingly grown in recent years around the TDC, an enabling component of these measures. And the demand for ever greater performances and targeted features for specific applications have naturally turned the research towards TDC architectures implemented in FPGA devices [10]–[14].

Furthermore, rapid prototyping and negligible NRE of FPGAs have consolidated that TDCs based on the classic ASIC (Application Specific Integrated Circuit) design are destined to be increasingly relegated to mass production.

The issue is not just limited to the device used to implement the TDC. The specifications required today force to abandon the standard synchronous digital design moving to asynchronous operating modes; this is a completely out of the box approach in the field of PL devices [14]-[17]. It allows getting performances equivalent to an overclock at hundreds of GHz, which is unfeasible in real FPGA devices but necessary for achieving, for instance, resolutions in the range of ps. In concrete terms, this results for example in the controlled use of the intrinsic delays of the logic blocks that make up the FPGA device. In this way, it is possible to create chains of buffers (a.k.a. bins or taps) that constitute a delay-line, in which each tap behaves like a unit that quantizes the time interval over the delay-line performing its digital conversion. The TDC based on this architecture is referred to as Tapped Delay-Line TDC, TDL-TDC [18].

The need to meet specific requirements from different applications provides for the basic TDL-TDC architecture to be equipped with additional processing resources, such as a Nutt interpolator that extends the full-scale range (FSR) of the measure over the maximum delay allowed by the TDL [19], a calibrator that compensates for non-linearities (DNL and INL) introduced by the physical mismatches among the taps of the TDL [20], for voltage and temperature fluctuations [21], and a sub–interpolator that allows to improve resolution by lowering by means of processing the physical minimum propagation delay of the tap available in the technological node of the used device [18], [22].

The paper introduces a TDL-TDC in FPGA that achieves a FSR up to 10.3 s, high–resolution and single–shot precision up to 366 fs and 12 ps r.m.s respectively, low DNL and INL up to 250 fs and 2.5 ps respectively.

Section 2 briefly takes stock of the state of the art of TDL-TDCs implemented in FPGA devices. Section 3 describes the presented TDL-TDC from the theoretical point of view. Section 4 deals with the implementation of the presented TDL-TDC in different XIIinx FPGA devices of last generation, with particular attention to portability of the proposed architecture among different devices.

# **II. STATE OF THE ART**

Several TDL-TDC implemented in FPGA devices are available, in different configurations of features and performances, to fulfill a wide range of applications. In order to give a synoptic view of the state of the art of these instruments, we collected in Tab. 1 the most significant implementations in Xilinx FPGA devices, highlighting resolution (LSB), single–shot channel precision ( $\sigma_{CH}$ ), FSR, linearity (DNL/INL), and maximum rate per channel if available.

# **III. TDL-TDC ARCHITECTURE**

The structure of the proposed TDL–TDC consists of one or more TDLs to digitize with ps resolution the time information [41], a sub–interpolation mechanism that improves the resolution below the propagation delay of the TDL bins [18], [20], [22], a synchronous counter for implementing the

85516

Nutt–Interpolation [22], a calibrator to maintain the linearity [4], [21], [42], and a decoding system to convert the thermometric code coming from the TDL in pure binary format [43], [44].

In principle, the TDL–TDC converts a time interval defined by the occurrence of a START and a STOP edge into a number. With reference to Fig.1, the operation is realized by propagating the START rising edge along a sequence of buffers (called taps or bins) that constitute the TDL. The buffer outputs are put as inputs of an array of D Flip–Flops (DFFs), whose clock is the line where the STOP edge occurs. In this way, when the STOP edge arrives, the DFFs are already reached by the START rising edge sample and store their input values returning as output a sequence of 1s of length proportional to the duration of the time interval under measure.

A further step converts this thermometric representation of the interval length in pure binary format [17].

The unnecessary requirement to precisely match the propagation delays through same type blocks in an FPGA device for the use for which it is normally intended, determines that the delays between the buffers that constitute the TDL are not strictly equal to one another as the quantization of the information would require [45]. Moreover, the structure of the device organized in clock regions introduces further inhomogeneities between the delays when the signal passes from one region to another. This unevenness of delays (Fig.2) reduces significantly both the resolutions and the linearity. To mitigate this issue, sub–interpolation and calibration are mandatory [42], [46]. The sub–interpolation compensates for resolution, and calibration for linearity.

### A. TDL

In the presented IP-Core, the choice of the buffers constituting the bins of the TDL is crucial for maximizing resolution and minimizing the resources used. The best compromise are the carry signal propagation chains within the adders in the Fabric of the Xilinx FPGAs [47], [48]. Therefore, the implemented TDL is constituted by a connection in series of a suitable number of carry propagation structures. Nevertheless, in the face of well-balanced delays between the bins, these structures suffer from an intrinsic non-linearity, due to the carry-skip mechanism, that translates to missing commutations within the output thermometric code (aka as bubble errors). These mentioned defects, that can be considered as such only for the not-at-all standard purpose for which the device is intended to be used, cannot be exactly compensated for as the manufacturer does not provide the information necessary to fully characterize them. This is the reason why, as mentioned, sub-interpolation and calibration procedures must be used. In last generation Xilinx FPGAs, the necessary primitives of logic to implement the TDL are within the Configurable Logic Blocks (CLBs) that make up the Fabric structure of the device (Fig. 3). Both slices of each CLB have a custom primitive for carry management called CARRY4 (the number 4 stands for 4 bits) in Xilinx 5, 6, 7-Series,

| Reference  | FPGA device | Tech. Node | LSB     | $\sigma_{CH}$  | FSR            | DNL     | INL     | Ch. Rate |
|------------|-------------|------------|---------|----------------|----------------|---------|---------|----------|
| [23]       | Virtex-6    | 40–nm      | 10.0 ps | 19.6 ps r.m.s. | N.A.           | 15.0 ps | 22.5 ps | N.A.     |
| [24]       | Virtex-5    | 65–nm      | 16.3 ps | Ň.A.           | N.A.           | 48.9 ps | 81.5 ps | N.A.     |
| [25]       | Virtex-6    | 40–nm      | 1.70 ps | 4.2 ps r.m.s.  | N.A.           | 1.36 ps | 1.70 ps | N.A.     |
| [26]       | Kintex-7    | 28–nm      | 17.6 ps | 12.7 ps r.m.s. | N.A.           | 17.6 ps | 15.3 ps | N.A.     |
| [27]       | Virtex-6    | 40–nm      | 10.0 ps | 10.0 ps r.m.s. | N.A.           | 19.1 ps | 22.0 ps | N.A.     |
| [28]       | Kintex-7    | 28–nm      | 10.6 ps | 8.13 ps r.m.s. | N.A.           | 10.6 ps | 45.6 ps | N.A.     |
| [28]       | Virtex-6    | 40–nm      | 10.1 ps | 9.82 ps r.m.s. | N.A.           | 11.9 ps | 33.3 ps | N.A.     |
| [28]       | Spartan–6   | 40–nm      | 16.7 ps | 12.8 ps r.m.s. | N.A.           | 20.4 ps | 42.4 ps | N.A.     |
| [29]       | UltraScale  | 20–nm      | 2.25 ps | 3.90 ps r.ms.  | N.A.           | N.A.    | N.A.    | N.A.     |
| [30]       | Virtex-5    | 65–nm      | 7.40 ps | 6.80 ps r.m.s. | N.A.           | 5.48 ps | 11.6 ps | N.A.     |
| [31]       | Virtex-7    | 28–nm      | 1.15 ps | 3.50 ps r.m.s. | N.A.           | 4.03 ps | 6.79 ps | N.A.     |
| [12]       | Virtex-7    | 28–nm      | 10.5 ps | 14.6 ps r.m.s. | N.A.           | 0.84 ps | 1.16 ps | N.A.     |
| [12]       | UltraScale  | 20–nm      | 5.02 ps | 7.80 ps r.m.s. | N.A.           | 0.60 ps | 2.31 ps | N.A.     |
| [32]       | Virtex-5    | 65–nm      | 18.0 ps | 25.0 ps r.m.s. | 10.7 s         | N.A.    | N.A.    | 5 MHz    |
| [33]       | Artix-7     | 28–nm      | 10.0 ps | 15 ps r.m.s.   | 10.7 s         | N.A.    | N.A.    | 10 MHz   |
| [34]       | Artix-7     | 28–nm      | 250 fs  | 12 ps r.m.s.   | 10.3 s         | N.A.    | 4.2 ps  | 20 MHz   |
| [35]       | Artix-7     | 28–nm      | 250 fs  | 12 ps r.m.s.   | 10.3 s         | 33 fs   | 4.6 ps  | 45 MHz   |
| [36], [37] | Zynq-7000   | 28–nm      | 2 ps    | 12 ps r.m.s.   | 10.7 s         | N.A.    | N.Â.    | 45 MHz   |
| [38]       | UltraScale  | 20–nm      | 305 fs  | 8.5 ps r.m.s.  | $10.2 \ \mu s$ | N.A.    | N.A.    | 50 MHz   |
| [39]       | Spartan–6   | 40–nm      | 7.70 ps | 8.90 ps r.m.s. | N.A.           | 22.3 ps | 67.8 ps | 40 MHz   |
| [40]       | Ûirtex−7    | 28–nm      | 6.00 ps | 7.00 ps r.m.s. | 2.1 s          | N.A.    | N.A.    | 125 MHz  |

TABLE 1. Most significant implementations of TDL-TDCs in Xilinx FPGA devices, sorted by resolution (LSB).

65–nm (X5S), 40–nm (X6S), and 28–nm (X7S) respectively, and CARRY8 in Xilinx Ultra-Scale and Ultra-Scale+, 20–nm (XUS) and 18–nm (XUS+) respectively. In Spartan–6 family (40–nm) there is a SLICEX that does not contain the primitive CARRY4 and is therefore useless for implementing TDLs. In each SLICE, the primitive CARRY4/8 can only be connected to the upper corresponding CARRY4/8 resource, thus realizing a vertical, ascending and unidirectional structure.

The number  $N_R$  of TDL taps to implement obviously depends on the TDC clock period  $T_{CLK-TDC}$  and the mean real propagation delay  $t_p[n_R]$  of the bins corresponding to the technological node of the used FPGA. This being the case, the number of taps should be at least  $N_R = T_{CLK-TDC}/t_p[n_R]$ or better greater. Obviously, if energy saving is not a primary factor in the application, it is always possible to increase the TDC clock frequency, consequently reducing the number of taps of the TDLs and avoiding the crossing of different clock regions with resulting relative anomalous delays, as mentioned above and depicted in Fig. 2. Table 5 gives possible combinations of values for the three parameters  $N_R$ ,  $T_{CLK-TDC}$ ,  $t_p[n_R]$  for last generation FPGA families. In particular, corresponding to the implementation of the presented IP-Core, we have experimentally verified that a suitable number of taps is  $N_R = 256$  in X5S, X6S, and X7S, and  $N_R = 512$  in XUS. The implementation in XUS+ is actually still under test.

### **B. SUB-INTERPOLATION**

The difference of values of the taps of the delay line entails not uniform quantizations of the time interval under measure with consequent deterioration in the precision of measures, in particular single-shot ones [46].

At a first glance, the solution may seem to perform the average of repeated measurements of the same time interval



FIGURE 1. TDL and structure coding the time distance between START and STOP edges into a binary number.

performed with different TDL implementations [20]. This is not effective since in practice the number of feasible averages is statistically insufficient to compensate for the effect of any ultra-bin in the series of measurements. If present, the ultra-bin with its value would continue to prevail even in the averaged measure [20], [22].

The sub-interpolation process consists in averaging F measures of the same interval performed over one TDL with  $N_R$  bins, adding to the interval an appropriate offset to each measurement in order to involve for each time a different set of bins in the measurement process. The same would occur if measuring one time the same interval on F different TDLs with  $N_R$  bins. In both ways, the result is as if the final measurement had been performed on a virtual TDL (V-TDL) made of  $F \cdot N_R$  virtual bins about F times faster than the average propagation delay. In other words it is like having performed a bin-by-bin average reducing the quantization noise [46]. Fig. 4 shows an example of delays distribution over bins of a TDL and over the bins of the resulting V-TDL corresponding to sub-interpolation with F = 2. The reduction of the average and variance of delay values is evident.

In the presented IP-Core, the sub-interpolation has been realized by the principle of performing F measurements over the same TDL.



FIGURE 2. Distribution of propagation delays of the taps constituting the TDL. The picture highlights that the TDL can cross different clock regions introducing extra delays. The greatest delay in the TDL is referred to as ultra-bin, in the pictured scenario due to the crossing between different clock regions.



FIGURE 3. Schematic of the partition in SLICEs of the CLBs in last generation Xilinx FPGAs [47], [48]. The transit path of the carry signal (CIN/COUT) from one SLICE to the other is highlighted. In the array of CLBs, SLICE positions are identified in terms of column (X) and row (Y) numbers occupied. Generically, the SLICEs in a CLB are also called SLICE(0) and SLICE(1) because of differences in the internal structure.

It can be shown that it is possible to mitigate effects of the non-uniformity of the TDL bin delays by propagating two fronts instead of just one (Fig. 5), with a suitable logic averaging the positions of the fronts. This technique goes by the name of Wave Union A [18]. The hardware complexity of propagating F replicas of a signal at this point made up of multiple edges on the same TDL suggests that a compromise between two extreme solutions of a single TDL and a TDL for each replica is the way that offers the best implementation efficiency. This is how a version of Wave Union A has been implemented in the IP-Core, in which  $f_{OUT}$  TDLs are placed side by side in parallel each one performing E measures to



**FIGURE 4.** Superposition of the propagation delays in a TDL composed of  $N_R$  real bins (red) over the propagation delays of the F = 2 sub-interpolated V-TDL with  $N_V$  virtual-bins (green). Data are obtained from really implemented TDLs.



FIGURE 5. Comparison of a simple thermometric-code (red) and square wave (green) sampled over the TDL in the case of single and multiple edges propagations.

give  $F = f_{OUT} \cdot E$ . This technique is known as Super Wave Union (SuperWU) [13], [35], [49].

Therefore, a reasonable compromise adopted in the IP-Core was the choice of SuperWU with two measures (E = 2) over four parallel TDLs,  $(f_{OUT} = 4)$ , i.e.  $F = 2 \cdot 4$  [46].

From an implementation point of view, the SuperWU is obtained instantiating  $f_{OUT}$  TDLs ([13], [35]) in parallel. The



**FIGURE 6.** Hardware implementation of Super WU at F = 8 ( $f_{OUT} = 4$ , E = 2). The waveforms below the four TDL-TDC blocks in parallel symbolize the START input which arrives at each of the blocks with a different delay.

**TABLE 2.** Values of the  $N_R$ ,  $T_{CLK-TDC}$ , and  $\overline{t_p}$  parameters after the SuperWU interpolation.

| Series                   | $E \cdot f_{OUT}$               | $N_V$                        | $\overline{t_p[n_V]}$                | $t_p^{MAX}[n_V]$                   | Tech. Node                       |
|--------------------------|---------------------------------|------------------------------|--------------------------------------|------------------------------------|----------------------------------|
| X5S<br>X6S<br>X7S<br>XUS | 2x4<br>2x4<br>2x4<br>2x4<br>2x4 | 2048<br>2048<br>2048<br>4096 | 4.6 ps<br>3.7 ps<br>2.5 ps<br>1.2 ps | 14.1 ps<br>18 ps<br>16 ps<br>20 ps | 65–nm<br>40–nm<br>28–nm<br>20–nm |

**TABLE 3.** Resource occupancy of the proposed 2 · 4 SuperWU interpolation expressed as the number of CARRY4/8 primitives, LUTs, and FFs always as a function of the target FPGA family.

| Series | CARRY4/8 | LUT | FF   | Tech. Node |
|--------|----------|-----|------|------------|
| X5S    | 256      | 1   | 1025 | 65–nm      |
| X6S    | 256      | 1   | 1025 | 40–nm      |
| X7S    | 256      | 1   | 1025 | 28–nm      |
| XUS    | 256      | 1   | 2049 | 20–nm      |

START signal is conveyed in a SuperWU–Launcher (SWUL) (Fig.6) that generates a 2–edge square wave, composed by a down–edge (DN) and an up–edge (UP), that is injected into the four TDLs. Moreover, the SWUL samples the START, by means of a DFF, generating the STOP. The STOP is used to sample all the TDLs. In this way, each START–STOP event generates 8 measures that will be processed by the next stage, the decoder (Fig. 6).

In Table (2) the obtained improvements in terms of resolution are reported, these are expressed as mean "virtual" propagation delay ( $\overline{t_p[n_V]}$ ) and "virtual" ultra–bin ( $t_p^{MAX}[n_V]$ ), in X5S, X6S, X7S, and XUS FPGA families considering the proposed SuperWU implemented with respect to the simple TDL described in Table 5.

Instead, Table (3) reports the resource occupancy of the implemented SuperWU interpolation.

# C. DECODER

The decoder has the task of converting the thermometric code deriving from the TDC bins sampling into a binary format. As reported in Fig. 8, this is accomplished through three different stages. The sequence of these steps consists of the identification of real bins hit positions on the TDL or



**FIGURE 7.** Graphical representation of a 2–edge square wave on a TDL that shows in the output thermometric code bubble error sequences. The twilight zone associated to the rising transition is  $N_{BL}$  long.



FIGURE 8. Decoder block diagram. The decoder is composed of the sequence of three cascaded processing blocks, the edge detection (EDP), the correction of the bubble errors (BECP), and the calculation of the virtual bin (SIP). The individual blocks are described in detail in the main text.

TDLs, in case of SuperWU (Edge Detection Phase, EDP), the detection plus correction of the bubble errors (Bubble–Errors Correction Phase, BECP) [44], and the calculation of the virtual bin by summing up the F real ones (Sub–Interpolation Phase, SIP).

On the implementation side, different solutions are possible for the EDP and BECP module. In this project, we have chosen an EDP based on base–2 logarithm (LOG2), BECP based on the Bubble–Error Compression (BEC) principle, and SIP performed by means of a Tree–Adder (TA). Furthermore, pipeline architectures are mandatory to sustain high measure rates.

### 1) EDGE DETECTION PHASE (EDP)

The core of the EDP is a pipeline–based LOG2 engine (1).

The EDP is performed over each of the  $f_{OUT} = 4$  TDLs using  $2 \cdot f_{OUT} = 8$  LOG2 engines, in particular $f_{OUT} = 4$ LOG2–DN stages detecting the falling down (DN) edges  $(n_{DN})$ , and  $f_{OUT} = 4$  LOG2–UP stages detecting the rising up (UP) edges  $(n_{UP})$ . In detail, each  $f_{OUT} = 4$  TDLs propagates the E = 2 edges, DN and UP, generated by the SWUL (Fig. 6). Both the LOG2–DN and the LOG2–UP modules are based on the same LOG2 engine structure.

Aim of the EDP is the position detection of the DN and UP edges  $(n_{DN}, n_{UP})$  over the four TDLs corresponding to the real bins.

Consider now an unrealistic case without bubble errors and a TDL composed of  $N_R$  taps generating an  $N_R$ -bit wide-word  $n_{TDL} \in [0; 2^{N_R} - 1]$ . The position of the Algorithm 1 Pseudo-code for the Hardware Implementation of the LOG2 Module

}



**FIGURE 9.** Implementation scheme of the EDP by means of LOG2 described in code 1.

DN edge  $(n_{DN})$  is

$$n_{DN} = \lfloor log_2(n_{TDL}) \rfloor \tag{1}$$

The position of the UP edge  $(n_{UP})$  can be similarly calculated by using a swapped version of  $n_{TDL}$  (*swap*( $n_{TDL}$ )), i.e.

$$n_{UP} = (N_R - 1) - \lfloor log_2(swap(n_{TDL})) \rfloor$$
(2)

In (2) the "power of 2", the weights of the digits are swapped, so the role in MSB and LSB is inverted.

As Fig. 9 shows, (1) and (2) can be easily translated into the LOG2–DN and LOG2–UP hardware pipeline modules with a latency of  $\lceil log_2(N_R) \rceil$  clock pulses.

Table 4 reports the area occupancy for the implementation of LOG2–DN and LOG2–UP modules in terms of number of LUTs and FFs as a function of the number of taps $N_R$  of the TDL. Here, we can see the difference in the used resources due to the asymmetry between the two modules. For the proposed IP-Core, we set  $N_R = 256$  in X5S, X6S, X7S and  $N_R = 512$  in XUS.

### 2) BUBBLE ERROR CORRECTION PHASE (BECP)

In correspondence to the real transitions of the digital signals entering the TDL, the resulting thermometric code has no clear transition edges from 0 to 1 or from 1 to 0 but rather it shows, between the real values 0 and 1, a twilight zone consisting of a sequence of random bits with no physical

| TABLE 4. | Area occupancy, as a function of $N_R$ , expressed as number of |
|----------|-----------------------------------------------------------------|
| LUTs and | FFs.                                                            |

|   | (a)   | LOG2–E | DN.  |   | <b>(b</b> ) | LOG2-I | JP.  |
|---|-------|--------|------|---|-------------|--------|------|
| - | $N_R$ | LUT    | FF   |   | $N_R$       | LUT    | FF   |
| - | 8     | 7      | 11   | - | 8           | 7      | 13   |
|   | 16    | 17     | 20   |   | 16          | 17     | 21   |
|   | 32    | 37     | 37   |   | 32          | 37     | 38   |
|   | 64    | 100    | 70   |   | 64          | 102    | 71   |
|   | 128   | 133    | 135  |   | 128         | 134    | 136  |
|   | 256   | 262    | 264  |   | 256         | 264    | 265  |
|   | 512   | 306    | 512  |   | 512         | 457    | 520  |
|   | 1024  | 516    | 1033 |   | 1024        | 986    | 1032 |

meaning and called bubble errors. The number of bubble errors constituting the twilight zone is referred to as bubble length (BL) (Fig. 7). These bubble errors can depend on a non–uniform propagation over the TDL and the mismatch of interconnections from the buffers to the DFFs. Experimentally, we have measured BL of 4 bits in the CARRY4 and of 16 bits in CARRY8 [49].

Anyway, working in presence of bubble errors means losing a factor  $N_{BL}$  in resolution, which is unacceptable and makes the introduction of a correction mechanism mandatory. In the decoder of the IP-Core the module performing this correction is referred to as Bubble–Errors Correction Phase, BECP.

After the EDP, the outputs produced by the  $f_{OUT} = 4$ LOG2–DN and  $f_{OUT} = 4$  LOG2–UP engines enter corresponding  $f_{OUT} = 4$  BEC–DN and  $f_{OUT} = 4$  BEC–UP modules, whose outputs are F = 8 real bins without bubble errors,  $f_{OUT} = 4 n_{DN}^{-}$  and  $f_{OUT} = 4 n_{UP}^{+}$ .

From the side of the operating mechanism, while the LOG2–DN and LOG2–UP detect the UP and DN edges, the  $N_{BL}$  bits before  $n_{DN}$  ( $\Delta n_{DN}[n_R]$  with  $n_R \in [n_{DN} - (N_{BL} - 1); n_{DN}]$ ) and those after  $n_{UP}$  ( $\Delta n_{UP}[n_R]$  with  $n_R \in [n_{UP}; n_{UP} + (N_{BL} - 1)]$ ) are selected. Referring to the DN and UP edges, the BEC–DN and the BEC–UP stages count the number of zeros ("0") in  $\Delta n_{DN}[n_R]$  and  $\Delta n_{UP}[n_R]$ , mathematically represented by  $N_{BL} - \sum \Delta n_{DN}$  and  $N_{BL} - \sum \Delta n_{UP}$  respectively. Then, these values are subtracted or summed to  $n_{DN}$  and  $n_{UP}$  obtaining the real bins  $n_{DN}^-$  and  $n_{UP}^+$  without bubble errors,

$$n_{DN}^{-} = n_{DN} - \left\{ N_{BL} - \sum \Delta n_{DN} \right\}$$
(3)

$$n_{UP}^{+} = n_{UP} + \left\{ N_{BL} - \sum \Delta n_{UP} \right\}$$
(4)

From the implementation point of view,  $N_{BL} - \sum \Delta n_{DN}$ and  $N_{BL} - \sum \Delta n_{UP}$  are performed in Look–Up Tables and the operation of subtraction/sum in (3) and (4) are performed in a pipeline stage. For this reason the BECP has 1 clock cycle of latency (Fig. 10).

From a theoretical point of view, the adopted correction strategy is a compromise between performance and complexity for the BECP. In fact, different bubbles will produce the same output code, if compressed. E.g. the BEs "1010",



(a) LOG–DN and BEC–DN implementation.

**(b)** LOG–DN and BEC–DN implementation.

FIGURE 10. Implementation scheme of EDP and BECP by means of LOG and BEC respectively.



**FIGURE 11.** Hardware implementation of the TA for F = 8.

"1001", and "0101" produce the same correction of 2 by means of the implemented BEC mechanism [49].

Table 7 shows the area occupancy of the sequence LOG2– DN/BEC–DN and LOG2–UP/BEC–UP modules in the case of  $N_R = 256$  and  $N_{BL} = 4$  in X5S, X6S, X7S and  $N_R = 512$ and  $N_{BL} = 16$  in XUS and XUS+.

## 3) SUB-INTERPOLATION PHASE (SIP)

After correction of bubble errors, the real bins are summed up by a fast-pipelined adder and based on a tree structure for parallelism (Tree Adder, TA) for calculating the virtual bins  $(n_V)$ . The module Sub-Interpolation Phase (SIP) performing this function adds the *F* real bins, which are 8, in the presented IP\_Core. The module takes  $\lceil log_2(F) \rceil$  pulses of clock (3 stages) to carry out the computation of the  $n_V$  bins. Table 8 reports the area occupancy of the TA as a function of *F*, considering 8–bit wide ports (w = 8) suitable for X5S, X6S, X7S ( $N_R = 256$ ), and 9–bit wide (w = 9) compatible with XUS and XUS ( $N_R = 512$ ). Fig. 11 shows the scheme of the TA implementation for F = 8.

Finally, at the end of the description of the modules constituting the decoder structure, Fig. 12 depicts a synoptic view of the decoder scheme also from the functional point of view.

### D. CALIBRATOR

The non-linearity depending on the unevenness of delays in TDL is reduced by the calibration procedure [42]. In fact, the sub–interpolation reduces the propagation delay of the real bins without effect on linearizzation. As a consequence of that, if ultra–bins are reduced in magnitude, on first approximation, by the factor F, the same happens also to faster bins. Independently or not from the presence of sub–interpolation, the measures performed by a TDL–based TDC are affected by high DNL and INL. In other words, the V–TDL has the same percentage inhomogeneity as the TDL.

Non-linearities can be identified by performing a sufficiently large number of measurements of bin delays that follow a Poissonian distribution. Small deviations from the uniform distribution reveal any non-linearities. This is a Code-Density Test (CDT) [50] and it involves the creation of a Calibration Table (CT) made up of the measured delays  $t_p[n_{R,V}]$  of the  $n_{R,V} \in [1; N_{R,V}]$  bins. The error  $\delta t_{CAL}$  in the estimation of propagation delays is given by  $\delta t_{CAL} =$  $(1/K) \cdot \sum t_p[n_{R,V}]$ , where *K* is calibration length (Fig. 13). We refer to this procedure as bin-by-bin calibration.

The use of CT provides for its integration in order to give the Characteristic Curve (CC), a look-up table for converting the uncalibrated measures coming from the decoder into calibrated ones (Fig. 14).

Seeking for a compromise between performance and area occupancy, a periodic and pipelined calibration based on fixed–point arithmetic has been implemented in the IP-Core.

The calibration is the sequence of two steps. First, a histogram (CT) of the uncalibrated measures is created and then it is integrated generating the CC. The CT is stored in a Block RAM (BRAM) that is a configurable memory module into the FPGA. The necessary addresses are the number of virtual bins  $n_V \in [0; N_V - 1]$  of the V–TDL, and each bin of the histogram needs enough bits to represent the maximum number of possible counts K (calibration length). For this reason a  $2^{N_V} \cdot \lceil log_2K \rceil$  bits BRAM is required. Each time a virtual bin is measured, the calibrator increments by one the relative location in BRAM. This process is performed for K times.

The next step is the pipelined integration of the CT. From a theoretical point of view, the calculation of CC can be approximated by truncation or rounding. The first has lower computational costs but it is certainly less performing in terms of quantization noise, opposite to the rounding. Therefore, the latter is preferable to provide better system precision. In order to maximize processing efficiency, the CT is integrated to provide the CC all K measures, according to the algorithm described by the following equations,

$$CC[0] = \frac{CT[0]}{2}$$

$$CC[1] = CC[0] + \frac{CT[0] + CT[1]}{2}$$

$$\vdots$$

$$CC[n_V] = CC[n_V - 1] + \frac{CT[n_V - 1] + CT[n_V]}{2}$$
(5)







**FIGURE 13.** Graphical representation of a 4-taps TDL with propagation delay (10 ps, 50 ps, 5 ps, 20 ps) represented as a filled bar, estimated CT (dotted line), and a relative calibration error  $\delta t_{CAL}$ .

In (5), we notice a division operation by powers of two, which in binary representation corresponds to a right shift by the number of bits equal to the exponential of two. However, in this way, there is an increasing loss of precision due to completely neglecting the rest of the division. To avoid this detrimental effect, (5) is multiplied by two,

$$2 \cdot CC[n_V] = 2 \cdot CC[n_V - 1] + (CT[n_V - 1] + CT[n_V])$$
(6)

At the end of the integration process, the  $2 \cdot CC[n_V]$  values are scaled down by a factor two and stored in a  $2^{N_V} \cdot \lceil log_2 K \rceil$ bits BRAM. Without this trick, the CC would accumulate this error gradually.

The calibrator defines the LSB of the V-TDL-TDC that is,

$$LSB = \frac{\sum t_p}{K} \tag{7}$$





To guarantee an updated and stable status of calibration of the system, we stored two CCs, CC#1 and CC#2 as Fig. (15) shows. While one of the CC (e.g. CC#1) is being created with CT integration, the other (e.g. CC#2) is used for calibrating the measures performed. After that, when a new set of K samples and the corresponding updated CT are available, the role of the two CCs is swapped.

From the implementation point of view, considering the trade–off between area occupancy and  $\delta t_{CAL}$ , we have implemented the calibration algorithm based on  $K = 2^{16}$ . Table 9 reports the area occupancy of the module as a function of the different FPGA families. The value of  $N_V = 2048$  is used for X5S, X6S, and X7S, and of  $N_V = 4096$  for XUS and XUS+.

# E. FULL-SCALE RANGE EXTENSION AND MULTI-CHANNEL SYNCHRONIZATION

A wide full-scale range (FSR) of measures is mandatory in several leading applications of TDCs, such as 3D imaging and time-of-flight measures (e.g. in LIDAR systems). The issue of full-scale range extension arises from the fact that the interpolator has high resolution but it cannot measure long time intervals.



FIGURE 15. Implementation scheme of the calibrator.

The adopted solution to get a longer FSR is to put beside the V-TDL, which measures short intervals but with high resolutions, a  $N_{CC}$  bit-wide counter [19] that measures long intervals but with limited resolution. It is like having two TDCs in parallel, one based on the counter and one on the V-TDL. So, the measure is composed of a coarse part  $(T_{COARSE})$  made by means of a counter and a fine contribution  $(T_{FINE})$  calculated by the V-TDL. Being both driven by the same  $T_{CLK-TDC}$ , the counter measure is added to the measure of the interpolator between the asynchronous event and the following clock event. Precisely, as Fig. 16 shows, the generic time event T is used as START signal for the V–TDL and it is sampled also to be the STOP signal. In this way the V-TDL returns  $T_{FINE}$ , and the same STOP event is used to latch the value at the counter output providing the  $T_{COARSE}$ , i.e.  $T = T_{COARSE} - T_{FINE}$  (Fig. 17). This technique is referred to as Nutt interpolation [51].

The resolution of the system is that of the V–TDL based part of the TDC, while the counter determines the FSR equal to  $2^{N_{CC}} \cdot T_{CLK-TDC}$ .

Multichannel TDC measurements are also increasingly being used in many sectors, first of all in digital imaging. As Fig. 18 shows, the Nutt interpolation allows to synchronize different parallel  $N_{CH}$  channels. In this case, the synchronization is realized through  $T_{CLK-TDC}$  driving both the counters and the V–TDL (Fig. 18).

From an implementation point of view, the FSR is limited by the number of bit that are used for the counter ( $N_{CC}$ ) and by the  $T_{CLK-TDC}$  (Table 5 column 2). These parameters change with the technological node, in particular allowing faster technologies to use more bits for the counter with slower clocks (Table 10). Table 11 reports area occupancy of the counter as a function of  $N_{CC}$ .

### **IV. EXPERIMENTAL MEASUREMENTS**

The proposed IP-Core has been validated on a 28–nm X7S Artix–7 200T FPGA. For this implementation the minimum TDC clock period is equal to 2.4 ns.

### A. AREA OCCUPANCY AND POWER CONSUMPTION

As first test, we have verified the area occupancy and the power consumption of one single channel as a function of



FIGURE 16. Nutt-Interpolation block diagram. The abbreviation CC stands for coarse counter.



**FIGURE 17.** Timestamp generation. The abbreviation CC stands for coarse counter.

**TABLE 5.** Compliant values of  $N_R$ ,  $T_{CLK-TDC}$ ,  $\overline{t_P[n_R]}$  for different FPGA series. Also the value  $t_P^{MAX}[n_R]$  of the ultra-bin measured in realized implementations is reported.

| FPGA<br>series | $T_{CLK-TDC}$ | $\overline{t_p[n_R]}$ | $t_p^{MAX}[n_R]$ | $N_R$ | Tech.<br>Node |
|----------------|---------------|-----------------------|------------------|-------|---------------|
| X5S            | > 3.2 ns      | 34 ps                 | 78 ps            | > 94  | 65–nm         |
| X6S            | > 2.5 ns      | 25 ps                 | 55 ps            | > 100 | 40–nm         |
| X7S            | > 2.4 ns      | 16 ps                 | 50 ps            | > 150 | 28–nm         |
| XUS            | > 2.0 ns      | 10 ps                 | 50 ps            | > 200 | 20–nm         |

Algorithm 2 Pseudo–Code for the Hardware Computation of LOG2

unsigned int log2(unsigned int n\_TDL) {
 unsigned int n = 0;
 while (n\_TDL \gg= 1)
 ++n;
 return n;
}



FIGURE 18. Nutt-Interpolation block diagram. The abbreviation CC stands for coarse counter.

 $F = E \cdot f_{OUT}$ , implementing one, two, four, and eight TDLs respectively. Resources involved in the whole IP-Core implementation are summarized in Table 12.

**TABLE 6.** Area occupancy, as a function of  $N_R$ , expressed as number of LUTs and FFs.

| (a    | ) LOG2– | DN.  |   | <b>(b</b> ) | LOG2- | UP.  |
|-------|---------|------|---|-------------|-------|------|
| $N_R$ | LUT     | FF   | _ | $N_R$       | LUT   | FF   |
| 8     | 7       | 11   | _ | 8           | 7     | 13   |
| 16    | 17      | 20   |   | 16          | 17    | 21   |
| 32    | 37      | 37   |   | 32          | 37    | 38   |
| 64    | 100     | 70   |   | 64          | 102   | 71   |
| 128   | 133     | 135  |   | 128         | 134   | 136  |
| 256   | 262     | 264  |   | 256         | 264   | 265  |
| 512   | 306     | 512  |   | 512         | 457   | 520  |
| 1024  | 516     | 1033 |   | 1024        | 986   | 1032 |

**TABLE 7.** Area occupancy, number of LUTs and FFs, of the LOG2–DN/BEC–DN and LOG2–UP/BEC–UP, for X5S, X6S, X7S ( $N_R = 256$ ,  $N_{BL} = 4$ ) and XUS and XUS+ ( $N_R = 512$ ,  $N_{BL} = 16$ ).

| $N_R$ | $N_{BL}$ | Edge | LUT | FF  |
|-------|----------|------|-----|-----|
| 256   | 4        | DN   | 311 | 339 |
| 256   | 4        | UP   | 303 | 359 |
| 512   | 16       | DN   | 417 | 694 |
| 512   | 16       | UP   | 408 | 714 |

**TABLE 8.** Area occupancy, as a function of *F*, expressed as number of LUTs and FFs.

| (]<br>d1 | $\hat{\mathrm{V}}_R$ | -bit wid<br>= 256<br>d to X5S | i), ad- | ( | $N_R$ | -bit wid<br>= 512<br>d to XU | 2), ad- |
|----------|----------------------|-------------------------------|---------|---|-------|------------------------------|---------|
| _        | F                    | LUT                           | FF      |   | F     | LUT                          | FF      |
|          | 2                    | 7                             | 16      |   | 2     | 11                           | 18      |
|          | 4                    | 16                            | 28      |   | 4     | 18                           | 31      |
|          | 6                    | 33                            | 53      |   | 6     | 37                           | 59      |
|          | 8                    | 52                            | 68      |   | 8     | 58                           | 75      |

**TABLE 9.** Area occupancy of the calibrator, expressed as kib ( $2^{10}$  bits) of BRAM, number of LUTs, and FFs, considering  $K = 2^{16}$ , as function of  $N_V$ .

| $N_V$ | BRAM         | LUT | FF  |
|-------|--------------|-----|-----|
| 256   | 1.5x(32 kib) | 394 | 320 |
| 512   | 1.5x(32 kib) | 406 | 327 |
| 1024  | 1.5x(32 kib) | 419 | 334 |
| 2048  | 3x(32 kib)   | 423 | 341 |
| 4096  | 6x(32 kib)   | 488 | 348 |

**TABLE 10.** Dependency between maximum  $N_{CC}$ , minimum  $T_{CLK-TDC}$ , and FSR as a function of the FPGA families.

| Series | max. $N_{CC}$ | min. $T_{CLK-TDC}$ | FSR    | Tech. Node |
|--------|---------------|--------------------|--------|------------|
| X5S    | 16            | 3.2 ns             | 0.2 ms | 65–nm      |
| X6S    | 8             |                    | 640 ns | 40–nm      |
| X7S    | 32            | 2.4 ns             | 10.3 s | 28–nm      |
| XUS    | 32            | 2 ns               | 8.6 s  | 20–nm      |

The area occupancy defines the maximum number of channels implementable, which is limited to 16 in the selected device by the number of clock resources (BUFG) available. **TABLE 11.** Area occupancy of the coarse counter as a function of  $N_{CC}$ , expressed in terms of number of LUTs and FFs.

| $N_{CC}$ | LUT | FF |
|----------|-----|----|
| 4        | 2   | 4  |
| 8        | 1   | 8  |
| 16       | 1   | 16 |
| 24       | 1   | 24 |
| 32       | 6   | 32 |

 TABLE 12. Single-Channel Area Occupancy and Power Consumption

 in 28-nm X7S Artix7-200T FPGA reported by VIVADO. Note that global

 resources quantities are different from the sum of resources necessary

 for independently implemented single parts, due to synthesis

 optimization by VIVADO design suite.

| $F = E \cdot f_{OUT}$            | FF                          | LUT                         | DSP              | BRAM                                 | BUFG             | Power                                    |
|----------------------------------|-----------------------------|-----------------------------|------------------|--------------------------------------|------------------|------------------------------------------|
| 1 x 1<br>2 x 1<br>2 x 2<br>2 x 4 | 751<br>1071<br>1890<br>3738 | 782<br>1203<br>1931<br>3694 | 1<br>1<br>1<br>1 | 48 kib<br>48 kib<br>48 kib<br>96 kib | 1<br>1<br>1<br>1 | 10.9 mW<br>11.8 mW<br>12.5 mW<br>13.8 mW |
| 2 x 8                            | 7294                        | 7010                        | 1                | 192 kib                              | 1                | 20.2 mW                                  |

**TABLE 13.** Mean-bin  $(\overline{t_p})$ , ultra-bin  $(t_p^{MAX})$ , and precision in the selected 28-nm X7S Artix7-200T FPGA.

| $F = E \cdot f_{OUT}$ | $\overline{t_p}$ | $t_p^{MAX}$ | Prec.          |
|-----------------------|------------------|-------------|----------------|
| 2 x 1                 | 18 ps            | 50 ps       | 11.2 ps r.m.s. |
| 2 x 1                 | 10 ps            | 30 ps       | 9.4 ps r.m.s.  |
| 2 x 2                 | 5 ps             | 20 ps       | 8.3 ps r.m.s.  |
| 2 x 4                 | 2.5 ps           | 16 ps       | 8.0 ps r.m.s.  |
| 2 x 8                 | 1.2 ps           | 9 ps        | 7.9 ps r.m.s.  |
| 2 x 10                | 1.0 ps           | 9 ps        | 8.1 ps r.m.s.  |

### B. SUB-INTERPOLATION, RESOLUTION, AND PRECISION

To make evident that the best compromise between area occupancy and resolution is given by the SuperWU with  $f_{OUT} = 4$ , we can observe the improvement on the propagation delays of the virtual bins as a function of  $f_{OUT}$ , implementing one, two, four, eighth and, ten TDLs respectively. A reduction of the quantization error due to the V–TDL means an increase in resolution, and a reduction of the mean and ultra–bin  $(\bar{t}_p, t_p^{MAX})$ . Moreover, we have estimated the single–shot channel precision, that is composed by the intrinsic jitter of the start/stop signal (~ 7 ps r.m.s.), the quantization error, and a further jitter proportional to  $f_{OUT}$  introduced by the SuperWU algorithm [46].

As Tab. 13 summarizes, we found out that, increasing  $f_{OUT}$ , area occupancy and power consumption increase while improvement on resolution and precision saturated around  $f_{OUT} = 4$ . Considering this evidence, we have chosen  $f_{OUT} = 4$ .

### C. CALIBRATION AND TEMPERATURE COMPENSATION

The bin-by-bin calibration algorithm guarantees linearity and consequently determines the LSB.

Assuming to claim single-shot precision in units of picoseconds r.m.s. (Table 13, Column 4), a calibration length

of  $K = 2^{16}$  is mandatory. To this corresponds to an LSB equal to 366 fs (7).

Furthermore, the implementation has demonstrated that the continuous updating mechanism of the CT allows compensating for temperature fluctuations with a maximum error of 286 fs/°C.

### D. DIFFERENTIAL AND INTEGRAL NON-LINEARITY

We have measured the differential (DNL) and integral (INL) non-linearity shown by the presented TDC over FSR of 400 ns. We have obtained DNL < 250 fs and INL < 2.5 ps. These values have been measured with a Code–Density Test (CDT), applying  $N_{CDT}$  (5 · 10<sup>9</sup>) START/STOP signals distributed uniformly over the FSR. In this way, it is possible to compute the DNL errors as a function of  $t \in [0; FSR] (dnl[t])$ 

$$dnl[t] = \frac{CDT[t] - \overline{CDT}}{N_{CDT}} \cdot FSR \tag{8}$$

and the INL errors as a function of  $t \in [0; FSR]$  (*inl*[*t*])

$$dnl[t] = \sum dnl[\tau] \tag{9}$$

Index *t* is the digital code at output of the TDC multiplied by the LSB defined in (7).

The DNL and INL values correspond to the maximum of the functions dnl[t] and inl[t] respectively. Fig. 19 represents dnl[t] and inl[t].

# E. DEAD-TIME AND CHANNEL RATE

Finally, we reported the measures of minimum dead-time and maximum channel rate achievable with the tested implementation.

For the maximum channel rate, we have connected the TDC to a START/STOP square wave signal, where the STOP is a delayed replica of the START. We have increased the frequency of the wave until errors in the measure of the delay between START and STOP occurred. In this way, we have found that the maximum channel rate is equal to 150 MHz.

For the minimum dead-time, we have generated only two consecutive START/STOP pulses. In this case, we have reduced the distance between the two pulses until the measured value was correct. A minimum dead-time value of 5 ns resulted.

# V. COMPARISON AND RESULTS

Table 14 summarizes all measurements performed, both on the implementation in the selected last generation devices for testing X7S and XUS Xilinx FPGAs, i.e., 28–nm Aritx–7 200T, Kintex–7 375T, 28–nm Zynq–7000 7020, 20–nm Kintex UltraScale, and in other past generation of Xilinx FPGAs, i.e., X5S 65–nm Virtex–5 70T, X6S 40–nm Spartan–6 45T. Although not all tests performed with the selected device have been re-run with all other devices, the information in the tables is sufficiently exhaustive to provide a meaningful comparative frame.

From Table 14, referring to Artix-7, we can observe as all the X7S FPGAs (i.e., Kintex-7 and Zynq-7000) have





FIGURE 19. Differential and Integral non-linearity errors as a function of time.



FIGURE 20. Maximum channel rate, measured counts (blue dots), and fitting (red line).

almost the same performance. The reduction in performance of the Zynq–7000 is due to less available resources that prevent implementation of accurate calibration and high-order sub-interpolation algorithms. Also, in the Virtex– 5 and Spartan–6, the performance is lower, particularly in terms of resolution, due to the limited hardware resources and the obsolete technology of these devices (65–nm and 40–nm respectively). Furthermore, the greater resource use of multi-channel version of the TDC, makes fewer resources available for implementing the calibration and high-order sub-interpolation algorithms, which consequently result less effective.

| Performance             | Artix–7       | Virtex-5       | Spartan-6    | Kintex-7      | Zynq-7000    | Kintex UltraScale |
|-------------------------|---------------|----------------|--------------|---------------|--------------|-------------------|
| Resolution              | 366 fs        | 18 ps          | 25 ps        | 250 fs        | 2 ps         | 305 fs            |
| Precision               | 8.0 ps r.m.s. | 25.0 ps r.m.s. | 17 ps r.m.s. | 8.0 ps r.m.s. | 12 ps r.m.s. | 8.5 ps r.m.s.     |
| Full Scale Range        | 10.3 s        | 10.7 s         | 640 ns       | 10.3 s        | 10.7 s       | $10.2 \ \mu s$    |
| DNL                     | 250 fs        | N.A.           | N.A.         | 200 fs        | 1.4 ps       | N.A.              |
| INL                     | 2.5 ps        | N.A.           | N.A.         | 2.2 ps        | 5 ps         | N.A.              |
| Number of Channels      | 16            | 16             | 4            | 16            | 8            | 24                |
| Channel Rate            | 150 MHz       | 5 MHz          | N.A.         | 150 MHz       | 45 MHz       | 50 MHz            |
| Dead–Time               | 5 ns          | N.A.           | N.A.         | 5 ns          | 20 ns        | N.A.              |
| Temperature Sensitivity | 286 fs/°C     | N.A.           | N.A.         | N.A.          | N.A.         | N.A.              |

TABLE 14. Implementation results and measurements. In particular, Artix-7 column refers to the device selected for testing the IP-Core.

Finally, the Kintex UltraScale provides good performance in terms of resolution, precision, and number of channels, FSR and channel rate. Unfortunately, the proposed IP-Core, designed for the X7S, cannot be strictly migrated but needs minor adaptations to fit to XUS technology node.

### **VI. CONCLUSION**

A completely engineered TDL–based TDC on FPGA suited for multi–channel implementation is proposed. Architectural details are analyzed with respect to different Xilinx FPGA families at different technological nodes, i.e. X5S (65–nm), X6S (40–nm), X7S (28–nm), and XUS (20–nm). The robustness of the different modules, i.e. TDL, interpolator, decoder, calibrator are completely investigated from the theoretical and implementation point of view, reporting design rules and results obtained at different technological nodes.

The proposed IP-Core has been fully tested in 28–nm X7S Artix7–200T FPGA. In this specific implementation, we have measured LSB of 366 fs, with single–shot channel precision below 12 ps r.m.s., FSR up to several seconds, DNL and INL up to 250 fs and 2.5 ps respectively. Furthermore, maximum channel rate and a minimum dead–time of 150 MHz and 2 ns respectively have been demonstrated.

Moreover, the trade–off between area occupancy and achievable resolution offered by the SuperWU based interpolation and the effectiveness of the calibration mechanism have been highlighted. By way of example, sensitivity to temperature fluctuations of 286 fs/°C has been assessed.

### REFERENCES

- E. Charbon, "Introduction to time-of-flight imaging," in *Proc. IEEE SEN-SORS*, Nov. 2014, pp. 610–613.
- [2] E. Venialgo, N. Lusardi, F. Garzetti, A. Geraci, S. E. Brunner, D. R. Schaart, and E. Charbon, "Toward a full-flexible and fastprototyping TOF-PET block detector based on TDC-on-FPGA," *IEEE Trans. Radiat. Plasma Med. Sci.*, vol. 3, no. 5, pp. 538–548, Sep. 2019.
- [3] F. Garzetti, S. Salgaro, E. Venialgo, N. Lusardi, N. Corna, A. Geraci, and E. Charbon, "Plug-and-play TOF-PET module readout based on TDC-on-FPGA and gigabit optical fiber network," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC)*, Oct. 2019, pp. 1–4.
- [4] D. Li, M. Liu, R. Ma, and Z. Zhu, "An 8-ch LiDAR receiver based on TDC with multi-interval detection and real-time *in situ* calibration," *IEEE Trans. Instrum. Meas.*, vol. 69, no. 7, pp. 5081–5090, Nov. 2020.
- [5] R. Lussana, F. Villa, A. D. Mora, D. Contini, A. Tosi, and F. Zappa, "Enhanced single-photon time-of-flight 3D ranging," *Opt. Exp.*, vol. 23, no. 19, p. 24962, 2015.

- [6] D. Bronzi, Y. Zou, F. Villa, S. Tisa, A. Tosi, and F. Zappa, "Automotive three-dimensional vision through a single-photon counting SPAD camera," *IEEE Trans. Intell. Transp. Syst.*, vol. 17, no. 3, pp. 782–795, Mar. 2016.
- [7] R. J. Cotter, "Time-of-flight mass spectrometry for the structural analysis of biological molecules," *Anal. Chem.*, vol. 64, no. 21, pp. 1027A–1039A, Nov. 1992.
- [8] D. V. O'Connor, D. Phillips, and D. R. Phillips, *Time-Correlated Single Photon Counting*. London, U.K.: Academic, 1984.
- [9] W. Becker, Advanced Time-Correlated Single Photon Counting Techniques. Berlin, Germany: Springer, 2005, doi: 10.1007/3-540-28882-1.
- [10] N. Lusardi, J. W. N. Los, R. B. M. Gourgues, G. Bulgarini, and A. Geraci, "Photon counting with photon number resolution through superconducting nanowires coupled to a multi-channel TDC in FPGA," *Rev. Sci. Instrum.*, vol. 88, no. 3, Mar. 2017, Art. no. 035003, doi: 10.1063/1.4977594.
- [11] F. Garzetti, N. Lusardi, A. Geraci, E. Dobovicnik, G. Cautero, C. Dri, R. Sergo, and L. Stebel, "Fully FPGA-based and all-reconfigurable TDC for 3D (x, y, t) cross delay-line detectors," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC)*, Nov. 2018, pp. 1–3.
- [12] H. Chen and D. D.-U. Li, "Multichannel, low nonlinearity time-to-digital converters based on 20 and 28 nm FPGAs," *IEEE Trans. Ind. Electron.*, vol. 66, no. 4, pp. 3265–3274, Apr. 2019.
- [13] N. Lusardi, F. Garzetti, and A. Geraci, "Digital instrument with configurable hardware and firmware for multi-channel time measures," *Rev. Sci. Instrum.*, vol. 90, no. 5, May 2019, Art. no. 055113, doi: 10.1063/1.5028131.
- [14] E. Bayer, P. Zipf, and M. Traxler, "A multichannel high-resolution (<5 ps RMS between two channels) time-to-digital converter (TDC) implemented in a field programmable gate array (FPGA)," in *Proc. IEEE Nucl. Sci. Symp. Conf. Rec.*, Oct. 2011, pp. 876–879.
- [15] C. Favi and E. Charbon, "A 17 ps time-to-digital converter implemented in 65 nm FPGA technology," in *Proc. ACM/SIGDA Int. Symp. Field Program. Gate Arrays (FPGA)*, New York, NY, USA, 2009, pp. 113–120. [Online]. Available: http://doi.acm.org/10.1145/1508128.1508145
- [16] J. Wu, "An FPGA wave union TDC for time-of-flight applications," in *Proc. IEEE Nucl. Sci. Symp. Conf. Rec. (NSS/MIC)*, Oct. 2009, pp. 299–304.
- [17] F. Caponio, A. Abba, N. Lusardi, and A. Geraci, "A high-precision wave union TDC implementation in FPGA," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC)*, Nov. 2013, pp. 1–4.
- [18] J. Wu and Z. Shi, "The 10-ps wave union TDC: Improving FPGA TDC resolution beyond its cell delay," in *Proc. IEEE Nucl. Sci. Symp. Conf. Rec.*, Oct. 2008, pp. 3440–3446.
- [19] N. Lusardi and A. Geraci, "8-channels high-resolution TDC in FPGA," in Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC), Oct. 2015, pp. 1–2.
- [20] M.-A. Daigneault and J. P. David, "A novel 10 ps resolution TDC architecture implemented in a 130 nm process FPGA," in *Proc. 8th IEEE Int. NEWCAS Conf.*, Jun. 2010, pp. 281–284.
- [21] W. Pan, G. Gong, H. Li, and J. Li, "A 20-ps temperature compensated time-to-digital converter (TDC) implemented in FPGA," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC)*, Nov. 2013, pp. 1–6.
- [22] N. Lusardi and A. Geraci, "Comparison of interpolation techniques for TDCs implementation in FPGA," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC)*, Oct. 2015, pp. 1–2.

- [23] M. Fishburn, L. H. Menninga, C. Favi, and E. Charbon, "A 19.6 ps, FPGAbased TDC with multiple channels for open source applications," *IEEE Trans. Nucl. Sci.*, vol. 60, no. 3, pp. 2203–2208, Jun. 2013.
- [24] N. Dutton, J. Vergote, S. Gnecchi, L. Grant, D. Lee, S. Pellegrini, B. Rae, and R. Henderson, "Multiple-event direct to histogram TDC in 65 nm FPGA technology," in *Proc. 10th Conf. Ph.D. Res. Microelectron. Electron. (PRIME)*, Jun. 2014, pp. 1–5.
- [25] Q. Shen, S. Liu, B. Qi, Q. An, S. Liao, P. Shang, C. Peng, and W. Liu, "A 1.7 ps equivalent bin size and 4.2 ps RMS FPGA TDC based on multichain measurements averaging method," *IEEE Trans. Nucl. Sci.*, vol. 62, no. 3, pp. 947–954, Jun. 2015.
- [26] Y. Wang and C. Liu, "A nonlinearity minimization-oriented resourcesaving time-to-digital converter implemented in a 28 nm Xilinx FPGA," *IEEE Trans. Nucl. Sci.*, vol. 62, no. 5, pp. 2003–2009, Oct. 2015.
- [27] J. Y. Won, S. I. Kwon, H. S. Yoon, G. B. Ko, J.-W. Son, and J. S. Lee, "Dual-phase tapped-delay-line time-to-digital converter with on-the-fly calibration implemented in 40 nm FPGA," *IEEE Trans. Biomed. Circuits Syst.*, vol. 10, no. 1, pp. 231–242, Feb. 2016.
- [28] J. Y. Won and J. S. Lee, "Time-to-digital converter using a tuned-delay line evaluated in 28-, 40-, and 45-nm FPGAs," *IEEE Trans. Instrum. Meas.*, vol. 65, no. 7, pp. 1678–1689, Jul. 2016.
- [29] Y. Wang and C. Liu, "A 3.9 ps time-interval RMS precision time-to-digital converter using a dual-sampling method in an UltraScale FPGA," *IEEE Trans. Nucl. Sci.*, vol. 63, no. 5, pp. 2617–2621, Oct. 2016.
- [30] M. Zhang, H. Wang, and Y. Liu, "A 7.4 ps FPGA-based TDC with a 1024unit measurement matrix," *Sensors*, vol. 17, no. 4, p. 865, Apr. 2017.
- [31] X. Qin, L. Wang, D. Liu, Y. Zhao, X. Rong, and J. Du, "A 1.15-ps bin size and 3.5-ps single-shot precision time-to-digital converter with on-board offset correction in an FPGA," *IEEE Trans. Nucl. Sci.*, vol. 64, no. 12, pp. 2951–2957, Dec. 2017.
- [32] N. Lusardi, A. Geraci, J. Marjanovič, and M. Gustin, "High-resolution TDL-TDC system for MTCA.4 standard," in *Proc. IEEE Nucl. Sci. Symp., Med. Imag. Conf. Room-Temp. Semiconductor Detect. Workshop* (*NSS/MIC/RTSD*), Oct. 2016, pp. 1–4.
- [33] N. Lusardi, F. Garzetti, A. Geraci, J. Marjanović, and M. Gustin, "Multichannel time-to-digital converter for MTCA.4 standard in FPGA," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC)*, Oct. 2017, pp. 1–4.
- [34] N. Lusardi, F. Garzetti, and A. Geraci, "Fully programmable system for multi-channel experiments targeting to time measurement at high performance," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC)*, Oct. 2017, pp. 1–5.
- [35] F. Garzetti, N. Lusardi, and A. Geraci, "All-digital fully-configurable instrument for multi-channel time measurements at high performance," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. Proc. (NSS/MIC)*, Nov. 2018, pp. 1–5.
- [36] N. Lusardi, F. Garzetti, M. A. Cibin, R. Sury, and A. Geraci, "Hardware and software co-design of a system-on-chip for real-time bidirectional transfer and processing of data from a time-to-digital converter," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC)*, Oct. 2017, pp. 1–6.
- [37] N. Corna, F. Garzetti, N. Lusardi, and A. Geraci, "System-on-chip Linuxbased platform for high-performance time-to-digital conversion," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC)*, Nov. 2018, pp. 1–4.
- [38] N. Lusardi, F. Garzetti, N. Corna, R. D. Marco, and A. Geraci, "Very high-performance 24-channels time-to-digital converter in Xilinx 20-nm kintex UltraScale FPGA," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf.* (*NSS/MIC*), Oct. 2019, pp. 1–4.
- [39] E. Arabul, S. Paesani, S. Tancock, J. Rarity, and N. Dahnoun, "A precise high count-rate FPGA based multi-channel coincidence counting system for quantum photonics applications," *IEEE Photon. J.*, vol. 12, no. 2, pp. 1–14, Apr. 2020.
- [40] X. Qin, W. Zhang, L. Wang, Y. Zhao, Y. Tong, X. Rong, and J. Du, "An FPGA-based hardware platform for the control of spin-based quantum systems," *IEEE Trans. Instrum. Meas.*, vol. 69, no. 4, pp. 1127–1139, Apr. 2020.
- [41] A. Aloisio, P. Branchini, R. Cicalese, R. Giordano, V. Izzo, and S. Loffredo, "FPGA implementation of a high-resolution time-to-digital converter," in *Proc. IEEE Nucl. Sci. Symp. Conf. Rec.*, vol. 1, Oct. 2007, pp. 504–507.
- [42] K.-J. Choi and D.-W. Jee, "Design and calibration techniques for a multichannel FPGA-based time-to-digital converter in an object positioning system," *IEEE Trans. Instrum. Meas.*, vol. 70, pp. 1–9, 2021.
- [43] X. Hu, L. Zhao, S. Liu, J. Wang, and Q. An, "A stepped-up tree encoder for the 10-ps wave union TDC," *IEEE Trans. Nucl. Sci.*, vol. 60, no. 5, pp. 3544–3549, Oct. 2013.

- [44] Z. Jaworski, "Verilog HDL model based thermometer-to-binary encoder with bubble error correction," in *Proc. 23rd Int. Conf. Mixed Design Integr. Circuits Syst. (MIXDES)*, Jun. 2016, pp. 249–254.
- [45] N. Lusardi, A. Abba, F. Caponio, and A. Geraci, "Quantization noise in non-homogeneous calibration table of a TCD implemented in FPGA," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC)*, Nov. 2014, pp. 1–5.
- [46] N. Lusardi, F. Garzetti, and A. Geraci, "The role of sub-interpolation for delay-line Time-to-Digital converters in FPGA devices," *Nucl. Instrum. Methods Phys. Res. A, Accel. Spectrom. Detect. Assoc. Equip.*, vol. 916, pp. 204–214, Feb. 2019. [Online]. Available: http://www.sciencedirect. com/science/article/pii/S0168900218317479
- [47] Libraries Guide for Schematic Designs, Xilinx, San Jose, CA, USA, 2012.
- [48] UltraScale Architecture Configurable Logic Block, Xilinx, San Jose, CA, USA, 2017.
- [49] N. Lusardi, F. Garzetti, R. D. Marco, and A. Geraci, "Implementation issues of a high-performance multi-channel time-to-digital converter in Xilinx 20-nm UltraScale FPGAs," in *Proc. IEEE Nucl. Sci. Symp. Med. Imag. Conf. (NSS/MIC)*, Nov. 2018, pp. 1–4.
- [50] R. Pelka, J. Kalisz, and R. Szplet, "Nonlinearity correction of the integrated time-to-digital converter with direct coding," *IEEE Trans. Instrum. Meas.*, vol. 46, no. 2, pp. 449–453, Apr. 1997.
- [51] R. Nutt, K. Milam, and C. W. Williams, "Digital intervalometer," U.S. Patent 3 983 481 A, Aug. 4, 1975.



**FABIO GARZETTI** (Member, IEEE) received the bachelor's and master's degrees in electronic engineering from the Politecnico di Milano, where he is currently pursuing the Ph.D. degree with a focus on the improvement of the current technology for time-to-digital converters (TDCs) on FPGAs and the synchronization of multiple devices for very high number of measurements channels on a distributed topology. He developed his thesis work with the Digital Electronics Laboratory, DEIB,

on a topic regarding innovative solutions for calibration and triggering of asynchronous signals for TDCs in field-programmable gate arrays (FPGAs). In DEIB, he applied for the award of temporary research fellowships within the framework of the research program design of modules for readout and processing of sampled data based on FPGA architectures, supported by CAEN ELS. In 2020, he became an Associate Member of the Italian National Institute for Nuclear Physics (INFN).



**NICOLA CORNA** (Member, IEEE) was born in 1992. He received the bachelor's and master's degree in electronics engineering from the Politecnico di Milano, in 2015 and 2018, respectively. He is currently pursuing the Ph.D. degree with DEIB, with a focus on the development of systems on FPGA and SoC reconfigurable devices, with particular interest for the time-domain devices. Since 2020, he has been associated with the Italian National Institute for Nuclear Physics (INFN).

He is the author and developer of various open source projects. His research interest is free software.



**NICOLA LUSARDI** (Member, IEEE) was born in Piacenza, in November 1990. He received the Ph.D. degree in 2018.

He developed his thesis work with the Digital Electronics Laboratory, DEIB, with a focus on a topic regarding high-resolution time-to-digital converters (TDCs) in field-programmable gate arrays (FPGA). He is currently a Temporary Researcher with DEIB, a Professor of electronics with the Politecnico di Milano, and associated with

the Italian National Institute for Nuclear Physics (INFN). He has proposed its research line and its knowledge as a digital designer to public and private research centers. Since 2014, he has been collaborated with CERN in the LHCb experiment; CAEN S.p.A., Viareggio, Italy; CAELels S.r.I., Basovizza, Italy; the Elettra Sincrotrone Trieste S.C.p.A., Basovizza, Italy; Single Quantum B.V., Delft, The Netherlands; the Delft University of Technology; the École Polytechnique Fédérale de Lausanne; and Rete Ferroviaria Italiana. He is a Co-Founder of TEDIEL S.r.I., Italian start-up, spin-off of the Politecnico di Milano.



**ANGELO GERACI** (Senior Member, IEEE) received the M.Sc. degree (*cum laude*) in electrical engineering and the Ph.D. degree (*cum laude*) in electronics and communication engineering from the Politecnico di Milano, in 1993 and 1996, respectively. Since 2004, he has been an Associate Professor with the Department of Electronics, Information, and Bioengineering (DEIB), Politecnico di Milano, where his research activity is mainly focused on digital electronics based on

microcontrollers, DSP, and FPGA devices, specifically in the areas of radiation detection, medical imaging, energy storage for automotive electric systems, and HPC applications. He is currently a Lecturer of the courses Sistemi Elettronici Digitali and Digital Electronic Systems Design with the School of Industrial and Information Engineering, Politecnico di Milano, and holds courses for the Ph.D. program in information technology. He is the author or coauthor of more than 320 publications on refereed international congress proceedings and journals. Since 1995, he has been a Scientific Collaborator of the Italian National Institute for Nuclear Physics (INFN). He is also a member of the Directive Board and the Deputy Coordinator of the Ph.D. students at the School in Information Engineering, Politecnico di Milano. He is also an Auditor of projects on behalf of the Italian MIUR. He is also a Referee for several international journals including, IEEE TRANSACTIONS ON NUCLEAR SCIENCE and Review of Scientific Instruments. He has been the proposer and the manager of several sponsored joint research projects between the Politecnico di Milano and private/public companies. He received the 2004 IEEE TRANSACTIONS Prize Paper Award by the IEEE Power Electronic Society. He has been elevated to the grade of a Senior Member of the IEEE Nuclear and Plasma Society, since 2003.

. . .