Abstract-This paper presents a low nonlinearity, missingcode free, time-to-digital converter (TDC) implemented in a 28-nm field programmable gate array (FPGA) device (Xilinx Virtex 7 XC7V690T) with novel direct bin-width calibrations. We combine the tuned tapped delay lines (TDLs) and a modified direct-histogram architecture to correct the nonuniformity originated from carry chains, and use a multiphase sampling structure to minimize the skews of clock routes. Results of code density tests show that the proposed TDC has much better linearity performances than previously published TDCs. Moreover, our TDC does not generate missing codes. 
I. INTRODUCTION
T IME-TO-DIGITAL converters (TDCs) are required in many time-resolved applications due to their excellent performances in timing resolution; they have been widely applied in space sciences [1] - [3] , medical diagnosis and imaging [4] - [14] , nuclear physics [15] , [16] , quantum communications [17] - [19] , and time-of-flight detections [20] - [23] .
TDCs are actually high-precision (several picoseconds) stopwatches that are capable of time-tagging fast events and generating corresponding digital codes. For example, TDCs in time-correlated single photon counting instruments [24] , [25] generate picosecond timestamps for photon events in fluorescence lifetime imaging microscopy (FLIM), fluorescence spectroscopy [9] - [13] , or time-resolved luminescence experiments for characterizing solid-state materials [26] . With rapid advances in CMOS and digital technologies, TDCs can be implemented in application-specific integrated circuits (ASIC) [14] , [20] , [27] - [34] or field programmable gate arrays (FPGA) [35] - [59] to achieve a subnanosecond resolution. Compared with FPGA-based TDCs, ASIC-based solutions usually have better precision and linearity [29] , [30] . However, they are more expensive and time consuming, usually more suitable for large-scale commercial products. On the other hand, FPGA TDCs provide greater flexibility with a shorter developing cycle for prototyping and verifications. FPGAs are reprogrammable, easy to access (low cost), and promising for product developments. Furthermore, recent advances in FPGAs have allowed tapped delay line (TDL) TDCs to achieve a resolution less than 20 ps [45] , [46] .
The simplest digital TDCs can be implemented by clock driven counters, but the time resolution of this type of TDCs is limited by the clock frequency [29] , [36] . To achieve a better resolution, vernier delay line (VDL) [35] , [43] and TDL methods [35] , [37] , [47] have been widely used. Furthermore, coarse and fine code methods and interpolation methods [43] , [50] have been proposed to achieve a larger measurement range with higher precision. Besides, the cyclic pulse shrinking [39] and dynamic reconfiguration methods [41] were proposed to explore the different FPGA-TDC architectures. Over the past few years, the TDL has become a mainstream method for FPGA-TDC implementations [35] , [46] , [47] , [51] - [58] . TDLs can be easily built using carry chains in most FPGA devices [37] . In a TDL, signals ("hits") with transitions (0-1 or 1-0) propagate along a carry chain, and they are sampled and registered by a clock at each tap.
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/ Time intervals between the signal transitions and the rising edge of a sampling clock can be estimated through the registered codes. Thus, the resolution of a TDL-TDC is determined by the propagation speed of carry chains. In 1997, an FPGA-based VDL-TDC was reported [35] achieving 200 ps resolution. In 2006, Xilinx (Virtex-II) and Altera (ACEX 1 K) FPGAs were used to achieve 69.4 and 112.5 ps, respectively [51] . Chen et al. implemented a 17-ps TDC in a 65-nm CMOS FPGA (Xilinx Virtex 5) in 2009 [47] . Fishburn et al. [46] used a 40-nm CMOS FPGA (Xilinx Virtex 6) to achieve 10 ps. For a raw TDL-TDC, the manufacturing process of FPGA is the main factor that determines the resolution. Several methods have been presented to break this process-related limitation to achieve a better resolution such as the wave union (WU) method [40] , [53] , [60] and the multichain averaging method [49] , [52] . WU approaches, however, require extra binary converters and data processing units, whereas the multichain averaging method needs multichannel TDCs to serve one channel.
The dead time is the shortest time interval between the end of a measurement and the start of the next one [36] , and it has been studied by various research groups [38] , [54] , [55] , [58] . To reduce the dead time and increase the conversion rate, multichannels TDCs [40] , [46] , [57] , [61] are commonly applied. Dutton et al. [54] proposed a multiple-events TDC using a new direct-histogram architecture in a single TDL to allow capturing multiple events for each sampling period and reducing the dead time to less than 500 picoseconds in Virtex 5 FPGA devices.
The nonlinearity of TDCs is usually quantified by differential nonlinearity (DNL) and integral nonlinearity (INL) and evaluated by statistical code density tests [36] , [38] . The code density test feeds the TDC with an amount of random hits in time, and the number of hits collected in a single bin is proportional to the individual code bin width [56] . ASIC-based TDCs can achieve better linearity DNL < 1 least significant bit (LSB) and INL < 1 LSB by optimizing circuit design and layouts [29] , [38] . The nonlinearity of FPGA-based TDCs is usually worse than that of ASIC-based solutions, due to the skews of sampling clocks and the nonuniformity of carry chains [36] , [47] , [58] . Won et al. [56] proposed a dual-phase method to reduce the nonlinearity originated from clock routes by placing two parallel TDLs in the central area of an FPGA. Downsampling or decimation methods can be used to reduce the nonuniformity of carry chains, but these methods also degrade the resolution [47] , [61] . Wang and Liu proposed a way to reduce the nonuniformity of carry chains and the number of missing codes by realigning and reorganizing the output codes of TDLs [58] . In 2016, Won and Lee reported a tuned-TDL method and implemented it in Xilinx Kintex 7, Virtex 6 and Spartan 6 FPGAs [55] . Their work shows improvements in linearity and bin-width distributions, but they are unable to remove missing codes completely.
Calibrations of process, voltage, and temperature variations and nonlinearities are necessary [59] , [60] in FPGA-TDCs. Static nonlinearities caused by the nonuniformity of TDLs and clock distributions are commonly calibrated by applying on-line or off-line bin-by-bin calibration [37] , [59] .
In this paper, our contributions are, for the first time as follows.
1) To combine the tuned-TDL [55] and the modified directhistogram architectures [54] to enhance the linearity of carry chains and completely remove missing codes and significantly suppress the number of very-narrow (code bin width <0.33 LSB) and very-wide (code bin width >2 LSB) bins, see Section III, that appear in almost all FPGA-TDCs; 2) To implement a multiphase sampling approach inspired by dual-phase TDL-TDCs [56] to minimize the clock skews (therefore enhance the linearity) and to lower the requirements for clock frequencies (from 476 to 159 MHz for Virtex 7 FPGAs); 3) To propose innovative on-line bin-width calibrations without any additional processing time by using hardware-friendly weighted addends and bit-shifting operations; 4) To implement the proposed TDC in 28-nm Virtex 7
FPGAs. Comparisons with previously reported TDCs are summarized in the Table I . The proposed TDC clearly shows excellent linearity performances.
II. DESIGN AND ARCHITECTURE
We implemented the tuned-direct-histogram TDC in a 28-nm Xilinx Virtex-7 FPGA. The proposed TDC can be used to interface CMOS single-photon avalanche diodes (SPAD) for ranging, FLIM or positron emission tomography applications [7] . The output signals of CMOS SPADs [25] are compatible with the proposed TDC with very simple frontend circuitry converting the SPAD signals into digital ones. The dead time of a SPAD can range from several to tens of nanoseconds. Benefiting from a short dead time (<500 ps), the proposed TDC can serve multichannel SPADs for high-speed time-resolved spectroscopy applications [7] .
The architecture of the TDC is shown in Fig. 1 . The 'start' port of a tuned-TDL is buffered by a hit signal driven inverter. The TDL is based on cascaded carry chain modules (called CARRY in Xilinx FPGAs) with modified outputs. The states along the TDL are registered by D-flip flops as thermometer codes (1 111 000... or 0 000 111. . .), to be converted to onehot codes (0 001 000. . .) by the XOR-based edge detector to indicate the position of the transitions. We applied the directhistogram architecture. But in order to apply the bin width calibration, each bit of the one-hot code drives a synchronous counter instead of a ripple counter used in Dutton's original design [54] . The diagram in Fig. 1(b) shows that each synchronous counter has multiple count registers (according to the coarse code) to extend the measurement range. By this arrangement, these counters can be further employed for novel bin-width calibrations to be detailed in Section III-D. In order to cover the sampling clock period and reduce the length of the TDL simultaneously, we proposed a multiphase sampling architecture, Fig. 1(c) , extended from the dual-phased method. The histogram data is stored in the registers of the counters and buffered in Block RAM, and transferred to a PC via an on-board universal asynchronous receiver/transmitter (UART) module.
A. CARRY4 Nonlinearity and Tuned-TDL
As Fig. 1 shows, each CARRY4 module includes four cascaded carry elements with each containing a direct and an XORed outputs (labeled as 'C' and 'S' ports, respectively). The first and last elements in a CARRY4 have 'cin' and 'cout' ports respectively for connecting with adjacent CARRY4s. CARRY4 modules provide fast propagation, but the large nonuniformity between delay taps leads to poor linearity. Most previously published FPGA-TDCs used four C-type ports as 'CCCC'. The sampling patterns of delay elements were described in Won's recent work [55] , and large non-uniformity can be observed when certain patterns of output ports were used (such as 'CCCC' or 'SSSS'). The principle and theoretical justification have been described in their work, and they suggested that C-and S-type outputs should be used alternately as 'SCSC' to obtain better performances (for Xilinx Kintex-7 and Virtex-6 devices). We also performed different patterns, and the pattern 'SCSC' also provides the best performance for our Xilinx Virtex-7 device (XC7V690T).
B. Missing Codes and Modified Direct-Histogram Architecture
Due to the nonuniformity of carry chains in a TDL, "bubbles" (1110100. . . or 00010111. . .) are generated in thermometer codes after sampling. Traditional TDL-TDCs convert one-hot codes to binary-codes, and bubbles must be removed by bubble-proof circuits [47] , [60] . However, after the bubble removal, missing codes (DNL ≤ −0.9) appear as the taps are not able to detect enough hit events [58] . Some research groups proposed the bin realignment and tuned-TDL methods to reduce missing codes, but they are not able to remove missing codes completely.
The direct-histogram architecture [54] used in this paper does not convert thermometer codes to binary codes, and the bubbles are counted into the histogram on purpose to remove the missing codes. When bubbles appear in thermometer codes, multihot codes (0 011 100) are generated by the XOR edge detector, and the missing codes are compensated and filled up by the multihot codes. Dutton's direct- histogram design does not have missing codes, however, its linearity performances are not satisfactory. The proposed TDC combines the direct-histogram with the tuned-TDL, not only removing the missing codes completely, but also greatly enhancing the linearity (see Section III-C). Although bubbles introduce errors, they are static and can be corrected easily by bin-width calibrations according to our study. Our study shows that missing codes have more dominating effects on the linearity performances of a TDC. In addition, benefiting from the direct-histogram architecture, multiple events can be recorded simultaneously by the same TDL, and the dead time is reduced to only hundreds of picoseconds [54] .
C. Clock Distribution Routes and Multiphase Architecture
FPGA chips have well-designed clock routes and different clock regions (CR), as shown in Fig. 2 to reduce clock skews [62] . In order to optimize the linearity of an FPGA-TDC, clock skews have to be considered carefully. The clock signal is delivered to a dedicated global buffer (BUFG) in the center of FPGA chips. And clock signals spread to upper and lower parts of the chip along two vertical routes and branch to horizontal subroutes (in the middle nodes of each CR). The direction of the TDL is vertical, and a large skew exists between two delay cells that are located at the boundary of two adjacent CRs.
The length of the TDL (N, the number of bins in a TDL) should be able to cover at least one period of the sampling clock, L S B × N ≥ τ (where LSB is the average code bin width and τ is the period of the sampling clock), otherwise the TDC cannot capture events completely [56] . In Virtex 7 devices, the maximum clock frequency (of different speed grades) is from 450 MHz (2.22 ns) to 600 MHz (1.67 ns) [63] . Devices operating at higher clock frequencies are more prone to timing errors. On the other hand, TDLs have larger nonlinearity when they cross the boundaries of CRs. To avoid crossing CR boundaries, the length and the location of a TDL should be controlled properly. It is hard to achieve a short TDL and use a high-clock frequency simultaneously, if a single TDL is used. Won et al. [56] proposed a dual-phase method to reduce the length of the TDL and to allow the TDCs operating at a lower clock frequency. The dual-phase method used two parallel TDLs sampled by two clocks with 0°and 180°phases, respectively, and therefore the length of each TDL only needs to cover half of the clock period LSB × N ≥ τ/N phase (where the N phase is the number of phase). The number of phases does not influence the linearity of the TDC directly. A large number of phases reduces the clock frequency, but it also increases the system complexity. There is a trade-off, and the number of phases is selected according to the devices or system specifications. After performing full-length TDL tests (it confirms that using dual-phase sampling causes more timing errors), we used the multiphase architecture with three sampling phases. Three parallel TDLs are sampled by three clock signals with 0°, 120°, 240°phase shifts, respectively, and each TDL covers one-third of the clock period. The timing diagrams of the triple-phase architecture are shown in Fig. 3 .
D. Bin Width Calibration
The bin-by-bin calibration method was widely used to enhance the linearity of TDCs [57] , [58] , [60] . This method can be summarized as (1) , where the calibrated time of Bin n, t n , can be derived as
where
are the code bin width of the code bins n and k. The effect of applying (1) is equivalent to discarding all missing codes. However, removing missing codes also reduces the number of effective bins [58] . Equation (1) does not provide a significant improvement even after calibration according to (3) Another simple calibration approach can be easily derived from the definition of the DNL [54] . Different from the original design for post calibration, we propose a new strategy to allow on-line calibration, denoted as bin-width calibration in this paper. Because the count in a histogram bin is proportional to the code bin width, the DNL is related to the actual count of code bin k, H [k] , and the ideal count, H , as
where Q is the ideal code bin width in a code density test, and N is the number of the bins in a TDL. 
To perform (4), TDCs should not contain any missing codes, and therefore most FPGA-based TDCs with missing codes are not able to apply (4) 
2) right shifting the accumulator by M-bit to obtain
is stored in an I -bit register. The advantages of this method (5 and 6) are: 1) it is extremely easy to implement and 2) no post processing is needed. The disadvantage is that more resources are required for a bigger M.
III. EXPERIMENTS AND RESULTS
To evaluate the proposed TDC, code density tests were performed. A raw-TDL and a tuned-TDL were tested with the traditional and the modified direct-histogram architectures, respectively (four combinations). The bin-width calibration method was tested and discussed as well. Two independent low-jitter crystal oscillators (DSC1103) were used as the signal sources for code density tests. The temperature and operating voltage on the FPGA chip were maintained within a stable range (temperature: 30.1°C ± 0.3°C, voltage: 0.995 V ± 0.002 V).
A. Full-Length TDL Test
In order to determine the length and location of the TDL, a full-length TDL with 2000 bins (from bin 0 to bin 1999) was tested. The TDL fully covers a column of slices in the FPGA chip and crosses ten CRs as shown in Fig. 2 . The DNL plots, shown in Fig. 4 , show large nonlinearity (DNL > 2 LSB) appearing at the boundaries of CRs (at bins 200, 400, 600, 800, 1200, 1400, 1600, and 1800) except at the boundary (bin 1000) between two central CRs (CRX1Y4: bin 800 to bin 999; CRX1Y5: bins 1000 to bin 1199) shown in Fig. 2 . The reason of the exception is that these two CRs are symmetrical in terms of the clock routing. At bin 1100 (corresponding to Node B, In order to minimize nonlinearity, the length of the single TDL is set to have 200 bins (from bin 900 to bin 1100). In the Vertex-7 FPGA, the average code bin width is 10.5 ps. A single TDL with 200 bins has a propagation delay of 2.1 ns. Three parallel TDLs were implemented for the proposed multiphase method in two central CRs (X1Y4 and X1Y5). Each TDL only covers one-third of the clock period. With this arrangement, the minimum frequency of the sampling clock signal can be reduced to 159 MHz.
B. Linearity Tests
In this paper, we compared the direct-histogram architecture with some traditional architectures. The tested TDLs are located between Slice-X106Y225 and Slice-X106Y274 (50 Slices, 200 bins). The output pattern (as 'CCCC') of the 
C. Bin Width Distribution
The integration of the tuned-TDL and direct-histogram significantly improves the uniformity of the code bin widths as well. According to the bin-width distributions shown in Fig. 6 , the traditional raw-TDL generates a large number of missing codes and shows poor uniformity (σ bin−width = 12.60 ps). Even with the direct-histogram architecture applied in the raw-TDL, the improvement (σ bin−width = 6.40 ps) is not significant and very-wide (DNL > 2 LSB) and very-narrow (DNL < 0.33 LSB) bins still exist although missing codes are removed. In Fig. 6(b) , the tuned-TDL improves the distribution of the bin-width (σ bin−width = 5.98 ps) and reduces the number of the missing codes. However, combining the directhistogram and the tuned-TDL not only significantly improves the distribution of the bin-width (σ bin−width = 2.10 ps), but also completely removes the missing codes and reduces the numbers of very-narrow and very-wide bins.
D. Equivalent Bin Width and Equivalent Standard Deviation
The equivalent bin width w eq and the equivalent standard deviation σ eq were proposed by Wu for assessing the linearity performance of TDCs [64] , defined as
Applying (7) and (8) to raw and tuned TDLs (with traditional or with direct histogram architectures), w eq and σ eq are summarized in Table III . The proposed TDC shows the best results with w eq = 11.15 ps and σ eq = 3.22 ps before calibration.
E. Hardware Bin-Width Calibration
The proposed TDC has no missing codes, and therefore, from (3), the bin-width calibration can be implemented in (5), without extra processing time. A larger M leads to a better calibration for the code bin width, but M > 5 does not improve much further. Fig. 7 and Table IV 
F. Time Interval Measurements
To verify the linearity of the proposed TDL, a programmable delay generator called IDELAYE2 in Virtex-7 FPGAs was used to generate known time intervals [65] . The delay of each tap in the IDELAYE2 was continuously calibrated by an IDELAYCTRL module based on a low jitter reference clock. The tap delay is 39 ± 5 ps per tap when the reference clock is working at 400 MHz [63] . Furthermore, the external jitter and error are minimized since the time intervals are generated in the FPGA chip and sent to the TDC directly. A copy of the sampling clock was delayed by an IODELAY module and connected to the 'start' port of a single-channel TDL. By controlling the tap value of the IODELAY, the time intervals from 1244 to 2464 ps in a step around 38.1 ps were provided and measured by both uncalibrated and calibrated TDCs. The time intervals were also measured by a commercial time-correlated single-photon counting (TCSPC) module (PicoQuant PicoHarp 300, with 4 ps resolution and DNL < 5%, peak < 1% rms). Each measurement captured more than 100 000 samples, and the time intervals were calculated based on the histogram. The measurement results and the differences between the measured and expected values for uncalibrated and calibrated tuned-TDLs are shown in Fig. 8 . The standard deviations of the measurements were calculated according to the differences, and they are 5.11 and 4.42 ps, for the uncalibrated and calibrated TDCs, respectively.
IV. CONCLUSION
We integrate, for the first time, the tuned-TDL, the modified direct-histogram based on the multiphase architecture to implement a low nonlinearity, missing-code free TDC with the fast bin-width calibration in FPGAs. The unique advantages are as follows.
1) The synergistic effects brought by this combination are significant in suppressing the nonuniformity according to the tested DNL and INL, measurement deviations, the equivalent bin widths and their standard deviations. Moreover, the missing codes are completely removed. 2) The multiphase method extended from the dual-phase method provides extra design flexibility to minimize the nonlinearity from clock route skews and to lower the timing requirements for the clock frequency simultaneously. 3) Based on the direct-histogram architecture and the missing-code-free feature, a novel bin-width calibration method can be applied, and the performance was presented and evaluated. DNL pk−pk and INL pk−pk after calibration are reduced to 0.08 and 0.13 LSB, respectively. σ D N L and σ I N L decrease to 0.10 and 0.21 ps, respectively. In summary, a new design concept for FPGA-TDC is presented and evaluated in this paper.
In previously published literatures, traditional thermometerto-binary architectures have been fully studied and their limitations were evaluated clearly. The newly proposed directhistogram architecture has not yet gained enough attention and has not been widely applied. This paper shows that the directhistogram architecture can be widely applied in tuned-TDLs to achieve low nonlinearity, missing-code free with direct bin-width calibrations providing distinguished advantages over traditional methods. Although, the resource consumption is the main drawback of this architecture and need to be noticed. In the further, we will investigate the solution of this drawback continuously.
