

## Strategies Towards High Performance (High-Resolution/Linearity) Time-to-digital Converters on Field-programmable Gate Arrays

Thesis submitted by

Wujun Xie

Strathclyde Institution of Pharmacy and Biomedical Sciences

University of Strathclyde

This thesis is submitted for the degree of Doctor of Philosophy in Pharmacy and Biomedical Sciences

## **Declaration of Work**

This thesis is the result of the author's original research. It has been composed by the author and has not been previously submitted for examination, which has led to the award of a degree.

The copyright of this thesis belongs to the author under the terms of the United Kingdom Copyright Acts as qualified by University of Strathclyde Regulation 3.50.

Due acknowledgement must always be made of the use of any material contained in, or derived from, this thesis.

Signed: Wyym Xie. Date: 2023/July/07

## Abstract

Time-correlated single-photon counting (TCSPC) technology has become popular in scientific research and industrial applications, such as high-energy physics, bio-sensing, non-invasion health monitoring, and 3D imaging. Because of the increasing demand for high-precision time measurements, time-to-digital converters (TDCs) have attracted attention since the 1970s. As a fully digital solution, TDCs are portable and have great potential for multichannel applications compared to bulky and expensive time-to-amplitude converters (TACs).

A TDC can be implemented in ASIC and FPGA devices. Due to the low cost, flexibility, and short development cycle, FPGA-TDCs have become promising. Starting with a literature review, three original FPGA-TDCs with outstanding performance are introduced. The first design is the first efficient wave union (WU) based TDC implemented in Xilinx UltraScale (20 nm) FPGAs with a bubble-free sub-TDL structure. Combining with other existing methods, the resolution is further enhanced to 1.23 ps. The second TDC has been designed for LiDAR applications, especially in driver-less vehicles. Using the proposed new calibration method, the resolution is adjustable (50, 80, and 100 ps), and the linearity is exceptionally high ( $DNL_{pk-pk}$  and  $INL_{pk-pk}$  are lower than 0.05 LSB). Meanwhile, a software tool has been open-sourced with a graphic user interface (GUI) to predict TDCs' performance. In the third TDC, an onboard automatic calibration (AC) function has been realized by exploiting Xilinx ZYNQ SoC architectures. The test results show the robustness of the proposed method. Without the manual calibration, the AC function enables FPGA-TDCs to be applied in commercial products where mass production is required.

## Acknowledgements

I would like to express my gratitude to everyone who helped me in the past four years. PhD training is a long journey. Without their support, I cannot complete this PhD project successfully.

First, I would like to thank my academic supervisor, Dr David Day-Uei Li. He is a brilliant researcher. He supervised and guided me to complete my PhD project, and when I faced challenges, he was always willing to support and encourage me. He was the first to tell me that I am an excellent student and can be a qualified researcher after the training. Without his encouragement, I might not have been productive and published 3 papers in IEEE Trans. Ind. Electron. He is also a supportive friend. Without his mental support, I could not complete my study during COVID-19. Also, I would like to thank the University of Strathclyde for providing funding and resources. The University of Strathclyde offered me the lab space and equipment to conduct my research and is also a great platform to network with other researchers worldwide.

Second, I would like to thank my parents for their selfless love. My PhD study becomes possible because of their emotional and financial support. They are the light in my life. My family has faced many challenges in the past years. However, they can still provide their strongest support and give me the faith that I can complete my PhD project.

Third, to my naughty cat, Dobbie. During the year-long lockdown, he is my lifesaver. He is a playful boy and a troublemaker. He really releases me from the continuous pressure. Meanwhile, I would like to thank my friends Yu Wang, Zhenya Zang, Tianyu Jiang, Miaomiao Chang, Yifei Wan, Hengqing Tian. They are fiercely loyal friends and shared my happiness and sadness over the past four years.

Fourth, to Dr Haochang Chen. He is an innovative and talented researcher. Without his guidance, I could not start my project smoothly. Besides, I would like to thank the various researchers and peers, including the researchers I met and discussed with and the reviewers of the papers I submitted. They shared their valuable ideas and comments, guiding me to conduct my project.

## Contents

| Abstract                                                           | 3  |
|--------------------------------------------------------------------|----|
| Acknowledgements                                                   | 4  |
| Contents                                                           | 5  |
| List of Figures                                                    | 8  |
| List of Tables1                                                    | 2  |
| Abbreviations1                                                     | 3  |
| Chapter 1. Introduction1                                           | 6  |
| 1.1. Background and motivation1                                    | 6  |
| 1.1.1. Time-to-digital converters and time-of-flight measurements1 | 8  |
| 1.1.2. Applications2                                               | 20 |
| 1.2. Research aim2                                                 | 23 |
| 1.3. List of Contributions                                         | 24 |
| 1.4. Outline of the thesis                                         | 25 |
| Chapter 2. Literature Review                                       | :6 |
| 2.1. Hardware platform2                                            | :6 |
| 2.1.1. Application-specific integrated circuit (ASIC)2             | 27 |
| 2.1.2. Field-programmable gate array (FPGA)2                       | 27 |
| 2.1.3. Setup time, hold time, and metastability2                   | :9 |
| 2.2. Performance parameters                                        | 0  |
| 2.2.1. Resolution                                                  | 0  |
| 2.2.2. Linearity                                                   | 1  |
| 2.2.3. Precision and accuracy                                      | 51 |
| 2.2.4. Measurement range                                           | 3  |
| 2.2.5. Deadtime                                                    | 4  |
| 2.3. Methods for time measurements                                 | 4  |
| 2.3.1. Direct counting                                             | 4  |
| 2.3.2. Interpolation                                               | 5  |
| 2.4. Challenges and developments                                   | 8  |
|                                                                    |    |

| 2.4        | 4.1. | Challenges                                        | 38  |
|------------|------|---------------------------------------------------|-----|
| 2.4        | 4.2. | Structures for FPGA-TDCs                          | 41  |
| 2.4        | 4.3. | De-bubble method                                  | 48  |
| 2.4        | 4.4. | Calibration methods                               | 50  |
| 2.5.       | Sum  | ımary                                             | 53  |
| Chapter 3. | Η    | igh resolution wave union TDC in 20 nm FPGA       | 54  |
| 3.1.       | Mot  | ivation                                           | 54  |
| 3.2.       | Arcl | hitecture and method                              | 58  |
| 3.2        | 2.1. | CARRY8s and dual-sampling structures              | 60  |
| 3.2        | 2.2. | Bubble errors introduced by the wave union method | 62  |
| 3.2        | 2.3. | Wave union methods with the sub-TDL structure     | 63  |
| 3.2        | 2.4. | Encoding and Compensation Strategies              | 64  |
| 3.2        | 2.5. | Dual-sampling wave union TDC with Sub-TDL         | 67  |
| 3.2        | 2.6. | Binned dual-sampling wave union TDC               | 68  |
| 3.3.       | Exp  | erimental tests                                   | 68  |
| 3.         | 3.1. | Experimental setup                                | 68  |
| 3.:        | 3.2. | Linearity tests                                   | 71  |
| 3.:        | 3.3. | Time-interval tests                               | 73  |
| 3.4.       | Theo | oretical analysis                                 | 74  |
| 3.5.       | Log  | ic resource consumption                           | 78  |
| 3.6.       | Sum  | ımary                                             | 79  |
| Chapter 4. | Η    | igh-linearity TDC for LiDAR applications          | 82  |
| 4.1.       | Mot  | ivation                                           | 82  |
| 4.2.       | Arcl | hitecture and method                              | 84  |
| 4.         | 2.1. | Architecture                                      | 84  |
| 4.2        | 2.2. | Distortions caused by mixed-calibration methods   | 89  |
| 4.2        | 2.3. | Mixed binning method with resolution adjustments  | 89  |
| 4.3.       | Soft | ware prediction                                   | 91  |
| 4.4.       | Harc | dware implementation and test results             | 94  |
| 4.4        | 4.1. | Linearity                                         | 95  |
| 4.4        | 4.2. | Precision                                         | 99  |
| 4.5.       | Mul  | tichannel design                                  | 103 |
| 4.6.       | Con  | nparison                                          | 106 |

| 4.7        | 7.                     | Sum                                              | nmary                                                   | 107 |
|------------|------------------------|--------------------------------------------------|---------------------------------------------------------|-----|
| Chapter 5. |                        | Automatic calibration in ZYNQ-based Structures10 |                                                         | 109 |
| 5.1        | l.                     | Mot                                              | ivation                                                 |     |
| 5.2        | 2.                     | ZYN                                              | NQ architecture and TDC structure                       | 110 |
|            | 5.                     | 2.1.                                             | CARRY4 and tunned-TDL architecture                      | 111 |
|            | 5.                     | 2.2.                                             | Weighted histogram calibration method                   | 111 |
|            | 5.                     | 2.3.                                             | Communications between PL and PS                        | 118 |
| 5.3        | 3.                     | Auto                                             | omatic calibration function                             |     |
| 5.4        | 1.                     | Exp                                              | erimental results                                       | 120 |
|            | 5.                     | 4.1.                                             | Linearity and bin width distribution                    | 121 |
|            | 5.                     | 4.2.                                             | Time interval tests                                     | 122 |
| 5.5        | 5.                     | Mul                                              | tichannel implementation and logic resource consumption | 123 |
| 5.6        | 5.                     | Con                                              | nparison                                                | 127 |
| 5.7        | 7.                     | Con                                              | clusion                                                 | 127 |
| Chapter    | r 6.                   | С                                                | onclusion                                               | 129 |
| 6.1        | l.                     | Sum                                              | nmary                                                   |     |
| 6.2        | 2.                     | Futu                                             | ıre work                                                | 131 |
| Referen    | nces.                  |                                                  |                                                         | 133 |
| Append     | lix                    |                                                  |                                                         | 147 |
| Jo         | urna                   | l pub                                            | lications                                               | 147 |
| Co         | Conference submission1 |                                                  |                                                         | 148 |

## **List of Figures**

| Figure 2.17 a) An example of the gray code transition. The structures of b) the gray |
|--------------------------------------------------------------------------------------|
| code counter and c) gray code oscillator TDC47                                       |
| Figure 2.18 Bubble removal circuits: a) AND gate based, b) XOR gate based. c)        |
| an AND + inverter-based bubble-proof circuit                                         |
| Figure 2.19 The sub-TDL structure for CY450                                          |
| Figure 2.20 a) Conversion between the analogue input and digital output. b)          |
| Quantization errors caused by the conversion                                         |
| Figure 2.21 Errors caused by nonlinearities                                          |
| Figure 3.1 WU method's operating principle [111]54                                   |
| Figure 3.2 Examples of a) a LUT-based WU-A launcher and its truth table and b)       |
| a NAND-based WU-B launcher. The concepts of c) the WU-A method and d)                |
| the WU-B method56                                                                    |
| Figure 3.3 The block diagram of the proposed TDC system a) with the WU method        |
| and b) without the WU method59                                                       |
| Figure 3.4 Slices in a) 7-series FPGAs and b) UltraScale FPGAs60                     |
| Figure 3.5 a) DNL and INL curves for the 790-bin DS TDC. b) DNL and INL              |
| curves for the 390-bin DS TDC61                                                      |
| Figure 3.6 a) An example of the propagation speed difference between the rising      |
| and falling edges. b) The bin number versus the measured time fitting curves         |
| for the rising and falling edge signals63                                            |
| Figure 3.7 The block diagram of the encoder                                          |
| Figure 3.8 Concepts of a) the compensation strategy and b) the binning method. c)    |
| Hardware implementation of the compensation method65                                 |
| Figure 3.9 DNL and INL performances. a) DNL and b) INL plots of the original         |
| WU TDC and the compensated WU TDC in the UltraScale FPGA (810 bins).                 |
|                                                                                      |
| Figure 3.10 The block diagram of the DSWU TDC67                                      |
| Figure 3.11 Experimental Setup                                                       |
| Figure 3.12 Xilinx KCU105 evaluation board69                                         |
| Figure 3.13 a) Si5324 evaluation board. b) CG635 clock generator69                   |
| Figure 3.14 DNL and INL curves for the DSWU TDC (1626 bins)72                        |
| Figure 3.15 DNL and INL curves for the binned-DSWU TDC (807 bins)72                  |
|                                                                                      |

| Figure 3.16 Bin width distributions for the compensated WU TDC and the binned-                |
|-----------------------------------------------------------------------------------------------|
| DSWU TDC72                                                                                    |
| Figure 3.17 TIT results and RMS resolutions for a) the DS TDC system and b) the               |
| WU TDC system73                                                                               |
| Figure 3.18 TIT results and RMS resolutions for a) the DSWU TDC system and                    |
| b) the binned-DSWU TDC system74                                                               |
| Figure 3.19 Test setup for investigating jitters of LUT and the delay element76               |
| Figure 3.20 Test results for investigating jitters of LUT and the delay element76             |
| Figure 3.21 layouts of the DSWU TDC. a) Overview. b) Clock regions (X2Y1 $\sim$               |
| X2Y3)                                                                                         |
| Figure 4.1 a) Synchronized LiDAR system. b) Timing diagram of time interval                   |
| measurements. c) Conception of a timing event histogramming function83                        |
| Figure 4.2 Block diagrams of a) the proposed TDC architecture and b) the coarse               |
| code histogramming module85                                                                   |
| Figure 4.3 Two-step histogramming methods in [156]                                            |
| Figure 4.4 Comparison between a) the proposed mixed-binning method and b) the                 |
| mixed-calibration method in [66]87                                                            |
| Figure 4.5 Concepts of a) the MC method in [66] and b) the proposed MB method.                |
| c) Errors in the MC method                                                                    |
| Figure 4.6 a) Linearity curves for the full-length (2400 bins; $LSB = 5.13 \text{ ps}$ )      |
| original TDC (without using any calibration methods). b) the layout of the                    |
| full-length original TDC placed in Slice X49Y0 – X49Y29991                                    |
| Figure 4.7 Linearity curves for the 460-bin original TDC placed in Slice                      |
| X49Y120-X49Y17992                                                                             |
| Figure 4.8 The prediction tool's GUI93                                                        |
| Figure 4.9 DNL curves for the proposed TDCs when $i = 10, 16$ and 2097                        |
| Figure 4.10 INL curves for the proposed TDCs when $i = 10, 16$ and 20                         |
| Figure 4.11 Short delay time interval tests results (< 2 ns): a) $i = 10$ , b) $i = 16$ , and |
| c) $i = 20$ 101                                                                               |
| Figure 4.12 Valid RMS resolutions in long delay time interval tests (< 1000 ns).              |
|                                                                                               |
| Figure 4.13 The layout of the 128-channel hybrid TDC104                                       |

| Figure 5.1 The block diagram of the proposed TDC system based on ZYNQ                |
|--------------------------------------------------------------------------------------|
| structures110                                                                        |
| Figure 5.2 Example of bin compensation in a) the MC method [66]; and b) the          |
| proposed weighted histogram calibration method112                                    |
| Figure 5.3 Hardware implementation of the weighted histogram calibration113          |
| Figure 5.4 Bin compensation when a) $Wk \le 1$ LSB; b) 1 LSB < $Wk \le 2$ LSB;       |
| and c) 2 LSB $< Wk \le$ 3 LSB                                                        |
| Figure 5.5 The pseudo-code for calculating address factors115                        |
| Figure 5.6 Flow diagrams of a) the proposed weighted histogram calibration           |
| method; b) the previous mixed calibration method [66]116                             |
| Figure 5.7 Vivado IP Integrator system block design117                               |
| Figure 5.8 a) AXI communication model; b) Three types of AXI interface117            |
| Figure 5.9 The workflow of AC TDCs                                                   |
| Figure 5.10 a) DNL and b) INL plots of the calibrated and uncalibrated TDCs.         |
|                                                                                      |
| Figure 5.11 Distribution of a) calibrated bin-widths and b) uncalibrated bin-widths. |
| LSB = 9.83 ps121                                                                     |
| Figure 5.12 a) Time interval measurement results and b) Time interval histogram      |
| at the time interval about 980 ps121                                                 |
| Figure 5.13 Implementation layouts of a) a single channel and b) 16 channels.        |
|                                                                                      |

## **List of Tables**

| Table 1-1 Comparisons between analog and digital solutions18                    |
|---------------------------------------------------------------------------------|
| Table 2-1 Comparison between mainstream interpolation methods   38              |
| Table 2-2 TDCs in ASICs and FPGAs                                               |
| Table 3-1 Interpolation Efficiency on Resolution                                |
| Table 3-2 Comparison of The Linearity Performances Between Four Different       |
| TDC Designs in UltraScale 20 nm FPGAs (The unit is LSB if not mentioned)        |
|                                                                                 |
| Table 3-3 Evaluation of Measurement Uncertainties                               |
| Table 3-4 Consumption of Logic Resources                                        |
| Table 3-5 Comparison of Published FPGA-based TDCs and the Proposed TDCs         |
|                                                                                 |
| Table 4-1 Linearity performances of the proposed TDCs obtained from software    |
| predictions and hardware implementations96                                      |
| Table 4-2 Precision performance of the proposed TDCs in the software Prediction |
| and the hardware implementation103                                              |
| Table 4-3 Logic resource consumption                                            |
| Table 4-4 Linearity performances of 16 channels (out of 128 channels in the     |
| proposed multichannel TDC, unit: ×1.0E-3 LSB)104                                |
| Table 4-5 Comparison Between Reported High-linearity TDCs with Acceptable       |
| Resolutions105                                                                  |
| Table 5-1. Sampling Pattern Comparison. 111                                     |
| Table 5-2 Linearity comparison between the uncalibrated TDC and calibrated TDC. |
|                                                                                 |
| Table 5-3 Logic resource utilization 123                                        |
| Table 5-4 Summary of linearity performance for 16-channel TDCs (Units: LSB)     |
|                                                                                 |
| Table 5-5 Comparison of published calibration methods                           |
| Table 5-6 Summary of recently published FPGA-TDCs with comparable               |
| performances126                                                                 |

## Abbreviations

| AC       | Automatic calibration                     |  |  |
|----------|-------------------------------------------|--|--|
| ADC      | Analogue-to-digital converter             |  |  |
| AI       | Artificial intelligence                   |  |  |
| ALU      | J arithmetic logic unit                   |  |  |
| AMBA     | Advanced microcontroller bus architecture |  |  |
| ASIC     | Application-specific integrated circuit   |  |  |
| AXI      | Advanced extensible interface             |  |  |
| BCF      | Bin correction factor                     |  |  |
| BRAM     | Block random access memory                |  |  |
| CARRY4   | CY4                                       |  |  |
| CARRY8   | CY8                                       |  |  |
| CB       | Connection block                          |  |  |
| CGPM     | Conférence générale des poids et mesures  |  |  |
| CLB      | Configurable logic block                  |  |  |
| CMOS     | Complementary metal-oxide-semiconductor   |  |  |
| CPU      | Central processing unit                   |  |  |
| CR       | Clock region                              |  |  |
| СТ       | Computed tomography                       |  |  |
| CW       | Continue wave                             |  |  |
| CY4      | CARRY4                                    |  |  |
| CY8      | CARRY8                                    |  |  |
| DDR      | Double Data Rate                          |  |  |
| deoxy-Hb | Deoxygenated haemoglobin                  |  |  |
| DFF      | D-type flip-flop                          |  |  |
| DNL      | Differential nonlinearity                 |  |  |
| DSP      | Digital signal processing                 |  |  |
| DSWU     | Dual-sampling wave union                  |  |  |
| D-ToF    | Direct ToF                                |  |  |
| EDA      | Electronic design automat                 |  |  |
| ELF      | Executable and linkable format            |  |  |
| FF       | Flip-flop                                 |  |  |
| FIFO     | First-in, first-out                       |  |  |
| FLIM     | Fluorescence lifetime imaging microscopy  |  |  |
| FMC      | FPGA mezzanine card                       |  |  |
| fNIRS    | Functional near-infrared spectroscopy     |  |  |
| FoV      | Field-of-View                             |  |  |
| FPGA     | Field-programmable gate array             |  |  |
| FSR      | Finite step response                      |  |  |

| FWHM   | Full width at half-maximum    |
|--------|-------------------------------|
| GCO    | Gray code oscillator          |
| GPIO   | General purpose IO            |
| GUI    | Graphical user interface      |
| HDL    | Hardware description language |
| HPC    | High pin count                |
| IC     | Integrated circuit            |
| INL    | Integral nonlinearity         |
| Ю      | Input/output                  |
| IOB    | IO block                      |
| IoT    | Internet of Things            |
| IP     | Intellectual property         |
| ISR    | Infinite step response        |
| I-ToF  | Indirect ToF                  |
| LAB    | Logic array block             |
| LB     | Logic block                   |
| LiDAR  | Light detection and ranging   |
| LOR    | Line of response              |
| LPC    | Low pin count                 |
| LSB    | Least significant bit         |
| LSI    | Large-scale integration       |
| LSPR   | Large-scale parallel routing  |
| LUT    | Look-up table                 |
| MB     | Mixed-binning                 |
| MCU    | Microcontroller unit          |
| ММСМ   | Mixed-mode clock manager      |
| MOS    | Metal-oxide silicon           |
| MR     | Measurement range             |
| MUX    | Multiplexer                   |
| NIR    | Near infrared ray             |
| OH2BIN | One-hot-to-binary             |
| OS     | Operating system              |
| oxy-Hb | Oxygenated haemoglobin        |
| PC     | Personal computer             |
| PCB    | Printed circuit board         |
| РЕТ    | Positron emission tomography  |
| pk-pk  | Peak-to-peak                  |
| PL     | Programmable logic            |
| PLL    | Phase locked loop             |
| PPA    | Performance, power, and area  |
| PS     | Processing system             |

| PSDL   | Pulse shrinking delay line               |  |  |
|--------|------------------------------------------|--|--|
| QC     | Quantum communication                    |  |  |
| QRNG   | Quantum random number generator          |  |  |
| Radar  | Radio detection and ranging              |  |  |
| RF     | Radio frequency                          |  |  |
| RO     | Ring oscillator                          |  |  |
| RTL    | Register transfer level                  |  |  |
| SB     | Switch block                             |  |  |
| SD     | Secure digital                           |  |  |
| SDK    | Software development kit                 |  |  |
| SDRAM  | Synchronous Dynamic Random Access Memory |  |  |
| SI     | International system of units            |  |  |
| SiPM   | Silicon photomultiplier                  |  |  |
| SMA    | Sub-Miniature version A                  |  |  |
| SNR    | Signal-to-noise ratio                    |  |  |
| SoC    | System on chip                           |  |  |
| SPAD   | Single-photon avalanche diode            |  |  |
| SRS    | Stanford research system                 |  |  |
| TAC    | Time-to-amplitude converter              |  |  |
| TCSPC  | Time-correlated single-photon counting   |  |  |
| TDC    | Time-to-digital converter                |  |  |
| TDL    | Tapped delay line                        |  |  |
| ТН2ОН  | Thermometer-to-one-hot                   |  |  |
| TI     | Time interval                            |  |  |
| TIM    | Time-interval meter                      |  |  |
| TIT    | Time interval test                       |  |  |
| TM2BIN | Thermometer-to-binary                    |  |  |
| ToF    | Time-of-flight                           |  |  |
| TVC    | Time-to-voltage converter                |  |  |
| UTC    | Coordinated universal time               |  |  |
| VDL    | Vernier delay line                       |  |  |
| WCF    | Width correction factor                  |  |  |
| WR     | White rabbit                             |  |  |
| WU     | Wave union                               |  |  |

## **Chapter 1.** Introduction

In this chapter, the research background will be introduced, and the motivation will be explained in Section 1.1. Subsequently, the aim and contributions are listed in Section 1.2 and Section 1.3, respectively. Finally, Section 1.4 will provide the outline of the thesis.

## 1.1. Background and motivation

Time is the sequence of continued and irreversible events from the past and into the future. Thirty thousand years ago, ancient humans first recorded the period of the Moon. The history of time measurement has continued since then. People noticed the regular movement of the Sun, the Moon, and the Earth, and developed the concept of days, months, and years. The timing technique can be found everywhere across history. The ancient Chinese created the lunar calendar for agriculture cultivation. The Babylonians invented the 60-fraction system, defined hours, minutes, and seconds, and developed fundamental astronomy. Many inventions were made to record the flow of time, such as sundials, candle clocks, and water clocks. The first mechanical clock was invented in 1275 by counting the rotation of the escapement, and after that, mechanical clocks have been miniatured further and become the armed watch in our modern life.

To keep pace with the fast development of science, a more precise timing technique is required. Marrison designed a quartz clock by extracting oscillating signals from a quartz crystal in 1929 [1]. Shaped and insulated quartz crystals are robust to environmental variants [2], such as humidity, pressure, and temperature. The atomic clock was firstly built by Essen and Parry in 1955 [3]; it counts time by monitoring the frequency of cesium-133 atom radiation (around 9.2 GHz). In 1971, the International System of Units (SI) used 9,192,631,770 periods of the cesium-133 atom radiation to define a second. As the frequency of light is much higher than that of the microwave, optical clocks become attractive. In 2001, an optical clock with 532 THz light was proposed and achieved excellent measured stability (at femtosecond level) [4]. Although atomic and optical clocks can achieve outstanding performance, the whole

systems are bulky and require a strict experimental environment. Small-size highprecision electronic timing devices are still required.



Figure 1.1 a) General structure of analog TIMs. b) The TAC principle.

In physics and biology, reactions and responses happen in a very short time interval (TI). To capture the time information, TI meters (TIM), as electronic high-precision stopwatches, are used to measure the time interval between two or more events and convert the result into digital formats [5]. A TIM can be built in analog and digital manners. Because of capacitors' voltage-storing capability, time-to-amplitude converters (TAC) or time-to-voltage converters (TVC) are the first efficient TIMs, which reach a resolution in the sub-nanosecond range [6]. Figure 1.1a shows the simplified structure of an analog TIM. It contains three components: a TAC, an amplifier, and an analog-to-digital converter (ADC). Figure 1.1b shows TACs' operating principle. When the switch is on, the current source  $I_{conv}$  will charge the capacitor  $C_{conv}$ . Then the capacitor's voltage, V(t), will be increased during this process and is expressed in Eq. (1.1).

$$V(t) = \frac{1}{C} \int_{T_{start}}^{T_{stop}} I \cdot dt$$
(1.1)

Passing through the amplifier, V(t) is increased by a gain G and then converted into a digital format by the ADC. With multistage amplifiers and well-developed ADCs, analog TIMs can achieve outstanding performance. In the 1990s, Kalisz *et al.* designed Page **17** of **148** 

an analog TIM with a 3 ps resolution [7]. A recently reported TAC-ADC solution achieved a 782 fs resolution [8]. However, constrained by the discrete components, these analog solutions are bulky, expensive, and complicated to extend to multi-channel parallel systems [9]. Time-to-digital converters (TDC) have become more popular because they are more portable and can be easily integrated with digital systems as a digital solution (listed in Table 1-1).

Table 1-1 Comparisons between analog and digital solutions

|         | Pros                           | Cons                 |
|---------|--------------------------------|----------------------|
| Analog  | Excellent Resolution (100 fs)  | Bulky and expensive, |
| Anaiog  |                                | Limited channels     |
| Digital | Portable, multi-channel design | Limited resolution   |

## 1.1.1.Time-to-digital converters and time-of-flight measurements

Unlike complex analog TIM systems, a TDC can independently convert time intervals into digital codes and is vital in time-of-flight (ToF) measurements. Although all-digital TDCs are more portable and compatible, the complementary metal-oxide-semiconductor (CMOS) process still restricts TDCs' performance. Because of improved manufacturing techniques in the past 20 years, full-digital TDCs have a better resolution and linearity, attracting researchers' attention. In 2021, Chung *et al.* reported a 360-fs TDC in 65-nm CMOS [10], the TDC with the highest resolution.

The ToF technique is a method for measuring the distance between objects [11]. Figure 1.2a is an example of ToF systems. A signal will be emitted and detected; the ToF corresponds to the distance. Classified by the type of emitted signals, ToF measurements can be sorted as direct ToF (D-ToF, shown in Fig. 1.2b) and indirect ToF (I-ToF, shown in Fig. 1.2c). Both D-ToF and I-ToF can measure the distance simultaneously. In D-ToF, the emitted signal is a light pulse and is detected after being reflected by the target. Measured by a TDC, the time difference ( $\Delta t$ ) between the two events (the emission and the detection) is the distance, see Eq. (1.2).



Figure 1.2 a) Direct ToF measurement system. The principles of b) the direct ToF and c) the indirect ToF (Emitted signal: red; received signal: blue).

$$d = \frac{1}{2} \cdot c \cdot \Delta t \tag{1.2}$$

However, in I-ToF, the emitted light is a modulated continuous wave (CW) signal with a frequency (f). After travel, a phase difference  $(\Delta \varphi)$  between emitted and received signals is obtained by a phase comparator. The distance can be calculated by Eq. (1.3).

$$d = \frac{1}{2} \cdot \frac{\Delta \varphi}{2\pi} \cdot \frac{1}{f} \cdot c \tag{1.3}$$

Compared with D-ToF, I-ToF sensors are more commercially available because the image sensor can be implemented in the standard CMOS technology. The measurement range is limited by the modulation frequency (typically 20–100 MHz, equivalent to 1.5–7.5 m) [12], [13]. However, with the breakthrough achievement of single-photon avalanche diode (SPAD) imagers [14], enabling the detection of lights at the photon level, D-ToF has become more attractive because of the fast acquisition and long measurement range and has been widely applied to bio-sensing and industrial applications [13].

#### 1.1.2.Applications

As the critical component in ToF measurements, TDCs are widely used in many fields, like light detection and ranging (LiDAR), ToF-based positron emission tomography (PET), and fluorescence lifetime imaging microscopy (FLIM). As a timing device, TDCs are popular in digital synchronizers, time-resolved temperature sensing, and quantum security.

#### a) Light detection and ranging (LiDAR)

Light detection and ranging (LiDAR) is a technique that has been used in many areas, such as driverless vehicles, industrial robots, and landscape mapping [15]–[17]. A typical LiDAR hardware system contains a pulsed laser emitter, optical components, photon detectors, and TDCs, see Fig. 1.2a. In these industrial LiDAR applications, the distances between objects can range from a few centimetres to hundreds or even thousands of meters, for example, vehicles on roads [18], [19] and lakes and forests on land. Therefore, TDCs in these applications only require an acceptable resolution (around 100-200 ps) but need high precision and an extended measurement range. Moreover, in ToF measurements, a time interval of 66.6 ps corresponds to a range of 1 cm distance.

Similar to other image sensors, such as cameras, multi-pixel photon detectors are required in LiDAR applications. Decided by a lens, a single pixel in a photon detector has a limited Field of View (FoV), for example, 0.5 x 0.5 degrees. A sensor array is built to obtain a wide-field (e.g., 120 x 100 degree) image. To serve the sensor array, multichannel TDC becomes popular.

#### b) Time-of-flight Positron Emission Tomography (ToF PET)

With a great sensitivity at the picomolar level, Positron Emission Tomography (PET) is irreplaceable molecular imaging in nuclear medical physics [20], [21]. It uses radioactive trackers to get metabolic information and make images of the distribution of labelled molecules in vivo [22]. PET imaging has been widely applied to human and pre-clinical applications, especially in cancer diagnoses and therapy monitoring. Figure 1.3a shows an example of PET scanners. The patient has been injected with radioactive Page 20 of 148

trackers (labelled molecules, highlighted in red), which will be absorbed by cancerous tissue. The cancerous tissue will release gamma rays due to the labelled molecules. The detector triggers when it receives gamma rays. In non-ToF PET, the line of response (LOR, highlighted in blue) with power information indicates where the cancerous tissue is. However, combined with TDCs, the ToF-PET can record timestamps of events. With the ToF data (highlighted in yellow), the location of the cancerous tissue can be quantified.



Figure 1.3 Examples of a) PET scanner and b) FLIM systems.

Compared with a non-ToF PET system, a ToF system can provide a higher signal-tonoise ratio (SNR) due to the extra ToF information [23]. However, the PET's performance is limited by the ToF resolution. For example, a 600 ps ToF resolution will roughly result in a location uncertainty with a full width at half maximum (FWHM) of 9 cm. Therefore, one of the major concerns in developing ToF-PET systems is improving TDCs' resolution [22]–[24].

#### c) Fluorescence lifetime imaging microscopy (FLIM)

Fluorescence-based imaging techniques are tools for different applications, from chemical science, clinical diagnosis, and cell biology [25]. A high-contrast image can be constructed with labelled samples by capturing the fluorescence intensity. The fluorescence lifetime efficiently quantifies molecules' and cells' information by recording the fluorescence decay [26]. Fluorescence lifetime imaging can be

implemented in the frequency and time domain [25]. Due to the fast development of photon detectors, such as SPADs and silicon photomultipliers (SiPMs), time-based fluorescence lifetime imaging microscopies (FLIMs, see Fig.1.3b) have higher photon efficiency, throughput, and SNR performance than frequency-based solutions [27].

#### d) Functional near-infrared spectroscopy (fNIRS)

Functional near-infrared spectroscopy (fNIRS) is non-invasive imaging that uses nearinfrared (NIR) light whose wavelengths are in the range of 600-900 nm to measure brain activity [28]. Most biological tissues are transparent to NIR light because absorption factors of the main constituents in tissues, such as water and lipid, are relatively low [29]. However, NIR light will be absorbed and scattered by oxygenated haemoglobin (oxy-Hb) and deoxygenated haemoglobin (deoxy-Hb), whose concentrations change during brain activities. Therefore, brain activities can be detected. In traditional fNIRS, CW light is emitted, attenuated by the target tissues, and finally detected by a photodiode. With Monte-Carlo simulations, the photon path can be estimated, and the concentrations of oxy-Hb and deoxy-Hb can be approximated using physical models [30]. With TCPSC techniques, ToF-fNIRS allows researchers to quantify the concentration and restore the photon path more precisely [29].

#### e) Digital synchronizer

In digital circuits, to reduce metastability failures, asynchronized signals or signals from other clock regions should be synchronized before arriving at registers. In an extensive distributed system with thousands of nodes over 1 km<sup>2</sup>, synchronization between nodes becomes a problem. In a Gigabit Ethernet, the precision time protocol (PTP) [31] only provides a time resolution of 8 ns. To provide sub-nanosecond accuracy and picosecond precision, White Rabbit (WR), as a synchronized network, was introduced in [32] and is now used in telescope systems [33] and extreme light facilities (SHINE) [34]. As precise stopwatches, TDCs are used to monitor the synchronization process in WR.

#### f) Temperature sensing

Temperature sensing and thermal management are essential for the Internet of Things (IoT) and semiconductor manufacturing [35], [36]. Temperature sensors have two types: voltage domain and time domain [37]. The voltage will change in a circuit in voltage domain solutions, affected by the environment temperature. Converted by an ADC, the temperature can be obtained by measuring the voltage. Although the voltage solution demonstrates high accuracy and resolution, the high-power consumption and large chip area cost make it imperfect for chip designs. Instead, the TDC-based temperature sensor has become more popular due to its low power consumption and simplicity [37]–[39].

#### g) Quantum security

Quantum communication (QC) is a new technology that provides a global secure communication network [40]. To safeguard a secure communication channel, an unpredicted key associated with high-quality random numbers is essential [41]. Extracting randomness from quantum machines is the most trustable way to generate random numbers [42]. Combined with a TDC and a single-photon detector (e.g., SPAD), a quantum random number generator (QRNG) can extract random codes from photon propagation [43]. Photon detection at the single-photon level follows the Poisson distribution. The numbers of recorded events in different time intervals have their probabilities, and the photon events are independent of the last event. Following the Poisson random distribution, a QRNG can convert the timestamp from the TDC into true random digital codes [44]. With a higher TDC timing resolution, the QRNG has a better generation rate.

## 1.2. Research aim

The key aims and goals of this PhD project are:

- To review reported TDC architectures and calibration methods and fairly compare their performances.
- To develop new TDC structures, targeting a higher resolution and comparable linearity.

• To further explore the commercial potentials of FPGA-TDCs and develop a suitable solution for commercial FPGA-TDC timing products.

## **1.3.** List of Contributions

### a) A 1.23-ps resolution TDC for bio-sensing and high-energy physics.

- Corrected the misunderstanding in some previously published articles that the well-known wave union (WU) method is not applicable in high-end FPGA devices, such as 20 nm and 16 nm FPGAs.
- Proposed the first efficient WU-TDC in 20 nm UltraScale FPGAs.
- Provided a theoretical analysis for TDCs' precision.

# b) A resolution-adjustable high-linearity TDC for LiDAR in driverless vehicles.

- Proposed a resolution-adjustable TDC (resolutions: 50, 80, 100 ps) with great linearity.
- Optimized resource usage and the 128-channel design, suitable for LiDAR in driverless vehicles.
- Designed a simulation tool to predict TDCs' performance with different calibration methods and made it open-sourced.

### c) An automatic calibration TDC based on ZYNQ SoC structures.

- Achieved an automatic calibration function based on ZYNQ SoC structures.
- Proposed an improved single-step block random access memory (BRAM) calibration method.

## 1.4. Outline of the thesis

A summary of the following chapters of this thesis is shown below:

#### Chapter 2: A literature review of TDC designs

Chapter 2 will introduce and evaluate current hardware platforms where TDCs can be implemented, from ASICs to FPGAs, and discuss the differences between these platforms. Then, a review of the recent developments of FPGA TDCs will be presented, and the critical parameters of TDCs will be explained. Based on the previously published TDCs designs, the research trends and challenges will be shown.

#### Chapter 3: Wave-union-based high-resolution TDCs on 20 nm FPGAs

In this chapter, a high-resolution TDC on 20 nm FPGAs will be presented. This FPGA-TDC is the first efficient WU-based TDC on 20 nm FPGAs, which corrects the misunderstanding that the WU method is inapplicable to high-end FPGAs. Moreover, Chapter 3 provides theoretical analysis to evaluate TDCs' precision.

#### Chapter 4: 128-channel resolution-adjustable TDCs on 20 nm FPGAs

Although there is a growing research trend in developing higher resolution TDCs, Chapter 4 points out that the linearity and measurement range are more critical for a TDC in LiDAR systems for driverless vehicles. A multi-channel high-linearity resolution-adjustable TDC is presented in this chapter.

#### Chapter 5: Automatic calibration TDCs on ZYNQ SoCs

Chapter 5 describes the automatic calibration TDCs on ZYNQ SoCs. This TDC shows excellent potential for commercial FPGA-based TCSPC products with the automatic calibration function.

#### **Chapter 6: Conclusion**

Chapter 6 summarizes the major contributions during this PhD study. A prediction about future work based on current studies is given in this chapter.

## **Chapter 2.** Literature Review

To better understand the implementation and performance of TDCs, hardware platforms and techniques which are used to build a TDC should be first introduced. Therefore, in this chapter, two mainstream hardware platforms, where TDCs can be implemented, will be explained and the parameters to evaluate TDCs' performance will be discussed. After that, the general methods used to measure time intervals will be explained. At the end of this chapter, some challenges and recent achievements in TDC studies will be reviewed.

The FPGA-TDCs reported in the thesis were mainly implemented on Xilinx FPGAs. However, Xilinx has been acquired by AMD in 2022. Therefore, the Xilinx name is now being replaced by AMD.

### 2.1. Hardware platform

A digital TDC can be implemented on application-specific integrated circuits (ASIC) and field programmable gate arrays (FPGAs). Modern ASICs and FPGAs are fabricated with complementary metal-oxide-semiconductor (CMOS) processes, because of the low static power consumption. Figure 2.1 is an example of CMOS inverters. it is built by two transistors (PMOS and NMOS) which have opposite polarities. The two transistors have a similar structure, but their polarities of charge carriers are opposite. Since one transistor is always off, the static power consumption decreases significantly. In this section, the hardware platforms (ASICs and FPGAs) for TDC implementations will be introduced.



Figure 2.1 CMOS inverter

Page 26 of 148

### 2.1.1. Application-specific integrated circuit (ASIC)

Unlike general-purpose integration circuit (IC) chips which can be used in many fields, like central processing units (CPUs) in personal computers (PC) and microcontroller units (MCU) in low-power embedded systems, an ASIC is an fully customized IC aimed for a specific purpose or application [45].

An ASIC-TDC shows great flexibility. Starting from the register transfer level (RTL) implementation using hardware description language (HDL), engineers can design a TDC with any standard cell. However, to optimise performance, power, and area (PPA), the cells should be placed carefully. Furthermore, pre-manufacturing tests, such as function verification, and post-manufacturing tests, for example, failure analysis, are essential to ensure chips' reliability. The whole project will take more than 6 months and require a team of engineers to be involved.

Implementing TDCs on ASIC is a good choice if mass production is required. However, for a fast prototype, an FPGA is cost-effective.

### 2.1.2. Field-programmable gate array (FPGA)

A field-programmable gate array (FPGA) is a semi-customised and reprogrammable IC. Unlike ASIC devices, an FPGA has given logic resources surrounded by programmable I/O blocks with a fixed physical layout. However, the inner connections and the resource usage are configurable. Due to the flexibility and low cost, FPGAs are popular platforms for TDC designs.

Figure 2.2 shows the structure of modern FPGAs (island-structure). Roughly, a modern FPGA consists of three different types of resources: logic elements, input/output (IO) elements, and wiring elements. The logic elements, such as look-up tables (LUTs), carry-chains, multiplexers (MUX), and flip-flops (FFs, also called registers), are placed in logic blocks (LB). Depending on FPGA vendors, LBs have different names, for example, configurable logic block (CLB) in Xilinx FPGAs [47] and logic array block (LAB) in Intel FPGAs [48]. The IO blocks (IOBs) are used to input and output signals

to and from peripheral circuits. The wiring elements include routing channels, connection blocks (CBs), and switch blocks (SBs). The routing channels are wires with fixed tracks, linking all resources in FPGAs. Both CBs and SBs are programmable switches. CBs are the connections between neighbouring LBs (vertical and horizontal), and SBs connect routing channels. Excepting the configurable blocks, "hard logic" blocks are also implemented on FPGAs for specific applications, such as digital signal processing (DSP) units for high-efficient computing (e.g., floating-point calculations), clock managers for clock networking, including phase-locked loop (PLL) and mixed-mode clock manager (MMCM), and BRAMs for data storage.



Figure 2.2 The structure of modern FPGAs [46].

Having abundant programmable logic resources, FPGAs are ideal platforms for implementing algorithms that can be pipelined or parallelised. However, many problems require the hardware to process a large amount of data and execute sophisticated algorithms with extensive branching. Therefore, to enhance the software programmability, there has been a growing trend of integrating a hard-core processor (e.g., an ARM-based processor) with an FPGA in a single chip since the 2000s [49]. The FPGA+ARM chip is called System-on-Chip (SoC) [50]. Figure 2.3 shows the general architecture of Xilinx Zynq SoCs. It contains two parts: a processing system (PS) and a programmable logic (PL). In Zynq, PS is a 'hardened' ARM-based processor,

offering the flexibility to implement complex algorithms. Unlike a soft intellectual property (IP) core (a reusable and configurable block-level design in FPGAs), the ARM core in Zynq SoCs has a dedicated toolchain and can be operated on a higher frequency (a few hundred MHz) but consumes more power (a few mW). PL is equivalent to a traditional FPGA. Communications between PS and PL use Advanced eXtensible Interface (AXI) buses [51].



Figure 2.3 Simplified architecture of Xilinx Zynq SoCs [52].

#### 2.1.3. Setup time, hold time, and metastability

Timing violations, including setup time and hold time violations, will lead to metastability and limit the performance of digital sampling elements, for example, flip-flops. To capture data correctly, the signal should keep stable for a while before the clock's valid edge (setup time). To store data correctly, the data must be stable for a while (hold time) after the clock's valid edge.

Due to timing violations, the circuit falls into metastable states and may be unable to settle into a stable logic level. As a result, the behaviour of the circuit is unpredictable. To avoid metastability, careful timing analysis is a must.

Although metastability endangers TDCs precision, the effect varies in different TDC architectures. For example, in a tapped-delay-line (TDL) TDC (which will be discussed in Sec. 2.3.2), metastability is not a big issue with a careful timing constraint, because the signal to be measured will keep stable and be propagated through a long delay line.

Besides, delay lines will be placed within a clock region to reduce the risk of metastability.

### 2.2. Performance parameters

Similar to ADCs, the TDC measurement is quantisation progress, converting time intervals to digital codes. Figure 2.4 is the TDC transfer function [53]. The various parameters can evaluate the performance of TDCs. This section will discuss the most critical parameters, including resolution, linearity, precision, accuracy, measurement range (MR), and deadtime.

### 2.2.1. Resolution

The resolution or the least significant bit (LSB) is the minimum time interval (TI) that a TDC can distinguish. In Fig. 2.4, the resolution is the bin/step width of the ideal conversion (highlighted in black). However, in actual cases, considering the bins are ununiform (highlighted in red), the resolution can be measured as the averaged bin width or estimated by the fitting curve between the input TI and digital codes.



Figure 2.4 TDC transfer function [53]. (DNL: differential nonlinearity; INL: integral nonlinearity)

#### 2.2.2. Linearity

The linearity can be characterised by differential nonlinearity (DNL) and integral nonlinearity (INL). DNL is the deviation from the single quantisation step, and INL is the accumulation of DNL. They are:

$$DNL[i] = (W[i] - W_{ideal})/W_{ideal}, \qquad (2.1)$$

$$INL[i] = \sum_{n=0}^{i} DNL[i], \qquad (2.2)$$

where W[i] is the width of the *i*-th bin, and  $W_{ideal}$  is the ideal bin width or averaged bin width.

The linearity is one of the main reasons which causes errors. The code density test is the standard method to estimate TDCs' linearity [5]. As shown in Fig. 2.5, many random TIs (more than 1,000,000 events) are fed into a TDC. Detected by the TDC, the random TIs that fall into each different time bins are counted. As the fed TIs are distributed evenly over MR, the recorded histogram can reflect the width distribution of the time bins (see Fig. 2.5), and DNL and INL can be measured using Eqs (2.1) and (2.2).



Figure 2.5 The concept of the time histogram

#### 2.2.3. Precision and accuracy

The precision and accuracy are parameters that are commonly mixed up in many reported TDC designs. The precision (also called RMS resolution,  $\sigma$ ) reflects the

concentration of measurement readouts with a given TI. The precision can be measured by the time interval tests (TIT) or the full-width half maximum (FWHM). In TITs, feeding hit signals with a fixed time interval to a TDC, and the standard deviation of the readouts is the precision (shown in Fig. 2.6a), see Eq. (2.3). The FWHM is the difference between two points in independent values where their dependent values are half of the maximum. The relationship between  $\sigma$  and FWHM is detailed in Eq. (2.4):

$$\sigma^{2} = \frac{1}{h-1} \sum_{i=1}^{b} (x_{i} - \bar{x})^{2}, \qquad (2.3)$$

$$FWHM = 2\sqrt{2\ln 2} \sigma \approx 2.36 \sigma, \qquad (2.4)$$

where  $x_i$  is the *i*-th test result in *b* tests and  $\bar{x}$  is the average value.

However, the accuracy is defined as the average absolute difference between the measurement value (see the dashed line in Fig. 2.6a) and the true reference (highlighted in red in Fig. 2.6a). Generally, the accuracy is related to the INL (the accumulated difference) and is also affected by the system offset.



Figure 2.6 a) The concept of accuracy and precision. b) Different scenarios of measurement readouts.

Figure 2.6b presents different scenarios of measurement readouts. It shows that, for a precise TDC, the inaccuracy can be cancelled quickly by offset corrections. Hence, precision is more critical for TDC performances. Meanwhile, a TDC with high

precision can obtain a trustable result with fewer measurements, reducing the measurement time significantly. However, many factors can degrade TDCs' precision, for example, environmental noises (temperature changes) and electronic noises (jitters). Therefore, calibration is an essential technique in TDC studies.

#### 2.2.4. Measurement range

The measurement range (MR) is the maximum time interval a TDC can measure. It is limited by the TDC's resolution and logic resource. A TDC with a longer MR requires more registers and delay elements. For a *C*-bit TDC, the measurement range can be expressed as:

$$MR = t_{LSB} \times 2^C. \tag{2.5}$$

MR varies in different applications. For example, FLIM requires an MR within tens of nanoseconds, but geospatial surveys hope for an MR more than thousands of nanoseconds. The Nutt method [54] provides a way to extend MR using two TDC channels, the start channel and the stop channel. Figure 2.7 is the timing diagram. A coarse counter measures the valid clock cycles ( $t_{coarse}$ ), and two TDC channels (start channel,  $t_{start}$ , and stop channel,  $t_{stop}$ ) sample the fine time interval. The time interval (t) can be measured by Eq. (2.6).

$$t = t_{start} + t_{coarse} - t_{stop}, \tag{2.6}$$



Figure 2.7 Timing diagram for the Nutt method [54]. Page **33** of **148** 

### 2.2.5. Deadtime

The deadtime defines the time interval between the end of the current measurement and the start of the following [5]. During the deadtime, the TDC cannot conduct a measurement. Therefore, the deadtime limits the TDC's sampling rate. For a time-resolved measurement system, its repetition rate is affected by the deadtime of the TDC and the detector.

Traditionally, one TDC channel serves one detector. However, detectors such as SPADs have a high deadtime ranging from a few nanoseconds to tens of nanoseconds [55], [56]. A dynamic reallocation strategy proposes a way to save logic resources [18], [57], [58]. One sharing TDC channel (e.g., the low deadtime TDC in [59]) can serve multiple detectors in the meantime, reducing resource usage and improving system throughput.

## 2.3. Methods for time measurements

### 2.3.1. Direct counting

The direct counting method is the simplest way to measure time intervals, by counting the number of clock cycles between the start and stop events, for example, the coarse counter in Fig. 2.7. Timers in DSPs, CPUs, and MCUs are built with this method. The resolution of the TDC with direct counting is determined by the clock frequency of the counter and the number of the bits of the counter. The resolution can be improved by increasing the number of bits of the counter or by increasing the clock frequency.

One advantage of a TDC with direct counting is the simplicity and low costs, as it only requires a counter and basic digital logic. Additionally, direct counting TDCs are typically faster than other TDC techniques, as they require minimal time for data processing and are not affected by jitter. However, direct counting TDCs have limitations in terms of their resolution, as the resolution is limited by the number of bits in the counter. They also suffer from errors due to clock skew, which can lead to inaccuracies in the measured time interval.

The TDCs with the direct counting method are still popular in some applications where the required resolution is low (> 2 ns), because of the simple structure and low resource consumption. However, the limited resolution cannot meet the requirements of applications such as PET-CT and FLIM.

#### 2.3.2. Interpolation

#### a) Tapped delay line

Interpolation is a method to achieve better resolution beyond the limit caused by reference clocks [60]. It uses accumulated delays caused by signal propagation to divide a longer, measurable time interval.

In FPGAs, all combinational logics (e.g., MUX and logic gates) and internal wire connections can interpolate time intervals. Carry-chain elements are a well-established and carefully-routed component for building arithmetic functions, containing a series of MUXes and XORs [61]. As carry-chain elements are available in most modern FPGAs, such as CARRY4 (CY4) in 28, 40, and 45 nm Xilinx FPGAs, CARRY8 (CY8) in 20 nm and more advanced Xilinx FPGAs, and CARRY\_SUM in Intel FPGAs [48], the tapped delay line (TDL) structure becomes the primary approach in FPGA-TDCs [59], [62]–[66].



Figure 2.8 Concepts of a) the tapped delay line and b) the Vernier delay line.

Figure 2.8a shows the general architecture of TDL-TDCs. Carry-chain elements construct a delay line. Start/hit signals propagate through the delay line. Once the stop/clk signal is valid, the registers (D-type flip-flop, DFF) capture the status of the delay line, and the thermometer code ("111110...000" - unlike binary coding, in thermometer coding, the '1's in the output code increases as the input value increases) represents the measured time interval, see Eqs (2.7) and (2.8). The resolution is the average delay of delay elements ( $\tau$ ). Logic "1" is propagated through *M* delay elements and is captured by *M* registers, and the measured time interval is *t*.

$$t = M \times t_{LSB}, \tag{2.7}$$

$$t_{LSB} = \tau. \tag{2.8}$$

#### b) Vernier method

In 1631, Pierre Vernier invented the Vernier calliper, a calliper with a secondary scale, improving measurement accuracy by using mechanical interpolation. Similar to the Vernier calliper, Vernier methods have been introduced to TDCs. The structure of the Vernier delay line (VDL) is shown in Fig. 2.8b. It uses two types of delay elements with different propagation delays ( $\tau_1$ ,  $\tau_2$ ), and  $\tau_1 > \tau_2$ . Figure 2.9a presents the signal propagation in the VDL. In the beginning, the TI between the start and stop signals is  $t_x$ . The gap gradually narrows during propagation because the stop signal's transmission is faster than the start signal. Meanwhile, the DFFs capture propagation status, and then the TI is measured. In the VDL method, the resolution is  $t_{LSB} = \tau_1 - \tau_2$ . However, this method requires two types of delay elements, consuming more logic resources and requiring specific logic elements.

Another variation of the Vernier method in TDCs is using two ring oscillators (RO) to generate two periodic signals. The block diagram is shown in Fig. 2.9b. The frequency of RO1 is slower than that of RO2 ( $f_{RO1} < f_{RO2}$ ). A coincidence circuit is used to check the phases of the two periodic signals, and two counters measure the cycles of these signals. The timing diagram is demonstrated in Fig. 2.9c. The start and stop signals trigger the TDC. After many cycles, the rising edges of the periodic signals are aligned. Then, the TI can be expressed as:


Figure 2.9 a) Signal propagation in the delay element-based VDL. The ring oscillatorbased Vernier TDC's b) block diagram and c) timing diagram. The rising edge of the fast signal (start or stop) is catching up the rising edge of the slow signal (start or stop), forming a Vernier interpolation. Therefore, the time interval with finer resolution is measured.

$$t_x = N_1 \cdot T_1 - N_2 \cdot T_2, \tag{2.9}$$

$$t_{LSB} = T_1 - T_2 = \frac{1}{f_{RO1}} - \frac{1}{f_{RO2}}.$$
 (2.10)

Table 2-1 lists the pros and cons of the mainstream interpolation methods. Compared to TDL-TDCs, the VDL-TDC reaches better resolution (beyond the delay of delay elements). However, it requires different delay elements to construct a VDL. As a result, the VDL-TDC is more suitable for ASIC design. On the contrary, in the RO-based Vernier method, RO can be implemented using a few carry-chains, and the frequency is changeable by modifying the number of delay elements [67], [68]. Therefore, the RO-based Vernier method is an excellent way to build a resource-saving TDC in FPGAs. However, the linearity and precision highly rely on the stability of the ROs, and a longer deadtime is unavoidable if a better resolution is wanted [69]. Moreover, manual adjustments for the RO period difference are needed during the design process [70].

| Туре     | Pros.                               | Cons.                                |  |  |
|----------|-------------------------------------|--------------------------------------|--|--|
|          | • Easy to implement                 | • High resource consumption          |  |  |
| TDL      | • Short deadtime                    | (More than hundreds of registers and |  |  |
|          | (Typically, $2 \sim 5 \text{ ns}$ ) | delay elements per channel)          |  |  |
| VDL      | • Better resolution                 | • Require specific logic elements    |  |  |
| RO-based | • Easy to implement                 | • Long deadtime (> 20 ns)            |  |  |
| Vernier  | • Resource-saving                   | • Manual intervention                |  |  |

 Table 2-1 Comparison between mainstream interpolation methods

# 2.4. Challenges and developments

## 2.4.1. Challenges

## a) Ultra-wide bin problems

As discussed in Sec. 2.3.2, compared to the RO-based Vernier method, the TDL method performs better because of the acceptable resource consumption and the low deadtime. However, the ultra-wide bin problems constitute a primary concern in TDL-TDCs. The ultra-wide bins are the bins whose width is wider than others in a delay line. Ultra-wide bin problems degrade the linearity significantly. Factors like clock skew, temperature, and boundaries between clock regions, can lead to ultra-wide bins.

The extra delay introduced by the boundary between clock regions is the main reason for ultra-wide bin problems [71]. In FPGAs, clock signals are distributed through dedicated clock routes. Figure 2.10a shows an example of an FPGA's clock network. With these clock routes (highlighted in blue in Fig. 2.10a), an FPGA can be divided into several clock regions [72]. A clock region contains many configurable logic blocks (CLB), and each CLB has several LUTs, MUXes, registers, and a carry-chain, as shown in Fig. 2.10b. The carry-chain is a lookahead carry logic containing MUXes and XORs (see Fig. 2.10c). It provides a single-bit fast addition and can be cascaded to perform multi-bit operations [73]. In Fig. 2.10c, to let the signal from *Cin* passing through the line, *DI* is set to "000..000" and *Sel* is set to "111...111" The cascaded carry-chains within a clock region are suitable for implementing a delay line because the delays in each element are smooth with a deviation. However, in some FPGAs which run at a low frequency, a longer TDL that crosses the boundary between clock regions is required because the total delay of the TDL should cover the period of the sampling clock to avoid interpolation loss [71], introducing an extra delay and forming an ultrawide bin.



Figure 2.10 a) An example of FPGAs' clock network. The block diagrams of b) CLB and c) CARRY-chain.

## b) Bubble problems

For delay line-based methods, the raw data is encoded in thermometer codes, for example, "111...111100...000" and "000...000111...111". Bubble problems are the unexpected transitions of the logic status, such as logic "1" surrounded with logic "0" and vice versa, and will result in encoding errors [65], [74].



Figure 2.11 Example of bubbles in the TDL structure.

Figure 2.11 explains the reason which causes bubble problems. Due to the clock network (as shown in Fig. 2.10a), there are some extra propagation delays before the clock signal arrives at each DFF. For example, DFF2 (highlighted in blue in Fig. 2.11) captures the TDL status earlier than other DFFs because DFF2 is located near the clock tree node. A probability exists: DFF2 does not store the data as expected because of the mismatch; therefore, the bubble appears.

#### c) PVT variations

Due to the process, voltage, and temperature variations, the element delays vary in different conditions. An FPGA device can have a stable voltage, integrating with a well-designed power supply and filter. However, the process and temperature variations affect TDC performance significantly.

Even with an advanced manufacturing technique, cells in FPGAs are still uneven, decreasing the linearity of the TDC. Although TDCs in high-end FPGAs can achieve a better resolution due to the better CMOS process (short cell delays), smaller transistors (width and length of gates) lead to more thermal noise [75] and therefore affect TDCs' precision.

The temperature condition affects the signal propagation in FPGA devices and TDCs' resolution. A report from Szplet *et al.* concluded that from -20 °C to +60 °C, the value of the resolution increases by 0.5 ps/°C, degrading the timing resolution [76]. When Page **40** of **148** 

operating an FPGA-TDC in a cryogenic environment, a dummy carry is a must (to calibrate the clock delay), and other electronic components in the printed circuit board (PCB) should be carefully selected [77].

## 2.4.2. Structures for FPGA-TDCs

Many structures and methods were proposed to improve the FPGA-TDCs' performance further. In this subsection, some advanced FPGA-TDC structures in the recently published papers will be listed.

#### a) Multi-chain TDCs

The structure of the multichain TDC [63], [78]–[81] is shown in Fig. 2.12. In this method, TI has been measured by several chains. In Fig. 2.12, there are N TDLs, indexing from 0 to N-1. Multiple TDLs construct the chains, and the resolution for each TDL is  $\tau_d$ . Between each chain, there is a slight delay caused by wires and buffers (highlighted in yellow in Fig. 2.12).



Figure 2.12 The structure of the multichain TDC [78].

For each chain, the measured TI can be expressed as:

$$t(i) = t_c(i) - \tau(i),$$
 (2.11)

where  $t_c(i)$  is the timestamp generated by the *i*-th chain, and  $\tau(i)$  is the delay caused by wires and buffers between the chains. The final time interval can be expressed:

$$t = \frac{1}{N} \sum_{i=0}^{N} (t_c(i) - \tau(i)).$$
 (2.12)

Therefore, the resolution of the multichain structure is  $\frac{\tau_0}{N}$ .

The multichain structure can improve the resolution significantly. The latest multichain TDC [80] achieves a 0.3 ps resolution. However, compared with the single-chain TDL, the resource consumption of the multichain structure grows linearly (N-fold). Moreover, the non-linearities from N chains will be accumulated and degrade the final precisions. Calibration modules for each induvial chain are necessary to obtain better precision.

#### b) Tuned-TDL TDCs

As shown in Fig. 2.10c, both in Xilinx and Intel FPGA devices, carry-chains have two output ports: CARRY\_OUT (C) and SUM\_OUT (S). The delay of signal propagation varies from different ports. Won and Lee use this feature to tune the TDL and improve linearity [62]. In their study, they examined 28 nm (Kintex-7), 40 nm (Virtex-6), and 45 nm (Spartan-6) Xilinx FPGA devices and suggested that the output pattern "SCSC" performs the best for the devices whose carry-chains are CARRY4 (CY4, four pairs of C and S ports). In high-end FPGAs, for example, Xilinx UltraScale (20 nm) and UltraScale+ FPGAs (16 nm), the carry-chains in these devices are CARRY8 (CY8, eight pairs of C and S ports). Chen and Li [66] set tap timing tests for 16 ports and built a 96-channel TDC integrated with the tuned-TDL structure in UltraScale FPGAs. The proposed TDC [66] achieves averaged  $DNL_{pk-pk} = 0.27$  LSB and  $INL_{pk-pk} = 0.59$  LSB.

## c) Multi-phase TDCs

Ideally, in a TDL-TDC, delays caused by the delay cells are the same, and the FFs can sample the TDL's status simultaneously. However, in FPGAs, it is impossible to meet this requirement due to the clock network, resulting in high nonlinearity, bubble errors and ultra-wide bins problems. To ease these effects, the multi-phase structure was proposed [82].

Figure 2.13 compares the structure of the single-phase TDC and multi-phase TDC. *D* is the number of used delay elements. Assuming that a single-phase TDC is running at 200 MHz, the total delay of the corresponding TDL should be larger than 5 ns. However, for the 2-phase TDC in Fig. 2.13b, TDLs are truncated and duplicated. Each TDL only needs to cover half of the clock cycle. Therefore, ultra-wide problems are smoothed out because of the short delay lines, and the non-linearity caused by clock skews is eased [59], [82]. The lasted multi-phase TDC achieves a 1.56 ps resolution [83].



Figure 2.13 A comparison between a) the single-phase TDC and b) multi-phase TDC.

The multi-phase method can reduce clock skews and clock frequency by increasing the number of phases. However, multiple phases increase the complexity of the target system and may introduce timing errors [59]. Therefore, the number of phases should be carefully selected according to the target platform and specifications.

#### d) Pulse shrinking TDCs

Rahkonen and Kostamovaara [60] first proposed the pulse shrinking method. They found that the width of a pulse will be narrowed when it transmits through some specific elements. With this feature, pulse shrinking TDCs can be implemented on customised CMOS devices [60], [84].

Figure 2.14a is the structure of the pulse shrinking TDC. The pulse forming circuit generates a pulse based on the start and stop signals. A pulse shrinking delay line (PSDL) and an OR gate construct a loop. Once the pulse is undetectable (see the mechanism in Fig. 2.14.b), the measured time interval is nR, and the resolution (R) is the pulse width narrowed by PSDL. Szplet and Klepacki applied this method to FPGA TDC design [85], [86]. In FPGAs, the propagation speeds of the rising and falling edges are different. Basic logic resources can be used to build PSDL in FPGAs, and the proposed pulse shrinking FPGA achieves 41.8 ps resolution in Spartan-3 FPGAs [86].



Figure 2.14 a) The structure of the pulse shrinking TDC. b) The pulse shrinking mechanism.

Like the RO-based Vernier TDC, the pulse shrinking TDC consumes a few logic resources. However, the deadtime increases linearly if a high resolution is required (more cycles). Furthermore, lacking full-customised logic elements, the achievable

resolution of the FPGA-based pulse shrinking TDC is limited compared to ASIC designs (2 ps in 180 nm ASICs [87] and 41.8 ps in 90 nm FPGAs [86]).

#### e) DSP based TDCs

To boost on-chip computing power, especially for floating-point numbers, DSP modules are integrated into FPGAs. Figure 2.15 shows a simplified structure of the DPS48 module. It contains three major components: a pre-adder for fast fixed-point addition and subtraction operation, a multiplier, and an arithmetic logic unit (ALU) for complex operations [88].



Figure 2.15 The simplified structure of the DSP48 module [89].

The reports [89]–[91] indicate that the ALU and pre-adder can perform additional operations. Therefore, they can build a delay line like carry-chain modules. Moreover, the resolution (LSB = 4.23 ps) of the ALU-based DSP-TDC is better than that of the pre-adder-based DSP-TDC (LSB = 8.12 ps) in Kintex-7 FPGAs.

The bit number of a DSP is limited. To increase MR, multiple DSPs need to be cascaded. However, the DSP connection introduces extra wire delays and results in severe ultrawide bin problems [89]. Moreover, compared with other resources like carry-chains and LUTs, DSPs in FPGAs are limited. A suggestion is to use DSPs to achieve functions, which need more computing power.

#### f) Large-scale parallel routing

Large-scale parallel routing (LSPR) [92] or delay matrix [93] uses wire delays to interpolate the time interval. Figure 2.16a presents the ideal propagation path through wires. Ideally, the signal propagates through the wires at a fixed speed, and the delays between node A and  $B_n$  meet Eq. (2.13):

$$D_{A-B_2} - D_{A-B_1} = D_{A-B_3} - D_{A-B_2} = D_{A-B_{i+1}} - D_{A-B_i}.$$
(2.13)

In FPGAs, multilength wires are provided to balance area, flexibility, and delays. Zhang *et al.* examined 20nm, 28nm, and 40nm FPGAs and reported four types of wires with different lengths [92], see Fig. 2.16b. Signals propagate through wires, and registers capture the status. The implemented TDCs can achieve 5.5 ps resolution in Virtex-6 FPGAs, 1.29 ps resolution in Kintex-7 FPGAs, and 3.95 ps in UltraScale FPGAs [92]. Although this method only consumes a few logic resources, it occupies many CLBs and is unsuitable for multichannel applications [92], [93].



Figure 2.16 a) Ideal propagation path through wires. b) multiple-length lines in FPGAs [92].

## g) Gray code oscillator TDCs

Loops in pure-combinational logic are not allowed because feedback delays for multiple bits vary from path to path, resulting in encoding errors and making the system

unstable. As sequential logics, registers are necessary to slow down and stabilise the system. An exception is a gray counter. Figure 2.17a shows an example of the gray code transition. The possibility of the encoding error is reduced because only one bit is changed in each state. Therefore, gray code counters (see Fig. 2.17b) are widely applied in digital systems.



Figure 2.17 a) An example of the gray code transition. The structures of b) the gray code counter and c) gray code oscillator TDC.

In 2019, Wu and Xu proposed a new TDC structure called gray code oscillator (GCO) TDC [94]. The method contains two parts: GCO (highlighted in the red line) and DFF. LUTs construct GCO. Without the restrictions caused by DFFs, the oscillating speed is only limited by the total delay caused by wires and LUTs. The DFFs are used to store the status of GCO. Then, the time interval is measured.

The GCO TDC saves logic resources significantly. It can encode  $2^n$  stages using *n* LUTs. However, FPGAs, e.g., UltraScale+, only provide 5-input LUTs and 6-input LUTs [73], limiting the performance of GCO TDCs. Moreover, uneven wire connections will affect TDCs' linearity. A careful manual adjustment is suggested to improve the linearity [95].

## 2.4.3. De-bubble method

Bubbles are the unexpected transitions of the logic status, such as logic "1" surrounded with logic "0". To assess the severity of bubble problems, "bubble depth", the maximum number of successive unexpected logic statuses, is defined. As bubbles will result in encoding and conversion errors, the bubbles should be removed before processing the raw data from the TDC. This section will study some previously reported methods that can remove bubbles.

#### a) Basic de-bubble circuits

Figures 2.18a and 2.18b present the circuits using AND gates and XOR gates to remove bubbles, respectively. However, this encoding wall can only remove 1-bit depth bubbles from thermometer codes.

AND gates and inverters can be used to build a bubble-proof circuit [96]–[98]. Figure 2.18c shows an example of this circuit. This presented circuit can locate the actual signal propagation by detecting the patterns "110" and "100" and then correctly convert the thermometer codes with bubbles to one-hot codes. Moreover, this circuit can be extended with more gates to remove multi-bit bubbles.



Figure 2.18 Bubble removal circuits: a) AND gate based, b) XOR gate based. c) an AND + inverter-based bubble-proof circuit.

Unfortunately, these methods will lead to zero-width bin problems and degrade the resolution. For example, in Fig. 2.18a, the 4-th bit of the raw data is the bubble, and the bubble will always exist due to the clock network. However, there will be a zero-width bin in the final timing histogram because the circuit simply ignores the bubble in Fig. 2.18a.

#### b) Bin realignment

The bin realignment method was proposed in [99]. The raw data output from a TDL is sent to a computer. A MATLAB program is used for figuring out the position of the bubbles. LUTs, where the correct data is stored, are used to readdress the raw data from the TDL. Therefore, the bubbles and zero-width bins are removed. However, because a PC is involved, the pre-processing is time-consuming.

#### c) Ones-counter method

The ones-counter method's main concept is countering the total number of valid logic statuses in thermometer codes [74], for example, logic "1". This method was firstly proposed in flash ADCs in 2014 [100]. Then, Wang *et al.* introduced this method to FPGA-TDCs [74]. Unlike the methods shown in Figs 2.18a and 2.18b that can only remove 1-bit depth bubbles, the ones-counter method can remove multi-bit depth bubbles. Compared with the bubble-proof circuit in Fig. 2.18c, the ones-counter method can remove bubbles without losing resolution. However, in FPGAs, the input ports of a LUT are limited (5 or 6 ports). Therefore, the ones-counter method needs a multi-stage pipelined structure to encode the raw data from the TDL, requiring extra logic resources and processing time.

#### d) Sub-TDL structure

Bubble errors happen due to the mismatches caused by clock skews. In high-end FPGAs, bubble problems become severe because the fast signal propagation makes the whole system more vulnerable to clock skews.

The sub-TDL structure [66] (or called decomposition method in [65]) is an intelligent solution that can remove bubbles without extra logic resources. The term "bubble area"

is a particular digital sequence that contains bubbles, and the term "bubble depth" is the number of bubbles in the "bubble area" [65]. The sub-TDL structure aims to remove bubbles by mapping raw TDL into different sequences; the number of sub-sequences should be larger than the maximum bubble depth.

Figure 2.19 is an example of the sub-TDL structure. The observed bubble number is three. Hence, it is assumed that the maximum bubble depth is three. Then, the TDL, whose resolution is  $\tau$ , is constructed by CY4 and divided into four groups by wire connections. The four sub-TDL (the resolution is  $4\tau$ ) are bubble-free without extra logic resources. Summing the sub-codes, the resolution of the sub-TDL structure will be reverted to  $\tau$ , similar to the multichain averaging structure.



Figure 2.19 The sub-TDL structure for CY4.

# 2.4.4. Calibration methods

Conversions between the digital output and analogue input result in errors, as shown in Fig. 2.20a. Due to device imperfection, the nonlinearity will add more errors to the whole system. Hence, calibration is essential in TDC designs.



Figure 2.20 a) Conversion between the analogue input and digital output. b) Quantization errors caused by the conversion.

Factors, such as temperature, device offset, quantization error (Q), and nonlinearities will introduce measurement errors into TDC systems. To cancel these errors, calibration is performed case by case.

In most industrial applications, the working temperature ranges from -20 °C to 60 °C. Because of the particles' thermal motion, signal propagation speeds change when the temperature rises. Therefore, the TDC's averaged bin size is not fixed but predictable [79]. LUTs that store temperature coefficients can correct the changes.

Due to inner delays, for example, delays caused by IO ports, there will be a fixed difference between the measured time interval and the ground truth. This offset could be cancelled in the final stage of data processing.

Quantization errors (Q) occur during the signal conversions, see Fig. 2.20b. Bin-by-bin calibration [101] is the most efficient way to reduce Q. In this method, bins are calibrated to their centre. For a bin with lower and upper boundaries,  $t_1$  and  $t_2$ , the quantisation error (Q) can be represented by Eq. (2.14):

$$Q^{2} = \frac{1}{t_{2} - t_{1}} \int_{t_{1}}^{t_{2}} (t - t_{out})^{2} dt$$
  

$$= \frac{1}{t_{2} - t_{1}} \cdot \frac{1}{3} \cdot (t - t_{out})^{3} |_{t_{1}}^{t_{2}}$$
  

$$= \frac{1}{3(t_{2} - t_{1})} \cdot \{(t_{2} - t_{out})^{3} - (t_{1} - t_{out})^{3}\}$$
  

$$= \frac{1}{3(t_{2} - t_{1})} \cdot (t_{2} - t_{1}) \{3t_{out}^{2} - 3(t_{1} + t_{2}) + (t_{2} - t_{1})^{2}\}, \quad (2.14)$$

where t is the input time interval and  $t_{out}$  is the converted output. According to Eq. (2.14), when  $t_{out} = \frac{1}{2}(t_2 + t_1)$ , Q is its minimum, and the value is  $\frac{(t_2 - t_1)^2}{12}$ . To achieve the real-time onboard calibration, in [102]–[104], LUTs are used to remap bins to their centre.

In some scientific applications, nonlinearities are ignored. In a TCSPC histogram (see Fig. 2.21), the central bin of the distribution can be treated as the measured time interval in statistics, and the measurement error can be reduced further with bin-by-bin calibration. However, industrial applications use the peak bin as the measured time interval, because of the limited computing power. In this scenario, errors introduced by nonlinearities become pronounced. Bin N in Fig. 2.21 is a good example. Because of the large DNL (a higher detection probability), the event number recorded in Bin N is the largest among other bins. As a result, the measurement is inaccurate. Moreover, the histogram distortion degrades TDCs' precision (the concentration of the histogram), due to the large DNL. Meanwhile, as the accumulation of deviations, the INL affects the TDC's accuracy directly. Therefore, nonlinearity calibration is also essential in TDC designs. In the following chapters, the calibration methods will be explained in detail.



Figure 2.21 Errors caused by nonlinearities. Page **52** of **148** 

# 2.5. Summary

In this chapter, the different platforms (ASIC and FPGA) have been analysed in terms of TDC implementation. The advantages and disadvantages are summarised in Table 2-2. Due to the affordability and short development cycle, FPGA-TDCs have become more attractive in fast prototyping and scientific instruments.

| Platforms | Pros.                     | Cons.                    |
|-----------|---------------------------|--------------------------|
| ASIC      | High resolution           | Long development cycle   |
|           | • High linearity          | • High one-off cost      |
|           | • High reliability        |                          |
|           | • Low power consumption   |                          |
| FPGA      | Comparable resolution     | Worse linearity          |
|           | • Affordable              | • High power consumption |
|           | • Short development cycle |                          |

Table 2-2 TDCs in ASICs and FPGAs

The resolution, precision, and linearity are the most critical parameters for TDCs. TDL and VDL structures are the basic FPGA-TDC structures. Architectures like the multichain TDC [63], [78]–[81], multiphase TDC [82], and RO Vernier TDC [69] are proposed to improve performances further. Researchers use DSPs [90], LUTs [94], and routing resources [92] to construct a delay line and then build a TDC to make the most of FPGAs.

Challenges exist in FPGA-TDC designs. TDLs should be constrained within a clock region to avoid ultra-wide bin problems. Many circuits and structures (e.g., bin realignment and sub-TDL) are proposed to remove bubbles. Besides, the bin-by-bin calibration can reduce the quantisation error efficiently.

# Chapter 3. High resolution wave union TDC in 20 nm FPGA

# 3.1. Motivation

To meet the demands for high-precision bioimaging or biosensing techniques, such as PET-based computed tomography (CT), Raman spectroscopy, and NIR-based oximeter [14], [21], [105], [106], a high-resolution TDC is desired. For early cancer diagnosis, tumours (typically < 5 mm in size) are detectable if a detector can achieve a resolution in picoseconds [107], [108]. To build TDCs with an excellent resolution, researchers have proposed many methods. The multichain TDC proposed by Qin *et al.* reached 1.15 ps [79]. Sui *et al.* built a 1.56-ps resolution TDC [83]. A recently proposed VDL-TDC performs a 2.50 ps resolution [109]. Although these methods efficiently improve TDCs' resolutions, the logic resource consumption increases significantly. In 2009, Wu and Shi proposed the wave union (WU) method [110]. This method can improve TDCs' resolution and ease the ultra-wide bin problems with extra but acceptable resource costs.

In an actual delay line, bins or delay elements are uneven, due to many factors, such as manufacturing imperfections, clock skews, and PVT variations. Some bins have small delays (high resolution), but some other bins have large delays (low resolution, highlighted in red in Fig. 3.1). Therefore, ultra-wide bins occur. An ultra-wide bin is a bin whose bin width is several times bigger than averaged bin width [71]. In this study, ultra-wide bins are the bins whose DNL > 2 LSB.



Figure 3.1 WU method's operating principle [111]. Page **54** of **148** 

Ultra-wide bins problems have been explained in Chapter 2 and are troublesome in TDC designs, increasing TDCs' nonlinearity. The WU method is an efficient way to eliminate ultra-wide bin problems. Figure 3.1 shows the operating principle of the WU method. A union of signal wavelets, containing multiple rising (0-1 transition) and falling (1-0 transition) edges, is generated from a signal launcher and then is fed into a delay line. During the signal propagation, some edges may fall into large bins. However, other edges will be captured by small bins. Large bins are sub-divided into small bins with multiple edges, and therefore, the linearity is maintained.

Wu and Shi summarized that the WU launcher has two different versions, with a finite step response (FSR) launcher (WU-A) or an infinite step response (ISR) launcher (WU-B) [110]. The WU-A launcher generates a pulse train with a limited number of rising and falling edges. However, similar to a ring oscillator, the WU-B launcher releases an infinite pulse train when activated. Figures 3.2a and 3.2b are examples of the WU-A and WU-B launchers, respectively.

In Fig. 3.2a, the launcher is built with a LUT and a delay buffer (a series of CY8). The WU signal with two edges is propagating through a single TDL. Then the edges are captured by DFFs individually, performing like multiple sampling. As a result, a new virtual TDL with a smaller averaged bin width (better resolution) has been constructed (see Fig. 3.2c). In the meantime, the equivalent small bins have eased ultra-wide bin problems. Therefore, the TDC's linearity is improved.



Figure 3.2 Examples of a) a LUT-based WU-A launcher and its truth table and b) a NAND-based WU-B launcher. The concepts of c) the WU-A method and d) the WU-B method.

The structure of the WU-B launcher is like a ring oscillator, activated by the hit signal, see Fig. 3.2b. Data processing in the WU-B method is more complex than that in the WU-A method [110]. The time of *n*-th oscillation edge t can be expressed as:

$$t = t_0 + t_{osc} \times n, \tag{3.1}$$

where  $t_o$  is the arrival time of the hit signal and  $t_{osc}$  is the period of the launcher's oscillation. Meanwhile, the oscillation edge will be captured *m*-times in *m* system clock cycles whose period is  $t_{clk}$ . Therefore, *t* can be calculated as:

$$t = T_m(m) + t_{clk} \times m, \tag{3.2}$$

where  $T_m(m)$  is the measured value from the TDC. According to Eqs (3.1) and (3.2), the arrival time of the hit signal is:

$$t_0(m) = T_m(m) + t_{clk} \times m - t_{osc} \times n(m), \tag{3.3}$$

where n(m) is the number of the oscillating edges encoded at the *m*-th clock cycle, and n(0) = 0. The oscillation period  $t_{osc}$  is close to the system clock period  $t_{clk}$ , ensuring that the oscillation edge will be captured once in each sampling cycle and that not all measured events fall into an ultra-wide bin. The timing diagram is shown in Fig. 3.2d. After *m* clock cycles, the TDC performs *m* equivalent measurements, improving the resolution.

The WU-A and WU-B methods can achieve excellent resolution and ease ultra-wide bin problems. However, in the WU-B method, the launcher should be designed carefully by selecting the oscillation period and ensuring the stability of the ring oscillator. Generally, the WU-B method is more suitable for ASIC-TDCs, and the WU-A method is popular in FPGA. Hence, the following sections will focus on implementing WU-A TDCs in FPGAs.

The WU method has been studied in many reports [112]–[114]. A small interval between the target edges ensures the samplings are correlated and provides better

efficiency. However, Szplet *et al.* found that the propagation speeds of rising and falling edges differ on Spartan-6 FPGAs [115]. The speed differences vary with CMOS processes [116] and become pronounced in high-resolution TDCs in more advanced CMOS technologies. Consequently, the traditional de-bubble method, removing bubbles by recognizing their positions, becomes inefficient. Therefore, a recently published study claimed that the WU method is unstable for UltraScale FPGAs [117] because of the severe bubble errors. Since then, no efficient WU TDC has been reported in UltraScale FPGAs.

However, in 2018, the sub-TDL structure [66] was proposed to remove bubbles and zero-width bins without extra logic resources. As the sub-TDL structure can remove bubbles entirely, an efficient WU TDL with a resolution near 1.0 ps is presented in this chapter.

# 3.2. Architecture and method

Figure 3.3a shows the general structure of the sub-TDL-based TDC system implemented in UltraScale FPGAs. It contains four parts: a TDL, an encoder, a calibration module, and a histogram module. An extra coarse counter is required if an extended measurement range is wanted. The TDL is tuned to improve the linearity [62] and re-wired based on the sub-TDL structure to remove bubbles [66]. The encoder performs thermometer-to-one-hot (TH2OH) and one-hot-to-binary (OH2BIN) operations. The calibration module can further enhance the linearity, and the histogram records the sampled time events and performs the TCSPC function.

Integrated with the WU-A method, the TDC will sample rising and falling edges, and the edges are processed individually. Therefore, extra modules for falling edges, such as sub-TDL structures and encoders, are required (highlighted in yellow in Fig. 3.3b), consuming more logic resources.

The following sections will present the structures and methods used to build a WU TDC with a resolution towards 1.0 ps in detail.



Figure 3.3 The block diagram of the proposed TDC system a) with the WU method and b) without the WU method.



Figure 3.4 Slices in a) 7-series FPGAs and b) UltraScale FPGAs.

## 3.2.1. CARRY8s and dual-sampling structures

Due to layout and placement strategies, the slices in UltraScale FPGAs differ from that in 7-series FPGAs. The carry logic in 7-series FPGAs is CY4, whereas that in UltraScale FPGAs is CY8. In 7-series FPGAs, *C* and *S*-outputs of a CY4 are selected by a MUX and then stored by DFFs (see Fig. 3.4a). However, in UltraScale FPGAs, each pair of C and S is accompanied by two DFFs, equivalent to adding more sampling taps physically. Wang and Liu used this future to build a dual-sampling (DS) TDC in UltraScale FPGAs with a 2.25 ps resolution [117]. However, the linearity degrades, and bubble errors exacerbate when the resolution is enhanced.



Figure 3.5 a) DNL and INL curves for the 790-bin DS TDC. b) DNL and INL curves for the 390-bin DS TDC.

A DS TDC with sub-TDLs was implemented and tested, achieving 2.53 ps resolution. The linearity performance was measured by code density tests (see Fig. 3.5a). In code density tests, the TDC captures and records more than 1,000,000 events with random time intervals (TI) and then forms a histogram of time stamps. The histogram can reflect the TDC's linearity because the fed TIs are distributed evenly over MR. Statistically speaking, other minor errors (signal jitters) can be ignored with millions of samples. For a perfect conversion, the DNL and INL curves should be a straight line, because each bin's DNL and INL are always 0. Large clock skews result in a distinct step in the INL curve (~Bin 450) due to the clock distribution tree's bifurcation [82]. Calibration methods, like bin-by-bin calibration, can suppress the nonlinearity caused by device imperfections but cannot remove the deviation caused by the clock network. An

efficient solution to alleviate clock skews is the multi-phase structure by truncating the length of each TDL [82]. The DNL and INL curves for a shorter TDL (around 390 bins) with the DS structure are shown in Fig. 3.5b, where the  $INL_{pk-pk}$  (peak-to-peak INL) has been improved to 9.80 LSB.

## **3.2.2.** Bubble errors introduced by the wave union method

As shown in Fig. 3.2c, with rising and falling edges, a WU TDC performs like a TDC with two delay lines (rising and falling). The resolution of the WU TDC is:

$$LSB_{wu} = \frac{MR}{N_{wu}} = \frac{MR}{N_{rising} + N_{falling}} = \frac{LSB_{rising} \times LSB_{falling}}{LSB_{rising} + LSB_{falling}},$$
(3.4)

where MR is the measurement range of the delay line,  $N_{wu}$ ,  $N_{rising}$ , and  $N_{falling}$  are the total bin numbers of the WU TDC, the plain TDC with the rising sampling, and the plain TDC with the falling sampling, respectively.

|                    | Device     | Fast edge (ps) | Slow edge (ps) | WU signal (ps)    |  |
|--------------------|------------|----------------|----------------|-------------------|--|
| [116] <sup>1</sup> | Virtex-5   | 35.90          | 37.43          | 18.23             |  |
|                    | Artix-7    | 15.50          | 15.97          | 7.83              |  |
| [115]              | Spartan-6  | 15.20          | 16.10          | 7.82 <sup>2</sup> |  |
| This Work          | UltraScale | 4.79           | 5.08           | 2.46 <sup>2</sup> |  |

Table 3-1 Interpolation Efficiency on Resolution

<sup>1</sup> Averaged value in [116]. <sup>2</sup> Calculated values based on Eq. (3.4).

The interpolation efficiency of the WU method has been studied in many projects [99], [115], [116]. The propagation speeds of the rising and falling edges in different FPGAs were quantified in [115], [116] (summarized in Table 3-1). Similar tests were performed in UltraScale FPGAs and the impact was quantified, separately feeding the rising and falling edges to a TDL. The test results do show the speed difference between the two edges. Figure 3.6a presents the principle of the speed difference; the gap between the falling and rising edges changes when the WU signal transmits through the TDL. Figure 3.6b shows the fitting curves between the bin number and the measured time for the

edges; the edges use different numbers of taps to convert a fixed time interval. In these tests, the LSB of the slower edge is 5.08 ps, whereas the LSB of the faster edge is 4.79 ps, included in Table 3-1. According to Eq. (3.4), the ideal LSB of the WU signal, containing a pair of 0-1 and 1-0 transitions, should be 2.46 ps. Due to the speed difference, it becomes hard to predict the position of bubbles, making traditional bubble methods invalid [117].



Figure 3.6 a) An example of the propagation speed difference between the rising and falling edges. b) The bin number versus the measured time fitting curves for the rising and falling edge signals.

## **3.2.3.** Wave union methods with the sub-TDL structure

The WU method is invalid in UltraScale FPGAs because of severe bubble errors and the speed difference between the rising and the falling edges. However, in 2018, researchers found that a physical TDL has a maximum bubble depth. Using this phenomenon, the decomposition method [65] and the sub-TDL structure [66] were proposed, and it was found that bubbles can be removed by rewiring output taps into sub-groups. Using simulation analysis, Kwiatkowski and Szplet examined clock skews in Kintex-7 FPGAs and reported that the maximum clock skew is around 19 ps within a clock region [118]. The sub-TDL structure can minimize the effect of clock skews because the delays are elongated by rewiring and are larger than the maximum clock skew. Therefore, bubbles are removed. As a result, the WU method is still efficient in UltraScale FPGAs if the sub-TDL structure is applied. Note that the sub-TDL structure has to be modified if more edges are involved, for example, a TDC using an 8-edges WU signal in [64]. The encoding process in [64] is complicated; however, it is out of the scope of this study.

## **3.2.4. Encoding and Compensation Strategies**

Fetching the data from the sub-TDL, the encoder converts data into binary code. Figure 3.7 is the block diagram of the proposed encoder. The thermometer-to-binary (TM2BIN) process typically has two steps: TH2OH and OH2BIN. Considering that the two functions are implemented using combinational logic, they can be operated simultaneously and merged into one module (TM2BIN). The encoder assembles the binary codes and then outputs the final result.



Thermometer codes

Figure 3.7 The block diagram of the encoder.

Due to the uneven TDL, the calibration module is essential to suppress the nonlinearity (see Fig. 3.8a, the actual TDL). A fast calibration approach, called the bin compensation strategy, was used to improve linearity, aiming to compensate the bins with a nominal bin width [66]. This method was integrated into the design. In this method, two factors, the main bin calibration factor ( $BCF_m$ , highlighted in black arrow in Fig. 3.8a) and the

compensation bin calibration factor ( $BCF_c$ , highlighted in red arrow in Fig. 3.8a), are introduced to reassign the TDC's fine codes to the corrected codes.



Figure 3.8 Concepts of a) the compensation strategy and b) the binning method. c) Hardware implementation of the compensation method.

The address of the corrected bins can be calculated based on code density tests. T[k] can be defined as:

$$T[k] = \sum_{n=0}^{k-1} W[n] = \sum_{n=0}^{k-1} \{ \text{LSB} \times (\text{DNL}[n] + 1) \},$$
(3.5)

where W[n] is the width of the *n*-th bin.  $BCF_m$  and  $BCF_c$  are calculated accordingly, for example, the uneven  $Bin_{actual} N$ -2 collects a larger count proportional to its bin width; therefore, it is necessary to assign a proportion of the count to  $Bin_{ideal} N$ -1 through  $BCF_c$ . This compensation strategy works well if the bin boundaries do not deviate from the ideal bin boundaries (highlighted in dash lines in Fig. 3.8a). As each bin only has at most one  $BCF_m$  and one  $BCF_c$ , with this compensation method,  $Bin_{ideal}$ N+1 receives no count allocation (due to the ultra-wide bin  $Bin_{ideal} N$ +1, as  $Bin_{ideal}$ N+1 has already assigned its contribution to  $Bin_{ideal} N$ +2 and N+3) and therefore resulting in a missing code. This method can remap and compensate bins simultaneously without changing the resolution. A few missing codes can be ignored with a slightly degraded resolution, and the compensation process is simplified as the pseudocode below.

For 
$$k = 1: N$$
  
if  $(T_{actual} [k] < T_{ideal} [k])$   
if  $(T_{actual} [k+1] < T_{ideal} [k])$   
BCF<sub>m</sub> =  $k - 1$   
BCF<sub>c</sub> = void  
else if  $(T_{actual} [k+1] > T_{ideal} [k])$   
BCF<sub>m</sub> =  $k - 1$   
BCF<sub>c</sub> = $k$   
else  
continue...

This method with two BRAM modules is shown in Fig. 3.8c. The two factors are fetched from the calibration BRAM based on the fine codes from the TDL and then are fed to the histogram BRAM, which two BRAMs implement in the simple dual-port mode. With the on-board compensation method, the post-processing time is reduced significantly. However, more troublesome missing code problems degrade the INL and the performance, especially for two-stage TDCs [119], [120]. A new compensation method is still to be developed. A simple approach will be introduced in the chapter later.

Figure 3.9 presents the linearity performances of the original and the compensated WU TDCs. For compensated WU TDCs, the DNL is [-0.92, 1.75] LSB, and the INL is [-1.20, 5.97] LSB.



Figure 3.9 DNL and INL performances. a) DNL and b) INL plots of the original WU TDC and the compensated WU TDC in the UltraScale FPGA (810 bins).

# 3.2.5. Dual-sampling wave union TDC with Sub-TDL

As discussed in the sections above, the DS architecture and wave union method can be integrated with the sub-TDL structure. A dual-sampling wave union (DSWU) TDC was implemented to improve the resolution further. Its block diagram is shown in Fig. 3.10. With the DS architecture, there are 16 sampling taps physically, and the wave union method with two edges doubles the taps virtually. Therefore, the total number of the equivalent sampling taps is  $8 \times 2 \times 2 = 32$ . As the typical cell delay of CY8 in UltraScale FPGAs is around 5.0 ps, the achievable resolution of the DSWU TDC in UltraScale FPGAs should be around  $5 \div 4 = 1.25$  ps. The linearity, however, degrades when the resolution is improved.



Figure 3.10 The block diagram of the DSWU TDC.

# **3.2.6.** Binned dual-sampling wave union TDC

A binning method was used to improve the DSWU TDC's linearity, inspired by the compensation strategy and the bin decimation method in [121]. As shown in Fig. 3.8b, a set of merged ideal bins are constructed by merging two consecutive bins into larger bins. Then, the compensation strategy can improve the linearity. The central concept is to reduce the number of ultra-wide and ultra-small bins, enhancing both the DNL and the INL.

# 3.3. Experimental tests

# 3.3.1. Experimental setup

To evaluate the performance of the proposed TDCs, code density tests and time-interval tests have been performed. The whole TDC is running on 500 MHz (which is also the system clock frequency). The experimental setup is shown in Fig. 3.11. The test system contains a Xilinx KCU105 board [122], a Silicon Lab Si5324 evaluation board [123], a Stanford Research Systems (SRS) CG635 clock generator [124], and a WaveRunner 640Z oscilloscope [125].

As shown in Fig. 3.12, the KCU105 board has a Kintex UltraScale XCKU040 FPGA chip with sufficient logic resources to implement TDCs. Also, the board provides two pairs of configurable (differential or single-end) Sub-Miniature version A (SMA) ports for low jitter signal transmissions and two FPGA Mezzanine Card (FMC) connectors for port extensions (LPC: low pin count connector with 160 pins; HPC: high pin count connector with 400 pins).

Si5324 is a clock multiplier/ jitter attenuator and can generate any frequency from 2kHz to 945 MHz at a low jitter, see Fig. 3.13a. CG635 is a high-performance clock generator, shown in Fig. 3.13b. It can generate an extremely stable square wave clock between 1  $\mu$ Hz and 2.05 GHz and provide a phase shifting function. The oscilloscope (WaveRunner 640) examines the signals from feed to the Hit and CLK ports.



Figure 3.11 Experimental Setup.



Figure 3.12 Xilinx KCU105 evaluation board.



Figure 3.13 a) Si5324 evaluation board. b) CG635 clock generator.

In code density tests, two independent signal sources were used to ensure time events were random enough, and there was no correlation between the hit and clock signals [101]. Moreover, as in code density tests, millions of time events are fed to the TDC. The coherence happens if the frequency is not selected carefully, making the linearity test result invalid [126]. Therefore, the hit signal is set to 99.97 MHz in the tests. The hit and clock signals are fed to the board via single-end ports, adding more jitters to the system.

In time interval tests, Si 5324-EVB and SRS CG-635 are synchronous to a 10 MHz signal (highlighted in red, see Fig. 3.11). With the phase-shifting function provided by SRS CG-635, a controllable time interval can be obtained between the hit signal and the system clock. For a fixed-time interval, measurements were repeated 50,000 times.

|                                | DNL                    | $\sigma_{DNL}$ | INL                      | $\sigma_{INL}$ | $w_{eq}$ (ps) | $\sigma_{eq}$ (ps) |
|--------------------------------|------------------------|----------------|--------------------------|----------------|---------------|--------------------|
| Original<br>WU<br>(2.47 ps)    | [-0.90, 4.06],<br>4.96 | 0.82           | [-4.62, 11.58],<br>16.20 | 3.26           | 4.85          | 1.87               |
| Compensated<br>WU<br>(2.47 ps) | [-0.92, 1.75],<br>2.67 | 0.43           | [-1.20, 5.97],<br>7.17   | 1.06           | 2.99          | 0.86               |
| DS<br>(2.53 ps)                | [-0.93, 4.59],<br>5.52 | 1.07           | [-5.39, 13.57],<br>18.96 | 4.01           | 6.51          | 1.88               |
| DSWU<br>(1.23 ps)              | [-0.84, 7.93],<br>8.77 | 0.93           | [-6.36, 24.70],<br>31.06 | 6.42           | 2.95          | 0.85               |
| Binned-DSWU<br>(2.48)          | [-0.93, 1.68],<br>2.61 | 0.35           | [-1.78, 2.67],<br>4.45   | 0.80           | 2.95          | 0.85               |

Table 3-2 Comparison of The Linearity Performances Between Four Different TDC Designs in UltraScale 20 nm FPGAs (The unit is LSB if not mentioned)

#### **3.3.2.** Linearity tests

DNL, INL, and their standard deviations ( $\sigma_{DNL}$  and  $\sigma_{INL}$ ) are important parameters for evaluating a TDC's linearity. Two equations were derived to assess the equivalent bin width and its standard deviation, summarized in [127]:

$$\sigma_{eq}^{2} = \sum_{i=1}^{N} \left( \frac{W[i]^{2}}{12} \times \frac{W[i]}{W_{total}} \right) \text{ where } W_{total} = \sum_{i=1}^{N} W[i], \tag{3.6}$$

$$w_{eq} = \sigma_{eq}\sqrt{12} = \sqrt{\sum_{i=1}^{N} \left(\frac{W[i]^3}{W_{total}}\right)},\tag{3.7}$$

where N is number of bins,  $w_{eq}$  is the equivalent bin width and  $\sigma_{eq}$  is the standard deviation of the equivalent bin width (named differently by other groups, for example, the quantization error  $\sigma_0$  [128]).

The linearity performances of the proposed TDCs are listed in Table 3-2. With the compensation strategy, the linearity of the WU TDC is much better. The  $DNL_{pk-pk}$  (peak-to-peak DNL) is enhanced from 4.96 to 2.67 LSB, whereas  $INL_{pk-pk}$  is also improved significantly after the compensation. Compared with the original WU TDC and the DS TDC, the WU method is still advantageous in easing ultra-wide bin problems. The  $DNL_{pk-pk}$  of the original WU TDC is 4.96 LSB, but that of the DS TDC is 5.52 LSB. The compensation strategy is less suitable for the DS TDC due to more ultra-wide bins.

Figure 3.14 shows DNL and INL curves for the DSWU TDC. The resolution of the DSWU TDC is 1.23 ps. The DNL<sub>pk-pk</sub> is 8.77 LSB and the INL<sub>pk-pk</sub> is 31.06 LSB. The DNL and INL curves for the binned-DSWU TDC are shown in Fig. 3.15. The DNL is [-0.93, 1.68] LSB and the INL is [-1.78, 2.67] LSB. Compared with the compensated WU TDC, the binning method is still advantageous, even though they have a similar resolution. The INL<sub>pk-pk</sub> for the binned-DSWU TDC (4.45 LSB) is much smaller than the compensated WU TDC (7.17 LSB). The binning method can also improve linearity with less distortion. Figure 3.16 shows the bin width distributions of the compensated WU and the binned-DSWU TDCs. The binned-DSWU TDC has a more concentrated bin-width distribution. Table 3-3 shows that  $\sigma_{DNL}$  of the compensated WU (0.43 LSB)

and the binned-DSWU (0.35 LSB) TDCs also show that the binned-DSWU TDC can deliver more robust results.



Figure 3.14 DNL and INL curves for the DSWU TDC (1626 bins).



Figure 3.15 DNL and INL curves for the binned-DSWU TDC (807 bins).



Figure 3.16 Bin width distributions for the compensated WU TDC and the binned-DSWU TDC.
## **3.3.3.** Time-interval tests

Precision (also called RMS resolution) shows the quality of measurements and can be affected by clock jitters, jitters of input signals, electronic noise, and PVT variations [129]. Time-interval tests are for precision estimation. A large number of time events with a fixed time interval are fed into the TDC, and the standard deviation of the distribution of these events reflects the precision.

The test results and the RMS resolutions for the proposed TDCs are shown in Figs 3.17 and 3.18. The averaged RMS resolutions for the DS, and the WU TDCs are 3.80 ps and 3.64 ps, respectively, whereas the averaged RMS resolutions for the DSWU, and the binned-DSWU TDCs are 3.67 ps and 3.63 ps.



Figure 3.17 TIT results and RMS resolutions for a) the DS TDC system and b) the WU TDC system.



Figure 3.18 TIT results and RMS resolutions for a) the DSWU TDC system and b) the binned-DSWU TDC system.

## 3.4. Theoretical analysis

Szplet *et al.* estimated measurement uncertainties by analyzing error sources of twostage multi-phase TDCs [130]. Their analysis approach can be extended for the singlestage TDCs, and the RMS resolution can be derived as:

$$\sigma_{system}^2 = \sigma_{in}^2 + \sigma_{sig}^2 + \sigma_{eq}^2 + \sigma_{clk}^2, \qquad (3.8)$$

where  $\sigma_{in}^2$  is the jitter introduced by the input signal.  $\sigma_{sig}$  is the jitter when a signal propagates along a delay line:

$$\sigma_{sig}^2 = \sigma_{wu}^2 + \sigma_{DL}^2, \tag{3.9}$$

where  $\sigma_{wu}$  is the jitter caused by the wave-union launcher, and several delay elements in the tapped delay line ( $\sigma_{DL}$ ) are involved. The wave union launcher contains a LUT and a CY8 (each CY8 contains eight delay elements ( $\sigma_{CY}$ )), and therefore,  $\sigma_{wu}$  can be expressed:

$$\sigma_{wu}^2 = 8\sigma_{CY}^2 + \sigma_{LUT}^2, \qquad (3.10)$$

It is difficult to predict how many delay elements of the TDL are involved in measurements. However, according to [130], a TDL with *k* delay elements can have  $\sigma_{DL}$  as:

$$E[\sigma_{DL}^{2}] = \sum_{i=1}^{k} \sigma_{DLi}^{2} \cdot p(\sigma_{DLi}^{2}) = \sum_{i=1}^{k} i \cdot \sigma_{CY}^{2} \cdot p(\sigma_{CY}^{2})$$
$$= \sum_{i=1}^{k} i \cdot \sigma_{CY}^{2} / k = (\sigma_{CY}^{2} / k) \sum_{i=1}^{k} i$$
$$= \frac{k+1}{2} \sigma_{CY}^{2} \approx \frac{k}{2} \sigma_{CY}^{2}, \qquad (3.11)$$

where  $\sigma_{DL_i}$  is the jitter caused by the *i*-th delay element and the *i*-th delay element has the probability  $p(\sigma_{DL_i}^2)$  of hitting during a signal measurement. Because the delay element is CY8,  $\sigma_{DL_i}^2 = i\sigma_{CY}^2$  and  $p(\sigma_{DL_i}^2) = p(\sigma_{CY}^2)$ . From Eqs (3.8) - (3.11), the RMS resolution becomes:

$$\sigma_{system}^2 = \sigma_{in}^2 + \sigma_{clk}^2 + \sigma_{eq}^2 + \sigma_{LUT}^2 + \left(\frac{k}{2} + 8\right)\sigma_{CY}^2, \qquad (3.12)$$

To obtain an estimated value of  $\sigma_{LUT}$  and  $\sigma_{CY}$ , a RO was constructed, as shown in Fig. 3.19. There are *m* delay elements in the RO, and the jitter of the RO is:

$$\sigma_{RO}^2 = \sigma_{LUT}^2 + m\sigma_{CY}^2. \tag{3.13}$$

From Fig. 3.20,  $\sigma_{CY} = 0.16$  ps and  $\sigma_{LUT} = 1.45$  ps. The values of  $\sigma_{CY}$  and  $\sigma_{LUT}$  are slightly larger than those in 45 nm Spartan-6 FPGAs, but the difference is minimal (see Table 3-3, [130]). Experimental setups, including evaluation boards and instruments, can affect the test results. Moreover, more advanced CMOS manufacturing technologies might also contribute to higher uncertainties with much smaller (width and length of gates) transistors, resulting in more thermal noise [75]. Therefore, it is reasonable that the presented results are slightly different from those in [130]. The four proposed TDCs have similar structures, and there is no WU launcher in the DS TDC,

so  $\sigma_{sig} = 2.74$  ps (for WU, DSWU, and binned-DSWU TDCs) and  $\sigma_{sig} = 2.33$  ps (for DS TDC).



Figure 3.19 Test setup for investigating jitters of LUT and the delay element.



Figure 3.20 Test results for investigating jitters of LUT and the delay element.

|                                    | Spartan-6  | UltraScale (20 nm) |                     |      |       |         |  |  |  |
|------------------------------------|------------|--------------------|---------------------|------|-------|---------|--|--|--|
|                                    | (45 nm)    |                    |                     |      |       |         |  |  |  |
|                                    | [130]-2019 | [66]-2019          | Compensated         | סס   | DSWII | Binned- |  |  |  |
|                                    |            |                    | WU                  | 03   | D3W0  | DSWU    |  |  |  |
| LSB (ps)                           | -          | 5.02               | 2.47                | 2.53 | 1.23  | 2.48    |  |  |  |
|                                    |            | Error Sour         | ce Analysis         |      |       |         |  |  |  |
| $\sigma_{clk}$ (ps)                | 1.93       | -                  |                     | 1.49 | )     |         |  |  |  |
| $\sigma_{in}$ (ps)                 | 2.64       | -                  | 1.53                |      |       |         |  |  |  |
| $\sigma_{CY}$ (ps)                 | 0.153      | -                  | 0.16                |      |       |         |  |  |  |
| $\sigma_{LUT}$ (ps)                | 1.33       | -                  |                     | 1.45 | 5     |         |  |  |  |
| $\pmb{\sigma_{sig}}~(\mathrm{ps})$ | -          | -                  | 2.74                | 2.33 | 2.74  | 2.74    |  |  |  |
| $\sigma_{eq}$ (ps)                 | -          | 1.45               | 0.86                | 1.88 | 0.85  | 0.85    |  |  |  |
| $\sigma_{TDC}~(\mathrm{ps})$       | -          | -                  | 2.88                | 2.99 | 2.87  | 2.87    |  |  |  |
| $\sigma_{system}$ (ps)             | 10.19      | -                  | 3.58 3.68 3.58 3.58 |      |       |         |  |  |  |
|                                    | ·          | Time Inte          | erval Test          | ·    | ·     | ·       |  |  |  |
| $\sigma_{system}$ (ps)             | -          | 7.8                | 3.64                | 3.80 | 3.67  | 3.63    |  |  |  |

 Table 3-3 Evaluation of Measurement Uncertainties

Measured by the oscilloscope, the jitter caused by the clock signal ( $\sigma_{clk}$ , generated by SRS CG-635) is 1.49 ps and the jitter introduced by the hit signal ( $\sigma_{in}$ , generated by Si 5324-EVB) is 1.53 ps.  $\sigma_{TDC}$  was introduced to evaluate the RMS resolution of the TDC, and it can be expressed as Eq. (3.14). Hence,  $\sigma_{system}$  can be modified as Eq. (3.15). The measurement uncertainties of the proposed TDCs are listed in Table 3-3 (two previously published works in 45 nm and 20 nm FPGAs are also listed). The analysis provides more detailed contributions from error sources. Additional jitters caused by input circuits are significant for some complicated structures [130]. However, the impact is negligible for the single-stage TDCs. The results ( $\sigma_{system}$ ) obtained from the error source analysis agree with those obtained from TIT tests.

$$\sigma_{TDC}^2 = \sigma_{sig}^2 + \sigma_{eq}^2, \qquad (3.14)$$

$$\sigma_{system}^2 = \sigma_{TDC}^2 + \sigma_{in}^2 + \sigma_{clk}^2.$$
(3.15)

Page 77 of 148

|           |        | w/o WU       | W/L T        | DS           | DSWU/        |
|-----------|--------|--------------|--------------|--------------|--------------|
|           |        | [66]         | WU           | D8           | Binned-DSWU  |
| Modules   | Total  | Used         | Used         | Used         | Used         |
| CY8       | 30300  | 80 (0.26%)   | 85 (0.28%)   | 76 (0.25%)   | 88 (0.29%)   |
| LUT       | 242400 | 703 (0.29%)  | 1349 (0.56%) | 1272 (0.52%) | 2460 (1.01%) |
| FF        | 484800 | 1195 (0.24%) | 1840 (0.38%) | 2190 (0.45%) | 3463 (0.71%) |
| BRAM      | 600    | 1.5 (0.25%)  | 4.5 (0.75%)  | 3.5 (0.58%)  | 7.5 (1.25%)  |
| CLB*      | 30300  | -            | 251 (0.83%)  | 211 (0.79%)  | 405 (1.37%)  |
| Power (W) | -      | -            | 0.92         | 0.88         | 1.03         |

Table 3-4 Consumption of Logic Resources

\* Configurable logic block, including several LUTs, FFs and a CY8 [73].

# 3.5. Logic resource consumption

The power consumption and the logic resources required for the proposed TDCs were calculated using the electronic design automation (EDA) tool (Vivado Design Suite) and summarized in Table 3-4. Extra logic resources were used when using the WU method and the DS structure (more BRAMs are needed to perform histogram functions). However, the usages of logic resources for the proposed TDCs are still low, showing great potential for multiple-channel applications.

The implementation layouts of the DSWU TDC are shown in Fig. 3.21. The TDL is confined within a central clock region (Slices X49Y120~X49Y179) to avoid large clock skews. The WU launcher is in Slice X49Y119. These constraints are also applicable to the WU TDC and the binned-DSWU TDC. The constraints for the WU launcher are not required in the DS-only TDC.

Figure 3.21 shows that the layouts do not consume many hardware resources, so the proposed TDCs are suitable for multi-channel applications. If a more extended measurement range is required, coarse counters can be easily included for the proposed TDCs.



Figure 3.21 layouts of the DSWU TDC. a) Overview. b) Clock regions (X2Y1 ~ X2Y3).

# 3.6. Summary

In this chapter, the first efficient WU-based FPGA-TDC in UltraScale FPGAs was introduced, correcting the misunderstanding that the WU method is unsuitable for UltraScale FPGAs. Combining the sub-TDL structure, the DS method, and the WU method, four different TDCs were built and tested, and it was found that the proposed DSWU TDC achieves an excellent resolution (1.23 ps). To further improve the linearity, the mixed calibration method in [66] was modified and a new compensation strategy was proposed.

Table 3-4 summarized the performances and resource consumptions of the recently published high-resolution FPGA-TDCs and the proposed TDCs. With the single-TDL structure, the proposed TDCs are easy to implement and resource-saving compared with other TDCs based on multichain averaging methods [63], [79]–[81]. The 2D Vernier TDC [109] achieved similar performances (2.50 ps resolution) but used 3,7420 LUTs (28-fold more than the WU TDC). Although the large scale parallel routing TDC [92] achieved a great resolution (1.29 ps) with a few logic resources, it required many CLBs and could result in signal congestions in a complex digital system. The logic

resource consumption shows that the proposed TDC architectures have great potential for multi-channel applications.

Moreover, this chapter proposes a new method to evaluate precision by analysing the error sources. For a fixed architecture ( $\sigma_{sig}$ ),  $\sigma_{eq}$  is the only factor that varies when different methods are implemented and can be reduced by improving the resolution and the DNL, from Eq. (3.3). From this study, the architecture-dependent jitter  $\sigma_{sig}$ , including signal noises and element errors, is the main contributor to the measurement uncertainties of the proposed TDCs ( $\sigma_{TDC}$ : ~2.9 ps). The precision of the proposed WU-related TDCs ( $\sigma_{eq}$ : ~0.85 ps) is close to the upper limit of what this structure can offer, and the precision of DS TDC, however, can still be further improved with  $\sigma_{eq} = 1.88$  ps. Also, the error analysis indicates the reason why the previously reported high-resolution (<2.5 ps) TDL-TDCs can only achieve limited precision (> 3 ps) [63], [78], [79].

| Ref-<br>Year        | Methods                           | Device                | LSB<br>(ps) | RMS<br>Resol.<br>(ps)                                | DNL, $DNL_{pk-pk}$ (LSB)   | INL, $INL_{pk-pk}$ (LSB)        | CLB  | LUT             | FF                | B-<br>RAM | DSP              |
|---------------------|-----------------------------------|-----------------------|-------------|------------------------------------------------------|----------------------------|---------------------------------|------|-----------------|-------------------|-----------|------------------|
| [63]<br><b>-16</b>  | Multichain, TDL<br>2-stage, WU-A  | Spartan-6             | 0.90        | < 6.00                                               | [-1.00, 6.25] <sup>1</sup> | [-26.20, 11.50],<br>37.70       | N/S  | N/S             | 4364 <sup>4</sup> | N/S       | 0 4              |
| [131]<br><b>-16</b> | DS                                | UltraScale            | 2.25        | 3.90                                                 | [-1.00, 4.78] <sup>1</sup> | N/S                             | N/S  | N/S             | 1810 <sup>4</sup> | N/S       | 0 4              |
| [109]-<br>17        | 2D-Vernier,<br>Multi-core         | Stratix IV            | 2.50        | 6.72                                                 | [-0.56, 0.46],<br>1.02     | [-2.98, 3.23],<br>6.21          | N/S  | 37420<br>5-core | N/S               | N/S       | 0 4              |
| [79]<br>- <b>17</b> | Multichain, TDL Offset-correction | Virtex-7<br>(250 MHz) | 1.15        | 3.50                                                 | [-0.98, 3.50],<br>4.48     | [-5.90, 3.10],<br>9.00          | N/S  | 19666           | 7000 4            | 43        | 127 <sup>5</sup> |
| [83]<br>- <b>18</b> | Multi-phase                       | Cyclone V             | 1.56        | 2.30                                                 | [-1.00, 5.60] <sup>1</sup> | [-8.00. 35.00] <sup>1</sup>     | N/S  | N/S             | 1300 4            | N/S       | 0 4              |
| [66]<br>- <b>19</b> | Multi-channel<br>Sub-TDL          | UltraScale            | 5.02        | 7.80                                                 | [-0.12, 0.11],<br>0.27     | [-0.18, 0.46],<br>0.59          | 271  | 703             | 1195              | 1.5       | 0 4              |
| [81]<br>- <b>19</b> | Multichain, WU,<br>Multichannel   | Artix-7               | 2.00        | < 12.50                                              | N/S                        | N/S, 2.10                       | N/S  | 7010            | 3738              | 1.5       | 0 4              |
| [80]<br>- <b>19</b> | Multichain, WU,<br>Multichannel   | UltraScale            | 0.30        | < 8.50                                               | N/S                        | N/S                             | 3200 | N/S             | N/S               | 11.1      | 1                |
| [92]<br>- <b>21</b> | Large scale parallel routing      | Kintex-7              | 1.29        | 3.54                                                 | [-1.20, 1.40], 2.60        | [-3.28, 3.78], 7.06             | 2200 | 1002            | 3900              | 2         | 0 4              |
|                     | Sub-TDL, WU-A<br>Compensation     |                       | 2.47        | 3.64 <sup>2</sup><br>3.58 <sup>3</sup>               | [-0.92, 1.75],<br>2.67     | [ <b>-1.20</b> , 5.97],<br>7.17 | 251  | 1349            | 1840              | 4.5       | 0                |
| Work                | Sub-TDL, DS                       | 1114 vo Socio         | 2.53        | 3.80 <sup>2</sup><br>3.68 <sup>3</sup>               | [-0.93, 4.59],<br>5.52     | [-5.39, 13.57],<br>18.96        | 211  | 1272            | 2190              | 3.5       | 0                |
| This '              | Sub-TDL, DS WU-<br>A              | Ultrascale            | 1.23        | <b>3.67</b> <sup>2</sup><br><b>3.58</b> <sup>3</sup> | [-0.84, 7.93],<br>8.77     | [-6.36, 24.70],<br>31.06        | 405  | 2460            | 3463              | 7.5       | 0                |
|                     | Sub-TDL, DS<br>WU-A, Binning      |                       | 2.48        | 3.63 <sup>2</sup><br>3.58 <sup>3</sup>               | [-0.93, 1.68],<br>2.61     | [-1.78, 2.67],<br>4.45          | 405  | 2460            | 3463              | 7.5       | 0                |

Table 3-5 Comparison of Published FPGA-based TDCs and the Proposed TDCs

<sup>1</sup>Approximate value from figures presented in literature; <sup>2</sup> Data obtained from TITs; <sup>3</sup> Data obtained from the analysis of measurement uncertainties. <sup>4</sup> Estimated values based on the architectures and encoding methods. <sup>5</sup> 20-chain TDCs require 127 DSP blocks and need much more LUTs, FFs, and BRAMs.

# Chapter 4. High-linearity TDC for LiDAR applications

## 4.1. Motivation

With the advanced CMOS manufacturing, photon detectors are more compatible and commercial-available, such as SiPMs and SPADs [132], [133]. LiDAR has gradually become a standard technique both in scientific and industrial applications, for example, remote sensing of atmospheric aerosols, landscape mapping, robotics, and home monitoring [16], [17], [134]–[136]. Callenberg *et al.* examined a very low-cost SPAD product, priced 3 USD for large volumes, and designed a system for non-line-of-sight tracking, material classification, and depth imaging [137], [138]. LiDAR is also a promising technique in driverless vehicles. With time information, an image or video can be reconstructed to guide the vehicle's drive.

Boosted by artificial intelligence (AI), software in LiDAR-based driverless vehicles has been developed at a fast speed [139]–[142]. However, the developments in hardware are not at the same pace. FPGAs, due to their flexibility and short development cycle, are the perfect platforms for LiDAR systems in driverless vehicles.

As a time-meter, a TDC is a critical component in LiDAR systems. In the past decades, researchers aimed to build a high-resolution TDC to meet the demands of high-energy physics, PET imaging, and bio-sensing [105], [143]–[145]. Many architectures and methods, including the dual-sampling structure, the Vernier delay line, the multi-phase design, the multi-chain design, and the wave-union method, were proposed to overcome process-related limitations and improve TDC resolutions [74], [110], [146], [83], [78]. However, TDCs in ToF LiDAR systems for robotics and driverless vehicles have different prioritized parameters, especially in the linearity and the measurement range [147], [18], [148]. In many applications, LiDAR systems can detect objects' locations and even estimate their speeds and directions of movement [148]. Distances between vehicles, for example, measured by a LiDAR systems for such specifications require TDCs to built and the measurement range from a few centimetres to hundreds of meters. Therefore, LiDAR systems for such specifications require TDCs

with a resolution from 50 ps to 200 ps [18]. In ToF measurements, a time interval of 66.6 ps corresponds to a distance of 1 cm (round-trip). According to [149], a TDC's precision can be expressed as:

$$\sigma_{TDC}^2 = \sigma_{in}^2 + \sigma_{clk}^2 + \sigma_q^2 + \sigma_{INL}^2 + \sigma_{extra}^2, \qquad (4.1)$$

where  $\sigma_{in}$  is the input signal jitter,  $\sigma_{clk}$  is the system-clock jitter,  $\sigma_q$  is the quantization error caused by delay cells,  $\sigma_{INL}$  the INL standard deviation, and  $\sigma_{extra}$  jitters from external sources. Eq. (4.1) indicates that with a given resolution, the most efficient way to achieve precision measurements is by improving the linearity. An extensive measurement range can be easily achieved using coarse and fine counters. For the coarse counter, MMCM provides a stable and compensated clock [61]. Therefore, coarse-time codes do not need further calibration. However, due to the uneven delay line, it is still challenging to guarantee high linearity for the fine counter.



Figure 4.1 a) Synchronized LiDAR system. b) Timing diagram of time interval measurements. c) Conception of a timing event histogramming function.

In a synchronized LiDAR system, the complexity of the system interface can be reduced significantly [150]. Measurements are assessed by a TDC channel, as the laser diode and the TDC are synchronous to the timing generator module (see Fig. 4.1a). Figure 4.1b is a timing diagram of time interval measurements; the measured time contains two parts: the coarse time ( $T_{coarse}$ ) and the fine time ( $T_{fine}$ ). However, device uncertainties and offsets, such as the laser trigger delay ( $\delta$ ), detector timing jitter, background noise, and quantization error, can result in measurement uncertainties [151]. Therefore, post-processing (histogramming of timing events, see Fig. 4.1c) is needed. Onboard histogramming modules using on-chip block random-access memories (BRAMs) were proposed to improve processing efficiency [59], [66], [152]. However, for TDCs with an extended measurement range (> 500 ns), onboard histogramming modules cost significant BRAM resources, not suitable for multichannel TDC designs. Therefore, many previously reported TDCs with long measurement ranges can only post-process data in PCs [67], [92], [146], [153].

Furthermore, most commercially-available time-correlated single-photon counting (TCSPC) systems only have a fixed resolution [154]. Commercial TCSPC products [155] providing operation-mode selections (for example, high-speed low-resolution or low-speed high-resolution modes) are standard. It is desirable to have a resolution-adjustable TDC offering broader time-resolved applications.

The chapter will present a new calibration method, the mixed-binning (MB) method, to improve linearity. A 128-channel resolution-adjustable TDC is built with the proposed method. Meanwhile, a software tool to predict the TDC performance is fully described in the chapter.

## 4.2. Architecture and method

#### 4.2.1. Architecture

Figure 4.2a shows the proposed TDC architecture. The TDL is implemented with cascaded carry-chain modules (CARRY8, CY8) in UltraScale FPGAs. The TDL is also tuned to maximize linearity [62]. The sub-TDL structure [66] is introduced to remove bubbles by elongating tap intervals and reducing mismatch effects, whereas the encoder

converts thermometer codes from sub-TDL modules to binary codes. The TDC resolution is adjustable by the signal *Resol\_sel* (highlighted in red). A 9-bit coarse counter extends the measurement range, and the signal, *Trig*, is an asynchronous reset signal for the coarse counter. The calibration module (highlighted in blue) can be removed if the calibration module is not applied. In this chapter, the uncalibrated TDC is called the original TDC.



Figure 4.2 Block diagrams of a) the proposed TDC architecture and b) the coarse code histogramming module.

Figure 4.2b is the block diagram for the coarse code histogramming module. In [156], a two-step coarse-fine timing module was proposed to achieve a histogramming function for a measurement distance > 50 m. Due to the limited memory resources in ASIC chips [156], long-range measurements are divided into coarse timing and fine timing modes. As shown in Fig. 4.3, the first exposure determines the coarse code using an external 60 MHz clock (16 ns bin-width), and then the second exposure returns the fine code with a 1 GHz clock (1 ns bin-width). The scheme requires more photon events and two different clock signals. However, the UltraScale XCKU040 FPGA has sufficient resources to implement two histogramming modules simultaneously (see Fig. 4.2a): the fine histogramming module and the coarse histogramming module.



Figure 4.3 Two-step histogramming methods in [156].

Figure 4.4a is the hardware implementation of the proposed MB method with resolution adjustments. Two BRAM modules are used in the proposed method: the calibration module and the histogramming module. Unlike the MC method [66] in Fig. 4.4b, the calibration module contains several BRAMs (for different resolutions) in the extended MB method, and each calibration BRAM only stores two factors: the bin-correction factor (*BCF*) and the bin-width calibration factor (*WCF*). A multiplexer is controlled by

the signal *Resol\_sel* and outputs the factors for the corresponding resolution. The histogram is stored in the module *Histo\_BRAM*.



Figure 4.4 Comparison between a) the proposed mixed-binning method and b) the mixed-calibration method in [66].



Figure 4.5 Concepts of a) the MC method in [66] and b) the proposed MB method. c) Errors in the MC method.

#### 4.2.2. Distortions caused by mixed-calibration methods

The concepts of the MC method in [66] and the proposed MB method are present in Fig. 4.5. TDCs with the MC method [66] (derived from the histogram processing algorithm in [157]) reach excellent linearity; both  $DNL_{pk-pk}$  and  $INL_{pk-pk}$  are much less than 1 LSB. The MC method contains two steps: bin compensations and width calibrations. As shown in Fig. 4.4b, four factors are stored in the calibration BRAM: the main bin factor ( $BCF_m$ ), the compensated bin factor ( $BCF_c$ ), the main width factor ( $WCF_m$ ) and the compensated width factor ( $WCF_c$ ).  $BCF_m$  and  $BCF_c$  are used to reassign actual TDLs to virtual TDLs (see Fig. 4.5a). The width calibration makes TDLs more even, e.g., Bin [CAL2] and Bin [CAL4] in Fig. 4.5a.

Although the MC's bin compensation (related to  $BCF_m$  and  $BCF_c$ ) can improve linearity, it introduces extra errors ( $\sigma_{comp}$ ). A simple example is shown in Fig. 4.5c. Hit signals with a fixed time interval are registered in one bin ideally before the MC's bin compensation (without considering jitters from circuits and signals). In this stage,  $\sigma_{comp} = 0$ . However, following the rules shown in Fig. 4.5a, a much larger bin (e.g.,  $Bin_{actual}$  [3] with S recorded events, highlighted in yellow in Fig. 4.5a) remaps to two ideal bins, resulting in extra jitter  $\sigma_{comp} \sim 0.5$  LSB (see Fig. 4.5c). Although  $\sigma_{comp}$  can be reduced through width calibration, it is still significant in LiDAR TDCs when LSB  $\gg 10$  ps. In other words, the MC method can 'over calibrate' the proposed TDC and is therefore unsuitable for this work. Instead, a much more efficient MB strategy was developed to improve linearity.

#### 4.2.3. Mixed binning method with resolution adjustments

This study aims to develop high-linearity TDCs for driverless vehicle LiDAR systems instead of high-resolution solutions for scientific applications [66]. The proposed MB method integrates the binning method (or the bin decimation [102]) and the width calibration. To avoid  $\sigma_{comp}$ , each fine code is remapped to a new bin (see the difference between Figs 4.4a and 4.4b, highlighted in yellow). Unlike down-sampling methods with a fixed sampling interval, the binning method is more flexible, merging several physical bins dynamically, regardless of smaller or larger bins, into a new bin and

making a more even TDL with a larger average bin size. Figure 4.5b shows the binning method's concept with resolution adjustments, and the pseudo-code is shown below.

assume *n* actual bins and *m* merged bins 
$$(m=\text{floor}\left(\frac{n}{i}\right))$$
  
set  $W_{merged}[m] = i \times W_{ideal}$   
set  $T_{merged}[m] = \sum_{0}^{k=m-1} W_{merged}[m]$   
set  $T_{actual}[n] = \sum_{0}^{k=n-1} W_{actual}[n]$   
For  $k = 1: n$   
 $j = \text{floor}\left(\frac{k}{i}\right)$   
if  $(T_{actual}[k] < T_{merged}[j])$   
 $BCF[k] = j$   
else  
continue...

As in Fig 4.5b, the ideal width of the *m*-th merged bin is:

$$W_{merged}[m] = i \times W_{ideal}, \tag{4.2}$$

where *i* is the number of ideal bins to be merged and  $W_{ideal}$  is the ideal bin-width. BCFs are the addresses of merged bins calculated by the actual bin distribution obtained from code density tests. *BCFs*' remapping operations can make the TDL smoother but cannot even the bins. Therefore, the bin width calibration is needed to enhance linearity further. *WCFs* can be considered as a normalization factor and can be estimated from the results of code density tests after binning, expressed as:

$$WCF[m] = (DNL\{BCF[m]\} + 1)^{-1},$$
 (4.3)

To implement Eq. (4.3) in FPGAs, WCF[m] can be converted into an approximate integral number in binary codes [59]:

$$WCF[m] = 2^{M} \cdot (DNL\{BCF[M]\} + 1)^{-1},$$
 (4.4)

Accumulation operations for *WCF*s act like multiplication operations (see Fig. 4.4a, highlighted in green). The *J*-bit output data from Histo\_BRAM is right-shifted by *M*-bit (in red).

## 4.3. Software prediction

To predict the performance of TDCs before the hardware implementation, a software tool with a graphical user interface (GUI) has been developed and open-sourced on GitHub, available in the website, https://github.com/GitForWJ/TDC\_tools [158].



Figure 4.6 a) Linearity curves for the full-length (2400 bins; LSB = 5.13 ps) original TDC (without using any calibration methods). b) the layout of the full-length original TDC placed in Slice X49Y0 – X49Y299.

To find a proper *i*, a full-length (2400 bins; LSB = 5.13 ps) original TDC placed in Slice X49Y0-X49Y299 was implemented without using any calibration methods. Figure 4.6a shows its linearity curves;  $DNL_{pk-pk}$  is 8.63 LSB and  $INL_{pk-pk}$  is 41.81 LSB. The layout of the full-length original TDC is shown in Fig. 4.6b. For multichannel (128 or more channels) TDCs, TDLs are placed in the whole FPGA chip. Therefore, the proposed TDC has three different resolutions by merging 10, 16, and 20 ideal bins to ignore ultra-wide bin problems. The achievable resolutions are around 50 ps, 80 ps, and Page **91** of **148** 

100 ps, respectively; they are typical resolutions for ToF LiDAR applications [18], [147], [148] and are also similar to the resolutions of commercial TCSPC systems [155], [159].



Figure 4.7 Linearity curves for the 460-bin original TDC placed in Slice X49Y120-X49Y179

17 original TDCs placed in different clock regions were tested, where the final 128channel TDC was implemented, to ensure that the software tool covers possible variations as much as possible. Figure 4.7 shows the linearity curves for one of the tested original TDCs placed in Slice X49Y120-X48Y179, achieving 5.02 ps resolution with DNL<sub>*pk*-*pk*</sub> = 3.86 LSB and INL<sub>*pk*-*pk*</sub> = 8.52 LSB. Figure 4.8 shows the tool's GUI. The linearity measurements of original TDCs are used as the raw data for predictions, selected by the channel number (defined as *Ch-No* in the GUI).

The binning method can build a new TDL by remapping actual bins into merged bins [102]. *WCFs* make the TDL more even. Due to the clock network, ultra-wide bins commonly appear at the edges of a TDL [118]. To further improve the linearity, only a segment of the TDL was selected by changing the start-point and the endpoint (see Fig. 4.8). In this case, the start-point is Bin 6, and the endpoint is Bin 400.



Figure 4.8 The prediction tool's GUI

Measurements contain two parts: signal propagation ( $\sigma_{sig}$ ) and equivalent quantization ( $\sigma_{eq}$ ). Therefore, the expected precision ( $\sigma_{exp}$ ) of a TDC can be expressed as:

$$\sigma_{exp}^2 = \sigma_{sig}^2 + \sigma_{eq}^2, \tag{4.5}$$

According to [130],  $\sigma_{sig}$  can be re-derived as:

$$\sigma_{sig}^{2} = \sigma_{clk}^{2} + \sigma_{in}^{2} + \sigma_{DL}^{2} + \sigma_{TIC}^{2}$$
  
=  $\sigma_{clk}^{2} + \sigma_{in}^{2} + \frac{n}{2}\sigma_{CY}^{2} + \sigma_{TIC}^{2}$ , (4.6)

 $\sigma_{clk}$  is the system-clock jitter, and  $\sigma_{in}$  is the input signal jitter. The architecturedependent jitter ( $\sigma_{DL}$ ) caused by delay elements ( $\sigma_{CY}$ ) accumulates through the delay line. The jitter from input circuits ( $\sigma_{TIC}$ ) is negligible in single-TDL single-stage TDCs, as signals are from input/output buffers (IOBs) and transmitted via internal wire connections. Therefore, the expected precision ( $\sigma_{exp}$ ) of a single-stage single-TDL TDC can be considered as:

$$\sigma_{exp}^2 = \sigma_{clk}^2 + \sigma_{in}^2 + \frac{n}{2}\sigma_{CY}^2 + \sigma_{eq}^2, \qquad (4.7)$$

According to [127], [128], the equivalent quantization error  $\sigma_{eq}$  and the equivalent bin width  $w_{eq}$  can be calculated as:

$$\sigma_{eq}^{2} = \sum_{i=1}^{N} \left( \frac{W[i]^{2}}{12} \times \frac{W[i]}{W_{total}} \right) \text{ where } W_{total} = \sum_{i=1}^{N} W[i], \tag{4.8}$$

$$w_{eq} = \sigma_{eq} \sqrt{12},\tag{4.9}$$

For a fixed input time interval  $(T_{input})$ , the propagation jitter  $(\sigma_{sig})$  follows a Gaussian distribution [160] and causes errors  $(\varepsilon_{sig})$ . Therefore, the captured time interval  $(T_{captured})$  can be expressed in Eq. (4.10). The corresponding bin registers the time interval and results in  $\sigma_{eq}$ .

$$T_{captured} = T_{input} + \varepsilon_{sig}, \tag{4.10}$$

With Eqs (4.5) - (4.10), the precision can be predicted in software.  $\sigma_{CY}$ ,  $\sigma_{in}$  and  $\sigma_{clk}$  can be formulated by changing the element jitter (see Fig. 4.8), and each prediction tests 100,000 times.

#### 4.4. Hardware implementation and test results

To evaluate the proposed method, the proposed 128-channel TDCs were implemented in the Xilinx Kintex UltraScale KCU105 Kit (UltraScale XCKU040), operating at 500 MHz. Code density and time interval tests were conducted to assess linearity and precision performances. Two independent onboard low-jitter crystal oscillators were used for code density tests to ensure the randomness of hit signals and the sampling clock [101]. In time interval tests, the delay elements, IDELAYE3 and ODELAYE3, were used to generate a short delay (< 2ns) with a controllable time interval between the two event signals [66]. The precision in long measurement ranges is evaluated by measuring the time intervals generated from mixed-mode clock manager (MMCM) modules and delay elements (IDELAYE3 and ODELAYE3). Each experiment captured 1,000,000 samples in code density tests and 100,000 events in time interval tests. The testing environment's temperature was maintained, and an IDELAYCTRL module was to reduce the impact of PVT variations.

## 4.4.1. Linearity

Table 4-1 summarizes linearity performances of the proposed TDCs obtained from software predictions and hardware implementations. Binned TDCs use the binning-only method, whereas hybrid TDCs integrate the MB method.

By changing *i* (*i* = 10, 16, 20), three virtual TDLs were built with different LSBs. The binning method can improve the linearity but degrade the resolution from the software predictions. The MB method can further improve the linearity by making the bins more even. A similar conclusion can be drawn based on the results from hardware implementations. Figures 4.9 and 4.10 show the DNL and INL curves for the proposed TDCs in hardware implementations (*i* = 10, 16, 20). Hybrid TDCs achieve relatively good linearity (DNL<sub>*pk*-*pk*</sub> and INL<sub>*pk*-*pk*</sub> are less than 0.06 LSB, 0.04, and 0.02 LSB when the resolutions are 51.28 ps, 83.33 ps, and 105.26 ps, respectively). Moreover, with the MB method, the virtual TDLs are even enough, making  $\sigma_{eq}$  close to its ideal value ( $\frac{1}{\sqrt{12}} \approx 0.289$  LSB, based on [128]).

|                      |                 | Software predic | tions (Start-point = 6, Stop- | point = 400)    |                 |                 |  |  |
|----------------------|-----------------|-----------------|-------------------------------|-----------------|-----------------|-----------------|--|--|
|                      | <i>i</i> =      | 10              | <i>i</i> =                    | 16              | <i>i</i> = 20   |                 |  |  |
|                      | Binned          | Hybrid          | Binned                        | Hybrid          | Binned          | Hybrid          |  |  |
| LSB                  | 50              | .20             | 80                            | .32             | 100             | 100.40          |  |  |
| DNL (LSB)            | [-0.296, 0.305] | [-0.004, 0.003] | [-0.115, 0.116]               | [-0.004, 0.004] | [-0.120, 0.154] | [-0.004, 0.004] |  |  |
| $DNL_{pk-pk}$ (LSB)  | 0.601           | 0.008           | 0.231                         | 0.008           | 0.275           | 0.007           |  |  |
| $\sigma_{DNL}$ (LSB) | 0.109           | 0.002           | 0.067                         | 0.003           | 0.070           | 0.002           |  |  |
| INL (LSB)            | [-0.121, 0.184] | [-0.010, 0.002] | [-0.031, 0.107]               | [-0.005, 0.007] | [-0.039, 0.120] | [-0.006, 0.008] |  |  |
| $INL_{pk-pk}$ (LSB)  | 0.305           | 0.012           | 0.139                         | 0.012           | 0.159           | 0.014           |  |  |
| $\sigma_{INL}$ (LSB) | 0.078           | 0.004           | 0.040                         | 0.003           | 0.045           | 0.003           |  |  |
| $\sigma_{eq}$ (LSB)  | 0.294 0.289     |                 | 0.290 0.289                   |                 | 0.291           | 0.289           |  |  |
| $w_{eq}$ (ps)        | 51.05           | 50.20           | 80.83                         | 80.32           | 101.10          | 100.40          |  |  |
|                      |                 | Н               | ardware implementations       |                 | I               |                 |  |  |
| LSB                  | 51              | .28             | 83                            | .33             | 105.26          |                 |  |  |
| DNL (LSB)            | [-0.313, 0.215] | [-0.018, 0.021] | [-0.097, 0.113]               | [-0.017, 0.016] | [-0.118, 0.156] | [-0.008, 0.008] |  |  |
| $DNL_{pk-pk}$ (LSB)  | 0.528           | 0.039           | 0.210                         | 0.033           | 0.274           | 0.016           |  |  |
| $\sigma_{DNL}$ (LSB) | 0.095           | 0.011           | 0.052                         | 0.008           | 0.064           | 0.004           |  |  |
| INL (LSB)            | [-0.328, 0.000] | [-0.019, 0.035] | [-0.111, 0.067]               | [-0.028, 0.003] | [-0.158, 0.000] | [-0.009, 0.007] |  |  |
| $INL_{pk-pk}$ (LSB)  | 0.328           | 0.054           | 0.178                         | 0.032           | 0.158           | 0.016           |  |  |
| $\sigma_{INL}$ (LSB) | 0.069           | 0.012           | 0.041                         | 0.007           | 0.039           | 0.004           |  |  |
| $\sigma_{eq}$ (LSB)  | 0.292           | 0.289           | 0.290                         | 0.289           | 0.290           | 0.289           |  |  |
| $w_{eq}$ (ps)        | 51.94           | 51.29           | 83.65                         | 83.34           | 105.87          | 105.26          |  |  |

Table 4-1 Linearity performances of the proposed TDCs obtained from software predictions and hardware implementations



Figure 4.9 DNL curves for the proposed TDCs when i = 10, 16 and 20.

The resolutions and linearities obtained from software predictions and hardware implementations are slightly different (highlighted in bold). The difference in the resolution is caused by interpolation loss. In software, virtual TDLs are constructed by merging actual bins directly. To interpolate the 2 ns clock period, the TDCs need 39 bins, 24 bins, and 19 bins when i = 10, 16, and 20 (as the proposed TDC operates at

500MHz). Therefore, the resolutions in hardware implementations are 51.28 ps, 83.33 ps, and 105.26 ps. Quantization errors caused by *WCF*s result in linearity differences. In software predictions, *WCF*s are floating-point numbers and are multiplied by the bins' widths directly. However, as a trade-off between hardware resources and accuracy, *WCF*s are approximate integers in binary codes in hardware implementations (see Eq. (4.4)) and contribute to quantization errors. Moreover, accumulation operations in hardware also contribute to these errors. Therefore, the proposed hybrid TDC's linearity in hardware is slightly worse than software estimations. Although there are discrepancies, they are minimal and acceptable.



Figure 4.10 INL curves for the proposed TDCs when i = 10, 16 and 20.

In contrast, the binned TDC linearity estimations are similar in software and hardware, especially  $INL_{pk-pk}$ .

#### 4.4.2. Precision

Using the WaveRunner 640Z,  $\sigma_{clk} = 4.42$  ps,  $\sigma_{in} = 4.81$  ps, and  $\sigma_{CY} = 0.16$  ps were obtained. The precision of the proposed TDC can be evaluated in software. Jitters

caused by delay elements are accumulated and degrade the precision, see Eq. (4.6). From Eqs. (4.7) and (4.8), the precision decreases when the resolution drops. The expected precisions are 0.31 LSB, 0.30 LSB and 0.29 LSB when i = 10, 16, 20 (included in TABLE 4-2).

Feeding hit signals with a fixed time interval to the TDC, the precision or root-meansquare (RMS) resolution can be estimated by the standard deviation ( $\sigma$ ) of time interval tests in hardware and can be expressed as:

$$\sigma^2 = \frac{1}{R-1} \sum_{i=1}^{R} (x_i - \mu)^2, \qquad (4.11)$$

where  $x_i$  is the bin number of *i*-th output and  $\mu$  is the average value of *R* repeated measurements. In time interval tests, IDELAYE3 and ODELAYE3 controlled small intervals with a step of 11.11 ps. Figure 4.11 shows the results of the short-delay (< 2 ns) time interval tests. Due to an even TDL distribution, the worst cases happen when the input signal falls at the boundary between two bins. The two bins register the signals equally, resulting in the maximum RMS resolutions of 0.5 LSB for the three selected resolutions. Also, due to larger LSBs, at times, only one bin catches time intervals, and the RMS resolution is 0 LSB (See Figs 4.11b and 4.11c).



Figure 4.11 Short delay time interval tests results (< 2 ns): a) i = 10, b) i = 16, and c) i = 20.

The averaged value and the maximum value are not suitable for evaluating the precision performance of the proposed TDCs, because they overestimate or underestimate the TDC with good linearities. Therefore, time interval tests were conducted with *H* different intervals in a coarse counter period  $(T_H - T_1 < 2 \text{ ns})$  and the valid RMS resolution  $(\sigma_{valid})$  was defined to evaluate the precision:

$$\sigma_{valid}^2 = \frac{1}{H} \sum_{1}^{H} \sigma_i^2, \qquad (4.12)$$

where  $\sigma_i$  is the standard deviation for tests with a fixed time interval. Figure 4.12 presents valid RMS resolutions for long-range time interval tests. The averaged valid RMS resolutions ( $\sigma_{valid\_ave}$ ) are 0.31 LSB, 0.26 LSB, and 0.25 LSB when i = 10, 16, and 20. Figure 4.12 shows that the proposed TDC performs robustly in precision in long-range measurements (up to 1000 ns).



Figure 4.12 Valid RMS resolutions in long delay time interval tests (< 1000 ns).

Differences are also observed in Table 4-2;  $\sigma_{exp}$  in software is slightly larger than  $\sigma_{valid\_ave}$  in hardware (highlighted in bold). In software, the signal propagation in the delay line can be restored and the expected precision can be predicted. The quantization error in hardware significantly increases when the input time interval is close to two bins' boundaries. However, the input time intervals step is relatively large (11.11 ps), resulting in an overestimation in precision. Although the precision estimated from software predictions differs from the precision measured from hardware implementations,  $\sigma_{eq}$  estimated from both manners are still similar (see Table 4-1), showing that the software tool can robustly predict hardware implementations.

Table 4-2 Precision performance of the proposed TDCs in the software Prediction and the hardware implementation

|              |                     | Software              | Hard             | ware                      |  |
|--------------|---------------------|-----------------------|------------------|---------------------------|--|
|              |                     | <b>σ</b> <sup>1</sup> | Short Delay      | Long Delay                |  |
|              |                     | U exp                 | $\sigma_{valid}$ | $\sigma_{valid\_ave}^{2}$ |  |
| Units        | ps                  | LSB                   | LSB              | LSB                       |  |
| <i>i</i> =10 | 50.20 <sup>3</sup>  | 0.31                  | 0.31             | 0.31                      |  |
| V 10         | 51.28 <sup>4</sup>  | 0101                  |                  |                           |  |
| <i>i</i> =16 | 80.32 <sup>3</sup>  | 0.30                  | 0.26             | 0.26                      |  |
| 1 10         | 83.33 <sup>4</sup>  | 0.50                  | 0.20             | 0.20                      |  |
| <i>i</i> =20 | 100.40 <sup>3</sup> | 0.29                  | 0.25             | 0.25                      |  |
| ι 20         | 105.26 4            | 0.27                  | 0.25             | 0.23                      |  |

<sup>1</sup> The expected precision based on Eq (4.7); <sup>2</sup> Averaged valid RMS resolution; <sup>3</sup> Values from software predictions; <sup>4</sup> Values from hardware implementations.

# 4.5. Multichannel design

A 128-channel hybrid TDC was implemented in UltraScale FPGAs, and Table 4-3 concludes the logic resource consumption. Each channel costs around 660 LUTs and 1100 registers. The BRAM usage depends on the configuration of the resolution. In this design, each channel requires 2.5 BRAMs.

1-channel 128-channel Modules Total Used Used 74 (0.24%) CARRY8 30300 9472 (31.26%) LUT 242400 663 (0.27%) 87078 (35.92%) FF 484800 1124 (0.23%) 143940 (29.69%) BRAM 600 2.5 (0.42%) 320 (53.33%) CLB 30300 185 (0.61%) 20729 (68.41%)

Table 4-3 Logic resource consumption

Each channel is placed within a clock region to avoid significant clock skews. The 128 channels are placed evenly in the target chip, and the space between adjacent channels needs to be maintained due to the timing requirement and routing congestion. Figure 4.13 shows the layout of the 128-channel hybrid TDC in UltraScale FPGAs. Table 4-4 summarizes the linearity performances of 16 (out of 128) channels spread evenly across the FPGA chip to avoid an over-length presentation. The linearities of the TDC channels in different locations are uniform.

|                                                                                                                  |  | an angegenere |       |
|------------------------------------------------------------------------------------------------------------------|--|---------------|-------|
|                                                                                                                  |  |               |       |
|                                                                                                                  |  |               |       |
|                                                                                                                  |  |               | -     |
|                                                                                                                  |  |               |       |
|                                                                                                                  |  |               |       |
|                                                                                                                  |  |               |       |
|                                                                                                                  |  |               |       |
|                                                                                                                  |  | 24            |       |
| and the second |  |               | -     |
|                                                                                                                  |  |               |       |
|                                                                                                                  |  |               | - 127 |
| <b>K</b> K                                                                                                       |  |               | n_12/ |
|                                                                                                                  |  |               |       |
|                                                                                                                  |  |               |       |
|                                                                                                                  |  |               |       |

Figure 4.13 The layout of the 128-channel hybrid TDC.

| Table 4-4 Linearity performances of 16 channels (out of 128 channels in the proposed |
|--------------------------------------------------------------------------------------|
| multichannel TDC, unit: ×1.0E-3 LSB)                                                 |

| Ch.                  | 0  | 8  | 16 | 24 | 32 | 40 | 48 | 56         | 64 | 72 | 80 | 88 | 96 | 104 | 112 | 120 | Ave |
|----------------------|----|----|----|----|----|----|----|------------|----|----|----|----|----|-----|-----|-----|-----|
| <i>i</i> = 10        |    |    |    |    |    |    |    |            |    |    |    |    |    |     |     |     |     |
| DNL <sub>pk-pk</sub> | 38 | 33 | 36 | 36 | 39 | 37 | 30 | 34         | 32 | 36 | 34 | 43 | 38 | 35  | 39  | 30  | 36  |
| INL <sub>pk-pk</sub> | 57 | 55 | 50 | 58 | 55 | 55 | 51 | 54         | 58 | 57 | 54 | 54 | 59 | 50  | 57  | 52  | 55  |
|                      |    |    |    |    |    |    |    | <i>i</i> = | 16 |    |    |    |    |     |     |     |     |
| DNL <sub>pk-pk</sub> | 29 | 35 | 34 | 27 | 35 | 34 | 32 | 29         | 29 | 24 | 29 | 27 | 28 | 32  | 25  | 25  | 30  |
| INL <sub>pk-pk</sub> | 28 | 31 | 26 | 28 | 33 | 35 | 28 | 28         | 26 | 27 | 30 | 25 | 35 | 30  | 29  | 26  | 29  |
| i = 20               |    |    |    |    |    |    |    |            |    |    |    |    |    |     |     |     |     |
| DNL <sub>pk-pk</sub> | 20 | 14 | 20 | 12 | 15 | 21 | 19 | 19         | 18 | 15 | 16 | 16 | 14 | 27  | 19  | 15  | 18  |
| INL <sub>pk-pk</sub> | 18 | 15 | 12 | 23 | 12 | 13 | 23 | 18         | 14 | 15 | 23 | 13 | 14 | 19  | 18  | 22  | 17  |

|                        |                                    | FPGA        |           | ASIC     |          |            |               |         |  |
|------------------------|------------------------------------|-------------|-----------|----------|----------|------------|---------------|---------|--|
|                        | This Work                          | 18-[161]    | 17-[162]  | 20-[148] | 20-[147] | 19-[163]   | 19-[156]      | 19-[18] |  |
| Device                 | UltraScale                         | Cyclone IV  | Virtex 5  | 180 nm   | 350 nm   | 40 nm      | 90 nm         | 180 nm  |  |
| Method                 | Mixed-                             | Bin         | Counting- | DI I 3   | DII 3    | Gated Ring | Multi-event   | Dual    |  |
| Wethod                 | binning                            | realignment | weighted  | DLL      | DLL      | Oscillator | Histogramming | clock   |  |
| Perclution             | 51.28, <i>i</i> =10                |             |           |          |          |            |               |         |  |
|                        | 83.33, <i>i</i> =16                | 45.00       | 60.00     | 50.00    | 78.00    | 33-120     | 35/560        | 48.80   |  |
| (ps)                   | 105.26, i = 20                     |             |           |          |          |            |               |         |  |
| Drecision              | $15.89^{-1}, i = 10$               |             |           | -        |          |            |               |         |  |
| (ps)                   | 21.67 <sup>1</sup> , <i>i</i> = 16 | 18.00       | N/S       | 36.50    | 33.60    | 208.00     | N/S           | 62.37   |  |
| (þs)                   | 26.32 <sup>1</sup> , $i = 20$      |             |           | 50.50    |          |            |               |         |  |
| DNL                    | $36^2, i = 10$                     |             |           |          | 540 4    |            |               |         |  |
| $(\times 1.0E.3 ISP)$  | $30^2, i = 16$                     | 630         | 780       | 470      | 830.5    | 900        | 100           | 960     |  |
| (×1.0E-3 L3D)          | $18^2, i = 20$                     |             |           |          | 050      |            |               |         |  |
| INL                    | $55^2, i = 10$                     |             |           |          | 360.4    |            |               |         |  |
| $(\times 1.0E 3 I SP)$ | 29 <sup>2</sup> , $i = 16$         | 850         | 1310      | 710      | 1240 5   | 5640       | 180           | 2560    |  |
| (×1.0E-3 L3D)          | $17^2, i = 20$                     |             |           |          | 1240     |            |               |         |  |
| Range (us)             | 1.00                               | 0.007       | N/S       | 13.10    | 0.64     | 0.14-0.49  | 0.33 6        | 0.33 6  |  |

Table 4-5 Comparison Between Reported High-linearity TDCs with Acceptable Resolutions.

<sup>1</sup> Averaged valid RMS resolution measured from long-range tests; <sup>2</sup> The averaged peak-to-peak DNL and INL results of the multichannel hybrid TDC; <sup>3</sup> Delay locked loop, DLL. <sup>4</sup> Minimum values measured from 257 channels; <sup>5</sup> Maximum values measured from 257 channels. <sup>6</sup> Calculated by 50 m maximum measured distance.

## 4.6. Comparison

Table 4-5 summarizes the proposed TDCs and recently reported TDCs with similar resolutions, including FPGA and ASIC designs in the past four years. Although TDCs in [161], [162] can achieve similar resolutions (close to 50 ps), the proposed TDCs have much better linearities. Also, the proposed built-in histogramming function ensures that fast data transmission and processing are feasible for LiDAR systems, especially in driverless vehicles and robotics.

Using gated ring-oscillator architectures, ASIC-TDCs in [163] can configure their resolutions by changing the voltage. However, as a function of the supply voltage, the TDC resolution can be significantly affected by voltage jitter [163]. In contrast, carry chain structures in FPGAs are more robust [54], and the MB method provides a more flexible and reliable way to adjust TDCs' resolution. Generally, ASIC-TDCs can achieve better linearity than FPGA-TDCs through well-planned layout strategies. However, compared with ASIC-TDCs in [18], [147], [148], [156], [163], the proposed TDC can achieve much better linearity and comparable precision (see the proposed TDC with i = 10 for comparison), thanks to the proposed MB method and the resolution-adjustable architecture.

Choosing a suitable resolution is essential in LiDAR systems. In LiDAR image reconstruction, memory usages limit reconstruction methods' performance (e.g., neural networks methods [164]–[166]). A larger bin size corresponds to a smaller number of bins and consumes less memory. In the meantime, fewer bins can speed up image reconstruction with faster bin indexing [164]. Unlike binning in software [164], the proposed MB method in hardware maintains TDCs' linearity and provides an efficient and flexible way to change the resolution, suitable for LiDAR applications.

With the MB method,  $\sigma_{TDC}$  is degraded due to relatively large  $\sigma_{eq}$  (see Eq. (4.1)) but is still acceptable. LiDAR systems (see Fig. 4.1a) in driverless vehicles can tolerate a distance error of a few centimeters. SPADs are portable and cost-effective detectors for LiDAR systems but can contribute to significant jitters (compared with the proposed TDC). For example, the typical jitter of SPADs is 219 ps in Ref. [163] and 170 ps in [156]. However, the distance errors, including jitters from SPADs and the proposed TDC ( $\sigma_{TDC} = 15.89$  ps when i = 10), are still acceptable in driverless vehicles because measured distances are generally from tens of centimetres to hundreds of meters, and 1 cm corresponds to 66.6 ps in ToF measurements. Furthermore, if a low-jitter detector is used (e.g., SPADs have 25 ps jitter in Ref. [167] and 35 ps jitter in Ref. [168]), the proposed TDC can offer an overall low-jitter system, better than TDCs in Refs [18], [147], [148], [156], [163].

The width calibration performs like the bin-by-bin calibration proposed in [110]. In [110], TDC output codes were calibrated to the bins' centre value, resulting in fewer quantization errors. Similarly, with the width calibration, the difference is negligible no matter calibrating bins to their centre or boundary values because all bins are even enough. Moreover, compared with the look-up table (LUT) based bin-by-bin calibration, BRAM-based width calibration is more suitable for multichannel applications. Using many LUTs (or distributed RAMs) would result in congestions in the synthesis and implementation stages [169].

## 4.7. Summary

A new calibration method was proposed, the MB method, to improve linearity and adjust TDC's resolutions. A software tool was developed for TDC communities to predict TDC performances robustly. It can assess the performances of calibration methods and TDCs before hardware implementations. As a guide for beginners to understand the TDC principle, a GUI has also been developed to facilitate users in designing their systems.

A cost-effective 128-channel high-linearity resolution-adjustable TDC has been implemented and assessed in UltraScale FPGAs. The proposed 128-channel TDC shows excellent uniformity and offers excellent linearity with the MB method, comparable with recently reported ASIC-TDCs with similar resolutions [18], [148]. With an adjustable resolution and the histogramming function, the proposed TDC can apply to broad ToF LiDAR applications, such as driverless vehicles and robotics.

Moreover, the short development cycle in FPGAs is suitable for the current competitive market.
## Chapter 5. Automatic calibration in ZYNQ-based Structures

#### 5.1. Motivation

Fast prototyping is a common requirement in commercial products. Meanwhile, in industrial applications such as driverless vehicles, multiple sensors, including LiDAR, radar (radio detection and ranging), and cameras, are used to achieve better performance [15], [164], [170]. FPGAs, working in a parallel manner, are a suitable solution for connecting each sensor and implementing functional modules due to the flexibility and short development cycle.

As a high-precision stopwatch, a TDC is a key component in LiDAR and can be realized in ASICs and FPGAs. Owing to significant development in CMOS technology, both FPGA-TDCs and ASIC-TDCs can achieve picosecond resolutions. However, the nonlinearity in FPGA-TDCs is significant due to the fixed route and clock network. As such, there are only a small amount of commercial FPGA-TDCs [171]. Contrarily, ASIC-TDCs dominate the market in respect of commercial TCSPC products [154], [155], [159], which can be attributed to the better linearity and precision because of the careful placement and routing strategies.

In FPGAs, TDLs are uneven due to clock skews and process variations, resulting in large nonlinearity. To improve linearity by changing the output pattern of carry-chain modules, Won and Lee proposed the tuned-TDL structure [62]. In bin-by-bin methods, each bin is remapped to its centre and the INL is reduced. The mixed-calibration method and mixed-binning method for different applications were proposed in previous studies by the present authors [19], [66], [172], in which BRAM modules are used to store calibration factors to enhance the linearity further. Despite such efforts, in the described methods, factors are pre-calculated on PCs based on the results from code density tests and are loaded to BRAM modules, rendering such methods time-consuming, especially for multichannel designs. With the lack of proper calibration strategies, a trade-off is

needed between the number of channels and area occupancy in commercial FPGA-TDC products [173], [174].

In this chapter, a new automatic calibration (AC) strategy for exploiting the ARM-based ZYNQ SoC architecture for FPGA-TDCs is introduced.



Figure 5.1 The block diagram of the proposed TDC system based on ZYNQ structures.

#### 5.2. ZYNQ architecture and TDC structure

A block diagram of the proposed TDC system is presented in Fig. 5.1. The system was implemented in a low-cost Xilinx ZYNQ-7000 SoC (28 nm, XC7Z070, ZedBoard development board). The whole system contains two parts: the programmable logic (PL, equivalent to Artix-7 FPGA) and the processing system (PS, dual-core ARM Cortex-9 inside) [175]. The TDC, including TDLs (the backbone for the proposed TDC), encoders, and calibration modules, was built in PL. PS is used for calculating the

calibration coefficients. The channel selector in PL is used for multichannel applications, with the capability of transferring data between the TDC channels and the ARM core. The advanced eXtensible interface (AXI) is the data bus for communications between PL and PS [176].

To further improve the performance of TDCs, techniques, such as WU methods [71], sub-TDL structures [66] and tuned TDL methods [62], were also applied in the present study.

#### 5.2.1. CARRY4 and tunned-TDL architecture

Unlike CY8 in UltraScale (20 nm) or UltraScale+ (16 nm) devices, the carry chain module in ZYNQ-7000 is CY4, containing four cascaded delay elements. Each delay element has two outputs: CARRY\_OUT (*C*) and SUM\_OUT (*S*) [177]. Won and Lee [62] suggested that TDL-TDCs can be improved by changing the output patterns. Similar tests were conducted in the present study to determine the optimal sampling pattern for ZYNQ-7000 devices. The test results are shown in Table 5.1 and indicate that "SCSC" had the optimal linearity in ZYNQ-7000.

| Pattern          | DNL (LSB)                | INL (LSB)                |  |  |  |  |
|------------------|--------------------------|--------------------------|--|--|--|--|
| (LSB = 9.83 LSB) | [min, max], peak-to-peak | [min, max], peak-to-peak |  |  |  |  |
| CCCC             | [-0.99, 4.32], 5.32      | [-5.44, 4.17], 9.61      |  |  |  |  |
| SSSS             | [-0.91, 3.26], 4.17      | [-5.32, 4.71], 10.03     |  |  |  |  |
| CSCC             | [-0.98, 3.25], 4.23      | [-5.29, 5.19], 10.48     |  |  |  |  |
| CSCS             | [-0.95, 5.07], 6.02      | [-7.40, 5.28], 12.68     |  |  |  |  |
| SCSS             | [-0.97, 3.26], 4.24      | [-7.18, 5.27], 12.45     |  |  |  |  |
| SCSC             | [-0.89, 2.94], 3.84      | [-5.73, 4.96], 10.69     |  |  |  |  |
|                  |                          |                          |  |  |  |  |

Table 5-1. Sampling Pattern Comparison.

#### 5.2.2. Weighted histogram calibration method

The MC method proposed by Chen and Li [66] has been well-discussed in the previous chapters. In the described method, two pairs of calibration factors are efficiently used

to overcome ultra-wide bin (DNL > 2 LSB) and ultra-narrow (DNL < - 0.9 LSB) bin problems. However, one of the significant drawbacks of the method is that the MC method cannot suppress the nonlinearity caused by bins with bin-width > 2 LSB. As shown in Fig. 5.2a, the compensation failure caused by Bin [*n*] introduces missing code. For example, Bin [*M*+1], which is highlighted in red, degrades the resolution. Moreover, the MC method has two steps, the bin compensation and width calibration, and is not suitable for AC. Thus, a single-step weighted histogram calibration method was proposed in the present design.



Figure 5.2 Example of bin compensation in a) the MC method [66]; and b) the proposed weighted histogram calibration method.



Figure 5.3 Hardware implementation of the weighted histogram calibration.

Figure 5.3 shows the hardware implementation of the proposed weighted histogram calibration. Similar to the MB method [19] and MC method [66], BRAM modules are used in the proposed method to store the factors. However, there are three pairs of factors, including address factors (*Addr L, Addr M,* and *Addr R*) and width factors (*Coe L, Coe M,* and *Coe R*), stored in the calibration BRAM module. At the same time, to store the result, three Histogram BRAM modules were implemented.

There are three possible cases for bin compensation in the proposed calibration method, as shown in Figs 5.4a - 5.4c:

Case A:  $W[k] \le 1$  LSB, Case B: 1 LSB <  $W[k] \le 2$  LSB, Case C: 2 LSB <  $W[k] \le 3$  LSB.

where W[k] is the bin width of the k-th actual bin. Figure 5.5 shows the pseudo-code for calculating address factors. With three pairs of factors, the proposed method can ideally overcome ultra-wide bin problems caused by bins with a width less than 5 LSB. Figure 5.2b shows an example of such scenario: Bin [n] is the ultra-wide bin ( $W[n] \le$ 5 LSB) neighbored by two wide bins, Bin [n-1] and Bin [n+1]. Bin [n] can only be Page **113** of **148**  remapped to three ideal bins (Bin [*M*-1], Bin [*M*], and Bin [*M*+1]). However, with the factors *Addr R* [*n*-1] and *Addr L* [*n*+1], the ideal bins (Bin [*M*-2] and Bin [*M*+2]) are fulfilled by the actual bins (Bin [*n*-1] and Bin [*n*+1]). As such, the ultra-wide bin problem caused by Bin [*n*] ( $W[n] \le 5$  LSB) can be overcome by means of the proposed method, which is more efficient than the mixed calibration method described in a previous study [66]. Meanwhile, the proposed calibration method is highly suitable for the present design ( $DNL_{pk-pk} = 3.84$  LSB in the *SCSC* pattern).



Figure 5.4 Bin compensation when a)  $W[k] \le 1$  LSB; b) 1 LSB  $< W[k] \le 2$  LSB; and c) 2 LSB  $< W[k] \le 3$  LSB.



Figure 5.5 The pseudo-code for calculating address factors.

Calculations for width factors follow similar rules shown in Fig. 5.4. In Case A, only *Coe L* is valid and can be expressed as:

$$Coe \ L[k] = \frac{Width[k,n]}{W[k]},\tag{5.1}$$

where Width[k, n] is a portion of Bin [k] in the actual TDL which should be in Bin [n] in the ideal TDL (highlighted in red in Fig. 5.4a).

In Case B, Bin [k] is remapped to two ideal bins (see Fig. 5.4b). Therefore, *Coe L* and *Coe M* can be defined as:

$$Coe L[k] = \frac{Width[k,n-1]}{W[k]},$$
(5.2)

$$Coe M[k] = \frac{Width[k,n]}{W[k]},$$
(5.3)

In Case C, all width factors are valid and can be calculated as:

$$Coe \ L[k] = \frac{Width[k,n-1]}{W[k]},\tag{5.4}$$

Page 115 of 148

$$Coe M[k] = \frac{Width[k,n]}{W[k]},$$
(5.5)

Coe R [k] = 
$$\frac{Width[k,n+1]}{W[k]}$$
, (5.6)

Compared with the previous mixed calibration method [66], which needs two-round code density tests, the proposed weighted histogram calibration method can calculate both address factors and width factors in a single round (see the difference between Fig. 5.6a and Fig. 5.6b, highlighted in yellow), which is more suitable for AC.



Figure 5.6 Flow diagrams of a) the proposed weighted histogram calibration method; b) the previous mixed calibration method [66].



Figure 5.7 Vivado IP Integrator system block design.



Figure 5.8 a) AXI communication model; b) Three types of AXI interface.

#### 5.2.3. Communications between PL and PS

In ZYNQ devices, PS is a "hardened" IP core that has already been implemented in the chip. IPs can be configured in the block design in the EDA tool (in the present design, Xilinx Vivado 2019.1 was used). Figure 5.7 shows the block design of the proposed TDC system. A basic Xilinx SoC system should contain three parts: PS and its reset, PL, and AXI buses for signal and data transmission.

AXI is an interface protocol defined by ARM and is part of the advanced microcontroller bus architecture (AMBA) standard [178]. The AXI communication model is shown in Fig. 5.8a. The master device (for example, PS) and the slave device (for example, peripheral circuits) communicate through AXI interfaces. There are three types of AXI interfaces as listed in Fig. 5.8b: AXI-Full, AXI-Stream, and AXI-Lite. AXI-Full is for high-performance memory-mapped requirements, such as read and write operations in double data rate (DDR) or synchronous dynamic random-access memory (SDRAM). AXI-Stream is used to transmit data streams, such as videos. For low-speed communications, such as read and write operations in general purpose IO (GPIO), AXI-Lite is used.

In the present design, as a proof-of-concept, the communications between PS and PL use AXI-Lite (low-speed transmission). However, AXI-full is required a for a real commercial product due to the fast multichannel data transmission. As multiple slave devices are involved (data ports and control signals which are highlighted in red in Fig. 5.6), an AXI interconnect module is introduced to control communications.

#### 5.3. Automatic calibration function

With the ZYNQ SoC architecture, the proposed AC function was realized. The calibration function was written in the C programming language. Although an operating system (OS) can run on ZYNQ, an OS is unnecessary in this study. Using the Vivado SDK (software development kit), a BIN file, containing an Executable and linkable format (ELF) file and a bitstream file, was generated. ELF files are for PS and bitstream files are for PL. A secure digital (SD) card was used to store the BIN file. After powering up, the ZYNQ evaluation board will read data from the SD card and execute the code,

and then the TDC system will be launched. The workflow of the proposed function is shown in Fig. 5.9 and can be divided into two parts: initial and measurement stages. A program runs after powering up the board. Code density tests are conducted automatically, and the histogram records the results. PS obtains the histogram from PL and calculates the calibration factors. Subsequently, PS loads the factors to the calibration BRAM module and completes the initial process. In the measurement stage, the proposed TDC performs like other ordinary TDCs with the TCSPC function. Indexed by the raw data, the calibration results will be delivered from the calibration module and stored in the histogram module. Through such procedure, real-time and automatic calibration is achieved.



Figure 5.9 The workflow of AC TDCs.

The described workflow releases researchers and engineers from the time-consuming manual calibration, allowing for the possibility of mass production in commercial FPGA-TDC products.



Figure 5.10 a) DNL and b) INL plots of the calibrated and uncalibrated TDCs.

#### 5.4. Experimental results

The proposed AC-WU TDC was implemented and tested in a ZedBoard, which is a complete development board using the ZYNQ-7000 SoC [179]. The board has two independent low-jitter crystal oscillators (Fox-767) [179]. Without any correlations, the two oscillators are the system clock and the random input for code density tests. Meanwhile, the temperature and the operation voltage were maintained in the experiments.



Figure 5.11 Distribution of a) calibrated bin-widths and b) uncalibrated bin-widths. LSB = 9.83 ps.



Figure 5.12 a) Time interval measurement results and b) Time interval histogram at the time interval about 980 ps.

#### 5.4.1. Linearity and bin width distribution

Figure 5.10 presents the DNL and INL curves of the calibrated and uncalibrated TDCs. Table 5-2 summarizes all parameters for the linearity performance. After the calibration,  $DNL_{pk-pk}$  and  $INL_{pk-pk}$  were improved by 13-fold (from 3.91 LSB to 0.30 LSB) and 18-fold (from 12.05 LSB to 0.67 LSB), respectively.  $\sigma_{DNL}$  was enhanced by 21-fold (from 0.86 LSB to 0.04 LSB), and  $\sigma_{INL}$  was enhanced by 21-fold (from 2.79 LSB to 0.13 LSB). Moreover,  $\omega_{eq}$  and  $\sigma_{eq}$  were improved from 19.76 ps to 9.85 ps and from 5.70 ps to 2.84 ps, respectively.

Figure 5.11 shows the calibrated and uncalibrated TDCs' bin width distribution of the TDCs, wherein the calibrated TDC (Fig. 5.11a) had a considerably higher concentrated bin-width distribution than the uncalibrated TDC (Fig. 5.11b).

|                      | Tuned & Sub-WU | AC-WU        |
|----------------------|----------------|--------------|
| LSB (ps)             | 9.8            | 83           |
| DNL (LSB)            | [-0.93,2.98]   | [-0.14,0.16] |
| $DNL_{pk-pk}(LSB)$   | 3.91           | 0.30         |
| $\sigma_{DNL}$ (LSB) | 0.86           | 0.04         |
| INL (LSB)            | [-6.52,5.53]   | [-0.25,0.42] |
| $INL_{pk-pk}$ (LSB)  | 12.05          | 0.67         |
| $\sigma_{INL}$ (LSB) | 2.79           | 0.13         |
| $\omega_{eq}$ (ps)   | 19.76          | 9.85         |
| $\sigma_{eq} (ps)$   | 5.70           | 2.84         |

Table 5-2 Linearity comparison between the uncalibrated TDC and calibrated TDC.

#### 5.4.2. Time interval tests

To evaluate the precision of the proposed TDC, time interval tests were conducted. IDELAYE2 and IDELAYCTRL were used to generate a controllable delay between the sampling clock and the hit signal.

As shown in Fig. 5.12a, 30 measurements covering one sampling clock were conducted, and each measurement captured 100,000 samples. The standard deviations of each measurement were calculated, and the averaged value (13.86 ps) was the RMS resolution of the TDCs. Meanwhile, Figure 5.12b shows a histogram of a 980 ps time interval, which indicates a 14.16 ps RMS resolution.



Figure 5.13 Implementation layouts of a) a single channel and b) 16 channels.

|                | CARRY4  | LUTs     | DFFs     | BRAM     |
|----------------|---------|----------|----------|----------|
| Available      | 13300   | 53200    | 106400   | 140      |
| Single channel | 50      | 764      | 1095     | 2        |
| Single channel | (0.38%) | (1.44%)  | (1.02%)  | (1.42%)  |
| 16 abannal     | 800     | 9681     | 15141    | 32       |
| 10-channel     | (6.02%) | (18.19%) | (14.23%) | (22.85%) |
| A YI Buc       | 0       | 797      | 1278     | 0        |
| AAI DUS        | 0       | (1.49%)  | (1.20%)  | 0        |
|                |         |          |          |          |

Table 5-3 Logic resource utilization

# 5.5. Multichannel implementation and logic resource consumption

A 16-channel AC-WU TDC was implemented and tested in Zynq-7000 SoCs. Each TDL (containing 50 CARRY4s) was placed within a clock region to avoid significant clock skews. Every WU launcher was constrained near the first CARRY4 of the corresponding TDL to minimize jitters introduced by routing resources (see Fig. 5.13a).

To show the efficiency of the proposed weighted histogram calibration method, 16 channels were distributed in the tested chip randomly (see Fig. 5.13b).

Table 5-3 summarizes the resource consumption for the 16-channel design. Each channel costs 764 LUTs, 1095 DFFs, and 2 BRAMs. The usage report shows the considerable potential of the proposed TDC architecture in multichannel applications. Moreover, the AXI bus was used for communications between PL and PS with the cost of 1278 DFFs and 797 LUTs.

Code density tests were conducted for all channels. The linearity performances for the 16 channels are shown in Table 5-4. In the optimal case (Channel No.1),  $DNL_{pk-pk}$  was 0.21 LSB and  $INL_{pk-pk}$  was 0.52 LSB. At the same time, in the worst case (Channel No.7),  $DNL_{pk-pk}$  was 0.69 LSB and  $INL_{pk-pk}$  was 0.86 LSB. The averaged peak-to-peak values of DNL and INL were 0.38 LSB and 0.63 LSB, revealing that the proposed 16-channel TDC has good uniformity.

| Channel        | 0    | 1    | 2    | 3    | 4    | 5    | 6    | 7    | 8    | 9    | 10   | 11   | 12   | 13   | 14   | 15   | Ave. |
|----------------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
| $DNL_{pk-pk}$  | 0.21 | 0.34 | 0.33 | 0.28 | 0.30 | 0.29 | 0.56 | 0.69 | 0.25 | 0.41 | 0.25 | 0.26 | 0.52 | 0.27 | 0.48 | 0.60 | 0.38 |
| $\sigma_{DNL}$ | 0.03 | 0.05 | 0.05 | 0.04 | 0.04 | 0.04 | 0.10 | 0.07 | 0.04 | 0.06 | 0.03 | 0.04 | 0.07 | 0.03 | 0.06 | 0.07 | 0.05 |
| $INL_{pk-pk}$  | 0.52 | 0.55 | 0.51 | 0.52 | 0.67 | 0.45 | 0.90 | 0.86 | 0.51 | 0.45 | 0.57 | 0.37 | 0.87 | 0.62 | 0.89 | 0.87 | 0.63 |
| $\sigma_{INL}$ | 0.10 | 0.11 | 0.10 | 0.12 | 0.13 | 0.08 | 0.17 | 0.18 | 0.09 | 0.08 | 0.11 | 0.08 | 0.19 | 0.14 | 0.23 | 0.26 | 0.14 |

Table 5-4 Summary of linearity performance for 16-channel TDCs (Units: LSB)

Table 5-5 Comparison of published calibration methods

|           |                                                    |                | Auto/manual |
|-----------|----------------------------------------------------|----------------|-------------|
| Ref Year  | Methods                                            | Multiple steps |             |
|           |                                                    |                | Calibration |
| [101]-10  | Bin-by-bin calibration                             | Single-step    | Manual      |
| [59]-17   | Bin-width calibration                              | Single-step    | Manual      |
| [66]-19   | Mixed calibration                                  | Two-step       | Manual      |
| [153]-21  | Gain & error calibration                           | Single-step    | Auto        |
| This work | Weighted histogram calibration with an AC function | Single-step    | Auto        |

| Ref-<br>Year | Methods                                   | LSB   | $\omega_{eq}$    | RMS<br>Resol.      | DNL (LSB)                  | INL (LSB)                  | LUT               | DFF              | BRAM | AXI Bus   |
|--------------|-------------------------------------------|-------|------------------|--------------------|----------------------------|----------------------------|-------------------|------------------|------|-----------|
|              |                                           | (ps)  | (ps)             | (ps)               |                            |                            |                   |                  |      |           |
| [69]-18      | Two-Stage Delay<br>Line Loop<br>Shrinking | 8.50  | N/S <sup>1</sup> | 42.40              | [-0.22, 0.36]              | [-0.62, 0.91]              | N/S <sup>1</sup>  | N/S <sup>1</sup> | 0    | -         |
| [67]-20      | Bidirectional<br>RO Vernier               | 24.50 | N/S <sup>1</sup> | 28.00              | [-0.20, 0.25]              | [0.03, 0.82]               | 172               | 986              | 0    | -         |
| [146]-<br>20 | PLL Delay Matrix<br>with DDR              | 15.60 | N/S <sup>1</sup> | 15.60              | [-0.18, 0.18] <sup>2</sup> | [-0.16, 0.14] <sup>2</sup> | 9886 <sup>3</sup> | N/S <sup>1</sup> | 0    | -         |
| [153]-<br>21 | Slide Scale,<br>Gain & Error cal.,        | 4.88  | N/S <sup>1</sup> | 2.90~<br>8.03      | [-0.10, 0.15]              | [-0.23, 0.28]              | 2962              | 4157             | 0    | -         |
|              | Moving Ave.                               |       |                  |                    |                            |                            |                   |                  |      |           |
| This         | WU-A,<br>Tuned-TDL,                       | 9.83  | 9.85             | 13.86 <sup>4</sup> | [-0.14, 0.16],             | [-0.25, 0.42],             | 764               | 1095             | 2    | 1278 DFFs |
| Work         | Sub-TDL,                                  |       |                  |                    | 0.38 5                     | 0.63 7                     |                   |                  |      | 797 LUTs  |
|              | Auto Cal.                                 |       |                  |                    |                            |                            |                   |                  |      |           |

Table 5-6 Summary of recently published FPGA-TDCs with comparable performances

<sup>1</sup>N/S= not specified; <sup>2</sup> Rounding values from data presented in literature; <sup>3</sup>Combinational ALUTs in Altera FPGA; <sup>4</sup> The RMS resolution is measured internally. <sup>5</sup>Averaged peak-peak DNL or INL results of the Multichannel TDCs.

#### 5.6. Comparison

A comparison of the proposed calibration method with other published calibration methods is provided in Table 5-5, while summaries of recently published FPGA-TDCs and the proposed TDC are provided in Table 5-6.

Although calibration methods such as bin-by-bin calibration [101], bin-width calibration [59], and MC [66] methods can improve linearity and precision, manual calibration is required, which is unsuitable for commercial products. The gain and error calibration [153] can correct data automatically using signal processing methods (efficient but complex). However, two signal interpolators are required in the method [153], and more resources per channel are consumed than the present solution (see Table 5-6).

Unlike ultra-high resolution (< 5 ps) TDCs [79], [92], [118], [121], [153], [172], the aim of the AC-WU TDC is to achieve a high resolution and linearity simultaneously in low-cost SoC devices. Compared with previously reported TDCs with similar resolutions, the proposed TDCs are easy to implement in modern SoC devices and have better linearity. The TDC has also exhibited similar linearity performance to the PLL delay matrix TDC proposed in previous research [146]. However, the PLL delay matrix TDC requires 6-fold more LUTs than the proposed TDC with the AXI bus. At the same time, easy implementation is essential for broader applications, which is achieved in the proposed design compared with the two-stage delay line loop shrinking TDC [69] and the bidirectional RO Vernier TDC [67], since modern FPGAs do not have dedicated logic resources to construct loop architectures.

#### 5.7. Conclusion

In the present study, a multichannel automatic calibration architecture based on ZYNQ SoCs and a weighted histogram calibration method were proposed for the first time. A 16-channel TDC system was implemented and tested. The achieved resolution was 9.83 ps, and  $DNL_{pk-pk}$  and  $INL_{pk-pk}$  were increased to 0.38 LSB and 0.63 LSB, respectively. Meanwhile, the multichannel tests results show good uniformity between channels.

Through the proposed design, researchers can easily produce repeatable measurements using established ARM-based ZYNQ devices. The proposed approach can also be implemented with open-source softcore processors for low-cost commercial developments. Notably, compared with the proposed TDC, the resource consumption of a softcore processor in FPGAs is close to or even higher than a single-channel design, which cost at least thousands of LUTs and DFFs [180], [181]. The Rocket Chip Generator [182], a dedicated open-source core, and its variation S-RISC-V [183] consume over 30K LUTs and 15K DFFs. Resource optimization in softcore processors is complex and requires expertise in the implementation of control buses and data buses as well as the development of arithmetic logical units (ALUs).

In the current TCSPC products, timestamps are calibrated and processed on PCs. Peripheral Component Interconnect Express (PCIe) interfaces are essential in such systems to achieve real-time data processing. However, the proposed ZYNQ-based AC-WU TDC is a cheaper solution. Benefiting from automatic calibration with ARM-core processors, the proposed design does not require manual calibration, differing from the direct histogram TDC [59] and the mixed calibration TDC [66]. At the same time, the on-chip solution will meet the increasing demand for fast data processing, which widens the scope of application of the AC-WU TDC.

## Chapter 6. Conclusion

#### 6.1. Summary

Owing to the growing demands for TCPSC techniques, TDCs are gaining increasing attention in industry and academic circles. In this thesis, three efficient implementations of FPGA-TDCs using TDL structures were successfully demonstrated. A review of the reported designs and methods was first provided, with the aim of solving the drawbacks that limit the performance of TDCs and limit the application in the industry.

A review was provided in Chapter 2, including TDCs' critical parameters, novel architectures, calibration methods, and TDCs' applications. In general, due to the development of FPGA manufacturing techniques, FPGA-TDCs can achieve a significant resolution compared with ASIC-TDCs. Despite such advantage, the major drawbacks of FPGA-TDCs are bubbles caused by clock distributions and nonlinearities caused by device imperfection. Starting with the review, this thesis presented three different FPGA-TDC designs.

Chapter 3 demonstrated a strategy to implement a high-resolution WU TDC on UltraScale FPGAs (20 nm CMOS process). The proposed DSWU TDC is the first efficient WU-based TDC on UltraScale FPGAs, achieving 1.23-ps resolution. To improve the TDC's linearity, a binning method was proposed in this chapter. The binned DSWU TDC (2.48-ps resolution) achieved  $DNL_{pk-pk} = 2.61$  LSB and  $INL_{pk-pk} = 4.45$  LSB. Moreover, in this study, equations to analyse the measurement uncertainty were also provided. Followed the equations, for the DSWU TDC, the theoretical  $\sigma_{system} = 3.58$  ps and the experimental  $\sigma_{system} = 3.63$  ps. The experimental test results are in good agreement with the analysis. With the proposed equations, researchers can build a high-precision TDC by reducing errors caused by different error sources.

A 128-channel resolution-adjustable (51.28 ps, 83.33 ps, and 105.26 ps) high-linearity (both  $DNL_{pk-pk}$  and  $INL_{pk-pk}$  are less than 0.05 LSB) TDC was presented in Chapter 4. Due to the great linearity, the proposed TDC reached a comparable precision. The precision of the proposed TDC (selected resolution = 51.28 ps) is 15.89 ps. However, other reported TDCs with a resolution around 50 ps have a worse precision (> 18 ps) [161], [162]. With a 128-channel design, the proposed TDC is suitable for LiDAR applications, especially in driverless vehicles, where a photon detector array (hundreds of detectors) is required. Moreover, a software tool has been developed and opensourced, which will continuously help peers to understand and evaluate different TDC calibration methods. The link is https://github.com/GitForWJ/TDC\_tools.

Based on the ZYNQ SoC architecture, a 16-channel automatic calibration WU-TDC (9.83 ps resolution) with good linearity ( $DNL_{pk-pk} = 0.38$  LSB and  $INL_{pk-pk} = 0.63$  LSB) was proposed in Chapter 5. Using the ZYNQ structure, the proposed TDC overcomes one of the major drawbacks of the FPGA-TDC, the time-consuming manual calibration. Because of the troublesome manual calibration, mass production becomes impossible for commercial FPGA-TDC products. The test results show that the proposed calibration method is robust and is a good solution for commercial FPGA-TDC products.

The three proposed TDCs show comparable performances to other reported FPGA-TDCs. Meanwhile, the proposed equations for theoretical analysis and the opensourced tool will continuously help peers to improve their TDC further. Besides, the proposed automatic calibration solution shows a path to commercialize FPGA-TDC designs for industrial applications.

#### 6.2. Future work

Several potential works could be further developed based on existing research.

Error and uncertainty analysis has been explored for many years. Inspired by Szplet *et al.* [130], a detailed error source analysis was previously conducted [172], and inherent convert timing jitter was found to limit TDCs' precision, especially when the achievable resolution is better than 5 ps [63], [80], [92], [172]. Sondej *et al.* later analysed the transfer function of the wave union method [184]. Such research facilitates better understanding of the characters and performance of a TDC and can further improve TDCs' performance.

In the previous research [172], the wave union method was demonstrated to still be efficient in UltraScale FPGAs if bubbles are removed. The multi-sampling wave union (MSWU) method was proposed to fully utilize the wave union method [186]. However, bubble sorting<sup>1</sup> is completed on a PC [187], [188] and is not efficient. Therefore, an intelligent bubble removing method is needed.

Calibration in FPGA-TDCs uses a correction coefficient to minimize nonlinearity. Since factors are calculated on a PC, manual channel-by-channel chip-by-chip calibration is unavoidable. Therefore, in the present study, the first automatic calibration TDC using ZYNQ SoC architecture was proposed, rendering FPGA-TDCs more suitable for commercial applications. Despite such developments, the power consumption of PS (the ARM core) must not be ignored if the product is applied in industry. A considerably small and low-power soft-core processor is desired. In the soft-core processor, floating-point operations are unnecessary, because they may introduce rounding errors. Also, the frequency of the soft-core processor could be lower than 100 MHz, with a suitable FIFO (first-in, first-out) depth. A 16-bit processor is a must (2^16

<sup>&</sup>lt;sup>1</sup> Bubble sorting in this thesis is not the software algorithm for list sorting in computer science. Here, bubble sorting is a method to locate bubbles

= 65535), because more than 1,000,000 events will be recorded in a histogram with more than 100 bins. Additionally, to cancel nonlinearity, a trained neural network was integrated with TDCs in 1997 [189]. Considering thermal noise and electronic jitter follow Gaussian distribution, a jitter-remove neural network might be a potential solution for the further improvement of TDCs' precision.

In addition to the fundamental works which aim to improve TDCs' performance, a notable research topic is the implementation of TDC-based ADCs in FPGAs. In FPGAs, LVDS pads and the input buffers are basic elements. Because of comparator structures, LVDS buffers are naturally 1-bit ADCs [190]. Wu et al. firstly described an FPGAbased slope ADC [191]. The system clock feeds to an output buffer, and a slope is generated because of the parasitic capacitance. Two input pads of an LVDS comparator are used, with one being for the tested signal and the other being for the slope signal. The LVDS comparator generates a pulse, and the time of the pulse reflects the voltageto-measure. Without external resistors and capacitors, FPGA-based TDC-ADCs are simple and fully reconfigurable. At the same time, owing to TDCs' low deadtime, FPGA-based TDC-ADCs can work at a high sampling rate. In a recent report [192], a 1.2 GSample/s ADC was implemented in UltraScale+ FPGAs. The newest radio frequency (RF) SoC [193] integrates a high-rate ADC, which supports up to 6 Gsample/s. However, the price is high (around 30,000 USD per evaluation board). The TDC-ADC can be implemented on a low-price FPGA. Due to the simplicity and high sampling rate, the FPGA-based TDC-ADC is an optimal and low-cost solution for radio spectrometers [194].

### References

- W. A. Marrison, 'A high precision standard of frequency', *The Bell System Technical Journal*, vol. 8, no. 3, pp. 493–514, Jul. 1929, doi: 10.1002/j.1538-7305.1929.tb04431.x.
- W. A. Marrison, 'The evolution of the quartz crystal clock', *The Bell System Technical Journal*, vol. 27, no. 3, pp. 510–588, Jul. 1948, doi: 10.1002/j.1538-7305.1948.tb01343.x.
- [3] L. Essen and J. V. L. Parry, 'An Atomic Standard of Frequency and Time Interval: A Cæsium Resonator', *Nature*, vol. 176, no. 4476, pp. 280–282, Aug. 1955, doi: 10.1038/176280a0.
- [4] S. A. Diddams *et al.*, 'An Optical Clock Based on a Single Trapped <sup>199</sup> Hg <sup>+</sup> Ion', *Science*, vol. 293, no. 5531, pp. 825–828, Aug. 2001, doi: 10.1126/science.1061171.
- [5] J. Kalisz, 'Review of methods for time interval measurements with picosecond resolution', *Metrologia*, vol. 41, no. 1, pp. 17–32, Dec. 2003, doi: 10.1088/0026-1394/41/1/004.
- [6] A. E. Stevens, R. P. Van Berg, J. Van der Spiegel, and H. H. Williams, 'A timeto-voltage converter and analog memory for colliding beam detectors', *IEEE Journal of Solid-State Circuits*, vol. 24, no. 6, pp. 1748–1752, Dec. 1989, doi: 10.1109/4.45016.
- [7] J. Kalisz, R. Pelka, and A. Poniecki, 'Precision time counter for laser ranging to satellites', *Review of Scientific Instruments*, vol. 65, no. 3, pp. 736–741, Mar. 1994, doi: 10.1063/1.1145094.
- [8] G. Acconcia, F. Malanga, S. Farina, M. Ghioni, and I. Rech, 'A 1.9 ps-rms precision Time-to-Amplitude Converter with 782 fs LSB and 0.79%-rms DNL', *IEEE Trans. Instrum. Meas.*, pp. 1–1, 2023, doi: 10.1109/TIM.2023.3271755.
- [9] G. Acconcia, M. Ghioni, and I. Rech, '4.3ps rms jitter time to amplitude converter in 350nm Si-Ge technology', in 2021 7th International Conference on Event-Based Control, Communication, and Signal Processing (EBCCSP), Jun. 2021, pp. 1–4. doi: 10.1109/EBCCSP53293.2021.9502398.
- [10] H. Chung, M. Hyun, and J. Kim, 'A 360-fs-Time-Resolution 7-bit Stochastic Time-to-Digital Converter With Linearity Calibration Using Dual Time Offset Arbiters in 65-nm CMOS', *IEEE Journal of Solid-State Circuits*, vol. 56, no. 3, pp. 940–949, Mar. 2021, doi: 10.1109/JSSC.2020.3036960.
- [11] D. Piatti, F. Remondino, and D. Stoppa, 'State-of-the-Art of TOF Range-Imaging Sensors', in *TOF Range-Imaging Cameras*, F. Remondino and D. Stoppa, Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 1–9. doi: 10.1007/978-3-642-27523-4 1.
- [12] K. Nie, G. Tian, Q. Chen, Z. Wang, J. Xu, and Z. Gao, 'Method for power reduction of demodulation driver circuit in indirect time-of-flight CMOS image sensor', *Appl. Opt.*, vol. 60, no. 34, p. 10649, Dec. 2021, doi: 10.1364/AO.441845.
- [13] F. Piron, D. Morrison, M. R. Yuce, and J.-M. Redouté, 'A Review of Single-Photon Avalanche Diode Time-of-Flight Imaging Sensor Arrays', *IEEE Sensors Journal*, vol. 21, no. 11, pp. 12654–12666, Jun. 2021, doi: 10.1109/JSEN.2020.3039362.

- [14] C. Bruschini, H. Homulle, I. M. Antolovic, S. Burri, and E. Charbon, 'Singlephoton avalanche diode imagers in biophotonics: review and outlook', *Light: Science & Applications*, vol. 8, no. 1, Art. no. 1, Sep. 2019, doi: 10.1038/s41377-019-0191-5.
- [15] H. Song, W. Choi, and H. Kim, 'Robust Vision-Based Relative-Localization Approach Using an RGB-Depth Camera and LiDAR Sensor Fusion', *IEEE Transactions on Industrial Electronics*, vol. 63, no. 6, pp. 3725–3736, Jun. 2016, doi: 10.1109/TIE.2016.2521346.
- [16] U. Larsson, J. Forsberg, and A. Wernersson, 'Mobile robot localization: integrating measurements from a time-of-flight laser', *IEEE Transactions on Industrial Electronics*, vol. 43, no. 3, pp. 422–431, Jun. 1996, doi: 10.1109/41.499815.
- [17] Z.-P. Li *et al.*, 'Super-resolution single-photon imaging at 8.2 kilometers', *Opt. Express, OE*, vol. 28, no. 3, pp. 4076–4087, Feb. 2020, doi: 10.1364/OE.383456.
- C. Zhang, S. Lindner, I. M. Antolović, J. Mata Pavia, M. Wolf, and E. Charbon, 'A 30-frames/s, 252x144 SPAD Flash LiDAR With 1728 Dual-Clock 48.8-ps TDCs, and Pixel-Wise Integrated Histogramming', *IEEE Journal of Solid-State Circuits*, vol. 54, no. 4, pp. 1137–1151, Apr. 2019, doi: 10.1109/JSSC.2018.2883720.
- [19] W. Xie, Y. Wang, H. Chen, and D. D.-U. Li, '128-channel high-linearity resolution-adjustable time-to-digital converters for LiDAR applications: software predictions and hardware implementations', *IEEE Transactions on Industrial Electronics*, pp. 1–1, 2021, doi: 10.1109/TIE.2021.3076708.
- [20] P. Lecoq *et al.*, 'Roadmap toward the 10 ps time-of-flight PET challenge', *Phys. Med. Biol.*, May 2020, doi: 10.1088/1361-6560/ab9500.
- [21] S. Surti and J. S. Karp, 'Update on latest advances in time-of-flight PET', *Physica Medica*, vol. 80, pp. 251–258, Dec. 2020, doi: 10.1016/j.ejmp.2020.10.031.
- [22] T. K. Lewellen, 'Recent developments in PET detector technology', *Phys. Med. Biol.*, vol. 53, no. 17, pp. R287–R317, Sep. 2008, doi: 10.1088/0031-9155/53/17/R01.
- [23] M. Conti, 'State of the art and challenges of time-of-flight PET', *Physica Medica*, vol. 25, no. 1, pp. 1–11, 2009, doi: 10.1016/j.ejmp.2008.10.001.
- [24] H. Li et al., 'An Accurate Timing Alignment Method With Time-to-Digital Converter Linearity Calibration for High-Resolution TOF PET', *IEEE Transactions on Nuclear Science*, vol. 62, no. 3, pp. 799–804, Jun. 2015, doi: 10.1109/TNS.2015.2430751.
- [25] J. R. Lakowicz, Ed., Principles of Fluorescence Spectroscopy. Boston, MA: Springer US, 2006. doi: 10.1007/978-0-387-46312-4.
- [26] P. Bastiaens, 'Fluorescence lifetime imaging microscopy: spatial resolution of biochemical processes in the cell', *Trends in Cell Biology*, vol. 9, no. 2, pp. 48–52, Feb. 1999, doi: 10.1016/S0962-8924(98)01410-X.
- [27] W. Becker, *Advanced time-correlated single photon counting techniques*. in Springer series in chemical physics, no. 81. Berlin; New York: Springer, 2005.
- [28] S. I. Di Domenico et al., 'Chapter 28 Functional Near-Infrared Spectroscopy: Proof of Concept for Its Application in Social Neuroscience', in *Neuroergonomics*, H. Ayaz and F. Dehais, Eds., Academic Press, 2019, pp. 169– 173. doi: 10.1016/B978-0-12-811926-6.00028-2.

- [29] A. Torricelli *et al.*, 'Time domain functional NIRS imaging for human brain mapping', *NeuroImage*, vol. 85, pp. 28–50, Jan. 2014, doi: 10.1016/j.neuroimage.2013.05.106.
- [30] F. Scholkmann *et al.*, 'A review on continuous wave functional near-infrared spectroscopy and imaging instrumentation and methodology', *NeuroImage*, vol. 85, pp. 6–27, Jan. 2014, doi: 10.1016/j.neuroimage.2013.05.004.
- [31] 'IEEE Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems', IEEE. doi: 10.1109/IEEESTD.2008.4579760.
- [32] J. Serrano and P. Alvarez, 'The White Rabbit Project', in *Proceedings of ICALEPCS TUC004*, Kobe, Japan, 2009, p. 3.
- [33] J. Sánchez-Garrido *et al.*, 'A White Rabbit-Synchronized Accurate Time-Stamping Solution for the Small-Sized Cameras of the Cherenkov Telescope Array', *IEEE Transactions on Instrumentation and Measurement*, vol. 70, pp. 1– 14, 2021, doi: 10.1109/tim.2020.3013343.
- [34] J. Gu et al., 'Prototype of Clock and Timing Distribution and Synchronization Electronics for SHINE', *IEEE Transactions on Nuclear Science*, vol. 68, no. 8, pp. 2113–2120, Aug. 2021, doi: 10.1109/TNS.2021.3094546.
- [35] A. Bakker and J. H. Huijsing, 'Micropower CMOS temperature sensor with digital output', *IEEE J. Solid-State Circuits*, vol. 31, no. 7, pp. 933–937, Jul. 1996, doi: 10.1109/4.508205.
- [36] U. Sonmez, F. Sebastiano, and K. A. A. Makinwa, 'Compact Thermal-Diffusivity-Based Temperature Sensors in 40-nm CMOS for SoC Thermal Monitoring', *IEEE J. Solid-State Circuits*, vol. 52, no. 3, pp. 834–843, Mar. 2017, doi: 10.1109/JSSC.2016.2646798.
- [37] W. Song, J. Lee, N. Cho, and J. Burm, 'An Ultralow Power Time-Domain Temperature Sensor With Time-Domain Delta–Sigma TDC', *IEEE Transactions* on Circuits and Systems II: Express Briefs, vol. 64, no. 10, pp. 1117–1121, Oct. 2017, doi: 10.1109/TCSII.2015.2503717.
- [38] H. Jiang, C. Huang, M. R. Chan, and D. A. Hall, 'A 2-in-1 Temperature and Humidity Sensor With a Single FLL Wheatstone-Bridge Front-End', *IEEE Journal of Solid-State Circuits*, vol. 55, no. 8, pp. 2174–2185, Aug. 2020, doi: 10.1109/JSSC.2020.2989585.
- [39] C. Chen, C. Chen, Y. Lin, and S. You, 'An All-Digital Time-Domain Smart Temperature Sensor With a Cost-Efficient Curvature Correction', *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 27, no. 1, pp. 29–36, Jan. 2019, doi: 10.1109/TVLSI.2018.2867215.
- [40] H. J. Kimble, 'The quantum internet', *Nature*, vol. 453, no. 7198, pp. 1023–1030, Jun. 2008, doi: 10.1038/nature07127.
- [41] J. Zhang, C. Shen, H. Su, M. T. Arafin, and G. Qu, 'Voltage Over-scaling-based Lightweight Authentication for IoT Security', *IEEE Transactions on Computers*, pp. 1–1, 2021, doi: 10.1109/TC.2021.3049543.
- [42] X. Ma, X. Yuan, Z. Cao, B. Qi, and Z. Zhang, 'Quantum random number generation', *npj Quantum Information*, vol. 2, no. 1, Art. no. 1, Jun. 2016, doi: 10.1038/npjqi.2016.21.
- [43] F. Acerbi *et al.*, 'Structures and Methods for Fully-Integrated Quantum Random Number Generators', *IEEE Journal of Selected Topics in Quantum Electronics*, vol. 26, no. 3, pp. 1–8, May 2020, doi: 10.1109/JSTQE.2020.2990216.
- [44] A. Khanmohammadi, R. Enne, M. Hofbauer, and H. Zimmermanna, 'A Monolithic Silicon Quantum Random Number Generator Based on Page 135 of 148

Measurement of Photon Detection Time', *IEEE Photonics Journal*, vol. 7, no. 5, pp. 1–13, Oct. 2015, doi: 10.1109/JPHOT.2015.2479411.

- [45] P. Zhang, Advanced Industrial Control Technology. Elsevier, 2010. doi: 10.1016/C2009-0-20337-0.
- [46] I. Kuon, R. Tessier, and J. Rose, FPGA architecture: survey and challenges. in Foundations and trends in electronic design automation, no. 2,2. Boston, Mass.: Now Publ, 2008.
- [47] Xilinx, '7 Series FPGAs SelectIO Resources User Guide (UG471)', 2018. https://www.xilinx.com/support/documentation/user\_guides/ug471\_7Series\_Sel ectIO.pdf
- [48] Intel, 'Designing with Low-Level Primitives User Guide'. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/u g/ug\_low\_level.pdf (accessed Sep. 03, 2020).
- [49] H. Amano, Ed., *Principles and Structures of FPGAs*. Singapore: Springer Singapore, 2018. doi: 10.1007/978-981-13-0824-6.
- [50] L. H. Crockett, R. A. Elliot, M. A. Enderwitz, R. W. Stewart, and University of Strathclyde, Eds., *The Zynq Book: embedded processing with the ARM Cortex-*A9 on the Xilinx Zynq-7000 all programmable SoC, 1. ed. Glasgow: Strathclyde Academic Media, 2014.
- [51] Xilinx, 'AXI Reference Guide (UG761)', Nov. 15, 2012. https://www.xilinx.com/support/documentation/ip\_documentation/axi\_ref\_guid e/latest/ug761\_axi\_reference\_guide.pdf (accessed Mar. 07, 2022).
- [52] AMD, 'ds190-Zynq-7000-Overview.pdf Viewer AMD Adaptive Computing Documentation Portal', 2018. https://docs.xilinx.com/v/u/en-US/ds190-Zynq-7000-Overview (accessed May 21, 2023).
- [53] R. Szplet, *Time-to-Digital Converters*. in Signals and Communication Technology. Berlin, Heidelberg: Springer Berlin Heidelberg, 2014. doi: 10.1007/978-3-642-39655-7\_7.
- [54] R. Nutt, 'Digital Time Intervalometer', *Review of Scientific Instruments*, vol. 39, no. 9, pp. 1342–1345, Sep. 1968, doi: 10.1063/1.1683667.
- [55] F. Severini, I. Cusini, D. Berretta, K. Pasquinelli, A. Incoronato, and F. Villa, 'SPAD Pixel With Sub-NS Dead-Time for High-Count Rate Applications', *IEEE Journal of Selected Topics in Quantum Electronics*, vol. 28, no. 2, pp. 1–8, Mar. 2022, doi: 10.1109/JSTQE.2021.3124825.
- [56] L. Neri *et al.*, 'Dead Time of Single Photon Avalanche Diodes', *Nuclear Physics B Proceedings Supplements*, vol. 215, no. 1, pp. 291–293, Jun. 2011, doi: 10.1016/j.nuclphysbps.2011.04.034.
- [57] C. Zhang *et al.*, 'A 240 × 160 3D-Stacked SPAD dToF Image Sensor With Rolling Shutter and In-Pixel Histogram for Mobile Devices', *IEEE Open Journal* of the Solid-State Circuits Society, vol. 2, pp. 3–11, 2022, doi: 10.1109/OJSSCS.2021.3118332.
- [58] C. Zhang, S. Lindner, I. M. Antolovic, M. Wolf, and E. Charbon, 'A CMOS SPAD Imager with Collision Detection and 128 Dynamically Reallocating TDCs for Single-Photon Counting and 3D Time-of-Flight Imaging', *Sensors*, vol. 18, no. 11, Art. no. 11, Nov. 2018, doi: 10.3390/s18114016.
- [59] H. Chen, Y. Zhang, and D. D. Li, 'A Low Nonlinearity, Missing-Code Free Timeto-Digital Converter Based on 28-nm FPGAs With Embedded Bin-Width Calibrations', *IEEE Transactions on Instrumentation and Measurement*, vol. 66, no. 7, pp. 1912–1921, Jul. 2017, doi: 10.1109/TIM.2017.2663498.

- [60] T. E. Rahkonen and J. T. Kostamovaara, 'The use of stabilized CMOS delay lines for the digitization of short time intervals', *IEEE Journal of Solid-State Circuits*, vol. 28, no. 8, pp. 887–894, Aug. 1993, doi: 10.1109/4.231325.
- [61] Xilinx, 'UltraScale Architecture Libraries Guide (UG974)', 2014. https://www.xilinx.com/support/documentation/sw\_manuals/xilinx2014\_1/ug9 74-vivado-ultrascale-libraries.pdf
- [62] J. Y. Won and J. S. Lee, 'Time-to-Digital Converter Using a Tuned-Delay Line Evaluated in 28-, 40-, and 45-nm FPGAs', *IEEE Trans. Instrum. Meas.*, vol. 65, no. 7, pp. 1678–1689, Jul. 2016, doi: 10.1109/TIM.2016.2534670.
- [63] R. Szplet, D. Sondej, and G. Grzęda, 'High-Precision Time Digitizer Based on Multiedge Coding in Independent Coding Lines', *IEEE Transactions on Instrumentation and Measurement*, vol. 65, no. 8, pp. 1884–1894, Aug. 2016, doi: 10.1109/TIM.2016.2555218.
- [64] Y. Wang, X. Zhou, Z. Song, J. Kuang, and Q. Cao, 'A 3.0-ps rms Precision 277-MSamples/s Throughput Time-to-Digital Converter Using Multi-Edge Encoding Scheme in a Kintex-7 FPGA', *IEEE Transactions on Nuclear Science*, vol. 66, no. 10, pp. 2275–2281, Oct. 2019, doi: 10.1109/TNS.2019.2938571.
- [65] Z. Song, Y. Wang, and J. Kuang, 'A 256-channel, high throughput and precision time-to-digital converter with a decomposition encoding scheme in a Kintex-7 FPGA', J. Inst., vol. 13, no. 05, pp. P05012–P05012, May 2018, doi: 10.1088/1748-0221/13/05/P05012.
- [66] H. Chen and D. D. Li, 'Multichannel, Low Nonlinearity Time-to-Digital Converters Based on 20 and 28 nm FPGAs', *IEEE Transactions on Industrial Electronics*, vol. 66, no. 4, pp. 3265–3274, Apr. 2019, doi: 10.1109/TIE.2018.2842787.
- [67] K. Cui and X. Li, 'A High-Linearity Vernier Time-to-Digital Converter on FPGAs With Improved Resolution Using Bidirectional-Operating Vernier Delay Lines', *IEEE Transactions on Instrumentation and Measurement*, vol. 69, no. 8, pp. 5941–5949, Aug. 2020, doi: 10.1109/TIM.2019.2959423.
- [68] J. Zhang, Y. Wang, and Z. Song, 'A ring-oscillator based multi-mode time-todigital converter on Xilinx Kintex-7 FPGA', *Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment*, vol. 1011, p. 165578, Sep. 2021, doi: 10.1016/j.nima.2021.165578.
- [69] J. Zhang and D. Zhou, 'An 8.5-ps Two-Stage Vernier Delay-Line Loop Shrinking Time-to-Digital Converter in 130-nm Flash FPGA', *IEEE Trans. Instrum. Meas.*, vol. 67, no. 2, pp. 406–414, Feb. 2018, doi: 10.1109/TIM.2017.2769239.
- [70] K. Cui, X. Li, Z. Liu, and R. Zhu, 'Toward Implementing Multichannels, Ring-Oscillator-Based, Vernier Time-to-Digital Converter in FPGAs: Key Design Points and Construction Method', *IEEE Transactions on Radiation and Plasma Medical Sciences*, vol. 1, no. 5, pp. 391–399, Sep. 2017, doi: 10.1109/TRPMS.2017.2712260.
- [71] J. Wu and Z. Shi, 'The 10-ps wave union TDC: Improving FPGA TDC resolution beyond its cell delay', in 2008 IEEE Nuclear Science Symposium Conference Record, Oct. 2008, pp. 3440–3446. doi: 10.1109/NSSMIC.2008.4775079.
- [72] Xilinx, 'UltraScale Architecture Clocking Resources User Guide (UG572)', Aug. 25, 2021. https://docs.xilinx.com/v/u/en-US/ug572-ultrascale-clocking
- [73] Xilinx, 'UltraScale Architecture Configurable Logic Block User Guide (UG574)'. 2017. Accessed: Nov. 05, 2019. [Online]. Available:

https://www.xilinx.com/support/documentation/user\_guides/ug574-ultrascale-clb.pdf

- [74] Y. Wang, J. Kuang, C. Liu, and Q. Cao, 'A 3.9-ps RMS Precision Time-to-Digital Converter Using Ones-Counter Encoding Scheme in a Kintex-7 FPGA', *IEEE Transactions on Nuclear Science*, vol. 64, no. 10, pp. 2713–2718, Oct. 2017, doi: 10.1109/TNS.2017.2746626.
- [75] B. I. Abdulrazzaq, I. Abdul Halin, S. Kawahito, R. M. Sidek, S. Shafie, and N. A. Md. Yunus, 'A review on high-resolution CMOS delay lines: towards sub-picosecond jitter performance', *SpringerPlus*, vol. 5, no. 1, p. 434, Apr. 2016, doi: 10.1186/s40064-016-2090-z.
- [76] R. Szplet, J. Kalisz, and R. Szymanowski, 'Interpolating time counter with 100 ps resolution on a single FPGA device', *IEEE Transactions on Instrumentation and Measurement*, vol. 49, no. 4, pp. 879–883, Aug. 2000, doi: 10.1109/19.863942.
- [77] H. Homulle, S. Visser, B. Patra, and E. Charbon, 'Design techniques for a stable operation of cryogenic field-programmable gate arrays', *Review of Scientific Instruments*, vol. 89, no. 1, p. 014703, Jan. 2018, doi: 10.1063/1.5004484.
- [78] Q. Shen et al., 'A 1.7 ps Equivalent Bin Size and 4.2 ps RMS FPGA TDC Based on Multichain Measurements Averaging Method', *IEEE Transactions on Nuclear Science*, vol. 62, no. 3, pp. 947–954, Jun. 2015, doi: 10.1109/TNS.2015.2426214.
- [79] X. Qin, L. Wang, D. Liu, Y. Zhao, X. Rong, and J. Du, 'A 1.15-ps Bin Size and 3.5-ps Single-Shot Precision Time-to-Digital Converter With On-Board Offset Correction in an FPGA', *IEEE Transactions on Nuclear Science*, vol. 64, no. 12, pp. 2951–2957, Dec. 2017, doi: 10.1109/TNS.2017.2768082.
- [80] N. Lusardi, F. Garzetti, N. Corna, R. D. Marco, and A. Geraci, 'Very High-Performance 24-Channels Time-to-Digital Converter in Xilinx 20-nm Kintex UltraScale FPGA', in 2019 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), Oct. 2019, pp. 1–4. doi: 10.1109/NSS/MIC42101.2019.9059958.
- [81] N. Lusardi, F. Garzetti, and A. Geraci, 'Digital instrument with configurable hardware and firmware for multi-channel time measures', *Rev. Sci. Instrum.*, vol. 90, no. 5, p. 14, May 2019.
- [82] J. Y. Won, S. I. Kwon, H. S. Yoon, G. B. Ko, J. Son, and J. S. Lee, 'Dual-Phase Tapped-Delay-Line Time-to-Digital Converter With On-the-Fly Calibration Implemented in 40 nm FPGA', *IEEE Transactions on Biomedical Circuits and Systems*, vol. 10, no. 1, pp. 231–242, Feb. 2016, doi: 10.1109/TBCAS.2015.2389227.
- [83] T. Sui et al., 'A 2.3-ps RMS Resolution Time-to-Digital Converter Implemented in a Low-Cost Cyclone V FPGA', *IEEE Transactions on Instrumentation and Measurement*, vol. 68, no. 10, pp. 3647–3660, Oct. 2019, doi: 10.1109/TIM.2018.2880940.
- [84] P. Chen, S.-L. Liu, and J. Wu, 'A CMOS pulse-shrinking delay element for time interval measurement', *IEEE Transactions on Circuits and Systems II: Analog* and Digital Signal Processing, vol. 47, no. 9, pp. 954–958, Sep. 2000, doi: 10.1109/82.868466.
- [85] R. Szplet and K. Klepacki, 'A two-stage time-to-digital converter based on cyclic pulse shrinking', in 2009 IEEE International Frequency Control Symposium Joint with the 22nd European Frequency and Time forum, Apr. 2009, pp. 1133– 1136. doi: 10.1109/FREQ.2009.5168374.

- [86] R. Szplet and K. Klepacki, 'An FPGA-Integrated Time-to-Digital Converter Based on Two-Stage Pulse Shrinking', *IEEE Transactions on Instrumentation* and Measurement, vol. 59, no. 6, pp. 1663–1670, Jun. 2010, doi: 10.1109/TIM.2009.2027777.
- [87] R. Enomoto, T. Iizuka, T. Koga, T. Nakura, and K. Asada, 'A 16-bit 2.0-ps Resolution Two-Step TDC in 0.18- μ m CMOS Utilizing Pulse-Shrinking Fine Stage With Built-In Coarse Gain Calibration', *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 27, no. 1, pp. 11–19, Jan. 2019, doi: 10.1109/TVLSI.2018.2867505.
- [88] Xilinx, 'UltraScale Architecture DSP Slice User Guide', Aug. 30, 2021. https://docs.xilinx.com/v/u/en-US/ug579-ultrascale-dsp
- [89] P. Kwiatkowski, 'Employing FPGA DSP blocks for time-to-digital conversion', *Metrology and Measurement Systems*, vol. 26, no. 4, pp. 631–643, Dec. 2019, doi: 10.24425/MMS.2019.130570.
- [90] S. Tancock, E. Arabul, N. Dahnoun, and S. Mehmood, 'Can DSP48A1 adders be used for high-resolution delay generation?', in 2018 7th Mediterranean Conference on Embedded Computing (MECO), Jun. 2018, pp. 1–6. doi: 10.1109/MECO.2018.8406083.
- [91] X. Qin et al., 'A high resolution time-to-digital-convertor based on a carry-chain and DSP48E1 adders in a 28-nm field-programmable-gate-array', *Review of Scientific Instruments*, vol. 91, no. 2, p. 024708, Feb. 2020, doi: 10.1063/1.5141391.
- [92] M. Zhang, K. Yang, Z. Chai, H. Wang, Z. Ding, and W. Bao, 'High-Resolution Time-to-Digital Converters Implemented on 40-, 28-, and 20-nm FPGAs', *IEEE Transactions on Instrumentation and Measurement*, vol. 70, pp. 1–10, 2021, doi: 10.1109/TIM.2020.3036066.
- [93] M. Zhang, H. Wang, and Y. Liu, 'A 7.4 ps FPGA-Based TDC with a 1024-Unit Measurement Matrix', *Sensors*, vol. 17, no. 4, p. 865, Apr. 2017, doi: 10.3390/s17040865.
- [94] J. Wu and J. Xu, 'A Novel TDC Scheme: Combinatorial Gray Code Oscillator Based TDC for Low Power and Low Resource Usage Applications', in 2019 5th International Conference on Event-Based Control, Communication, and Signal Processing (EBCCSP), May 2019, pp. 1–7. doi: 10.1109/EBCCSP.2019.8836892.
- [95] S. Araújo, R. Machado, and J. Cabral, 'Double-sampling Gray TDC with a ROS Interface for a LiDAR System', in 2021 7th International Conference on Event-Based Control, Communication, and Signal Processing (EBCCSP), Jun. 2021, pp. 1–8. doi: 10.1109/EBCCSP53293.2021.9502403.
- [96] Yao-Jen Chuang, Hsin-Hung Ou, and Bin-Da Liu, 'A novel bubble tolerant thermometer-to-binary encoder for flash A/D converter', in 2005 IEEE VLSI-TSA International Symposium on VLSI Design, Automation and Test, 2005. (VLSI-TSA-DAT)., Apr. 2005, pp. 315–318. doi: 10.1109/VDAT.2005.1500084.
- [97] Q. Shen *et al.*, 'A fast improved fat tree encoder for wave union TDC in an FPGA', *Chinese Phys. C*, vol. 37, no. 10, p. 106102, Oct. 2013, doi: 10.1088/1674-1137/37/10/106102.
- [98] V. H. Bui, S. Beak, S. Choi, J. Seon, and T. Ted. Jeong, 'Thermometer-to-binary encoder with bubble error correction (BEC) circuit for Flash Analog-to-Digital Converter (FADC)', in *International Conference on Communications and Electronics 2010*, Aug. 2010, pp. 102–106. doi: 10.1109/ICCE.2010.5670690.

- [99] C. Liu and Y. Wang, 'A 128-Channel, 710 M Samples/Second, and Less Than 10 ps RMS Resolution Time-to-Digital Converter Implemented in a Kintex-7 FPGA', *IEEE Transactions on Nuclear Science*, vol. 62, no. 3, pp. 773–783, Jun. 2015, doi: 10.1109/TNS.2015.2421319.
- [100] S. Kumar, M. K. Suman, and K. L. Baishnab, 'A novel approach to thermometerto-binary encoder of flash ADCs-bubble error correction circuit', in 2014 2nd International Conference on Devices, Circuits and Systems (ICDCS), Combiatore, India: IEEE, Mar. 2014, pp. 1–6. doi: 10.1109/ICDCSyst.2014.6926213.
- [101] J. Wu, 'Several Key Issues on Implementing Delay Line Based TDCs Using FPGAs', *IEEE Transactions on Nuclear Science*, vol. 57, no. 3, pp. 1543–1548, Jun. 2010, doi: 10.1109/TNS.2010.2045901.
- [102] Y. Wang and C. Liu, 'A Nonlinearity Minimization-Oriented Resource-Saving Time-to-Digital Converter Implemented in a 28 nm Xilinx FPGA', *IEEE Transactions on Nuclear Science*, vol. 62, no. 5, pp. 2003–2009, Oct. 2015, doi: 10.1109/TNS.2015.2475630.
- [103] W. Pan, G. Gong, and J. Li, 'A 20-ps Time-to-Digital Converter (TDC) Implemented in Field-Programmable Gate Array (FPGA) with Automatic Temperature Correction', *IEEE Transactions on Nuclear Science*, vol. 61, no. 3, pp. 1468–1473, Jun. 2014, doi: 10.1109/TNS.2014.2320325.
- [104] C. Ugur, E. Bayer, N. Kurz, and M. Traxler, 'A 16 channel high resolution (\$\less\$11 ps RMS) Time-to-Digital Converter in a Field Programmable Gate Array', J. Inst., vol. 7, no. 02, pp. C02004–C02004, Feb. 2012, doi: 10.1088/1748-0221/7/02/C02004.
- [105] P. Lecoq, 'Pushing the Limits in Time-of-Flight PET Imaging', IEEE Transactions on Radiation and Plasma Medical Sciences, vol. 1, no. 6, pp. 473– 485, Nov. 2017, doi: 10.1109/TRPMS.2017.2756674.
- [106] Y. Hua, K. Bantounos, A. V. N. Jalajakumari, A. Turpin, I. Underwood, and D. Chitnis, 'A pulse oximeter based on time-of-flight histograms', in *Photonic Instrumentation Engineering VIII*, International Society for Optics and Photonics, Mar. 2021, p. 116930G. doi: 10.1117/12.2577757.
- [107] G. Kaissis and R. Braren, 'Pancreatic cancer detection and characterization state of the art cross-sectional imaging and imaging data analysis', *Translational Gastroenterology and Hepatology*, vol. 4, no. 0, May 2019, Accessed: Jun. 26, 2019. [Online]. Available: http://tgh.amegroups.com/article/view/5070
- [108] W. T. Kwong et al., 'Low Rates of Malignancy and Mortality in Asymptomatic Patients With Suspected Neoplastic Pancreatic Cysts Beyond 5 Years of Surveillance', *Clin. Gastroenterol. Hepatol.*, vol. 14, no. 6, pp. 865–871, 2016, doi: 10.1016/j.cgh.2015.11.013.
- [109] P. Chen, Y. Hsiao, Y. Chung, W. X. Tsai, and J. Lin, 'A 2.5-ps Bin Size and 6.7-ps Resolution FPGA Time-to-Digital Converter Based on Delay Wrapping and Averaging', *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 25, no. 1, pp. 114–124, Jan. 2017, doi: 10.1109/TVLSI.2016.2569626.
- [110] J. Wu, 'On-Chip processing for the wave union TDC implemented in FPGA', in 2009 16th IEEE-NPSS Real Time Conference, May 2009, pp. 279–282. doi: 10.1109/RTC.2009.5322002.
- [111] S. Tancock, J. Rarity, and N. Dahnoun, 'The Wave-Union Method on DSP Blocks: Improving FPGA-Based TDC Resolutions by 3x With a 1.5x Area Increase', *IEEE Transactions on Instrumentation and Measurement*, vol. 71, pp. 1–11, 2022, doi: 10.1109/TIM.2022.3141753.

- [112] J. Wu, Y. Shi, and D. Zhu, 'A low-power Wave Union TDC implemented in FPGA', J. Inst., vol. 7, no. 01, pp. C01021–C01021, Jan. 2012, doi: 10.1088/1748-0221/7/01/C01021.
- [113] S. Wang, J. Wu, S. Yao, and W. Chang, 'A Field-Programmable Gate Array (FPGA) TDC for the Fermilab SeaQuest (E906) Experiment and Its Test with a Novel External Wave Union Launcher', *IEEE Transactions on Nuclear Science*, vol. 61, no. 6, pp. 3592–3598, Dec. 2014, doi: 10.1109/TNS.2014.2362883.
- [114] J. Qi, H. Gong, and Y. Liu, 'On-Chip Real-Time Correction for a 20-ps Wave Union Time-To-Digital Converter (TDC) in a Field-Programmable Gate Array (FPGA)', *IEEE Transactions on Nuclear Science*, vol. 59, no. 4, pp. 1605–1610, Aug. 2012, doi: 10.1109/TNS.2012.2201952.
- [115] R. Szplet, P. Kwiatkowski, Z. Jachna, and K. Rozyc, 'Several issues on the use of independent coding lines for time-to-digital conversion', in 2013 IEEE Nordic-Mediterranean Workshop on Time-to-Digital Converters (NoMe TDC), Oct. 2013, pp. 1–8. doi: 10.1109/NoMeTDC.2013.6658238.
- [116] N. Lusardi, F. Garzetti, and A. Geraci, 'The role of sub-interpolation for Delay-Line Time-to-Digital Converters in FPGA devices', *Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment*, vol. 916, pp. 204–214, Feb. 2019, doi: 10.1016/j.nima.2018.11.100.
- [117] Y. Wang and C. Liu, 'A 3.9 ps Time-Interval RMS Precision Time-to-Digital Converter Using a Dual-Sampling Method in an UltraScale FPGA', *IEEE Transactions on Nuclear Science*, vol. 63, no. 5, pp. 2617–2621, Oct. 2016, doi: 10.1109/TNS.2016.2596305.
- [118] P. Kwiatkowski and R. Szplet, 'Efficient implementation of multiple time coding lines-based TDC in an FPGA device', *IEEE Transactions on Instrumentation and Measurement*, vol. 69, no. 10, pp. 7353–7364, Oct. 2020, doi: 10.1109/TIM.2020.2984929.
- [119] R. Szplet, J. Kalisz, and Z. Jachna, 'A 45 ps time digitizer with a two-phase clock and dual-edge two-stage interpolation in a field programmable gate array device', *Meas. Sci. Technol.*, vol. 20, no. 2, p. 025108, Jan. 2009, doi: 10.1088/0957-0233/20/2/025108.
- [120] R. Szplet, P. Kwiatkowski, Z. Jachna, and K. Różyc, 'An Eight-Channel 4.5-ps Precision Timestamps-Based Time Interval Counter in FPGA Chip', *IEEE Transactions on Instrumentation and Measurement*, vol. 65, no. 9, pp. 2088–2100, Sep. 2016, doi: 10.1109/TIM.2016.2564038.
- [121] Y. Wang and C. Liu, 'A 4.2 ps Time-Interval RMS Resolution Time-to-Digital Converter Using a Bin Decimation Method in an UltraScale FPGA', *IEEE Transactions on Nuclear Science*, vol. 63, no. 5, pp. 2632–2638, Oct. 2016, doi: 10.1109/TNS.2016.2606627.
- [122] Xilinx, 'KCU105 Board User Guide', Feb. 06, 2019. https://www.xilinx.com/support/documents/boards\_and\_kits/kcu105/ug917kcu105-eval-bd.pdf
- [123] Silicon Labs, 'ANY-FREQUENCY PRECISION CLOCK MULTIPLIER/ JITTER ATTENUATOR', Aug. 23, 2021. https://www.skyworksinc.com/-/media/Skyworks/SL/documents/public/data-sheets/Si5324.pdf
- [124] Stanford Research Systems, 'CG 635 2.05 GHz Synthesized Clock Generator'. https://www.thinksrs.com/downloads/pdfs/manuals/CG635m.pdf

- [125] Lecroy, 'WaveRunner 6 Zi Oscilloscopes', May 2017. https://cdn.teledynelecroy.com/files/manuals/waverunner-6zi-operatorsmanual.pdf
- [126] J. Rivoir, 'Statistical Linearity Calibration of Time-To-Digital Converters Using a Free-Running Ring Oscillator', in 2006 15th Asian Test Symposium, Nov. 2006, pp. 45–50. doi: 10.1109/ATS.2006.260991.
- [127] J. Wu, 'Uneven bin width digitization and a timing calibration method using cascaded PLL', in 2014 19th IEEE-NPSS Real Time Conference, May 2014, pp. 1–4. doi: 10.1109/RTC.2014.7097534.
- [128] R. Szymanowski, R. Szplet, and P. Kwiatkowski, 'Quantization error in precision time counters', *Meas. Sci. Technol.*, vol. 26, no. 7, p. 075002, Jun. 2015, doi: 10.1088/0957-0233/26/7/075002.
- [129] A. Balla et al., 'The characterization and application of a low resource FPGAbased time to digital converter', Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, vol. 739, pp. 75–82, Mar. 2014, doi: 10.1016/j.nima.2013.12.033.
- [130] R. Szplet, R. Szymanowski, and D. Sondej, 'Measurement Uncertainty of Precise Interpolating Time Counters', *IEEE Transactions on Instrumentation and Measurement*, vol. 68, no. 11, pp. 4348–4356, Nov. 2019, doi: 10.1109/TIM.2018.2886940.
- [131] Y. Wang and C. Liu, 'A 3.9 ps Time-Interval RMS Precision Time-to-Digital Converter Using a Dual-Sampling Method in an UltraScale FPGA', *IEEE Transactions on Nuclear Science*, vol. 63, no. 5, pp. 2617–2621, Oct. 2016, doi: 10.1109/TNS.2016.2596305.
- [132] Hamamatsu Photonics, 'MPPC module C13366-3050GD'. https://www.hamamatsu.com/resources/pdf/ssd/c13366-1350gd etc kacc1229e.pdf (accessed Jun. 28, 2021).
- [133] ID Quantique, 'ID101 Brochure', *ID Quantique*. https://marketing.idquantique.com/acton/attachment/11868/f-0235/1/-/-/-//ID101\_Brochure.pdf (accessed Mar. 10, 2021).
- [134] M. H. Nunes *et al.*, 'Recovery of logged forest fragments in a human-modified tropical landscape during the 2015-16 El Niño', *Nat Commun*, vol. 12, no. 1, p. 1526, Dec. 2021, doi: 10.1038/s41467-020-20811-y.
- [135] A. Comerón, C. Muñoz-Porcar, F. Rocadenbosch, A. Rodríguez-Gómez, and M. Sicard, 'Current Research in Lidar Technology Used for the Remote Sensing of Atmospheric Aerosols', *Sensors*, vol. 17, no. 6, Art. no. 6, Jun. 2017, doi: 10.3390/s17061450.
- [136] D. Yoon, J.-E. Joo, and S. M. Park, 'Mirrored Current-Conveyor Transimpedance Amplifier for Home Monitoring LiDAR Sensors', *IEEE Sensors Journal*, pp. 1– 1, 2020, doi: 10.1109/JSEN.2020.3043797.
- [137] C. Callenberg, Z. Shi, F. Heide, and M. B. Hullin, 'Low-cost SPAD sensing for non-line-of-sight tracking, material classification and depth imaging', ACM *Trans. Graph.*, vol. 40, no. 4, p. 61:1-61:12, 2021, doi: 10.1145/3450626.3459824.
- [138] ST Microelectronis, 'VL53L1X: Time-of-Flight ranging sensor based on ST's FlightSense technology', Apr. 2022. https://www.st.com/en/imaging-andphotonics-solutions/vl53l1x.html
- [139] H. Gao, B. Cheng, J. Wang, K. Li, J. Zhao, and D. Li, 'Object Classification Using CNN-Based Fusion of Vision and LIDAR in Autonomous Vehicle

Environment', *IEEE Transactions on Industrial Informatics*, vol. 14, no. 9, pp. 4224–4231, Sep. 2018, doi: 10.1109/TII.2018.2822828.

- [140] H. Wang, B. Wang, B. Liu, X. Meng, and G. Yang, 'Pedestrian recognition and tracking using 3D LiDAR for autonomous vehicle', *Robotics and Autonomous Systems*, vol. 88, pp. 71–78, Feb. 2017, doi: 10.1016/j.robot.2016.11.014.
- [141] X. Zhao, P. Sun, Z. Xu, H. Min, and H. Yu, 'Fusion of 3D LIDAR and Camera Data for Object Detection in Autonomous Vehicle Applications', *IEEE Sensors Journal*, vol. 20, no. 9, pp. 4901–4913, May 2020, doi: 10.1109/JSEN.2020.2966034.
- [142] Q. Zou, Q. Sun, L. Chen, B. Nie, and Q. Li, 'A Comparative Analysis of LiDAR SLAM-Based Indoor Navigation for Autonomous Vehicles', *IEEE Transactions* on *Intelligent Transportation Systems*, pp. 1–15, 2021, doi: 10.1109/TITS.2021.3063477.
- [143] S. Bartknecht *et al.*, 'GANDALF Design of a modular high resolution transient recorder for high energy physics', in 2009 IEEE Nuclear Science Symposium Conference Record (NSS/MIC), Oct. 2009, pp. 2221–2224. doi: 10.1109/NSSMIC.2009.5402076.
- [144] C. Liu, Y.-L. Liu, E. P. Perillo, A. K. Dunn, and H.-C. Yeh, 'Single-Molecule Tracking and Its Application in Biomolecular Binding Detection', *IEEE Journal* of Selected Topics in Quantum Electronics, vol. 22, no. 4, pp. 64–76, Jul. 2016, doi: 10.1109/JSTQE.2016.2568160.
- [145] K. Akiba et al., 'The Timepix Telescope for high performance particle tracking', Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, vol. 723, pp. 47–54, Sep. 2013, doi: 10.1016/j.nima.2013.04.060.
- [146] P. Chen et al., 'High-Precision PLL Delay Matrix With Overclocking and Double Data Rate for Accurate FPGA Time-to-Digital Converters', *IEEE Transactions* on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 4, pp. 904–913, Apr. 2020, doi: 10.1109/TVLSI.2019.2962606.
- [147] S. Jahromi, J. Jansson, P. Keränen, and J. Kostamovaara, 'A 32 × 128 SPAD-257 TDC Receiver IC for Pulsed TOF Solid-State 3-D Imaging', *IEEE Journal of Solid-State Circuits*, vol. 55, no. 7, pp. 1960–1970, Jul. 2020, doi: 10.1109/JSSC.2020.2970704.
- [148] A. Hejazi et al., 'A Low Power Multichannel Time to Digital Converter Using All Digital Nested Delay Locked Loops with 50 ps Resolution and High Throughput for LiDAR Sensors', *IEEE Transactions on Instrumentation and Measurement*, vol. 69, no. 11, pp. 9262–9271, Nov. 2020, doi: 10.1109/TIM.2020.2995249.
- [149] R. Machado, J. Cabral, and F. S. Alves, 'Recent Developments and Challenges in FPGA-Based Time-to-Digital Converters', *IEEE Transactions on Instrumentation and Measurement*, vol. 68, no. 11, pp. 4205–4221, Nov. 2019, doi: 10.1109/TIM.2019.2938436.
- [150] V. Slobodyanyuk and D. BUTTERFIELD, 'Timing synchronization of lidar system to reduce interference', US20170082737A1, Mar. 23, 2017 Accessed: May 06, 2022. [Online]. Available: https://patents.google.com/patent/US20170082737A1/en
- [151] A. R. Ximenes, P. Padmanabhan, M. Lee, Y. Yamashita, D. Yaung, and E. Charbon, 'A Modular, Direct Time-of-Flight Depth Sensor in 45/65-nm 3-D-Stacked CMOS Technology', *IEEE Journal of Solid-State Circuits*, vol. 54, no. 11, pp. 3203–3214, Nov. 2019, doi: 10.1109/JSSC.2019.2938412.

- [152] N. Dutton et al., 'Multiple-event direct to histogram TDC in 65nm FPGA technology', in 2014 10th Conference on Ph.D. Research in Microelectronics and Electronics (PRIME), Jun. 2014, pp. 1–5. doi: 10.1109/PRIME.2014.6872727.
- [153] K. Choi and D. Jee, 'Design and Calibration Techniques for a Multichannel FPGA-Based Time-to-Digital Converter in an Object Positioning System', *IEEE Transactions on Instrumentation and Measurement*, vol. 70, pp. 1–9, 2021, doi: 10.1109/TIM.2020.3011490.
- [154] Becker & Hickl GmbH, 'The bh TCSPC Handbook', Sep. 2019. https://www.becker-hickl.com/wp-content/uploads/2019/09/hb-bh-TCSPC-1.pdf
- [155] 'ID900 Brochure', *ID Quantique*. https://marketing.idquantique.com/acton/attachment/11868/f-023e/1/-/-/-//ID900\_Brochure.pdf (accessed Oct. 13, 2020).
- [156] S. W. Hutchings *et al.*, 'A Reconfigurable 3-D-Stacked SPAD Imager With In-Pixel Histogramming for Flash LIDAR or High-Speed Time-of-Flight Imaging', *IEEE Journal of Solid-State Circuits*, vol. 54, no. 11, pp. 2947–2956, Nov. 2019, doi: 10.1109/JSSC.2019.2939083.
- [157] S. Burri, H. Homulle, C. Bruschini, and E. Charbon, 'LinoSPAD: a time-resolved 256x1 CMOS SPAD line sensor system featuring 64 FPGA-based TDC channels running at up to 8.5 giga-events per second', in *Optical Sensing and Detection IV*, International Society for Optics and Photonics, Apr. 2016, p. 98990D. doi: 10.1117/12.2227564.
- [158] 'GitForWJ/TDC\_tools', *GitHub*. https://github.com/GitForWJ/TDC\_tools (accessed Nov. 02, 2020).
- [159] PicoQuant, 'PicoQuant Photon Counting and Timing'. https://www.picoquant.com/images/uploads/downloads/7304photon\_counting\_brochure.pdf
- [160] S. Henzler, *Time-to-digital converters*. in Springer series in advanced microelectronics, no. 29. Dordrecht: Springer, 2010.
- [161] G. Cao, H. Xia, and N. Dong, 'An 18-ps TDC using timing adjustment and bin realignment methods in a Cyclone-IV FPGA', *Review of Scientific Instruments*, vol. 89, no. 5, p. 054707, May 2018, doi: 10.1063/1.5008610.
- [162] Y.-H. Chen, 'A counting-weighted calibration method for a field-programmablegate-array-based time-to-digital converter', *Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment*, vol. 854, pp. 61–63, May 2017, doi: 10.1016/j.nima.2017.02.053.
- [163] R. K. Henderson *et al.*, 'A 192x128 Time Correlated SPAD Image Sensor in 40nm CMOS Technology', *IEEE Journal of Solid-State Circuits*, vol. 54, no. 7, pp. 1907–1916, Jul. 2019, doi: 10.1109/JSSC.2019.2905163.
- [164] Z. Sun, D. B. Lindell, O. Solgaard, and G. Wetzstein, 'SPADnet: deep RGB-SPAD sensor fusion assisted by monocular depth estimation', *Opt. Express, OE*, vol. 28, no. 10, pp. 14948–14962, May 2020, doi: 10.1364/OE.392386.
- [165] A. G. Howard *et al.*, 'MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications', *arXiv:1704.04861 [cs]*, Apr. 2017, Accessed: Mar. 05, 2021. [Online]. Available: http://arxiv.org/abs/1704.04861
- [166] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, 'SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB
model size', *arXiv:1602.07360 [cs]*, Nov. 2016, Accessed: Mar. 05, 2021. [Online]. Available: http://arxiv.org/abs/1602.07360

- [167] K. Zang et al., 'Silicon single-photon avalanche diodes with nano-structured light trapping', Nat Commun, vol. 8, no. 1, p. 628, Dec. 2017, doi: 10.1038/s41467-017-00733-y.
- [168] M. Ghioni, A. Gulinatti, I. Rech, P. Maccagnani, and S. Cova, 'Large-area lowjitter silicon single photon avalanche diodes', in *Quantum Sensing and Nanophotonic Devices V*, International Society for Optics and Photonics, Feb. 2008, p. 69001D. doi: 10.1117/12.761578.
- [169] Xilinx, 'XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices', Mar. 20, 2013. https://www.xilinx.com/support/documentation/sw\_manuals/xilinx14\_7/xst\_v6 s6.pdf
- [170] S. Chan et al., 'Long-range depth imaging using a single-photon detector array and non-local data fusion', Sci Rep, vol. 9, no. 1, p. 8075, Dec. 2019, doi: 10.1038/s41598-019-44316-x.
- [171] Swabian Instruments, 'Time Tagger Series Brochure'. https://www.swabianinstruments.com/static/downloads/TimeTaggerSeries.pdf
- [172] W. Xie, H. Chen, and D. D.-U. Li, 'Efficient time-to-digital converters in 20 nm FPGAs with wave union methods', *IEEE Transactions on Industrial Electronics*, pp. 1–1, 2021, doi: 10.1109/TIE.2021.3053905.
- [173] F. Garzetti, N. Corna, N. Lusardi, and A. Geraci, 'Time-to-Digital Converter IP-Core for FPGA at State of the Art', *IEEE Access*, vol. 9, pp. 85515–85528, 2021, doi: 10.1109/ACCESS.2021.3088448.
- [174] S. Bourdeauducq, 'A 26 ps RMS time-to-digital converter core for Spartan-6 FPGAs', arXiv:1303.6840 [physics], Mar. 2013, Accessed: Aug. 08, 2021.
  [Online]. Available: http://arxiv.org/abs/1303.6840
- [175] Xilinx, 'Zynq-7000 SoC Data Sheet: Overview (DS190)', 2018. https://www.xilinx.com/support/documentation/data\_sheets/ds190-Zynq-7000-Overview.pdf (accessed Mar. 23, 2021).
- [176] Xilinx, 'Zynq-7000 SoC and 7 Series Devices Memory Interface Solutions v4.2, User Guide (UG586)', 2018. https://www.xilinx.com/support/documentation/ip\_documentation/mig\_7series/ v4\_2/ug586\_7Series\_MIS.pdf (accessed Mar. 23, 2021).
- [177] Xilinx, 'Xilinx 7 Series FPGA and Zynq-7000 All Programmable SoC Libraries Guide for HDL Designs (UG768)', 2013. https://www.xilinx.com/support/documentation/ip\_documentation/mig\_7series/ v4\_2/ug586\_7Series\_MIS.pdf (accessed Mar. 23, 2021).
- [178] ARM, 'AMBA AXI and ACE Protocol Specification Version E', 2013. https://developer.arm.com/documentation/ihi0022/e?\_ga=2.67820049.1631882 347.1556009271-151447318.1544783517 (accessed Jun. 04, 2022).
- [179] Digilent, 'ZedBoard Digilent Reference', Jan. 27, 2014. https://digilent.com/reference/programmable-logic/zedboard/start (accessed Jul. 20, 2022).
- [180] Á. B. de Oliveira et al., 'Evaluating Soft Core RISC-V Processor in SRAM-Based FPGA Under Radiation Effects', *IEEE Transactions on Nuclear Science*, vol. 67, no. 7, pp. 1503–1510, Jul. 2020, doi: 10.1109/TNS.2020.2995729.
- [181] Z. Zang, Y. Liu, and R. C. C. Cheung, 'Reconfigurable RISC-V Secure Processor And SoC Integration', in 2019 IEEE International Conference on Industrial Technology (ICIT), Feb. 2019, pp. 827–832. doi: 10.1109/ICIT.2019.8755206.

- [182] K. Asanovic *et al.*, 'The Rocket Chip Generator'. [Online]. Available: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-17.html
- [183] Y. Liu, R. C. C. Cheung, and H. Wong, 'Lightweight Secure Processor Prototype on FPGA', in 2018 28th International Conference on Field Programmable Logic and Applications (FPL), Aug. 2018, pp. 443–4431. doi: 10.1109/FPL.2018.00081.
- [184] D. Sondej, R. Szymanowski, and R. Szplet, 'Methods of precise determining the transfer function of picosecond time-to-digital converters', *Metrology and Measurement Systems*, vol. 28, no. 3, 2021, doi: 10.24425/MMS.2021.137697.
- [185] J. Kalisz, M. Pawlowski, and R. Pelka, 'Error analysis and design of the Nutt time-interval digitiser with picosecond resolution', *J. Phys. E: Sci. Instrum.*, vol. 20, no. 11, pp. 1330–1341, Nov. 1987, doi: 10.1088/0022-3735/20/11/005.
- [186] P. Kwiatkowski and R. Szplet, 'Multisampling wave union time-to-digital converter', in 2020 6th International Conference on Event-Based Control, Communication, and Signal Processing (EBCCSP), Sep. 2020, pp. 1–5. doi: 10.1109/EBCCSP51266.2020.9291363.
- [187] P. Kwiatkowski, D. Sondej, and R. Szplet, 'Bubble-Proof Algorithm for Wave Union TDCs', *Electronics*, vol. 11, no. 1, Art. no. 1, Jan. 2022, doi: 10.3390/electronics11010030.
- [188] X. Deng and Q. Chen, 'A 4.32-ps precision Time-to-Digital Convertor using multisampling wave union method on a 28-nm FPGA', J. Inst., vol. 16, no. 12, p. P12031, Dec. 2021, doi: 10.1088/1748-0221/16/12/P12031.
- [189] R. Pelka, J. Kalisz, and R. Szplet, 'Nonlinearity correction of the integrated timeto-digital converter with direct coding', *IEEE Transactions on Instrumentation and Measurement*, vol. 46, no. 2, pp. 449–453, Apr. 1997, doi: 10.1109/19.571882.
- [190] F. Sousa, V. Mauer, N. Duarte, R. P. Jasinski, and V. A. Pedroni, 'Taking advantage of LVDS input buffers to implement sigma-delta A/D converters in FPGAs', in 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512), May 2004, p. I–1088. doi: 10.1109/ISCAS.2004.1328388.
- [191] J. Wu, S. Hansen, and Z. Shi, 'ADC and TDC implemented using FPGA', in 2007 IEEE Nuclear Science Symposium Conference Record, Oct. 2007, pp. 281–286. doi: 10.1109/NSSMIC.2007.4436331.
- [192] L. Leuenberger, D. Amiet, T. Wei, and P. Zbinden, 'An FPGA-based 7-ENOB 600 MSample/s ADC without any External Components', in *The 2021* ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Virtual Event USA: ACM, Feb. 2021, pp. 240–250. doi: 10.1145/3431920.3439287.
- [193] D. Brubaker, 'Next-Generation Zynq UltraScale+ RFSoC', 2019. https://www.xilinx.com/content/dam/xilinx/imgs/press/media-kits/zu+rfsocmulti-gen-press-pitch-deck.pdf
- [194] A. Nishimura *et al.*, 'Observational demonstration of a low-cost fast Fourier transform spectrometer with a delay-line-based ramp-compare ADC implemented on FPGA', *Publications of the Astronomical Society of Japan*, vol. 73, no. 3, pp. 692–700, Jun. 2021, doi: 10.1093/pasj/psab030.

## Appendix

## **Journal publications**

- <u>W. Xie</u>, Y. Wang, H. Chen, and D. Li\*, "128-channel high-linearity resolutionadjustable time-to-digital converters for LiDAR applications: software predictions and hardware implementations,", IEEE Ind. Electron. vol. 69, no. 4, pp. 4264-4274, April 2022.
- 2. <u>W. Xie</u>, H. Chen, and D. Li\*, "Efficient time-to-digital converters in 20nm FPGAs with wave-union methods" IEEE Trans. Ind. Electron., 69, 1,1021-2031. Jan 2022.
- Y. Wang<sup>1</sup>, <u>W. Xie<sup>1</sup></u>, H. Chen, and D. Li<sup>\*</sup>, "Multichannel time-to-digital converters with automatic calibration in Xilinx Zynq-7000 FPGA devices," IEEE Trans. Ind. Electron. vol. 69, no. 9, pp. 9634-9643, Sept. 2022. <sup>1</sup> Co-first Author
- Y. Wang, <u>W. Xie</u>, H. Chen, and D. Li\*, "Low hardware consumption, resolutionconfigurable Gray code oscillator time-to-digital converters implemented in 16nm, 20nm and 28nm FPGAs", IEEE Trans. Ind. Electron., vol. 70, no. 4, pp. 4256-4266, April 2023.
- Y. Wang, <u>W. Xie</u>, H. Chen, and D. Li\*, "High-resolution time-to-digital converters (TDCs) with a bidirectional encoder", Measurement, vol. 206, 2023
- D. Xiao, Z. Zang, <u>W. Xie</u>, N. Sapermsap, Y. Chen, and D. Li\*, "Spatial resolution improved fluorescence lifetime imaging via deep learning," Opt. Express 30, 11479-11494 (2022).
- D. Xiao, Z. Zang, N. Sapermsap, Q. Wang, <u>W. Xie</u>, Y. Chen, and D. Li\*, "Timeresolved flow cytometry with CMOS single-photon avalanche diode arrays and deep learning processors," Biomed. Opt. Express 12(6), 3450-3462, 2021.
- Z. Zang, D. Xiao, Q. Wang, Z Li, <u>W. Xie</u>, Y. Chen, and D. Li\*, "Fast analysis of time-domain fluorescence lifetime imaging via extreme learning machine", Sensors, no. 10, 2022.

## **Conference submission**

 <u>W. Xie</u>, H. Chen, Z. Zang, and D. Li\*, "Multi-channel high-linearity time-to-digital converters in 20 nm and 28 nm FPGAs for LiDAR applications", 2020 6th International Conference on Event-Based Control, Communication, and Signal Processing (EBCCSP), Sept. 2020.