I. INTRODUCTION

D
IRECT time-of-flight (TOF) based LIDARs require a time-to-digital converter (TDC) to measure the time interval that it takes for a light pulse to travel to the target and reflect back to a detector circuit. A complete single chip CMOS receiver solution might incorporate a SPAD based optical detector and a TDC circuit to measure the time-of-flight. A small SPAD and a miniaturized, low power, but accurate and high precision TDC would allow to build a large detector array on a single chip, which can contain hundreds, or even thousands of independent SPAD+TDC receiver channels.
For this kind of application, the TDC should be small, but still it should be able to provide a measurement range of hundreds of nanoseconds with an accuracy of few tens of picoseconds when a distance resolution of about a cm is required. The precision of the TDC should be at least comparable to the jitter of the SPAD in order for the system precision to be not limited by the TDC's jitter. Moreover, in order to avoid the need for post-processing, the large TDC array should be uniform in terms of LSB size, linearity and static offset errors. The TDC performance should be unsensitive to process, voltage and temperature variations since compensating these errors by post-processing might be unfeasible when the receiver contains hundreds of TDCs. When static errors are considered, it is especially important that the LSB size is uniform across the TDC array. The size of the LSB does not only define the resolution of the converter, but also the time-to-digital gain of the converter. For example, in laser radar applications, if random mismatch induced LSB deviation between the TDCs is 5 percent, the distance deviation for an object measured at 0.5 meters would be only 2.5cm, but for an object at 10 meters the error would be 50cm, which is unacceptable when a cm level accuracy is aimed at. Although the gain error can be compensated by post-processing, the compensation might require a complex calibration process that needs to be performed for each TDC, which can be quite unfeasible in practice when the circuit contains hundreds of TDCs.
Although static gain errors can have a large impact on the uniformity when longer distances are measured, static offset errors between the TDC channels are usually less important since the error is independent of the laser ranging distance and also because these errors are usually in the range of few tens of picoseconds.
Quite usually, a TDC array is realized as a flash type converter that utilizes a delay line or a ring oscillator. The resolution of such converter is defined by the propagation delay of a single delay element in the delay line/ring oscillator. In order to make the size of the LSB, and therefore the gain of the converter, well defined and uniform across the whole TDC array, a replica biasing method is often used [1] - [3] . The replica bias block creates a shared bias for the TDCs by using a PLL/DLL to fix the propagation delay of the delay line to be proportional to the reference clock's period. However, since a SPAD/TDC imager circuit with hundreds of measurement channels can be very large, random mismatches, IR drops and process parameters varying across chip might become an issue which the replica biasing cannot solve.
Instead of replica biasing, one option to reduce the LSB deviation of a flash TDC is to distribute all the clock/delay phases from the reference PLL/DLL to all of the flash TDCs in the array, instead of providing a shared bias to the TDCs. This technique has been used, for example, in [4] . However, distributing several, or tens of clock phases across the array requires careful buffering arrangement in order to minimize the timing skew(offset error) between the TDC channels.
In order to improve the resolution and to decrease the power consumption of a TDC, a two-step approach is sometimes used [5] . First, the time interval is quantized with a coarse quantizer and then the residue of the coarse quantizer is quantized with a fine converter. However, in these designs it is important that the ratio between the coarse LSB size and the fine LSB size is a power of two. If the ratio is not a power of two, the results need to be normalized in order to avoid large steps in the INL at the boundaries between the coarse/fine transition. The normalization can require calibration routines that might be impractical to implement in the real application environment.
Although flash type TDCs are common and simple to realize as an array, a cyclic/algorithmic TDC could be one potential converter architecture that fits the time resolution and area requirements. Although the measurement rate of a cyclic converter cannot match the speed of a flash converter, it is not usually an issue since a direct ToF application typically only requires a measurement rate of a few hundreds of kilohertz. The measurement rate is mainly limited by the laser driver circuit and the measurement range of the LIDAR system.
The main challenges in cyclic TDCs are related to the residue amplifier circuit. The time amplifier blocks are typically very nonlinear [6] - [9] , which makes the useful linear range very small. Moreover, most of the previously published time amplifiers almost always require some sort of calibration or biasing to lock the gain factor of the amplifier circuit [8] , [9] . Some designs have been introduced that don't need complex calibration [10] , but the measurement range is quite short for time-of-flight applications. Moreover, in quite many of these designs, the time amplifier circuit is used to solve only a few bits [6] , [9] , [11] and thus the requirements for the residue gain accuracy are quite relaxed.
In this work, a residue time amplifier with a wide linear input range has been designed that does not need any kind of biasing to accurately amplify a time interval by a factor of two. The gain factor of the residue amplifier is unsensitive to PVT variations and does not require any calibration either. The complete TDC utilizes the Nutt interpolation principle, where the input time interval is quantized in two steps. First a coarse quantization is done with a counter clocked by an accurate reference clock. Then the residues of the coarse stage are quantized by the fine converters that utilize the residue time amplifier to build a cyclic data converter. A total of 256 TDC channels have been implemented on a single chip. All the TDCs use the same accurate external reference clock for coarse and fine quantization to guarantee that the LSB (i.e. the gain) is identical between the TDCs and therefore the post-processing needs are minimal for the entire TDC array.
The paper is organized as follows: Section II provides an overview of the TDC array and a top level description of the design. Section III presents the design of a single TDC channel and the time amplifier. Section IV presents the measurements results for the TDC array and the results are compared to previously published relevant designs. Section V concludes the paper.
II. CHIP ARCHITECTURE
A block diagram of the designed circuit is shown in Fig. 1 . 256 TDC channels have been implemented on a single chip. One goal of the design has been to make the TDC channels modular, so that the number of TDCs can be easily increased without needing to make any other changes to the design except adding TDC blocks. On top-level, only some of the buffers driving global signals across the chip might need resizing.
The TDC is based on the Nutt interpolation method [12] , or Sliding-scale technique as referred in some sources [4] , [13] - [15] . In the Nutt interpolation method, a reference clock counter is used as a coarse quantizer, and two interpolators quantize the time residues of the coarse quantizer. Thus, the range of the interpolator only needs to cover a single reference clock period. Using the three partial time intervals, the time interval between the Start and Stop signals can be can be expressed
where N clk is the result of the clock counter(coarse measurement) and t start / t stop are the time residues measured by the interpolators(fine measurement). The counter result and the time residues measured by the interpolators can be also written as
where is the ceiling function and t start is now a random variable between 0 and t clk . One of the key benefits of the Nutt interpolation method is that, when the Start signal is asynchronous with respect to the reference clock, i.e. the arrival times of the Start signal are uncorrelated with reference clock edges, then t start / t stop are also random and uniformly distributed between 0 and t clk . Because of this, the interpolators' static quantization and nonlinearity errors are also randomized(i.e. converted to noise), even when the input time interval is constant, which significantly improves the linearity of the complete TDC. The randomization of static errors also allows the Nutt converter to achieve sub-LSB level accuracy by averaging.
As shown in Fig. 1 , this work uses one common Start interpolator together with 256 Stop interpolators and coarse counters. This arrangement allows us to measure the arrival times of 256 Stop signals with respect to one common Start signal. For each Stop channel, a multiplexer is used to select the source for the Stop signal. The Stop signal can be either a common off-chip signal(for testing purposes) routed to every Stop channel, or a separate input coming from an array of SPAD detectors.
To save power, the global reference clock routed to every TDC channel can be gated. The gating logic activates the global reference clock only for the short duration when the measurement is active. When reference clock is disabled, the power consumption of the chip consists of only leakage currents and the dynamic power consumption of the clocking path before the clock gating block. A clock multiplexer is also used to select between the reference clock and the IO read/write clock in order for the read-out to work independently from the reference clock. Although this arrangement does not allow for parallel readout and data conversion, it minimizes the switching noise impacting the TDCs since the output buffers don't need to drive large off-chip loads during conversion. Moreover, no dedicated readout registers are needed, thus some amount of area is saved.
All timing sensitive global signals common to all TDC channels, such as the reference clock, have been carefully routed and buffered to avoid timing skew(static offset error) issues and to achieve good uniformity in terms of timing between all the channels.
Each Start/Stop channel contains one interpolator, a coarse clock counter and read-out logic. The interpolator has an effective resolution of 9bits, but a 10bit output word is used in order to allow room for static timing offsets that can occur due to synchronization delays or routing/gate delays. Since the Start and Stop channel interpolators are identical, the nominal static offset due to synchronization logic and gate delays is equal in both interpolators and any possible offset is removed when the final output word is calculated according to (1) . A small offset error might remain due to random mismatches and process gradients.
The coarse counters are 6bits wide and the LSB of the coarse counter overlaps with the MSB of the 10bit interpolator output, which gives a final output word length of 15bits. The frequency of the off-chip reference clock is 100MHz and thus the TDC's dynamic range is about 640ns with a nominal resolution of about 20ps. An off-chip low-noise crystal oscillator is used in order to guarantee that the reference time base is unsensitive to on-chip PVT variations and IR-drops that might affect linearity if the reference time base would be synthesized on-chip with a PLL, for example.
III. TDC CHANNEL
A. Architecture Fig. 3 shows the block diagram of a single Stop channel. Each Stop channel consists of an interpolator (fine measurement) and a coarse measurement unit along with read-out logic. The Start channel is identical to the Stop channel, but the coarse counter has been disabled since it is not needed. Also, the Start channel's synchronized Start signal is buffered and routed globally to each Stop channel in order to start the coarse counters in every Stop channel as shown in Fig. 1 .
The interpolator in every Start/Stop channel is a cyclic data converter. The interpolator core consists of synchronization logic for residue generation, a residue time amplifier circuit with a gain of 2 and a clock counter for quantizing the amplified time residues. The interpolator counter uses the same 100MHz reference clock for quantization that is used by the coarse time interval counter. The linear range of the time amplifier is wide and it is designed to work with input time intervals ranging from 5ns to 15ns. Because both, the fine and the coarse converters, use the same reference clock for quantization, the raw TDC results don't need postprocessing or normalization to account for the gain mismatch between the coarse result and the interpolator result. Thus one bit shifting, one addition and one subtraction are the only operations needed to calculate (1). The calculation of the final result is done off-chip, so that the raw interpolator data can also be collected and analyzed.
The result of each TDC channel is stored in a register bank, which is configured to work as a counter for the fine/coarse quantizer during a measurement. During read-out, the register bank is re-configured to operate as a shift-register. In readout mode, the shift-registers of consecutive TDC channels are chained together and the data is read-out by shifting the data from one TDC channel to the next one. The shift-register output of the last TDC is connected to an 8-bit IO-bus. The counter and the shift-register arrangement is shown in Fig. 3 . Fig. 4 shows the timing diagram of a single Stop channel. When the Stop signal arrives to the Stop channel, the synchronizer's reset is released. The clock synchronized Stop signal then stops the coarse counter and starts the cyclic interpolation process. The synchronizer generates a time residue, which is the time difference between the synchronizer's input and output signals. This residue is then amplified by a factor of 2 with the pulse width doubling circuit shown in Fig. 3 below the synchronizer. The amplified time residue is then quantized by a counter and a new time residue is formed by the synchronizer. This process then repeats as long as the required number of bits have been resolved.
B. Operating Principle
After n cycles, the estimate for the fine time interval measured by the interpolator can be written as
where is the ceiling function, FCT R(i ) is number of clock cycles that were either added or subtracted from FCT R during the i th residue amplification cycle, t r (i ) is the time residue of the i th quantization cycle(note that t r (1) = t start/stop ). The quantization error after n cycles is
, which is distributed between 1 2 t clk 2 n and 3 2 t clk 2 n . In practice the binary output word is formed by increasing or decreasing the fine counter(FCTR) value every clock cycle as shown in the timing diagram Fig. 4 . After each complete residue amplification cycle, the counter value is multiplied by 2, i.e. shifted left by one bit.
The effective resolution of the interpolator is 9 bits, but 10 bits of range is used to account for static offset errors. With 10 bits of dynamic range, the interpolator can measure a range up to 2 t clk . The offset errors can occur due to routing and gate delays, but also the synchronizer introduces an offset of t clk /2. This additional offset occurs because the synchronizer needs to be built out of several chained latches in order to avoid metastability induced delay variations, that might occur if the time difference between the clock edge and the Stop signal is close to zero. Thus, the actual time residues are distributed between 1 2 t clk and 3 2 t clk . This means that any additional timing offset, for example, due to PVT variations can be tolerated given that the timing offset is in the range of ± results of Start and Stop interpolators are subtracted according to (1) and thus the offsets cancel out. Only small mismatch related errors might remain.
The time amplifier is based on two pulse width doubling circuits. The pulse width doubling circuit takes a CMOS logic level signal as an input, and doubles the pulse width of the input signal. An example of a delay line based pulse width doubler is shown in Fig. 5 . The number of delay stages depends on the required linear range and the delay of a single stage. The pulse width gain is always 2 and no calibration or bias tuning is required to stabilize the gain against PVT variations.
The pulse width doubling delay line differs from a normal delay line by that the direction of propagation can be controlled. This functionality can be used to double the pulse width of the IN signal. When IN goes high, the delay line allows a signal to propagate from left-to-right. At some point IN goes low and the direction of propagation is changed and the signal returns to its starting position. This naturally takes the same amount of time that the signal first spent propagating left-to-right, thus the total time that the signal spends propagating along the delay line is two times the pulse width of the IN signal. The output signal is a logic level signal with a pulse width of 2 t in . Although the output pulse width might be associated with static timing offsets, these are again cancelled in the final TDC result when the result of the Start interpolator and the Stop interpolator are subtracted.
As shown in Fig. 5 , the leading edge of the pulse width doubler's output is synchronous with the leading edge of the input pulse. In order to quantize the amplified time residue with a clock counter, the leading edge of the output pulse should be synchronous with the reference clock. Because of this, two pulse width doublers are used as shown in Fig. 3 . A timing diagram in Fig. 6 illustrates the principle how two doubling circuits are used to generate an amplified residue with clock synchronous leading edge. The first pulse width doubler(A) produces an amplified residue(TA_OUT) that is aligned with the input pulse. The second pulse width doubler(B) then uses the output of the synchronizer and the output of the first pulse width doubler(A) to produce another amplified residue(TB_OUT) whose leading edge is synchronous with the reference clock. Thus, the falling edge of TB_OUT occurs 2 t in after the rising edge of the reference clock. This falling edge is then synchronized again and a clock counter quantizes the pulse width of TB_OUT by counting the clock cycles between consecutive synchronized outputs. The counting functionality is realized by passing the clock synchronous output of the synchronizer to a digital circuit block shown in Fig. 3 , where a control block generates all the required synchronous control signals for the coarse counter and the fine up/dn counter. The control block also tracks the number of bits resolved and ends the conversion when nine cycles have completed. The control block and all the counters have been synthesized from Verilog code and automatically placed & routed.
The length of the pulse width doubling delay line should be sized according to the full scale input. In this case, the full scale input is entirely defined by the frequency of the reference clock, which in this work is 100MHz. The length of the delay line was rather conservatively sized with a corner simulation by ensuring that the length is adequate even in the case of fast PMOS/NMOS @ a temperature of -50 • C and with a supply voltage of 10% above the nominal VDD. Thus, the delay line consists of 43 inverter stages, which ensures that the time amplifier functions correctly in a wide range of operating conditions. The whole chip was simulated to work with a temperature range of −50 • C to 150 • C and with a supply voltage range of ±10% from the nominal value. Fig. 7 and Fig. 8 shows the simulated behavior of the PWD versus temperature and supply voltage. The upper plot shows the error between PWD output and the expected output, i.e. 2 × T in . Note, however, that the offset has been removed from these plots for clarity. The offset, again, if common to both Start and Stop interpolators, does not affect the TDC result. A nonlinear behavior can be seen when T in is shorter than about 2ns. In this design, this does not have an effect on the results, since the synchronizer always adds an offset of 5ns to the time interval that is fed to the PWD.
Although the mean gain factor is always 2 regardless of temperature and supply voltage, a periodic error is present. The period of this error corresponds to the delay of two consecutive inverters, i.e. the sum of low-to-high and highto-low transitions. This error can be minimized by matching the drive strengths of the NMOS and PMOS, however, based on parametric sweeps, the error cannot be completely removed by matching. This is probably caused by the nonlinear behavior of the low-to-high and high-to-low transitions that cannot be completely matched by simply adjusting the drive strengths of the NMOS/PMOS pair. The periodic error can be also reduced if the propagation delay of the inverter is made as small as possible. However, if the delay is minimized, then more inverters are needed to cover the required linear range. Fig. 9 shows the Monte Carlo simulated PWD behavior against process variations. Again, neglecting the periodic errors, the mean gain factor is 2. Since the input time intervals for the PWD are distributed between Although PVT variations don't affect the mean gain factor of the PWD, random mismatches will induce some amount of gain error. Fig. 10 shows the Monte Carlo simulated results for the PWD error when random mismatches are considered. The offset has been removed from the plots and the error is centered around 10ns, since the input time intervals for the PWD are distributed between 5ns and 15ns. A linear slope is present in some of the Monte Carlo results, which indicates gain error. However, the RMS error is quite comparable to the RMS errors induced by process variations and the most dominant error source is still the periodic error. Single-shot precision of the 256 TDC channels over a single reference clock period(10ns). 
IV. MEASUREMENT RESULTS
A photomicrograph of the TDC channels is shown in Fig 11. The dimensions for a single channel are 41.6µm × 725µm. Thus, the total area taken by the TDCs is about 7.71mm 2 , i.e. 0.03mm 2 per TDC. The measured INL of all the 256 TDC channels are shown in Fig. 12 . These were collected by having an asynchronous Start signal and a fixed frequency Stop signal with a period of about 620ns. This way the time intervals measured by the TDCs are uniformly distributed between 0ns and 620ns. The worst case peak-to-peak INL is about 45ps. The INL is the worst for short time intervals less than 40ns, after which the peak-to-peak INL settles to about 25ps. The common Start signal activates the coarse counter in every TDC channel simultaneously, which in turn causes a large current transient in the power delivery network. This turn-on transient might have an effect on the delays in the Stop signal's path, which probably explains why the INL is slightly worse for short time intervals. Although the INL for an individual interpolator can be in the range of ±100ps, the INL for the complete TDC is significantly better due to the asynchronous Nutt interpolation method, as seen in in Fig. 12 . The interpolators' static nonlinearity errors are converted to dynamic noise-like errors, which affects the single-shot precision of the TDC as explained in Section II. Because of this, the single-shot precision can vary Fig. 14 , which depicts the measured single-shot precision over one reference clock cycle for all the TDC channels. The single-shot precision was measured with SRS DG645 delay generator by using an external asynchronous triggering. The delay was incremented with a step of about 100ps to cover a full reference clock cycle and for each step the standard deviation of results was recorded. Since the jitter of the delay generator itself is less than 25ps, the single-shot precision is dominated by the interpolators' INL error, which also causes the single-shot precision to vary within a clock period. The RMS single-shot precision is about 72ps.
It is possible to compensate for the interpolators' INL errors by using a look-up table containing the measured INL. In order to illustrate how much the interpolators' INL affects the single-shot precision, Fig. 15 shows the single-shot precision for the TDC channels when the INL correction is used. After the look-up table correction, the single-shot precision is around 20ps and the periodicity due to the INL has disappeared. Fig. 16 shows the static timing skew(offset) of each channel with respect to the average result of all channels. The sigma value of the skew is about 26ps. Fig. 17 shows the extracted LSB deviation from the mean LSB across the whole TDC array. Since all the channels use the same reference clock for quantization, the LSB error is less than ±3 * 10 −5 %, which is so small that the uniformity error at full scale input(640ns) is less than the nominal LSB of 20ps. Fig. 18 and Fig. 19 show the measured supply and temperature sensitivity of the TDC channels. The rather high sensitivity is caused by a large tapered buffer in the Stop signal path, which drives a common Stop signal across the whole chip to every Stop channel. In the Start signal path no such buffer exists, which causes the Start-Stop delay to vary with temperature and supply voltage. However, in the real laser radar application, the Stop signal comes from the SPAD array and the tapered buffer is not used, which should improve the temperature and supply sensitivity significantly. The simulated nominal supply and temperature sensitivity due to the tapered buffer arrangement are around −250ps/V and 2.42ps/C, respectively, which matches quite well with the measured average sensitivity results.
The measured power consumption of all the TDC channels combined(including I/O power consumption) versus measurement rate is shown in Fig. 20 . The quiescent power consumption for the whole chip is about 1mW. When active, the power consumption is linearly proportional to the measurement rate, which is about 12nW per Hz per TDC. The conversion rate is limited to about 170kHz mainly by the 8-bit read-out bus, which operates at 100MHz and thus cannot handle datarates higher than about 800Mbits/s. The shift-register arrangement for readout consumes most of the power. Full-chip post-layout simulations indicate that the data conversion phase with a full scale input consumes about 10% of the total power. 90% of the power is consumed in the readout phase. In the readout phase, all data is gradually shifted out of the chip in 2 × 257 clock cycles. In measurement phase, the full scale input is 64 clock cycles long, thus the registers are updated at maximum 64 times and therefore the rough power ratio between the measurement and readout phase is about 64/(2 × 257) ≈ 0.12. Some other readout scheme could improve the total power consumption of the chip significantly. Table I summarizes the key performance parameters of a single TDC channel and provides a comparison against recently published TDC designs utilized in SPAD+TDC receiver/sensor chips. It should be noted, that since CMOS TDC circuits are digital-like in behavior and quite usually built out of digital logic gates, power and area are highly influenced by the used technology. Moreover, it is not clear in some publications if the reported power and conversion rate values are for the core design only or for the whole chip with I/O related constraints taken into account as well.
V. CONCLUSION
A long range TDC has been designed for SPAD based timeof-flight laser radar application in 0.35µm CMOS technology. The TDC uses a cyclic converter as an interpolator, in which a pulse width doubling circuit is used to amplify quantization residue by a factor of 2. A total of 256 TDCs have been implemented on a single chip, and all of the TDCs use the same reference clock for quantization in order to avoid LSB size mismatch across the array. The TDC does not need any calibration or replica biasing for setting the amplification factor or LSB size, which is an advantage when a large array of converters need to be designed on a single chip with strict requirements for uniformity across the whole array in terms of LSB size, linearity and static errors.
A single channel has an area of 0.03mm 2 , a range of 640ns and nominal LSB size of about 20ps with a reference clock of 100MHz. The power consumption per channel is about 12nW/Hz. Although the interpolators suffer from nonlinearity due to delay line transition mismatches, the asynchronous Nutt interpolation method converts these errors to noise-like dynamic errors, which improves the worst-case nonlinearity of the complete TDC to be around 45ps.
