Abstract: TDCs have been widely used to measure time intervals in various scientific, industrial, and portable electronics applications. To shorten the developing cycle of such digital systems, TDCs should be all-digital, flexible, and portable, and can be scalable to different processes. To achieve this, an all-digital cyclic time-to-digital converter (TDC) is proposed and described in synthesizable Verilog HDL code so that it can be built as an IP block. The proposed TDC uses a cyclic-delaying clock technique to reduce the error caused by the metastability of flip-flops and to simplify the control circuit and the compensation scheme to alleviate the need of continuous calibration. To confirm the functionality and performance, the proposed TDC has been simulated behaviorally based on the timing parameters obtained from HSPICE simulation. The LSB achieved is equal to the fastest loaded buffer delay of 13.87 ps in a 32-nm PTM and the proposed TDC has superior periodicity, high linearity, and a scalable dynamic range.
Introduction
TDCs have been widely used to measure time intervals in various scientific, industrial and portable electronics applications. Recently, because modern process scaling has been offering several benefits to digital circuits, digital conversion approaches become dominant, resulting in a simple, area-effective and low-power solution. An all-digital TDC can be used as a digital IP block so as to shorten the developing cycle of digital systems.
Many digital TDC approaches [1, 2, 3, 4, 5, 6] have been developed during the past decades. Among these, the delay-chain-based configuration is of particular interest. Typically, a simplest delay-chain TDC is composed of a chain of digital delay cells, as shown in Fig. 1 .
Despite the advantages of simplicity and the ease of implementation, several drawbacks are associated with this simplest TDC. First, the time resolution can be affected significantly by PVT variations. Second, this type of TDC is difficult to extend its dynamic range and the need for a very long delay chain is normally required. Third, the nonlinearity incurring from the long delay chain will limit the precision of the time resolution.
To alleviate the continuous calibration problem, relax the dynamic range limitation, and achieve a greater linearity, the signal path should be kept as short as possible. To this end, the number of stages in the delay chain can be reduced by folding the delay chain into a loop [7] , as depicted in Fig. 2 . The START signal circulates in the loop until the measurement is finished. The number of times that the START signal circulates in the loop is memorized by a loop counter. On the arrival of the STOP signal, the delay chain and the loop counter are frozen and sampled, and then the time interval of interest can be measured. Nevertheless, some problems are needed to solve for this looped TDC. First, a technique for detecting the STOP signal in the looped TDC so as to reliably stop the loop counter is required. Second, the possible setup-time violation of the sampling flip-flops used in the circuit for the detection of the STOP signal is unavoidable because both START and STOP signals are asynchronous in nature. In this paper, these designdependent problems will be addressed. Third, the physical layout for a looped TDC is very crucial and needs to be designed with care since the feedback path results in layout asymmetries and nonlinearity error in between two loop cycles.
To eliminate the nonlinearity of the above-mentioned TDC, another architecture, referred to as a linearly extended TDC, is proposed [8] , as shown in Fig. 3 . The main TDC performs a loop structure as described previously. The objective of the extender TDC is to determine the measurement value as the measurement is stopped while the timing event passes the feedback path to characterize the delay asymmetry in the feedback loop. Thus, the carefully designed layout techniques are no longer necessarily required.
To prevent the TDC from entering the erroneous state, the constraint is that the pulse-width of the START signal may not be less than the total amount of pulse shrinking due to buffers of the two delay chains in the main TDC and the extender TDC. It is important to use the proper number of stages for the delay chain; a bad choice of this number may lead to erroneous readings. In addition, the inevitable pulse shrinking or stretching effect due to PVT variations or some other factors may result in a variety of different output bit patterns for the delay-chain readings. Hence, care must be taken to design the decoder of this type of TDC so as to output a reasonable value, which often needs a complex decoding scheme. Moreover, to measure an arbitrary interval, an even more complicated control logic circuit is required due to both the pulse shrinking or stretching effect and input profiles.
In the following section, we propose a scalable, cyclic delay-chain TDC with a simpler control circuit that can tackle the above-mentioned problems. This paper demonstrates how to achieve the key performance metrics of the simplified design of circuits, easily scalable configurations, good linearity, and the theoretically unlimited dynamic range in the TDC, and area reduction in its implementation.
2 The proposed TDC architecture 2.1 Top-level architecture of the proposed TDC The major components of the top-level architecture of the proposed TDC are shown in Fig. 4 . The key idea of the proposed TDC is to leverage the cyclic-delaying clock technique to simplify the control circuit and the compensation technique used to calibrate the feedback delay in between two loop cycles. The salient features of this arrangement are that the inherent error caused by the metastability of flip-flops can be alleviated, the constraint of the pulse-width of the START signal can be relaxed and is no longer a critical issue, and a simple digital decoding scheme with a very simple control logic circuit can be used. Next, we will detail these issues.
Pulse-shaping block
The pulse-shaping block not only schedules the flow of the START signal to the main block but also generates a control signal, loop, to count the number of recycling times, CNT. A reasonable criterion for determining the number of stages, M, of the compensation block is on the observation that the propagation delay of the feedback path must be smaller than the overall propagating delay in the compensation block, i.e.,
where buf and T LB are the propagation delays of a buffer in the delay chain and of the pulse-shaping block, respectively, ignoring the possible wire delay.
Main block
The number of delay buffers in the main block, N, is scalable for the specified applications under varying operation conditions. The main counter memorizes the number of times (CNT) that the START signal loops through the delay chain of the main block before the arrival of the STOP signal and determines the dynamic range of the proposed TDC. Suppose that as the measurement is stopped a timing event is detected in the main block and the delay chain was passed through by the START event CNT times, as shown in Fig. 6 . Then, the time interval T to be measured can be expressed as T ¼ T loop þ T residue , where T loop is CNT times the propagation delay of the delay chain of the main block along with the propagation delay of the pulseshaping block and T residue is the residue interval measured by the delay chain of the main block. Let T buffersðmainÞ be the sum of propagation delays of N buffers in the main block, i.e., T buffersðmainÞ ¼ ðN À 1Þ Â buf þ buf ðNÀ1Þ , where buf ðNÀ1Þ > buf due to larger load and is the propagation delay of the (N À 1)th buffer, then the propagation delay T loop is
The residue interval T residue can be bounded as follows:
This residue interval is now quantized with the buffer delay-chain concept. The position of this quantized residue interval is described by n. Therefore, the quantized time interval of measurement, T m , is thus equal to
where T LBð0thÞ is the propagation delay of the pulse-shaping block from the START signal to the START_m signal at the beginning.
Compensation block
The compensation block (i.e., the M-stage delay chain) provides the measurement values as a digital calibration while the timing event is somewhere in the feedback path, thereby eliminating the measurement nonlinearity of the simple looped TDC, depicted in Fig. 2 . As this is the case, the timing event is captured by the corresponding D flip-flop of the compensation block as the measurement is stopped while the delay chain of the main block is passed through (CNT À 1) times by the START event, as illustrated in Fig. 7 . It is worth noting that the number of stages M in the compensation block may be less than the number of stages N in the main block. The time interval T to be measured can be represented as T ¼ T loop þ T residue , where
and the residue interval T residue is bounded as follows: 
By combining with the criterion described in (1), the relationship between M and N can be written as
Let the position of this quantized residue interval be m with the buffer delay-chain concept. Then, the quantized time interval of measurement, T m , equals
where the maximum value of m is determined by T LB based on (7).
Read-out scheme
The time interval to be measured is finished at the rising edge of the STOP signal. At this instance, the timing information on both the main and compensation blocks is frozen and the thermometer-to-binary decoder associated with the main block or the compensation block decodes the residue interval accordingly. To achieve this, a novel synchronous read-out scheme as depicted in Fig. 8 is proposed. It comprises a number of D flip-flops used as sampling elements and two thermometer-to-binary decoders, with each consisting of an XOR array and a 1-out-of-n to binary encoder. 
As a consequence, a conventional TDC configuration seems to be infeasible if PVT variations become significant. Fortunately, the above problem can be avoided by the proposed TDC architecture, where the hold-time violation concerning the falling edge of the pulse is trivial. In addition, the pulse-width is no longer a critical issue in the proposed TDC architecture. To see this, consider Fig. 10 , where the STOP signal is sampled by any four consecutive flip-flops on the rising edge of the delayed clock vector start[i-2:i+1]. The metastability causes the output to an erroneous state from 1 ! 0 (bit i-1) and 0 ! 1 (bit i) for case 1 and case 2, respectively. However, the error caused by the metastability can be alleviated at the cost of only one extra buffer delay at most since the STOP signal will be sampled by the next delayed clock (start[i] and start[i+1] for case 1 and case 2, respectively) after a buffer delay buf . Thus, the proposed TDC confines the error of unstable edges to a single bit since at most one bit is flipped. In such a case, the maximum quantization error of a single time-interval measurement may get an extra buffer delay À buf (−1 LSB) as shown in Fig. 10 , depending on the actual value of the time interval T to be measured and its resulting state of interpolation.
In addition, to reduce the probability of metastability by relaxing the setup-and hold-time constraints, all D flip-flops in Fig. 5 can utilize the symmetric senseamplifier-based structure (SAFF) [9] from standard cells, if any, to take the benefits of a very small metastability window (equal to the sum of setup and hold times) and hence of little insensitivity to the mismatches of nMOS and pMOS transistors.
Pulse-width
As mentioned, the pulse-width of the START_m signal cannot be less than the total amount of pulse shrinking through the two delay chains in the main block and the compensation block. On the other hand, accounting with the actual delay of the feedback path, i.e., T LB , the upper limit of the pulse-width of the delayed clock vector (start[0:N-1] and start_e[0:M-1]) can be determined in the manner shown in Fig. 11 . Thus, the pulse-width can be bounded as ðM þ NÞ Â t buf ðshrunkÞ < T pulse < T buffersðmainÞ þ T LB ð10Þ
where M and N are scalable, depending on the end application; t buf ðshrunkÞ ( buf is the amount of pulse-width shrunk by a buffer. The lower bound can be easily achieved, even ignored, by designing the buffer in the delay chain in a way such that the buffer will stretch the pulse rather than shrink it by properly configuring the sizes of pMOS and nMOS transistors in the buffer. The upper bound is trivially achieved since N, buf , and T LB are generally relatively large. Based on this and from Fig. 10 , the pulse-width of the START_m signal is no longer a critical issue in the proposed TDC configuration and can be set to a somewhat arbitrary value, called the feasible pulse-width, on condition that it must be wide enough for the start[0] pulse (one buffer after START_m) to have enough time to rise to V DD so as to sample the STOP signal at the D input reliably and the pulse-width of start_e[M-1] still satisfies the upper bound of (10) as well. The pulse-width of start_e[M-1] equals T pulse þ P MþNÀ1 i¼0 t buf ðstretchedÞi , where t buf ðstretchedÞi is the amount of pulse-width stretched of the ith buffer in the delay chain and may have a different value from t buf ðstretchedÞ of a typical buffer due to PVT variations. Although the feasible pulse-width of START_m can be approximated in first order as KC L V DD =½ðV DD À V T Þ 2 , where C L , K, and V T are the load capacitance, the device-dependent constant, and the threshold voltage of the transistor, respectively, these parameters cannot be easily determined from modern VLSI processes. Thus, the feasible pulse-width of START_m needed in a process is best obtained by simulation. The feasible pulse-widths of START_m for a number of processes are listed in Table I .
To generate the feasible pulse-width of START_m for use in different processes, a programmable pulse generator is designed in the proposed TDC, as depicted in Fig. 12 . The width of the generated pulse can be expressed as ½basic pulse width þ ðincremental pulse widthÞ Â k:
where the basic_pulse_width is mainly determined by the propagation delay, [ð2t pd Â iÞ þ t pd ], plus the propagation-delay effects of the AND gate and the 2-to-1 MUX, where t pd is the propagation delay of an inverter. The delay element is a buffer consisting of 2 inverters in cascade, where the propagation delay of a buffer is 2t pd (the incremental_pulse_width). k is an arbitrary integer and can be set to a positive integer if a wider pulse-width is needed as the buffers of the delay chain are not intentionally designed to stretch the pulse. In our design, k is set to 0 because each buffer in the delay chain is designed on purpose to stretch the pulse, with the amount of t buf ðstretchedÞ listed in Table I by simulation, passing through it. The pulse generator architecture has been designed, simulated, and verified using HSPICE in different CMOS processes, including 32 nm, 45 nm, 65 nm and 90 nm. The actual values are determined by simulation.
Features
The proposed TDC, which delays the clock (i.e., START) signal while sending the STOP signal to a common D input, has many salient features. First, the measured error caused by the metastability of flip-flops can be alleviated. Second, the pulsewidth is no longer a critical issue, as illustrated in Fig. 10 . Third, unlike the traditional TDC, the proposed TDC results in merely the thermometer-coded output patterns (1…10…0) rather than a variety of output patterns (0…01…10…, 1…10…0, 0…010…0, etc), which severely affect the complexity of the read-out circuitry for the conventional TDC due to the pulse-width varying with PVT variations. This means that the proposed TDC is particularly advantageous for the simplicity of implementation; namely, a thermometer-to-binary decoder can be simply used to decode the output pattern, and a very simple control circuit may be employed to control the flow of signals.
3 Simulation results
TDC performance
To design and implement the proposed TDC in a cost-effective way, the essential timing delays required for the design procedure are obtained by simulation and summarized in Table I .
As described previously, the feasible pulse-width of the START_m signal for a specific process can be set to a somewhat arbitrary value and verified by simulation. The feasible pulse-widths used are listed in Table I . To make each of the basic pulse-width closest to the feasible pulse-width, we choose the different values of i with various CMOS processes, i.e., i ¼ 8, 7, 7, 6 for 32 nm, 45 nm, 65 nm and 90 nm, respectively, as shown in Table I .
To illustrate Table I , consider the case of using the 32-nm PTM at the nominal corner. The LSB resolution of the proposed TDC is 13.87 ps. The propagation delay of the pulse-shaping block, T LB , is 30.09 ps measured from the external START signal to the START_m pulse for the first round and 117.44 ps for the feedback path. The number of stages M is chosen to be 12 for saving area. The pulse-width of the START_m signal can be programmed as (103:53 þ 11:73 Â k) ps, where k is a programmable integer. In our simulation, we take the value of k to be 0, as described before.
To get the transfer characteristics of the proposed TDC versus the measured time interval, a time-interval sweep for TDC measurements is required repeatedly. To this end, the proposed TDC is examined in both functionality and timing with a set of varied input time intervals by using Verilog HDL behavioral simulation, where different time intervals between the START and STOP signals are generated with the step of 100 ps ranging from 0.4 to 15 ns subject to the timing delays summarized in Table I , which are acquired by HSPICE simulation with PTM parameters [10]. During the simulation, we memorized the counting value for loop cycles, CNT, the position of the quantized residue interval n in the main block and m in the compensation block for each measured time interval. In the experiments, 3 samples of N were taken for the nominal process corner in the 32-nm PTM. The results presented in Fig. 13 show that the transfer characteristics of the proposed TDC measured over the 3 chosen samples for the number of stages N, in the main block, where N ¼ 128, 64, and 32. In Fig. 13(a) , the loop-cycle characteristics show an increased counting value for measured intervals corresponding to the propagation delay, T loop , referred to (2) or (5). Fig. 13(b) shows partial enlargement of the resulting curves with respect to the quantized time intervals n and m. Each curve exhibits a sawtooth waveform with a repetition period equal to one complete loop cycle, i.e., T buffersðmainÞ þ T LB . The proposed cyclic TDC architecture efficiently achieves high linearity and a high dynamic range, where the range of the main counter can be easily adjusted on demand.
To further illustrate Fig. 13(b) , consider the case for which N is 64 and the time interval to be measured is 9800 ps, indicated by T1. In this case the residue interval is detected in the main block with n ¼ 51 and CNT ¼ 9. Thus, by (2) and (4) As another instance, consider the case to measure a time interval of 10100 ps, indicated by T2, with the same N. In this case, the residue interval is within the compensation block, with CNT ¼ 10 and m ¼ 9. Hence, from (5) and (8) The quantization error is −2.02 (¼ 10097:98 À 10100) ps, within 1 LSB.
Read-out scheme
The hardware cost of the read-out scheme can severely affect the selection of the number of delay cells N of the main block. To estimate the cost of the read-out scheme versus N, the read-out scheme has been synthesized with the Xilinx Spartan-3A DSP (XC3SD3400A-4FG676) device, which contains 47744 4-input LUTs. The synthesis results of the read-out scheme are plotted in Fig. 14 in terms of the number of 4-input LUTs versus N. As expected, the number of 4-input LUTs required to realize the read-out scheme is strongly dependent on the length of the delay chain, N, in the main block.
As shown in Fig. 14, two observations can be obtained: First, for N 128, the number of 4-input LUTs used linearly fits the line 2 with the slope Sb, where Sb ¼ 8:8125. Second, for N > 128, the number of 4-input LUTs needed grows rapidly and fits the line 1 with the slope Sa, where Sa ¼ 25:56. Apparently, Sa is much greater than Sb. Based on this, the delay-chain length, N, should be less than 128 for the reason of area usage. Generally, the shorter the delay chain, the better linearity and lower cost.
Conclusion
The proposed cyclic TDC has the following features. First, it reveals superior periodicity, high linearity, and a high dynamic range. The measured error caused by the metastability of flip-flops can be alleviated, with the maximum quantization error almost less than −1 LSB over the entire characteristic. Second, the pulsewidth is no longer a critical issue and can generally be set to a somewhat arbitrary value. Third, the proposed cyclic TDC results in merely the thermometer-coded output patterns rather than a variety of output patterns, thereby reducing the cost of the read-out scheme. Fourth, from the synthesis results for the realization of the read-out scheme, the reduction of the number of the delay cells not only yields better linearity but also reduces area.
