Abstract-For TOF positron emission tomography (TOF PET) detectors, time-to-digital converters (TDCs) are essential to resolve the coincidence time of the photon pairs. Recently, an efficient TDC structure called ring-oscillator-based (RO-based) Vernier TDC using carry chains was reported by our team. The method is very promising due to its low linearity error and low resource cost. However, the implementation complexity is rather high especially when moving to multi-channels TDC designs, since this method calls for a manual intervention to the initial fitting results of the compilation software. In this paper, we elaborate the key points toward implementing high performance multi-channels TDCs of this kind while keeping the least implementation complexity. Furthermore, we propose an efficient fine time interpolator construction method called the period difference recording which only needs at most 31 adjustment trials to obtain a targeted TDC resolution. To validate the techniques proposed in this paper, we built a 32-channels TDC on a Stratix III FPGA chip and fully evaluated its performance. Code density tests show that the obtained resolution results lie in the range of (23 ps~37 ps), the differential nonlinearity (DNL) results lie in the range of (-0.4 LSB~0.4 LSB) and the integral nonlinearity (INL) results lie in the range of (-0.7 LSB~0.7 LSB) for each of the 32 TDC channels. This paper greatly eases the designing difficulty of the carry chain RO-based TDCs and can significantly propel their development in practical use.
I. INTRODUCTION
T IME-of-flight (TOF) PET consists of very fast detectors which utilize multi-channel time-to-digital converter (TDC) modules to resolve the coincidence time of the photon pairs. This helps to improve the tomographic reconstruction quality while simultaneously reducing radiation doses and/or scan times [1] . The performance of the overall PET system is directly related to the precision of the used TDCs. In this paper, we will focus on the designing of highly accurate multi-channel TDCs while keeping as least implementation complexity as possible.
Most present TDC structures adopt a two-step time measurement technique [2] - [5] . In this method, the first step uses a coarse counter running at system clock rate (usually corresponding to a period of several nanoseconds) to record the elapsed coarse time to guarantee large dynamic range. The second step adopts a fine time interpolator with subnanosecond resolution to accurately record the time bin locating in the specific system clock cycle at which the coarse time counter is latched to guarantee high precision. The mostly used fine time interpolator techniques include: tapped delay line (TDL) [3] - [12] , pulse shrinking delay line [13] and Vernier delay line [14] - [16] . Generally, there are two platforms to implement TDCs: application specific integrated circuits (ASICs) and field programmable gate arrays (FPGAs). ASIC-based TDCs have strong design flexibility and can utilize some very beneficial analog circuits such as delay locked loops (DLLs) contributing to an excellent delay line. However, their cost is especially high when the production volume is low and the development period is rather long. FPGA-based TDCs have much less design freedom which are constrained in the digital design space. However, the reconfigurability of FPGAs makes the design much less expensive and can be adjusted to meet new requirements quickly.
J. Wu proposed the carry chain based structure as an efficient TDL interpolator which is important to the development of the FPGA-based TDCs [7] . The carry chains which widely exist in modern FPGAs are specially provided by their vendors to fulfill fast algorithm functions such as fast addition or comparison. The delay time of a basic carry chain cell is very small such that it is reasonably conceived as the most ideal tool to fulfill the fine time interpolation task on FPGA chips. That is the reason why the carry chain based TDL TDCs have gained extensive studies in recent years. However, existence of ultra-wide bin which is physically determined during chip fabrication process limits the precision of such TDCs significantly. J. Wu proposed the wave-union A and B methods to effectively subdivide the ultra-wide bin by multiple measurements along a single carry chain and improved the precision beyond its cell delay [6] . The wave-union methods are then widely adopted by many later emerging FPGA-based TDC designs [8] , [11] . Another drawback of such TDCs is the large differential nonlinearity (DNL) and integral nonlinearity (INL) problem caused by uneven bin granularity of the used carry chains. One possible technique to mitigate this problem is to use the bin-by-bin calibration techniques [2] . However, this incurs large memory and logic resource cost.
Recently, we proposed a new ring-oscillator-based (RObased) TDC structure by organizing the carry chains in a Vernier loop style [17] . A specific construction method to set up two ROs with very little period difference was fully illustrated. This method opened up a new way to utilize the carry chains to build the fine time interpolator which led to much reduced DNL and INL. In this paper, we report a 32-channels TDC realized on a single FPGA chip by further exploiting the method proposed in [17] . One main challenge on the implementation complexity emerges when moving to multi-channels TDC designs, since the efforts which should be paid linearly increase with the channel number. This is especially true because the design flow requires an exhaustive manual intervention to the initial fitting results of the compilation software to obtain the two ROs with a targeted period difference. This paper proposes a new construction method for building the two ROs with definitely less than 31 trials per TDC channel which greatly reduces the design complexity.
In summary, the main contributions of this paper contain: key points to obtain high performance TDC by utilizing the carry chain RO-based method; a new and highly efficient construction method called the period difference recording (PDR) to build the ROs; multi-channel ability and scalability by utilizing the carry chain RO-based method in a single FPGA chip.
II. MULTI-CHANNEL TDC DESIGN
A. Carry Chain RO-Based TDC Structure
The basic carry chain RO-based Vernier TDC structure is depicted in Fig.1 . It uses two steps to measure a time interval including a coarse and a fine time measurement steps ( Fig.1(a) ). The corresponding timing diagram is shown in Fig.1(b) . The coarse counter running at the system clock rate is adopted to record the coarse time. The clock extraction module is designed to find the closest clock signal in time after the hit signal and extract the delayed hit and clock signals pair to the fine time interpolator module to measure the fine time interval between them. The working principle and circuit implementation of the clock extraction module can be found in Section II-B. Two signals labeled ctrl_1 and ctrl_2 are generated to denote the proper timing for the time assembler module to latch the coarse and fine counter values correctly and combine them together to produce the final timestamp. Fig.1(c) shows the detailed structure of the RO-based Vernier fine time interpolater which connects the last cell of the carry chain back to its first cell. The two ROs are composed of different numbers of carry chain cells (or delay units -DUs) and hold different oscillation periods. The DU works as basic delay unit and a complete Vernier delay line can contain an even or odd number of DUs. The period difference between the two ROs determines the resolution of the TDC. Additionally, each RO contains a pulse width reshaping module to maintain the stability of the positive duration of the oscillation signal propagating along it. Its circuit implementation is shown in the rightmost part of Fig.1(c) . According to the timing diagram ( Fig.1(b) ), the leading signal of a fine time interval event (the hit_syn signal in Fig.1 ) is fed to the slow RO while the lagging one (the clk_syn signal in Fig.1 ) to the fast RO. The fine timestamp is obtained by reading out the fine time counter which records the oscillation number at the moment that the lagging signal catches up the leading signal. Obviously, the DNL of such TDCs avoids the bad influence of the uneven bin granularity, since the resolution is determined by the physical length difference of the two ROs but not the bin widths of the used carry chains as in the TDL form. Much reduced DNL has been observed in [17] . However, since the ROs are not compensated and stabilized during the time measurement, the oscillation number cannot be set too large to assure small precision RMS which means that tradeoff between the resolution and the precision should be carefully considered and made by the designers.
B. Key Design Points
There are two key points for the designers to build the RObased Vernier TDC: the clock extraction module and the fine time interpolator module. 1) key design point for the clock extraction module: The basic task for the clock extraction module is to locate the clock signal which is nearest in time after the hit signal and then extract it out. The outputted leading and lagging signals are termed as hit_syn and clk_syn correspondingly and the fine time interval between them will be measured by the following fine time interpolator module. The circuit implementation of the clock extraction module is depicted in Fig.2 . It can be seen that an undesired additional delay τ d = τ reg + τ 2 − τ 1 is added to the original fine time interval and causes a minimal oscillation number n 0 = τ d LSB , where τ reg represents the delay introduced by the sampling D-type flipflops, τ 1 and τ 2 represents the adjustable delay introduced by the delay compensation units for the hit_in and clk_in signals respectively, and LSB represents the resolution of the TDC. The existence of n 0 deteriorates the TDC performance, since the precision RMS increases about proportionally to the square root of the overall oscillation number [18] . The timing diagram in Fig.2 shows that the following formula should be satisfied to guarantee the requirement that the renewed arrival time of Here t 1 and t 2 are the arrival time of the hit_in and clk_in signals respectively, while t * 1 and t * 2 are the corresponding output time after passing the clock extraction module. 
where T pos represents the positive duration of the hit_syn signal and T clk represents the period of the system clock. The compensation delays τ 1 and τ 2 are intentionally added to try to make τ d as small as possible leading to the least n 0 to gain the best precision performance. The compensation unit is composed of 32 cascaded look-up table (LUT) implemented NOT gates. According to our experimental experience, this gate amount is adequate to find a good enough parameter set (τ 1 , τ 2 ). The actually used gates number for each signal is manually adjusted and determined by using the resource editor tool provided by the FPGA manufacturers (for example the engineering change orders -ECO tool by Altera and the FPGA editor tool by Xilinx). The adjustment criteria is to make τ d locating in the range constrained by equation (1) and as close to zero as possible. However, this task is difficult to be fulfilled directly since all the timing parameters τ 1 , τ 2 , τ reg and T pos are very hard to be known exactly. To combat the difficulty, we propose to infer the actual τ d value by observing the distribution of the outputted fine time counter values n captured from the fine time interpolator module. The cases of the relative phase between the hit_syn and clk_syn signals are depicted in Fig.3 and the corresponding distributions of n are summarized in Table I , where the parameter n m represents the maximal fine time counter value. 
all the relative phase relations are in the expect range satisfying equation (1) {(n 0 , nm)} According to the cases listed in Table I , the following steps are applied to efficiently find an optimal set (τ 1 , τ 2 ) for each TDC channel:
1) According to the collected distributions of n, classify which case the present parameter set (τ 1 , τ 2 ) belongs to. If it accords with case (c), goto step 2; if it accords with case (d), goto step 3; if it accords with case (e), goto step 4; otherwise modify the number of cascaded NOT gates in either of the two delay compensation unit until one of the cases (c)~(e) appears. 2) Iteratively shorten the gates number of the delay compensation unit in the clk_syn signal path until case (e) appears and then goto step 4. 3) Iteratively shorten the gates number of the delay compensation unit in the hit_syn signal path until case (e) appears and then goto step 4. 4) Decrease n 0 by interatively shortening the gates number of the delay compensation unit in the clk_syn signal path until the minimal achievable positive n 0 is found. The shortened gates number in steps 2~3 is usually set larger than 1 to boost the finding process while the number in step 4 is set just as 1 to search the optimal n 0 . The above mentioned manual adjustment for the clock extraction module is very useful, since as the targeted TDC channels number increases, the initial fitting results have more risk to lay out of case (e) and lead to operation failure. Even the automatic fitting results initially accords with case (e), it is still very beneficial to apply step 4 to find the optimal n 0 .
2) key design point for the fine time interpolator module: The fine time interpolator module contains two structuresymmetric ROs built from the carry chains. The structural similarity is obtained by utilizing the partition based twostep construction method proposed in [17] . The complete oscillation period of the RO is composed of three parts: τ p1 caused by the carry chain (the path encompassed by the dotted rectangle), τ p2 caused by the connection path between the end of the carry chain and the pulse reshaping module (the bold line), and τ p3 caused by all the remaining logic units and paths in the RO as shown in Fig.4 .
It is clear that τ p3 keeps constant once the RO is set up. In our previous work, τ p2 is also assumed to be unchanged after manual intervention at the fine tuning point, so an iterative adjustment process of assigning different DU number combination sets to alter τ p1 for the two ROs is adopted. The adjustment task is performed as follows: cut off the connection at the fine tuning point whose oscillation period is longer; shorten the length of the carry chain by one DU; finally reconnect the new shorter carry chain to the corresponding pulse width reshaping module. The oscillation periods are observed on an external oscilloscope by introducing the oscillation signals out of the FPGA chip. This adjustment principle restricts that the DU number combination set assigning direction can only be conducted forward to the front end of the carry chain, which may incur the missing of many potential DU number combination sets, since τ p2 becomes actually uncertain after each time of adjustment. This arises from the possibility that when a RO needs to reduce its oscillation period, the DU number may actually require adding 1 instead of subtracting 1, if τ p2 decreases so dramatically that the entire oscillation period decreases even with the larger DU number. In our example design, the overall length of a complete carry chain is 32 and this theoretically gives 32 × 32 = 1024 possible DU number combination sets if both τ p1 and τ p2 are viewed changeable. The release of the adjustment constraint generates huge DU number combination set space and gives flexible design freedom. This point is also important in multi-channels TDC designs since the more DU number combination candidates can be used, the less design failure may be encountered when using such a short carry chain (totally 32 DUs). If a design failure happens for a TDC channel, the designer has to re-allocate a new physic region on the FPGA chip and re-construct this bad channel which will greatly increase the design complexity. Although extending the length of the used carry chain is also feasible to improve the design success rate, it will cause much larger resource cost which is especially true in multi-channels TDC designs.
C. Period difference recording method for fine time interpolator construction
In this section we propose the PDR method, by using which every possible DU number combination set can be covered with very few total adjustment trials. To clarify the PDR method, we define the oscillation period of the fast RO as τ f, i (i = 32, 31, ..., 1), when the i-th DU number of the fast RO is connected to the fine tuning point. Similarly we define the oscillation period of the slow RO as τ s, j (j = 32, 31, ..., 1), when the j-th DU number of the slow RO is connected to the fine tuning point. Additionally we define the oscillation period difference between the fast and slow ROs as ∆τ i, j = τ s, j − τ f, i corresponding to the DU number combination set (i, j). By using the above definitions, we illustrate the PDR design flow as follows:
1) Test and record the result of ∆τ 32, 32 , and goto step 2.
2) Fix i = 32, enumerate j = 31, 30, ..., 1, test and record the results of ∆τ 32, j , and goto step 3. 3) Fix j = 32, and initialize i = 31 if this is the first time entering this step. Test and record the result of ∆τ i, 32 , goto step 4. 4) For the current targeted DU number combination set (i, j) (j = 31, 30, ..., 1), compute ∆τ i, j = ∆τ 32, j + ∆τ i, 32 − ∆τ 32, 32 . If any ∆τ i, j lying in the targeted resolution range exists, output the combination set (i, j) and stop the iteration with success, otherwise make i = i − 1 and goto step 3. If no satisfying ∆τ i, j can be found even with i = 1, stop the iteration with failure. The PDR method only needs to record at most 63 results but can cover as much as 1024 different DU number combination sets, which greatly reduces the design complexity. It originates from the following identical equation:
= ∆τ 32, j + ∆τ i, 32 − ∆τ 32, 32
In practical use, the period difference between the two ROs is obtained by observing an external oscilloscope to collect the oscillation number k, the initial period difference ∆τ ini and the final period difference ∆τ f nl for an arbitary DU number combination set (i, j) (i, j = 32, 31, .., 1), and then the period difference is calculated as ∆τ i, j = ∆τ f nl −∆τini k . For example, Fig.5 shows a real waveform captured during our design process by a 2.5 Gs/s Tektronix oscilloscope (series number: DPO 3032), of which the channel 1 represents the slow RO while the channel 2 represents the fast RO. It can be seen that Fig.5(a) shows the entire oscillations waveform giving k = 25, Fig.5(b) shows the locally enlarged waveform of the first two oscillations giving ∆τ ini ≈ 300 ps, and According to our design experience, 16 × 16 design space generating 256 possible DU number combination sets is large enough to construct our 32-channels TDC. No failure happens during the whole design process. As an example, we summarize the recorded period difference values with the 16 × 16 design space for the TDC channel No.1 in Table II. Since a target resolution range of 25~35 ps is chosen in our design, by exploiting Table II , we can easily conclude the satisfying DU number combination sets by applying equation (2) . For example, the combination set (i, j)=(25, 30) gives ∆τ 25, 30 = −168 + 62 − (−133) = 27 ps which demonstrates itself a valid DU combination candidate. It should be noticed that the PDR method just provides an estimation of the resolution whose accurate value should be obtained from the code density tests as performed in Section III.
III. TEST RESULTS
This paper built a 32-channels TDC prototype on a single EP3SE110F1152I3 Stratix III device from Altera using a self- During tests, all recorded timestamps were transferred to PC via the USB 2.0 bus for further analysis. Code density tests were applied to test the performance of the DNL and INL. Furthermore, precision RMS test was also performed via two TDC channels by feeding two hit signals having a fixed delay value. To reduce any possible statistical error of counting, the test sample size was set to be one million. All the mentioned tests were conducted using nominal supply voltages and at an ambient temperature of around 20°C.
A. Specific performance characterization of TDC channel No.1
To apply the code density test, an arbitrary function generator AFG3251 was used to generated pulsed signals with repetition frequency of 500.1 kHz. The generator ran under an uncorrelated clock with the TDC to guarantee the correctness of the code density tests. The pulsed signals were introduced into the FPGA chip acting as hit signals. The tested fine timestamps for TDC channel No.1 lie in range of (9~64), so the LSB = 1667 ps 64−9 = 30.3 ps. The obtained diagrams of the DNL lying in the range of (-0.15 LSB~0.82 LSB) and the INL lying in the range of (-0.21 LSB~0.28 LSB) are depicted in Fig.6 .
To test large time interval results and evaluate the precision RMS, TDC channel No.2 was included. The AFG3251 was used to generate two correlated hit signals with a programmed delay value ranging in (0 ps~30000 ps). The two hit signals were fed to TDC channels No.1 and No.2 respectively by using two co-axial cable with equal length. The TDC output results were obtained by subtracting the time results of channel No.1 from those of channel No.2. Before test, a 740Zi Lecroy digital oscilloscope working at the Random Interleaved Sampling Mode (RIS) providing 200 Gs/s equivalent sampling rate was used to determine the signal jitter introduced by the AFG3251 which turned out to be less than 8 ps. That value has small 32,400 32,500 32,600 32,700 32,800 32,900 33,000 33,100 0 20,000 influence to our final results. The transfer curves of the TDC are depicted in Fig.7 of which Fig.7 (a) uses a step size of 500 ps and a dynamic range of 30 ns while Fig.7 (b) uses a step size of 100 ps and a dynamic range of 3.5 ns. The fitted linear curve has a slope very close to 1 which demonstrates that the TDC has very good linearity performance. The offset in the figure is mainly caused by the delay path difference of the two TDC channels from the IO element to the TDC module on the FPGA chip. During the transfer curve test process, the precision RMS values at each time interval point are calculated simultaneously which turn out to lie in the range of (32 ps4 0 ps). As an example, the histograms of the time interval results with values of 2324 ps and 32737 ps are depicted in Fig.8 .
B. Performance summarization of all the 32 TDC channels
In this section, a specific test configuration depicted in Fig.9 was applied to help simplify the test process. The AFG3251 is used to generate hit signals with repetition frequency of 500.1 kHz. We set a delay module independently for each of the 32 TDC channels which is composed of cascaded NOT gates with even number (the gates number is randomly set in the range of 40~100). This configuration is very useful to evaluate the precision RMS under large time interval tests such as 4~20 ns.
By analyzing the distribution of the time timestamps for each of the 32 TDC channels, all important performance parameters including fine time counter range, resolution, equivalent bin width, DNL, INL and precision RMS can be obtained. The detailed parameters results are listed in Table III . In this table, the resolution is calculated as LSB = 1667 ps nm−n0 . The term equivalent bin width w eq can take effects of the various bin widths into account [19] . It is calculated as w eq = i (
W ) with W = i w i , where w i represents the bin width for the i-th bin number. All the w i values are obtained by the code density tests.
From Table III , we conclude that the obtained resolutions and equivalent bin widths all lie in the range of (23.2 ps3 7.2 ps), and the fact that they are very close to each other reflects that the TDC has good linearity performance [10] . The obtained DNL results generally lie in the range of (-0.4 LSB 0.4 LSB) with a maximal amplitude of 0.59 LSB (channel number 6) and the obtained INL results generally lie in the range of (-0.7 LSB~0.7 LSB) with a maximal amplitude of 0.87 LSB (channel number 6). The obtained linearity is not as good as that reported in [17] . One reason is that the physical location of a TDC channel on the FPGA chip is found to influence the linearity error significantly. However, we did not optimize the physical locations in this design since it would be considerably time consuming and not necessarily required in most application cases. All the implementation regions were automatically generated by the compilation software. If the designers want to obtain TDC channels with very small linearity error, manually assigning the implementation regions and comparing their performance are recommended. Another reason is that multi-channels may influence each other during operation and deteriorate the linearity error. Manually and properly assign the implementation regions may help improve the linearity performance. Even so the linearity performance is still relatively better than that in the TDL based method utilizing carry chains which usually owns a maximal amplitude of 2~4 LSBs.
Large time interval results are obtained by subtracting the time results calculated from TDC channel No.1 from those of the TDC channels No.2~32 respectively. The precision RMS is calculated from the corresponding time interval results for each of the TDC channels. From Table III , it can be seen that all of the precision RMS results lie in the range of (32 ps3 9 ps).
Finally, the dead time of the realized TDC channels is mainly determined by the oscillation period of the Vernier delay line and the maximal oscillation numbers. The oscillation Fig.5 ) is about 5 ns and the maximal oscillation number is 80 (from Table III , channel No.23) leading to the dead time of 5 × 80 = 400 ns.
IV. DISCUSSION
Carry chains are usually organized in TDL style which is the mainstream realization method for FPGA-based TDCs. This method provides low implementation complexity since the carry chain based TDL can be automatically synthesized by software compiler without any manual intervention. However, a plain TDC constructed by the TDL method usually suffers from large DNL and INL. Fortunately, by applying some well developed optimization techniques, such as the wave union [6] or multi-chains averaging technique [10] to improve the equivalent resolution and the bin-by-bin calibration technique [2] to improve the INL, this TDC method is very promising for practical use.
This paper emphasizes the Vernier method by organizing the carry chains in RO style. This method has demonstrated itself very competitive in terms of resource cost, DNL and INL when compared with the TDL method for a plain TDC design. The shortcomings are that the realized resolution is not as high as that in the TDL method so far, the dead time is relatively longer, and manual intervention to adjust the RO period difference is needed during design process.
However, similar optimization techniques such as the multichains averaging and bin-by-bin calibration can also be applied to this kind of TDC to further improve its performance. Most importantly, applying the multi-chains averaging technique is very valuable to suppress the large precision RMS and further exploits the resolution capability of such TDCs down to 10 ps level. Some performance comparisons between this work and some other recent FPGA-based works are summarized in Table IV. V. CONCLUSIONS Our recently proposed RO-based TDCs by organizing the carry chains in the Vernier loop style are a promising option for the TDC designers mainly due to its remarkably low linearity error and low resource cost. However, implementation complexity problem is posed since this design calls for manual intervention to the initial fitting results when moving to multi-channels TDC designs. To combat that problem, this paper elaborates the key points to construct multi-channels TDCs to achieve high performance while keeping the least design complexity: one for the clock extraction module and one for the fine time interpolator module. Furthermore the PDR method is proposed to search the potential DU number combination sets for a targeted resolution which costs at most 31 trials in our example design. The PDR method greatly 
