Colom Abstract-Current PET systems with fully digital trigger rely on early digitization of detector signals and the use of digital processors, usually FPGAs, for recognition of valid gamma events on single detectors. Timestamps are assigned and later used for coincidence analysis. Good timing resolution is important, allowing better rejection of singles and leading to increased reconstructed image quality. In order to maintain a decent timing resolution for events detected on different acquisition boards, it is necessary that local timestamps on different FPGAs be synchronized. Sub-nanosecond accuracy is mandatory if we want this effect to be negligible on overall timing resolution. This is usually achieved by connecting all boards to a common backplane with a precise clock delivery network; however, this forces a rigid structure on the whole PET system, and clock synchronization gets more difficult as the size of the system grows. Instead, we propose a backplane-less PET system in which DAQ boards are connected by single full-duplex highspeed data links. Data encoding with embedded clock is used to avoid frequency differences between local oscillators. Timestamp synchronization between FPGAs with clock period resolution is maintained by means of data transfers in a way similar to the IEEE1588 standard. Finer resolution is achieved by reflection of received clocks and phase difference measurement on the transmitter. A hierarchic clock distribution ensures that accumulation of time uncertainty is minimized. It is crucial that data transceivers have very low latency uncertainty in order to achieve the desired timestamp accuracy; we discuss the availability of off-the-shelf hardware for these implementations.
I. INTRODUCTION
HERE is a trend among PET systems to move the digitization step closer to the photodetector outputs, implementing an increasing part of the required signal processing in digital devices. These changes are arguably motivated by advances in FPGA technology and availability of faster ADCs, whereby analog signals may be transferred to the digital domain with a smaller information loss. Research of systems with fully digital trigger, where trigger signals such as the last dynode output from PMTs undergo a minimal shaping stage before analog-to-digital conversion, is increasingly common [1] [2] . This trend naturally carries a tendency to integrate the increasingly simplified analog frontend, ADCs and first stages of digital processing together on the same circuit board. These boards may be then mounted directly on the photodetector rings in order to reduce cabling and simplify system assembly. This system electronics setup, as shown in Fig. 1 (right) , strongly favors the physical separation of the detector-and-DAQ board set from the central processing unit in the system, which would be responsible for coincidence detection, data collection and communication with the external device that performs image reconstruction. This is opposed to a more usual electronic assembly such as [3] , shown in Fig. 1 (left) , where DAQ boards and higher-level digital processors are integrated inside of the same chassis with a common backplane, and physically separated from the detectors and front-end electronics. The challenge then lies in the synchronization of all DAQ boards, which can no longer be achieved with a simple clock tree distributed through the backplane connections. The usual system-synchronous clocking scheme loses performance because cables are required to connect DAQ and central processing that, even if length-matched, introduce a clock skew component that is no longer easily controllable.
On the other hand, overall timing resolution in digital PET is continuously improving. Advances in timing algorithms and especially increases in ADC sampling rates allow tighter shapings to be applied to trigger signals so that pile-up and peaking time uncertainties are reduced. With FWHM coincidence timing resolutions below 2 ns and expected to improve significantly [4] , the effect of mis-synchronization between DAQ boards becomes increasingly important, since any discrepancy between reference times on acquisition FPGAs is directly reflected on the measured time difference between coincident events. Hence, clock skews on the order of 500 ps already have a non-trivial effect on the measured resolution figures. Other effects also need to be considered, such as the delay mismatch between different trigger signals as they traverse electronic subsystems with potential latency uncertainty, such as the ADC to FPGA connection.
In this paper, we describe a clocking and synchronization scheme that allows time references in different DAQ boards, connected only by single full-duplex high-speed data links, to T 978-1-4244-7110-2/10/$26.00 ©2010 IEEE be matched with a resolution well below 1 ns, with synchronization operations taking up a small part of the whole data link bandwidth. We also propose a system architecture that takes advantage of this clocking method and allows the DAQ boards to be merged with the analog front-end directly on the detector location and physically separated from the rest of the digital processors with reduced cabling requirements. While other similar synchronization methods such as [5] have been proposed, they mainly focus on the compensation of propagation delay mismatches and variations over very long links. In our case, we focus on maintaining a very low reference time uncertainty between boards and removing the need for external timing calibration.
II. PROPOSED SYSTEM ARCHITECTURE Our proposed DAQ system architecture follows the hierarchic division of circuit boards outlined in Fig. 2 , according to a tree topology. Boards on the lowest level are used to digitize detector signals, detect single gamma events and transmit them to their parent boards, connected to them only by means of a full-duplex data link. They may also contain the analog front-end and be directly coupled to the associated photodetectors; the number of detectors supported by each one is arbitrary. Boards on the higher levels are responsible for coincidence detection and may be connected with additional data links. The highest level contains a single board that concentrates all coincidence data and transmits them to an external computer that executes the image reconstruction algorithm. Data connections between different hierarchy levels are set up as master-slave connections, the master being the higher level board. The slave node is able to replicate exactly the local clock frequency in the master. This way, the local oscillator frequency in the top level board is distributed following the system tree structure and all boards run with exactly the same reference clock frequency.
Each data link synchronizes master and slave with a time resolution t r , so local timestamps on different boards at the nth level are synchronized with a resolution of n·t r . Hence it is important to keep the total number of tree levels as small as possible. A maximum of three hierarchy levels seems appropriate, reducible to two for small systems such a single detector rings (i.e. a single parent board performing coincidence detection and a number of acquisition boards that are independently synchronized with the former). In order to achieve this, higher level boards should support as many downlink connections as possible.
III. SYNCHRONIZATION THROUGH DATA LINKS
In this section, we explain how clock frequency replication and reference time synchronization with a resolution below the clock period can be achieved between FPGAs on physically separated circuit boards, connected only with a single full-duplex data link. This link shall also be used for transmission of data related to the detected gamma events, so the bandwidth reserved for synchronization operations should ideally be as small as possible.
A. Frequency Replication
Most modern FPGA families offer some sort of embedded high-speed transceiver hard IPs, such as the GXB on Altera devices [6] and the GTP/GTX on Xilinx devices [7] . These hardware blocks, consisting of matched transmitter-receiver pairs, can be used to implement one-way links using optical fiber or dual coaxial cables with differential signaling, with maximum data rates being family-dependent but never lower than 3 Gbps. Such data rates can only be accomplished using self-synchronous signaling, where the transmitter data clock is embedded in the data signal. Fig. 3 shows the block diagram for a generic high-speed receiver. A PLL-based Clock Recovery Unit (CRU) recovers the embedded serial data clock, whose frequency is equal to the link's bit rate, from signal transitions and then generates of the parallel clock, with word rate frequency. The receiver then performs data sampling, deserialization, word alignment and physical layer decoding before handing the received word to the user through a clock-domain-crossing interface. It may be possible to bypass some of these blocks depending on implementation. Note that serial clock recovery is only possible if a minimum rate of signal transitions is guaranteed on the data link, as the receiver PLL can lose frequency lock after a time interval detecting a constant voltage level. Hence, special physical-level coding must be used in order to force transitions even in the case of long strings of equal symbols. Common approaches include bit scrambling and 8B/10B physical coding. The latter method, encoding data bytes as sequences of ten bits, takes up 20% of the link bandwidth just for clock transmission and limited link control services, however it can be considered a superior choice because it guarantees at least one transition in a bounded time interval (5 bit periods) under all circumstances. Hence, embedded transceivers on FPGA typically include hardware 8B/10B encoder/decoder blocks.
Although a different local oscillator may be used to clock the rest of the FPGA, the recovered parallel clock is available and can be used to drive all other logic blocks, so that transmitter and receiver nodes are clocked with exactly the same frequency. We can expand this method to achieve system-wide frequency syntonization following the hierarchy shown in Fig. 2 , where each slave uses the recovered data clock to drive its timing logic, including the transmitters for its data links with lower nodes. In this way, every node in the system runs at the exact clock frequency of the local oscillator in the top node. The use of 8B/10B is mandatory in order to ensure that receiver PLLs stay locked at all times, as the loss of a received clock in a node carries malfunction of timing logic in said node and all other nodes dependant of it.
B. Coarse Synchronization
Each node runs a local timestamp counter. Even if all of them are clocked by exactly the same frequency, they may be started at different times and/or initial values and so they need to be synchronized. The standard method of timestamp synchronization across a data link is the Precision Time Protocol (PTP) [8] . The crucial step of said method is the transmission of the two synchronization frames shown on Fig.  4 . A first frame is transmitted from master to slave, and local timestamp values at transmission and reception times, T M1 and T S1 , are recorded. A response frame from the slave is transmitted some time later, again recording transmission and reception timestamp values T S2 and T M2 . These four values can be used to compute local timestamp offset between master and slave as follows: call t MS and t SM the time required for a frame to be transferred from master to slave and vice versa; these are unknown values. The total flight time for both frames can be computed as
In the simplest case, we assume that t MS = t SM so their value is obtained by halving the total flight time. Another possibility is that both values are different but we can measure the difference t = t SM -t MS using a separate method. In any case, we can compute the values of t MS and t SM and then adjust the slave timestamp counter by adding
to it. After this operation, local timestamps on both ends are synchronized and remain so in the short term because their respective timestamp counters are locked to exactly the same clock frequency. Long term synchronization is maintained by periodical repetition of the above adjustment method. In a typical application of this method, reception times T S1 and T M2 are assumed to be the value of the local timestamp counter at the time when the first word of the synchronization frame is read from the receiver block and captured by the PTP logic at its active clock edge. Likewise, transmission times T M1 and T S2 are the local timestamp values when the first word of the synchronization frame is written to the transmitter block, again at the same PTP logic's active clock edge. We may assume that the timestamp counters and the PTP logic in each node use the same clock with period T clk . We can identify two steps in the synchronization method where rounding errors are introduced that limit overall time resolution to T clk : Fig. 4 . Standard PTP two-frame synchronization procedure. Local arrival and departure timestamps are recorded and later used to compute one-way travel time so the slave timestamp offset can be corrected.
On one hand, the total flight time as computed in (1) is always an integer multiple of T clk , because both terms in parentheses represent the time difference between specific active edges of the same clock. This is an obvious accuracy loss, since flight time depends on propagation time through external transmission lines and might take any intermediate value. The fundamental error is given by the measurement of reception times T S1 and T M2 . The receiver is able to recover the parallel data clock from the received data stream and adjust its phase so that the parallel clock's active edge is either aligned with the reception of a new data word or misaligned with an easily measurable phase difference, which can be inserted into the correction term t. However, there is a clock domain change between the receiver parallel clock and the timestamp clock. These clocks have the same frequency but there may be a phase difference that introduces an error of up to T clk in the time measures. It is possible to adjust the timestamp clock's phase in the slave so that it matches the receiver's parallel clock phase, eliminating the phase error from T S1 , but it is impossible to do the same in the master device. Hence T M2 is rounded up to the nearest value that turns (1) into a multiple of T clk . Fig. 5 shows an example timing diagram for both synchronization frames. Notice how each clock has a different phase. T S1 has no error because the recovered clock is used to transmit data back to the master, but T M2 is increased by the time difference between the receiver and transmitter clocks.
On the other hand, timestamp values are read from a simple counter with a period of T clk . In particular, the correction factor given by (2) may only be applied with a resolution of T clk , by changing the current value of the timestamp counter.
Restricting timestamp values to integer multiples of T clk neglects the fact that the phases of the local timestamp clocks on devices from different acquisition boards may be different. While this may have no effect on the synchronization procedure itself, it introduces timing errors when considering a third reference, such as the generation of gamma events whose arrival times to the acquisition boards have to be estimated with a finer resolution than T clk .
The previous analysis suggests that several changes need to be applied to the synchronization scheme before we can guarantee resolutions well below T clk . Local timestamp values need to have a fractional part that is able to be adjusted by the synchronization protocol, even if the local clock only updates its integer part. The correction term t needs to be known with high accuracy. The local clock in a slave needs to be phase aligned so that its phase difference with the recovered parallel data clock is either null or a constant measurable value. Transmitter parallel clocks also need to be aligned to local clocks or have a known misalignment. Finally, the phase difference between the recovered reception clock and the local clock needs to be measured on the master side, in order to obtain an accurate value for T M2 .
C. Transceiver Latency Requirements
Consider the timing scheme for a full-duplex link between two digital boards in the system depicted in Fig. 5 . Partial flight time values t MS and t SM , representing latency values between corresponding synchronization blocks implemented in the FPGAs' programmable logic, are decomposed into three terms: transmitter and receiver latencies (t TX and t RX ) and propagation time through physical media external to the FPGA (t p ), and an additional term to represent the phase error. Thus
where  is the phase difference, whose measurement method is explained in the next section. We may assume that propagation times in both half-links are approximately equal if circuit board design and connection cable choice are adequate. Hence, in order to maximize the accuracy in the measurement of t, it is necessary that the first and third terms in parentheses in (3) have as little error as possible, i.e. that transmitters and receivers have a very low latency uncertainty. Moreover, transceiver latency characterization should remain constant across power cycles so that no external calibration is required each time the system is turned on. Ideally, latency characterization should also remain constant across instances of the same FPGA device, so a one-time calibration is not needed either. We remark that it is not mandatory for latencies to remain constant; they only need to be deterministic. For instance, t RX may include a variable number of bit intervals to account for a word alignment sub-block, as long as the total number of shifted bits is made available to the user logic inside of the FPGA.
D. Phase Measurements for Fine Synchronization
Measurement of the phase difference between the local timestamp clock and the received data clock can be implemented with the Digital Dual-Mixer Time Difference (DDMTD) method as described in [9] . This is an all-digital version of the well-known DMTD technique [10] , where measured clocks are mixed with a tone with a similar frequency and filtered in order to obtain lower frequency signals with the same phase relationship. The main advantage of this method against other possibilities such as a PhaseFrequency Detector (PFD) is that it can be implemented entirely inside of the FPGA without external analog components.
Let P 1 (t) and P 2 (t) be the clock signals whose phase difference we want to obtain. These are assumed to be periodic signals with frequency f = 1/T clk , so we can write ) where m' is the only integer such that the corresponding p i (…) value is nonzero, i.e. (7) where [x] stands for the integer part of x, i.e. the largest integer that is less or equal to x. Substitution of (7) into (5) yields (8) where {x} = x -[x] denotes the fractional part of x. Now assume that the DMTD clock period satisfies the relationship
for an integer N. Then the n-th sample becomes
Thus, we are effectively sampling one period of P i (t) that is stretched in time by a factor of N. In particular, the phase difference and thus the time difference between rising edges of P 1 and P 2 are amplified by N and more easily measurable. Also, the set of samples has a period of N, assuming that the clock signals P 1 and P 2 remain stable. 6 shows an implementation of the phase measurement algorithm inside the FPGA. The DMTD clock is synthesized directly from the most stable of both clock signals under measure by means of a PLL or DLL, so that it satisfies (9) exactly. Its phase relationship to the input clock signals is irrelevant. Both inputs are sampled and the system looks for edges in the sampled bit strings. Whenever edges are detected in sample numbers n 1 and n 2 , respectively, we obtain a phase difference estimation given by
Several estimations need to be obtained and averaged because there will be fluctuations between consecutive measures due to clock jitter. A new estimation is obtained every N + 1 input clock cycles, assuming all active edges are detected. Phase difference is measured with a resolution of 2/N, so N represents a trade-off between time resolution and measurement time. Moreover, it is advantageous to make N a power of two, in order to simplify the implementation of the modulus operation in (11), which is needed every time an input clock edge is missed.
IV. IMPLEMENTATION

A. FPGA and Transceivers
The main constraint for the selection of a suitable platform for the implementation of these data links is the need for transceivers with a very low latency uncertainty, below 1 UI (Unit Interval, i.e. serial bit period). We have found the most cost-effective choice to be the Xilinx Virtex-5 LXT FPGA family. Altera's roughly equivalent family, the Stratix II GX, features similar transceivers but it is not possible to configure them in a way that provides a deterministic latency figure, because several transceiver sub-blocks with unpredictable latency cannot be bypassed. One must resort to the newer, more expensive FPGA families to find fixed-latency configurations (although we have not tested whether they actually support our data link specification). There is also the possibility to use an external transceiving chipset with a parallel data interface to the FPGA, however we have not found commercial transceivers with sufficiently small latency uncertainty. According to [11] , the only chipset with this feature is Agilent's G-Link [12] , but its production has been discontinued.
The Virtex-5 embedded transceivers, the GTPs, support up to 3.75 Gbps per half-link and are arranged into "tiles", groups of two GTPs (each including transmitter and receiver) sharing some clocking circuitry. In particular, they have a common reference clock input for their CRUs. Also, we are forced to use the serial transmit clock generated by an internal synthesizer from the reference input. This means that two different GTP tiles are needed for our slave implementation, because the recovered receiver clock has to be used as a reference clock input to the transmitter GTP in order to generate exactly the same line rate.
GTPs have to be configured in "RX phase alignment" and "TX phase alignment" modes, meaning that GTP-generated parallel clocks (the recovered clock and the internal transmit clock) are automatically phase-shifted so that clock domain crossing FIFOs can be bypassed and no phase error is introduced. The only latency indetermination then stems from the word aligner block in the receiver, since every bit shift adds a delay of T clk /10, i.e. one bit period. The transceiver doesn't provide information about the total bit shift count, hence we need to implement word alignment externally with user logic, so that we can access this value. The only remaining indetermination, according to the device datasheet, comes from the bit sampling instant, up to 1 UI; however, we have experimentally found the actual delay variation between power cycles to be smaller.
B. ADCs and Delay Compensation
ADCs with serial outputs are very useful for trigger acquisition boards because the reduced number of digital traces simplifies board layout. This requirement, coupled with a high sampling rate in order to improve timing resolution, results in ADCs with gigabit-speed transmitters and the need for GTPs to recover the sampled data in the FPGA. Latency in this data link, including ADC and FPGA receiver, is effectively added to the trigger signal latency as perceived by the FPGA. Hence delay mismatches between different boards should be corrected.
We have chosen the AD9239 converter from Analog Devices [13] , a quad-channel ADC that supports sampling frequencies up to 250 MHz with a high-speed serial output (up to 4 Gbps). We can take advantage of its multi-channel feature to implement the simple delay correction scheme outlined in Fig. 7 . One of the ADC channels is used to sample a reference signal whose start can be triggered by FPGA logic. Receiver logic aligns all receiver channels so that the ADC-to-FPGA delay is identical for all of them; for example, by programming output test patterns in the ADC and selectively delaying input channels on the FPGA until the received patterns match exactly. The reference signal is then started and its start time is recovered from the read samples. A linear ramp signal is a good choice because it allows simple computation of the exact start time by fitting a line to recovered data. This way, the total delay t FR + t RA + t AF is measured. We can assume that t FR and t RA remain constant between identical circuit boards, so the t AF mismatch, the one affecting the sampled trigger signals, is corrected. We remark that this compensation scheme has no particular latency determinism requirements for the FPGA receivers. 
V. PRELIMINARY MEASUREMENTS
Preliminary testing of the fine synchronization method was performed using two Xilinx ML505 evaluation boards, each containing a XC5VLX50T device. Local oscillators were tuned to a 156.25 MHz frequency, and data links were set up with 8 bit words and 8B/10B physical coding, for a relatively slow data physical transmission speed of 1.5625 Gbps (a net data rate of 1.25 Gbps). The evaluation boards allow direct interface with the transmitter and receiver pins in a single GTP block using SMA cables. However, we would need the receiver and transmitter to reside in different GTP tiles in order to implement the proposed clock replication scheme, as explained in Section IV. Hence we needed to use an additional connection to transmit the master clock frequency to the slave so it could be applied as a reference signal for the GTP.
DDMTD measurements were implemented on the master using a value of N = 512 by cascading two frequency synthesis elements: a DLL and a PLL with frequency multiplication factors of 16/19 and 32/27, respectively. This value enables a delay measurement step of 12.5 ps. A PLL was chosen as the last element because of its jitter filtering characteristics.
A large number of raw phase measurements between master transmitter clock and the recovered receiver clock were taken and recorded. After that, evaluation boards were turned off and on and new measurements were taken. The procedure was repeated 100 times. The resulting histogram is shown in Fig.  8 , where ten different equally separated peaks can be observed. We checked that measurements from a single power cycle belonged to a single peak, and that classification into peaks was described exactly by the sum of the numbers of bits shifted on both master and slave receivers (modulo 10), confirming the latency determinacy of the programmed transceiver configurations. The FWHM width of the peaks gives a mean phase measurement resolution of 150 ps; we believe that clock jitter in the measurement logic is the main source of error. 
VI. CONCLUSIONS
A method for the synchronization of high resolution timestamps on different digital processors at an arbitrary physical distance has been presented. No special hardware connections between acquisition boards are required: only a full-duplex data link supporting high data rates, which can be assumed to be present in the system anyway. However, the use of FPGAs with embedded high-speed transceivers that support specific low latency uncertainty configurations is mandatory.
Additionally, a DAQ architecture for a PET system has been proposed that takes advantage of the above synchronization method to allow an arbitrary arrangement of digital acquisition boards while maintaining a state-of-the-art timing resolution for coincidence detection. A compensation scheme for trigger delay mismatches caused by the introduction of high-speed serial ADCs has also been presented.
Preliminary measurements suggest that a synchronization resolution below 200 ps and 400 ps is possible for a small or large PET system, respectively. This resolution is even better than what is expected in many PET DAQ systems with a rigid structure and system-synchronous clocking. Unfortunately, we couldn't test whether the synchronization algorithm actually yields the exact time reference difference between the boards. To do this, we need to distribute common reference signals to both boards in a way that allows extraction of a reference time instant with sub-clock period accuracy. Complete validation of the method will be performed once prototype DAQ boards are available.
