Ahstract-A DAQ architecture for a PET system is presented that focuses on modularity, scalability and reusability. The system defines two basic building blocks: data acquisitors and concen trators, which can be replicated in order to build a complete 
I. INTRODUCTION
T IME coincidence resolution is one of the most important aspects of PET systems. Traditionally, finer resolutions have allowed a tightening of the coincidence window for event acceptance, yielding a better noise equivalent count rate (NEC) for cleaner reconstructed images. The current surge in popularity of time-of-flight (TOF) PET systems [1] [2] , where time difference is used to estimate the radiotracer position along the line of response, imposes a much more stringent limitation on timing resolution. A coincidence resolution of 600 ps or better is estimated to be necessary for a modern TOF detector [3] . Coincidence is given by the difference between timestamps assigned to single gamma events on different acquisition boards in the system. Naturally, synchronization errors between timestamping boards need to be well below the coincidence resolution, since any discrepancy between reference times on acquisition nodes is directly reflected on the measured difference. Hence, an accurate system-wide synchronization scheme is mandatory.
The typical synchronization method for PET and other high energy physics (HEP) readout systems is the use of a clock tree, using zero-delay clock buffers and distributors to span the whole system. However, this approach has several drawbacks related to the use of cabling for clock distribution. All clock cables for a given tree stage have to be matched in length in order to obtain a fully balanced clock tree, which can be difficult when the number of acquisition nodes to be synchro nized is large, forcing additional global timing calibration. Further problems arise when there is a large distance between system nodes [4] , as fluctuating operating conditions such as temperature cause a variation of the delay of long cables or fibers. Hence, clock trees work best in a cableless environment, when all timestamping electronics are placed within the same crate or otherwise constrained, with clock distribution being implemented through controlled backplane connections. This condition severely limits the system scalability and the mobil ity of the detectors, often hardwiring the maximum amount of supported detectors and forcing hardware redesigns for system expansions or whenever the detector topology is changed.
In a previous paper [5] , we proposed a synchronization scheme over data links that was able to achieve state-of the-art synchronization resolution in order to overcome these issues. The fact that each link is independently self-calibrated eliminates the necessity of cable length matching and allows automatic compensation of long cable delay variations. Free dom of placement for the individual boards is thus guaranteed. A modular, scalable DAQ architecture was also proposed based on this synchronization method. For this paper, a first working prototype of said system architecture has been built and evaluated. The system has been designed and tested with PET applications in mind, but it must be noted that the proposed architecture is valid for general HEP readout applications where several trigger levels are needed.
II. HARDWARE ARCHITECTURE
The proposed DAQ arquitecture is outlined in Fig. 1 . The system is divided into a front-end section and a back-end section, and each of them is formed by an arbitrary number of identical modules with no constraints on physical location with respect to each other, i.e. the only placement restriction is given by the particle detectors at the front-end. The front end section consists of acquisition modules, which contain the photodetectors and perform analog conditioning of detector signals, digitization, single event detection and position and timestamp extraction. The back-end section contains concen trator modules, which collect data from several adquisition modules and perform time coincidence detection. Data from several concentrators can themselves be collected by a higher- level, identical concentrator. All modules are connected form ing a hierarchic tree, with the top node being responsible for the transmission of all aggregated data to the external processor that handles image reconstruction.
All connections between modules are purely digital, full duplex data links with embedded clock. Each link's ends are regarded as master or slave according to the global system hierarchy. The links serve three different purposes:
• Data transmission. Single event and coincidence data are transmitted upward, while configuration commands are sent downward.
• System frequency propagation. The slave recovers the clock frequency from the data link and uses it for its own downlink transmissions, as well as its digitizing circuitry in the case of adquisition modules. Thus, the whole system is syntonized with the master oscillator which is located in the top concentrator module.
• Synchronization. Each link is capable of synchronizing the time reference for both nodes independently of all other module connections.
This architecture is indefinitely scalable and admits an arbi trary number of detectors, as long as the data link bandwidth is enough to support the transmission of all coincidence data at the top level. Since all modules are identical copies of one of two different designs, all hardware is fully reusable in the case of system expansion (increase in the number of detector modules) or any other topology change.
A. Acquisition Module Figure 2 depicts a simplified diagram of the contents of a single acquisition module. Each module contains one gamma sensor, consisting of a scintillating crystal coupled to a pho todetector unit, which can be either a PSPMT or an array of SiPM devices.
Photodetector outputs are sent to AMIC [6] , an integrated analog front-end that converts 64 detector signals into 8 analog outputs, each of which is a weighted sum of the 64 inputs with digitally programmable coefficients. AMIC can thus be used as a replacement for a resistor network used as a charge division circuit for Anger logic [7] , benefiting from a higher bandwidth due to integrated preamplifiers, and the capability for automatic correction of photodetector channel gain mismatch by fine adjustment of weighted sum coefficients [8] . Additionally, it can be used to obtain the first few statistical moments of the light distribution in a continuous scintillator [9] , from which event position can be extracted. In particular, the second moment contains information on depth of interaction within the crystal [10] . The newest version of the ASIC, AMIC2GR [11] , is compatible with both PMT and SiPM-based detectors, so they can be used interchangeably by defining a common connector. Additionally, the AMIC architecture is fully expandable and allows the readout of detectors with more than 64 outputs by using several instances of AMIC and adding their corresponding current outputs together [12] .
The resulting analog signals undergo a shaping and anti aliasing filtering stage using a second-order filter before being digitized by free-running ADCs. A digitally controlled offset voltage is added to each channel in order to push the signal baseline as close as possible to the edge of the ADC input range, so as to maximize the dynamic range of detectable pulses. AD9239 12-bit converters from Analog Devices [13] are used with a 156.25 MSPS sampling rate. These are quad channel ADCs with serial outputs, which help reduce the number of board components and digital traces, simplifying board layout and reducing signal integrity issues. A total of 10 ADC channels are used: eight for AMIC outputs, one for an additional fast trigger output from the detector (used for the last dynode signal from PSPMTs), and one for ADC delay calibration.
ADC outputs are read by a Stratix IV EP4SGX110 FPGA [14] . The high sampling rate forces the use of embedded gigabit-speed transceivers for the serial signals. A channel alignment procedure is necessary after ADC frames are re ceived and decoded, because the latency of the transceiver and frame decoding logic for each channel may be different. To do so, the ADCs are programmed to force their outputs to show a transition between two fixed values; data channels inside the FPGA are then selectively delayed in order to have the transitions occur at the same clock cycle.
Single event detection and time and position extraction are performed on the sampled signals in a purely digital manner. Each ADC channel can be used in either amplitude or charge mode, i.e. computations can be performed on the raw samples or on their integrated value since the start of the current event, as indicated by the trigger signal crossing a threshold value. The maximum value (i.e. pulse amplitude or charge) is recorded for each position channel. For timing, the Digital Constant Fraction Discriminator (DCFD) method [15] is applied to the trigger signal t [n]: the bipolar signal
is generated for some amplitude A and time shift k, and its zero-crossing point is computed by linear interpolation on the clock interval where b [n] changes its sign. By using either
Gain, offset, Diff. amplitude or charge signals, we obtain digital implementations of the CFD [16] and ARC (Amplitude and Rise Compensated) [17] methods, All data from a detected event is collected into a 160-bit frame containing a timestamp with 1.6 ps resolution and sent upstream to a concentrator module,
3x AD9239

B. ADC Delay Compensation
Trigger signal delay from the digitization point to the timestamping algorithm block may be different for each acqui sition module and hence must be taken into account in order to avoid timing errors for coincidence detection, Moreover, this delay may contain non-deterministic components such as ADC delay and FPGA transceiver latency, unless a specific deterministic latency mode is selected for the transceiver. In order to compensate for this effect, an analog linear ramp gen erator with digitally controlled charge and discharge signals is implemented on board and sampled by an ADC channel, with the goal of estimating the delay.
The FPGA continuously computes the linear regression coefficients corresponding to the last M samples from the ramp signal y [n]. This can be accomplished in an efficient way by iteratively computing the first two moments of the sample interval as
and using them to obtain the instantaneous linear coefficients
Each delay estimation is obtained as follows: after fully discharging the ramp circuit, the charge signal is issued and the current signal baseline value
Ramp start is detected by coefficient al crossing a certain threshold, and the number T of elapsed clock cycles is stored.
At this point, the logic waits for M cycles until a linear fit y [n] � aln + ao of the ramp waveform is obtained. The measured delay is then computed as
to lower level concentrators and/or adquisition modules FPGA (Altera EP4SGX 110) i.e. the time when the ramp takes value B, using the charge signal trigger as the time origin. This measurement is repeated continuously, and the moving average of the last 256 delay measures is used as the valid delay estimation. This delay value is subtracted from the timestamp for all detected single events.
We remark that the delay value D contains not just the considered delay from ADC to timestamping logic, but also the propagation delay from the FPGA's delay estimation logic to the analog ramp circuit and to the ADC input; however, these additional components can be considered to be equal in identical acquisition modules, so they are canceled when computing timestamp differences.
C. Concentrator Module
The main purpose of concentrator modules is to detect coincident gamma events and to keep its child nodes (acquisi tors and lower level concentrators) synchronized. A simplified scheme is shown in Fig. 3 . Each concentrator has a number of links to lower level nodes where single event frames are received in chronological order and stored in FIFOs. The coincidence detection engine continually compares the timestamps of the first available event from each FIFO, and checks whether the timestamp difference between the two oldest visible events is within the selected time coincidence window. If so, both events are registered as a coincidence; if not, the oldest one is discarded as a random event.
The top concentrator module implements a Gigabit Ethernet connection for communication with external processors. This is used for the transmission of detected coincidences and for configuration and control commands. The FPGA in each module contains a system on chip (SoC) with an embedded Nios II processor which handles these commands and relays them to lower level modules whenever appropiate.
III. SYNCHRONIZATION OVER DATA LINKS
Synchronization of all acquisition modules with sub nanosecond resolution directly over data links is one of the key components behind the proposed architecture. In our previous paper [5] , the general theory behind this method was exposed, as well as implementation details for Xilinx FPGAs. Here, we summarize the most important aspects and explain the differences in implementation when using FPGAs from Altera.
A. Frequency Propagation
Gigabit rate data transmission between FPGAs is usually implemented using embedded high-speed transceivers and self-synchronous signaling, where the transmitter data clock is embedded in the data signal. The clock is recovered at the receiver using a PLL-based Clock Recovery Unit (CRU) and then used to sample and decode the incoming data stream. The transceiver is seeded by an external clock which is used both for transmission and as a reference for the CRU.
Our goal is to use the exact frequency of the recovered clock for the local logic at the slave node as well as for data transmission in the opposite direction (slave to master). However, the transmission clock is the same as the reference clock which is needed to recover the desired clock frequency in the first place; hence, a special clocking circuit is required if we want to use the same transceiver for both half-links.
The circuit from Fig. 4 is implemented in all system modules, using a National Semiconductor LMK02000 PLL and clock distributor [18] and an external VCXO with a nominal 156.25 MHz frequency. At the top node, the PLL's switch between the charge pump and the loop filter is kept open (tristate output), so that the VCXO control input stays at a constant bias value and the circuit works as a regular oscillator and clock distributor, feeding the transceiver's reference clock input. At a slave node, the PLL is initially configured in that same way; however, once the recovered clock is stable, the PLL loop is closed and the VCXO output eventually converges to a jitter-filtered copy of the recovered clock. The current functioning of the transceiver, including clock recovery, is not affected during the PLL transient phase because its reference clock suffers only very small variations while maintaining its nominal value. After PLL convergence, the half-link from slave to master is established, and the filtered recovered clock is used for local logic and for sampling in the case of an acquisition module.
B. Timestamp Synchronization
After all links are established, the main clocks in every module have exactly the same frequency jclb but different, Clock recovery circuit using an external PLL with VCXO. The PLL loop is closed after the recovered clock from the transceiver becomes available.
fixed phases. Moreover, the phase relationships are different each time the system is reset. Hence, an additional step is required in order to correct this mismatch. Instead of phase shifting the clocks, we act directly on the timestamp counters: a fractional part is added to the local timestamp, so as to account for the phase differences. Every module runs its own timestamp counter, even concentration modules. A timestamp counter updates the integer part as usual, but the fractional part is managed only by the synchronization algorithm. By adding the fractional part to the event timestamp values in the acquisition modules, the effect of varying ADC sampling clock phases is compensated when computing event time differences.
Synchronization of timestamps is done on a link by link basis, following the same hierarchy as frequency propagation. In each master-slave data link, the master timestamp's frac tional part remains fixed and the slave's is updated. The top module's timestamp is taken as a reference and its fractional part is fixed at zero.
Timestamp synchronization over a single data link is based on the standard two-frame method, as used by the IEEE 1588 Precision Time Protocol (PTP) [19] among others. Two synchronization frames are sent, one from master to slave and later one from slave to master, with respective flight times t M S and tSM, and their local departure and arrival timestamps are recorded, as given by Fig. 5 . Even if the timestamp counters from both nodes are not yet synchronized, the difference of timestamps from the same node is still a valid measure of a local time interval. Hence, the exact round-trip time is given by tMS + tSM = (TM2 -TMd -(TS2 -TSl)' (5) By measuring the latency tMS (or tSM), the correction offset for the slave timestamp counter can be obtained. Usual syn chronization methods assume that tMS � tSM and estimate their value directly from (5), but the resulting error is, at least, on the order of Tclk/2 [5] .
The algorithm can be refined by taking the skew between both half-links into account. Let us split each half-link's data path latency into components, as shown in Fig. 6 . The full data path between the synchronization protocol's digital logic Standard two-frame synchronization procedure. Local arrival and departure timestamps are recorded and then used to compute the round-trip time.
in each node has to be considered, because that is where the timestamps from (5) are assigned. The latency components are tT x for the transmitter, t p for external transmission lines (board traces and cables) and tRX for the receiver. An additional term D.t.p is needed to account for the phase change between the receiver clock domain (recovered clock) and the local clock domain at the receiver node, used for protocol logic and transmission. We obtain
tMS -tSM = ( tTX ,M -tTX ,S) + ( tRX ,S -tRX ,M) + ( t p ,MS -t p ,SM) + D.t.ps -D.t.pM. (6)
Knowing the values of (5) and (6), we can compute half-link latencies and correct the slave timestamp.
C. Measurement of Link Skew
Since (5) is always exact, synchronization resolution is governed by the accuracy of (6). Some considerations have to be made regarding the calculation of its addends. First of all, the term t p ,MS -t p ,SM can be minimized by adequate board design and connection cable choice, e.g. by using composite cabling. Next, the requirement that the differences in tTX and tRX be known exactly implies that we have to use transceivers that can be configured in specific deterministic latency modes, i.e. where these latencies are fixed after each reset and their value, or at least their difference, can be obtained at runtime. This forces the use of specific FPGA families from the main vendors: for Xilinx, a Virtex-5 LXT or better is needed; for Altera, an Arria II1Stratix IV GX or better. Our modules include Stratix IV GX devices, whose transceiver latency in deterministic mode for 8B/l OB-coded links satisfies tT x = constant tRX = constant + n . Tclk/10
where the value of n may vary across resets but is available to the user logic. The constant terms in (7) are constant across transceiver resets. If they are also constant across different FPGA instances, then they get canceled in (6) . If not, then a one-time calibration is needed to compensate for them. Dual-Mixer Time Difference (DDMTD) method, which can be implemented entirely inside of the FPGA. Using embedded PLLs, a clock is synthesized with frequency N fD = N + 1 f clk (8) for large N, and then used to sample the local and recovered clocks. It can be shown that the resulting bitstreams are equiv alent to sampling the clock periods with aN· f clk frequency. Fig. 7 illustrates the method with example waveforms for N = 6. The phase difference can be estimated by looking for active clock edges in both sampled bitstreams and counting the number of samples between them. These estimations are subject to errors due to clock jitter and metastability at the sampling registers, yielding a very low phase resolution as well as incorrect measures at the opposite phase value. A clustering algorithm with two clusters is implemented for the filtering of phase estimation values that obtains a much better resolution.
The main contribution to synchronization inaccuracy comes from the resolution of D.t.p using DDMTD. One key difference between the proposed architecture using Stratix IV FPGAs and the tests presented in [5] using Virtex-5 devices is that the Virtex transceivers can be configured in a special mode that allows the local and recovered clocks to be phase aligned on one of the link nodes (but not on both, because that would lead to an infinite loop in clock dependency), guaranteeing D.t.ps = o by design and reducing the number of error contributing terms in (6). This is not possible with Altera devices, so a worse synchronization resolution is to be expected.
IV. SETUP DESCRIPTION
Prototype circuit boards were implemented in order to evaluate the performance of the proposed DAQ architecture for a small PET system with two detectors. The family of boards used in the tests is pictured in Fig. 8 . Each acquisition module prototype is formed by two boards: an adquisition board with 9 analog inputs (8 general ones and 1 for a fast trigger signal), and an analog front-end board with AMIC devices. Two different front-end boards were designed, with one and four AMICs, that can be used with photodetectors with 64 and 256 output channels, respectively. The acquisition board has two inter-module links, so it can be used as a small concentrator module with two downlinks, as well as a mixed acquisition/concentration module. A total of three acquisition boards were built for testing.
The test setup used for evaluation consisted of two acquisi tion modules, one of them working as a concentrator as well and acting as the master in the module hierarchy. A continuous slab of 10 mm deep scintillating crystal covered by black epoxy was placed in each module, with a 49 x 49 nun 2 area coupled to a photodetector using optical grease. Two different types of photodetector unit were evaluated:
• A Hamamatsu H8500 position sensitive PMT. This detec tor has 64 outputs and its effective area matches that of the crystal. Parallelepiped LSO crystals were used with the PSPMTs.
• An array of 16 x 16 Hamamatsu S 10362-11-50P MPPCs.
These SiPM devices have an active area of 1 x 1 mm 2 but were soldered on a rectangular grid with 3.00 mm x 3.05 mm separation, so the effective scintillation area is only 10% of the total. A pyramidal frustum LYSO crystal was used in this case.
A 22 Na point gamma-ray source was placed between both detectors, at 5 mm and 690 mm distance, respectively. A PMT detector was placed on the far side, used primarily for electronic colimation: coincidences where the event on the far detector fell outside of the center region were filtered away. On the close side, PMT and SiPM detectors were tested. The close detector was mounted on a translation table in order to have the gamma source imping on different, known positions on the crystal. The test setup is shown in Fig. 9 . Coincidence detector (PMT) Fig. 9 . Detector setup for coincidence measurements.
V. RESULTS
A. ADC Delay Compensation
The ADC delay compensation method was evaluated first. This illustrates both the necessity and the effectiveness of the method.
B. Module Synchronization
The data link synchronization method was tested next. DDMTD phase measurements were implemented with N = 512, using the same parameters as in [5] . The timestamp adjustment procedure between modules was repeated every 200 ms, using 48-bit timestamps with 12 fractional bits. The synchronization algorithm was shown to converge in a single iteration, and then show a timestamp variation around (J = 83 ps over periods of one minute. Monitorization of the slave timestamp counter's fractional part over longer periods showed a slow drift, presumably due to temperature variation, until stabilization of the mean value is finally reached.
In order to test synchronization resolution, a COlmnon pulse source was distributed to two different acquisition channels, where they were detected and timestamped; the timestamp difference should ideally be zero in all cases. Random event pulses from the close detector were used as the source. For each setup, measurements were repeated interchanging the cable connections to both channels in order to eliminate the fixed time bias from possible cable length mismatch. The DCFD method was used for pulse timestamping, using A = 1 and k = 1 in (1) and working in amplitude mode.
First of all, the time difference distribution was measured for two channels on the same acquisition board, in order to have an estimation of the impact of the timing algorithm on these measurements, and a resolution (JIb = 126 ps was obtained. Using two channels on different acquisition boards yielded a resolution (J2b = 150 ps. This value includes the variation due to the timing algorithm and the ADC delay compensation methods on both boards; we can obtain an estimate of synchronization resolution as (J sync � J (J �b -(J ib -2(J �am p " (9) This formula yields (J sync = 78 ps, or a FW HM resolution of 183 ps assuming a gaussian distribution. Notice that this measure is very similar to what can be obtained simply by monitoring timestamp changes in the synchronization algo rithm.
Unfortunately, the time difference distribution was not cen tered around O. The mean value fJ was always fixed for a given board setup, but it depended on the choice of two boards used (out of 3 available) and also on the choice of inter-module link in each board (each board has two available connections).
Different values of fJ from 0 to 700 ps were measured. The reason for this variation is currently under investigation; one possible explanation would be that the constant transceiver latency components in (7) were not, in fact, constant across different Stratix IV GX devices or across different transceivers within the same device. This would imply the need of a one time calibration of the system for the measurement of fJ.
C. Photodetector Comparison
For each detector type, a large number of coincidences were captured with the garmna source at different positions with respect to the close detector. Five positions were used, forming an X shape centered on the center of the crystal, with a 4 mm x 4 mm separation between them. AMICs were programmed to emulate an ideal 2D charge division circuit with 4 outputs for Center of Gravity (CoG) positioning and to generate a trigger signal proportional to the sum of all detector outputs. The AMIC coefficients were not calibrated, i.e. detector channel gain spread was not compensated for. DCFD in amplitude mode with A = 1 and k = 1 was used on the specified trigger signal for event times tamping. A wide time coincidence window of ±100 ns was used in order to observe the random coincidence background. Events where any ADC channel reached its full scale value were considered saturated and filtered away.
For position resolution, coincident events were energy fil tered around the photopeaks, and electronic colimation was applied by using only events that were detected less than 15 imn away from the FOV center on the far side. An appropiate time coincidence window was applied so as to remove random coincidences. Figure 10 shows the 2D histogram of detected event position for both detectors. For PMT, the five positions are clearly separated, and a spatial resolution of 2.7 mm and 2.6 mm in different axes is obtained at the center. For SiPM, the image is noisier and the points are blurred but still distinguishable; the measured resolutions at the center are 4.4 mm and 3.9 mm.
Energy and time resolution were measured only at the center point. For each event, energy was estimated as the maximum detected amplitude value of the sampled trigger signal. Energy resolution was measured by histogramming energy measures and fitting a gaussian curve around the photopeak. FWHM resolutions of 27% for PMT and 31 % for SiPM were obtained. Similarly, time coincidence resolution was measured by histogramming timestamp differences and fitting a gaussian curve around the peak. The result is shown in Fig. 11 . The constant random background is clearly visible, and FW HM resolutions of 3.9 ns for PMT and 10.4 ns for SiPM are obtained.
VI. DISCUSSION AND CONCLUSION
A system architecture for PET has been proposed that is based on synchronization over data links. The architecture is fully modular and scalable and the same circuit boards accept both PMT and SiPM based photodetectors. Prototype boards have been designed and the architecture has been successfully tested for a small-scale PET system with two detectors. The design and validation of a full-scale coincidence detection system using several concentrator modules remains pending.
The synchronization method has been evaluated in a realistic setting, and its resolution has been shown to be within TOF PET specifications. However, the results appear to be inferior to those obtained in [5] using Virtex FPGAs. One reason is proposed for this: the lack of a phase alignment mode in Stratix transceivers that would allow the elimination of one D.t.p factor in (6). This conclusion is tentative, however, as the testing conditions for both implementations were not identi cal. Additionally, systematic errors are reported in timestamp synchronization whose cause is currently under investigation but might be related to the transceivers themselves.
The system has been shown to work with both PMT and SiPM based photodetectors; in particular, the ability to work with arrays of 256 SiPM has been proved. The performance of both detectors can be compared but the conditions were not the same, either: the scintillator area coverage for the SiPM detector was only 10% of that of the PMT, so worse results are to be expected. Using only the simplest configuration for the front-end and digital algorithms (AMIC as a CoG network; lack of gain calibration; event detection by fixed threshold crossing; basic, fixed DCFD for timing), we have obtained decent results for all measured specifications, with the exception of PMT-PMT energy resolution. We expect to obtain much better resolutions just by optimizing these digital algorithm parameters. 
