Abstract-Single photon avalanche diode (SPAD) arrays have proven themselves as serious candidates for time of flight positron emission tomography (PET). Discrete SPAD readout schemes mitigate the low-noise requirements of analog schemes and offer very fine control over threshold levels and timing pickup strategies. On the other hand, a high optical fill factor is paramount to timing performance in such detectors, and consequently space is limited for closely integrated electronics. Nonetheless, a production, daily used PET scanner must minimize bandwidth usage, data volume, data analysis time and power consumption and therefore requires a real-time readout and data processing architecture as close to the detector as possible. We propose a fully digital, embedded real-time readout architecture for SPAD-based detector. The readout circuit is located directly under the SPAD array instead of within or beside it to overcome the fill factor versus circuit capabilities tradeoff. Since the overall real-time engine provides all the required data processing, the system needs only to send the data required by the PET coincidence engine, significantly reducing the bandwidth requirement. A 3D prototype device was implemented in 2 tiers of 130 nm CMOS from Global Foundry / Tezzaron featuring individual readout for 6 scintillator channels. The timing readout is provided by a first photon discriminator and a 31 ps resolution time to digital converter, while energy readout and event packaging is done in real-time using synchronous logic from a CMOS standard cell library, all fully embedded in the ASIC. The dedicated serial output line supports a sustained rate 2.2 Mcps in PET acquisition mode, or 170 kcps in an oscilloscope mode for offline validation and development.
I. INTRODUCTION
T IME of flight is increasingly considered an essential path to improve contrast to noise ratio in reconstructed positron emission tomography (PET) images [1] , [2] . To do so, the data etienne.desaulniers.lamy@usher-brooke.ca; Alexandre.Boisvert@USherbrooke.ca; Frederik.Dubois@USher-brooke.ca; rejean.fontaine@usherbrooke.ca; Jean-Francois.Pratte@USher-brooke.ca).
C. Thibaudeau is with the Department of Radiobiology and Nuclear Medicine, Université de Sherbrooke, Sherbrooke, QC J1K 2R1, Canada (e-mail: Christian.Thibaudeau@USherbrooke.Ca).
Digital Object Identifier 10.1109/TNS.2015.2409783 acquisition system (DAQ) must provide very sharp timing resolution.
Single photon avalanche diode (SPAD) have received great attention as a serious contender to achieve this goal. Like photomultiplier tubes, they have excellent gain and timing resolution but in addition are more compact, are immune to magnetic fields and require much lower bias voltage. Various SPAD array detector architectures are under investigation, the most common being the silicon photomultiplier (SiPM). These have reached PET coincidence resolutions close to or just below 100 ps FWHM in experimental setups with [3] and LYSO [4] scintillators. However, SiPMs have a large output capacitance, limiting the achievable electronic noise for a given power budget and hence having a direct impact on the achievable timing resolution. Moreover, SPAD to SPAD signal fluctuation limits the achievable photon counting resolution at counts above photons. Finally, the single photon timing resolution of SiPM is limited by the signal time propagation variation as a function of the position within its SPAD array.
Another approach reads SPAD in an array using digital electronics with array-wide first photon timestamp [5] or sub-group timestamps [6] . These introduce precise threshold control and allow very fine configuration of acquisition parameters. This, in turn, implies a tradeoff between photodetection fill factor and embedded circuitry complexity. Furthermore, the large amount of raw data to move off the chip consumes significant amount of power.
Whichever solution is selected, both analog and digital approaches require real-time device and system-level integration to minimize overall power consumption and physical size in the context of massive multi-channel environments such as a full PET system.
We propose a fully digital, real-time data acquisition (DAQ) architecture for a vertically stacked SPAD-based PET detector. Unlike 2D electronic integration, vertical integration with through silicon via allows the presence of both a high fill factor SPAD array and a complex embedded system by placing the acquisition electronics under the photosensitive layer. The real-time embedded engine, placed on a dedicated layer, provides all the necessary modules for event detection and analysis, reducing the overall data throughput, power consumption and off-chip system complexity.
This paper begins with a short introduction to the overall detector specifications before focusing on the real-time DAQ. The detailed data flow is first presented, followed by the measurement methodology, and finally intrinsic experimental results from the key sub-components.
0018-9499 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 
II. DETECTOR ARCHITECTURE
To reach sub-millimeter spatial resolution, the overall detector design uses mm LYSO arrays with 0.1 mm gap ( Fig. 1, A) , individually coupled to SPAD sub-arrays. The SPAD devices used for this project are tiled with a m pitch, leading to an array of per scintillator (Fig. 1, B ) [7] . Individual quenching circuits with programmable hold-off and recharge time are tiled under each SPAD (Fig. 1, C) . The DAQ is located on its dedicated layer (Fig. 1, D) . The first detector iteration supports 6 independent PET acquisition channels, for a total of 2904 SPAD units, with a 200 MHz system clock. A printed circuit board (PCB) interfaces the 3D electronics through the bottom layer (Fig. 1, E) .
III. REAL-TIME ACQUISITION MODULE

A. Requirements
As for all PET DAQ systems, the DAQ needs to extract timestamp, energy and location information from incoming PET events. More importantly for time of flight measurements, the DAQ must eliminate or minimize its contribution to the overall timing jitter, independently of scintillator, SPAD and quenching circuit contributions. For best PET singles detection performance, the DAQ must eliminate acquisition dead time as much as possible and quickly detect and discard dark counts that triggered false starts. Finally, to help mitigate non-linearities in the energy estimation [8] , the DAQ should support counting multiple triggers from the same SPAD during a PET event.
B. Overview
To reach these design goals, the digital architecture is separated in 3 major blocks ( Fig. 2) : the timing pickup block, the DAQ, and the utilities module. The 1:1 readout scheme requires a timing and DAQ block dedicated to each channel, but the final real-time post-acquisition processing unit is fast enough to be common for all PET channels.
C. Timing block
The timing pickup block is split into an OR-gate trigger tree with individual SPAD connected to its inputs, a delay-line first photon discriminator and a dual Vernier ring TDC (Fig. 3 ) targeting 20 ps resolution. All modules are asynchronous, except for the TDC handshake interface with the DAQ module.
The OR-gate trigger tree was designed using 4 stages of custom wired-OR cells. Thanks to 3D vertical integration flexibility, the tree is laid out by having the next stage gate in the geometric center of its input sources. This keeps wire connections as uniform as possible throughout the tree, reducing inter-SPAD skew, and therefore reducing system-level timing variations.
To discard dark count false starts without affecting coincidence timing resolution, a delay line discriminator is used [9] . At medium and high dark count rates, this is done at the cost of single PET event sensitivity, but with negligible to no cost at low dark count rates. In this implementation, the discriminator splits the SPAD array in 4 sectors to create the threshold comparison.
The dual Vernier ring TDC uses a 5 stage oscillator configuration with a maximum conversion time of 100 ns, shorter than the energy integration time (see next section). 
D. DAQ dataflow
The DAQ block is fully synchronous logic, implementing all the necessary modules to monitor channels, detect events, and record PET events that fulfill trigger conditions (Fig. 4) .
SPAD avalanche detection triggers cross into the synchronous domain firstly through two flip-flops in cascade to protect against metastability, and then are shaped as a strobe lasting a single clock cycle. The chain falls into self-reset for 1 clock period and then becomes immediately ready for the next trigger. Successive avalanches must occur at least 4 clock periods apart to be distinguished. The quenching circuit's combined hold-off and recharge time should thus be set to 4 clock periods or more. This way, a single SPAD cell can detect more than one photon during a PET event, mitigating SPAD array non-linearities in energy measurements. The SPAD hit map is then summed using a succession of adders [10] split in two pipelined stages to maintain the target operational frequency. The 484 SPAD signals are thus compressed to an ADC-like 9-bit integer sample stream, covering the array's full dynamic range. This stream is then continuously written in a dual buffered, 9-bit 64 words deep circular memory.
However efficient the timing block's dark count discriminator may be, the TDC will sometimes be started by dark counts or low energy PET events. A simple level threshold allows the DAQ to confirm the event, or quickly abort the time conversion and reset the TDC circuit. Since the metastability protection and parallel sum introduce a 4 clock cycle delay with regard to the input triggers, the abort signal has a minimum latency of 4 clock cycles (20 ns) following the TDC's activation. From this we can determine that dead time caused by false starts last 5 clock cycle ( abort request), most of which are already rejected by the dark count discriminator scheme [9] . Once an event is accepted, the control logic waits until 56 samples are written to the circular memory. The four following memory slots are used to store the TDC's output code, the system counter value and internal status flags, causing the second acquisition dead time case in the system, and the dual buffer is switched. The four remaining circular memory slots already hold samples from before the trigger, thus providing the baseline, for a total window of 60 samples or 300 ns. The waveform thus spans more than 6 times the LYSO's decay constant, more than sufficient to estimate the PET event's energy.
Event data from all channels are then funneled through the real-time parser which sums the total SPAD count. It then takes the final packaging decision: in regular event mode, only the fields in Table I are transmitted to minimize data size. In oscilloscope mode, the packager adds the collected samples with data padding for off-chip transmission.
E. Utilities module
The utilities module regroups all other electronic systems: various counters, the configuration and communication handshake module, the clock tree, the LVDS receivers (clock, commands and rough timer synchronization) and transmitters (data and command reply).
The data communication channels use a synchronous start/ stop bit protocol to forgo a "chip select" line and minimize the package pin count. To accommodate different data types with various package sizes, the transmission engine uses a 16-bit data alignment and adds a 16-bit header containing data type and byte count information.
SPAD cells have individual on/off and hold-off time settings on their passive quench, active recharge circuit. Each PET channel has individual on/off and trigger level settings, as well as per-PET channel SPAD recharge time.
Finally, a low rate internal trigger allows the user to simultaneously activate every quenching circuit on the die stack for functional tests and debug procedures.
IV. MATERIALS AND METHODS
A. Materials
The ASIC (Fig. 1, C and D) was fabricated with the Tezzaron/ Global Foundries 3D mixed-signal 130 nm CMOS process. The prototype run included the quenching circuit arrays on one tier, face-to-face bonded with the DAQ on another tier. This particular fabrication run has only the top tier (Fig. 1, C) thinned, so the circuit samples are wire-bonded to the PCB instead of soldered flip-chip (Fig. 5) .
The test hardware includes a main PCB that houses an FPGA, external communications links, local memory, and power distribution. Die samples and all specific external test circuits are hosted on a daughterboard PCB (Fig. 5) . A low jitter LVDS oscillator crystal provides the clock signal for the digital logic and communication and also serves as the TDC stop signal.
Several tests require external TDC start triggers. To this end, the daughterboard has two LMK01010 clock buffer devices from Texas Instruments with programmable divider and 400 ps increment delay lines. One provides correlated triggers and is derived from the system clock. The other provides uncorrelated triggers derived from an independent oscillator crystal.
Finally, the quenching circuits' input from one PET channel were all shorted together and bonded to the PCB. This simplified experimental setups with external trigger sources and debug tools.
B. Methods
DAQ:
The first series of tests verify architectural functionalities compared to HDL functional simulations. To confirm correct internal 3D interconnects, every quenching circuit is enabled in turn and pulsed by the low rate internal trigger.
The sum tree feature is verified by enabling every quenching circuit in an acquisition channel and pulsing them with the low rate internal trigger. The test is repeated with the correlated external trigger.
TDC: The TDC's DNL and INL were obtained with the statistical code density test [11] . Four hundred thousand triggers were harvested by using the external uncorrelated trigger and by activating a single quenching circuit in the DAQ channel.
Trigger Tree Jitter and Skew: Two elements in the trigger tree contribute to timing measurement errors: skew induced by varying propagation delays and the electronic jitter. The propagation delays for each SPAD input was obtained using the correlated LMK01010 source driving an impulse generator circuit wired to the shorted PET channel inputs. The internal TDC is used to measure the timing information. A 400k timestamp histogram was built from each detection path and the propagation delays were calculated using a weighted average on the harvested TDC timing bins. The simulated trigger tree propagation delays were also obtained from a digital static timing analysis with extracted parasitic components taken into account. The electronic jitter measurement for each signal source was derived from the same measurements, where the bin distribution is used to evaluate electronic jitter. These timing skew and jitter measurements from the trigger tree quantify the 3D ASIC's contribution to the system Single Photon Timing Resolution (SPTR).
V. RESULTS
A. DAQ
The embedded system and DAQ run as designed at the targeted 200 MHz frequency, and thus supports up to 2.2 Mcps in PET singles mode on the single LVDS transmission line. The oscilloscope mode, after including the data padding to match the 16-bit data alignment, sustains 170 kcps on the same link.
B. TDC
The TDC's first design iteration achieves a 31 ps resolution, wider than the 20 ps design target. The maximum DNL is 2.15 LSB (Fig. 6 ) and the maximum INL is 9.6 LSB (Fig. 7) . This performance was verified on 4 different die samples.
C. Trigger tree skew and electronic jitter
Simulations predicted a propagation delay skew of 30 ps peak-to-peak (6 ps std. dev.) among the 484 paths in the trigger tree. Based on the TDC's measured DNL and INL (Fig. 6 and Fig. 7 ), measurements were taken between TDC codes 64 and 100 for best linearity by using the programmable coarse delay line of the LMK01010 on the test bench. The delay map is shown in Fig. 8 where we can see the 90 ps peak-to-peak of propagation delay skew and 14 non-responsive channels (hashed boxes). The same results are reorganized in Fig. 9 to show the delay distribution. In this figure, 6 outliers were removed using the interquartile threshold method, as well as the non-responsive channels. The central portion of the distribution contains over 95% of the active paths and has a 40 ps peak-to-peak skew (13 ps std. dev.). The presence of two peaks rather than one compared to simulation is assumed to be caused by transistor mismatch in the trigger tree.
Electronic jitter on individual paths was measured. The reported jitter measurement includes every component in the chain: the external clock crystal, the PCB clock buffers, the ASIC clock tree, the external pulsing circuit, the quenching circuit, the trigger tree, the dark count discriminator and the TDC. Every histogram contained either one or two timing bins with significant data, meaning that the electronic jitter is below one bin or 31 ps.
VI. DISCUSSION
The architecture supports successive SPAD triggers from individual cells to mitigate non-linearities in energy measurement. Its effectiveness will ultimately be limited by the hold-off and recharge time of the quenching circuit set by the user, in accordance to the SPAD array characteristics placed over the readout.
Indeed, the optimal hold-off depends on the SPAD afterpulsing noise. The recharge time should be as low as possible for a given SPAD and is a function its capacitance.
The measurements show that the electronic jitter of a complete SPAD readout chain is below 31 ps and the propagation delay skew is 40 ps peak-to-peak for over 95% of the channels. Using these two measurements, the total contribution of the 3D ASIC to SPTR excluding SPAD and scintillator contributions can be assessed. A conservative SPTR estimate can be made by adding the two terms, yielding a maximum of 71 ps peak-to-peak. A more realistic SPTR can be estimated from the quadratic sum of skew and electronic jitter standard deviations (13 ps and less than 31 ps, respectively), but keeping in mind that the skew distribution is not normal. This gives a total SPTR contribution of less than 32 ps std. dev.
These values are somewhat or well below the expected 156 ps std. dev. coincidence timing resolution predicted by the simulation model used in [9] . Considering that the current SPAD realized by our group has a jitter of ps FWHM [7] , the limiting factor of the overall coincidence timing resolution is therefore not the acquisition system but the scintillator. Assuming that new types of scintillators are introduced, the architecture can still be improved. First, the TDC's resolution will be corrected to reach or exceed 20 ps, as well as optimize the DNL and INL. Second, the skew will be further optimized with matched length routing in the ASIC, at the cost of additional routing congestion, and by improving transistor matching of the logic gates used in the trigger tree path.
These skew optimizations, combined with the centroid-like trigger tree placement, would not be possible without 3D integration. Indeed, similar designs limited to 2D implementations have important skew because the trigger logic and routing must work around the SPAD grid layout [5] . At the same time, they must use as little real estate as possible to maintain a high fill factor.
As for this point, fill factor is already very good in fully passive SiPM devices, reaching over 75% [12] [13] . However, there are absolutely no programmable features, even as simple as shutting down a particularly noisy cell. 3D electronic stacking is the natural path towards reaching these levels of fill factor with intelligent circuits embedded with the SPAD detector array. Current challenges include the limited range of 3D CMOS processes offered in multi-project wafer (MPW) fabrication runs compared to 2D processes, and the unavailability of commercial MPW runs offering easy combination of heterogeneous 3D CMOS processes (e.g. HV CMOS, low-power, high speed CMOS process, low noise SPAD process). This freedom to mix technologies is expected to become key to reaching system-level SPTR below 10 ps FWHM.
Finally, in the context of clinical and pre-clinical systems, efforts should not only target excellent timing, but also aim to reduce power consumption as much as possible. As SPAD dark count rate is sensitive to temperature, it will be easier to regulate and maintain the temperature in systems with lower power dissipation. By using a single-TDC approach with a timing discriminator and a real time DAQ, the architecture significantly reduces the data throughput requirements compared to multi-TDC approach or combinations. This impacts the overall power budget simply because slower and/or fewer data transceivers will provide the sufficient bandwidth in the overall system. This is particularly important with highly pixelated scanners supporting hundreds or thousands of acquisition channels.
VII. CONCLUSION
This paper presented a full-fledged DAQ for a SPAD-based PET detector. The architecture takes advantage of 3D microelectronics vertical integration to include several real-time features and timing optimizations. It supports multiple SPAD hit counting during a single PET event, mitigating non-linearity in energy measurements. The 3D stack allows room for strategic placement and routing of the timing trigger tree, reducing the impact of the tree's skew to the SPTR. The embedded TDC has a 31 ps resolution and the electronic jitter is less than 1 TDC bin. Finally, the low dead time, fully embedded signal analysis system can sustain up to 2.2 Mcps on a single 200 Mbps transmission link, making it ideal for high sensitivity, low power systems.
Progress on scintillators is still on-going and research teams aim to push the 100 ps FWHM coincidence timing resolution barrier [3] [4] to 10 ps FWHM. This will enable direct high resolution PET image reconstruction and provide faster feedback to acquisition technicians, biologists and medical doctors. On the other hand, to reach that goal, every known element in the detection chain must still be improved upon, whether it is electronic, SPAD, fill factor or scintillator.
Although low-noise analog-SiPM based systems should reach similar timing performance as digital systems, digital architectures are expected to be more resilient to dark count noise [14] , and will likely be the choice SPAD readout scheme for the 10 ps target. Initially, multiple-TDC systems will be essential to characterize and understand new scintillation materials. In the case of full PET systems, it remains to be seen if multi-TDC devices will be required to maintain the best possible timing, or if simpler but potentially lower power architectures will provide equivalent or similar timing performance.
