Many industrial and scientific applications, ranging from 3D imaging (such as LIDAR, surveillance, object tracking) to Time Correlated Single Photon Counting (TCSPC), require the ability to perform time-resolved detection of weak light signals, down to single-photon level. Single-Photon Avalanche Diodes are solid-state detectors capable of single-photon sensitivity in the visible and near-infrared wavelength regions and compatible with standard silicon processes. This demand driven the development of low cost, large size CMOS SPAD imagers. In this work we present the design and characterization of a time-gated 32x32 SPAD image sensor fabricated in a 0.16 μm BCD (Bipolar-CMOS-DMOS) technology. The sensor is based on an innovative 16x16 macropixel structure, each composed by four SPADs with independent sensing front-end and event counters, plus a shared Timeto-Digital Converter (TDC). This approach enables higher fill-factor (9.6% with a pixel pitch of 100 μm) by sharing the costly (in terms of area) TDC resource, as well as reduced power dissipation. The imager provides simultaneous photon-timing and photon-counting data and features a 12 bit, 75 ps bin width TDC, which can perform one conversion per each gate window, with up to 62 windows per data readout frame. Two main operation modes are available: a single-photon mode, where an arbitration circuit within the macropixel is used to share the TDC among the 4 SPADs, but with no loss of X-Y resolution (i.e. keeping information about the triggered SPAD); and a two-photon-coincidence mode, where the TDC performs a conversion only if two SPADs of the macropixel are triggered within a preset coincidence window. Lastly, the sensor features multiple readout modes, with varying amount of output data, in order to fit different end-user applications. The imager is capable of 100 kfps (frames per second) in full readout mode, and up to 400 kfps in a reduced data set mode.
INTRODUCTION
Many scientific and commercial applications require the ability to detect faint and fast light signals in the visible and near-infrared spectra, exploiting pulsed lasers to obtain distance of targets (e.g., in LIDAR) or to reconstruct high-bandwidth optical signals through Time Correlated Single Photon Counting (TCSPC), such as Fluorescence Lifetime Imaging Microscopy (FLIM) and functional Near-Infrared Spectroscopy (fNIRS). Various CMOS SPAD arrays have been presented in literature (1) (2) (3) (4) , having resolutions spanning from few tens of pixels up to QVGA and more, being capable of either photon counting or time-resolved operation. However, most time-resolved sensors suffer from very low fill-factor and cannot perform simultaneous counting and timing operations.
IMAGER STRUCTURE 2.1 Features
The imager developed in this work is fabricated in a 0.16 µm BCD technology, which combines SPADs having state-of-the-art photodetection efficiency (5) , thick oxide front-end transistors for operation up to 5 V excess bias, and high-performance 1.8 V digital logic for core digital processing. The array is organized in 16×16 macropixels, each composed by 2×2 SPADs having independent time-gated front-end circuitry and photon counters, but sharing a single, two-stage Time-to-Digital Converter (TDC). Configurable logic allows to operate each SPAD independently, sharing the TDC with no loss of spatial resolution, or to combine the four detectors in a two-photon coincidence mode, intended for rejection of unwelcome uncorrelated photons (e.g., ambient background in LIDAR applications). The macropixel employs square SPADs with 32 µm side and 5 µm rounded corners (corresponding to ~1000 µm 2 ), with a 100 µm pitch, yielding a 9.6% fill-factor and a 3.2×3.2 mm 2 imaging area.
Macropixel structure
The building block of this array is the macropixel, a structure (schematically shown in Fig. 1 ) that combines four SPADs to a single TDC, thus reducing the pixel's area occupation and the power consumption of the in-pixel timing electronics. The SPAD front-end is a Variable Load Quenching Circuit (6) with active detector disable capability, used to perform temporal gating of incoming light, and low (avalanche current) sensing threshold, to improve timing response. A 5 bit saturating counter is provided for each detector, for independent photon-counting. Each macropixel includes four independent memory elements, which allow to store one TDC conversion per each SPAD, thus preventing loss of spatial resolution: the discriminator circuitry identifies, within each gate window, the first SPAD that fires and configures the TDC to save the conversion result into the corresponding memory cell. Alternatively, a fast readout operation mode (intended for low photon flux applications) is also possible, where only the first conversion in each frame is stored, and an additional register is used to memorize which SPAD(s) fired; this allows to quadruple the frame rate compared to full operation. When the imager is configured in two-photon coincidence mode, the four SPADs in the macropixel are operated as a single entity, and the discriminator logic reconfigures itself to trigger a TDC conversion only if at least two SPADs fire within a predefined coincidence window (~ 1.5 ns). This operation mode is intended for LIDAR measurements, as it can provide a degree of rejection of background (random) light.
Timing architecture
The timing architecture is schematically shown in Fig. 2 ; it is based on a global "START" interpolator, placed in the chip periphery, and 256 in-pixel "STOP" interpolators plus coarse clock counters, providing a 12 bit output word (7 bit counter plus 5 bit interpolation) and implementing the cyclic sliding scale technique to improve conversion linearity. The interpolators employ 16 clock phases derived from a master reference and, unlike common implementations of this architecture, use both rising and falling edges of each phase. This peculiar solution allows to achieve an effective subdivision of the master clock period into 32 TDC bins, while distributing just half the signals; this yields significant power saving, but demands a strict 50% duty-cycle of the distributed clocks, thus requiring careful design of the distribution network. This requirement is satisfied using a novel architecture employing a Delay Locked Loop (DLL) followed by tunable interpolators and an edge combiner circuitry. The DLL splits the external (420 MHz nominal) reference clock into 16 equidistant clocks (Δt = 150 ps), which are further interpolated to obtain 32 intermediate phases, spaced by 75 ps each. Tunable buffers are inserted at this stage, which allow to tweak the output clocks in order to minimize phase-to-phase mismatch and significantly improve overall conversion linearity. These intermediate signals are then combined to finally obtain the 16 interpolator clocks: phases 0-15 generate the rising edge of the clocks, while phases 16-31 generate the falling edges. Overall, the TDC provides 75 ps resolution with a 307 ns Full Scale Range (FSR).
The TDC used in this work can convert the arrival time of at most one photon per gate window. To allow each macropixel to perform more than one conversion in each data readout frame (up to the previously mentioned maximum of four conversions), the timing circuit must be able to operate with multiple gate windows in each frame. To this end, a 64-entry memory bank has been added in the chip periphery to allow up to 64 START conversions (one per each gate window), and a 6 bit in-pixel 'gate counter' is used to identify the gate number in which each STOP conversion is performed, allowing to properly associate each STOP conversion to the correct START value. In this way, the imager can perform up to 64 gate windows per data frame being read. This effectively allows to operate with a higher repetition rate of the laser, obtaining the same results obtained in (7) without limiting the effective FSR or adding any constraint on the laser source. 
Ancillary electronics
The readout of the array employs a non-standard architecture, designed to minimize the number of logic gates and signals operating at full readout clock speed, while minimizing global signals and being easily scalable to larger imager sizes. In particular, only a peripheral row multiplexer operates at full clock frequency (100 MHz nominal), while the in-pixel bus drivers operate at 1/16 th of that frequency; each row has 15 clock cycles of pre-charge time, thus easing the in-pixel driver requirements, which can be made much smaller, even for larger arrays. The readout of each macropixel, depending on the operation mode, requires 1, 2 or 4 clock cycles, thus providing frame rates as high as 400 kfps and a throughput of 2.3 Gbps.
SPAD detectors
Two different imagers, with identical electronics but different SPAD cross-sections have been fabricated, differing on the position of the multiplication region with respect to the Silicon surface, and will be referred to as "shallow" and "deep" SPADs, according also to (5) .
EXPERIMENTAL CHARACTERIZATION 3.1 Testing setup
The testing system consisted of a general purpose FPGA board, previously developed in-house, based on a Xilinx Artix 7 FPGA, 2Gb DDR2 RAM and an USB 2.0 interface, connected to a daughter board through high-speed connectors. The daughter board hosts the imager die (directly wire-bonded to the PCB), a local oscillator for the TDC reference clock and two 50 Ω SMA connectors for the GATE and START input signals. The SPAD array is placed inside a mounting flange that provides a standard SM1 thread, as commonly used in optical setups. A third board provides all the power supplies required by the system, starting from a 12 V DC input and then through linear regulators with DC-DC pre-regulators to ensure high efficiency and low noise power rails.
SPAD efficiency, noise and uniformity
The Photon Detection Efficiency (PDE) was measured with a calibrated setup consisting of a wide-spectrum Quartz Tungsten Halogen (QTH) lamp attenuated with neutral density filters, then wavelength-filtered with a monochromator, whose output is monitored with a reference photodiode. The efficiencies are reported in Fig. 3 in the wavelength range from 400 nm to 1200 nm. The measured efficiencies match very closely the ones reported in (5) , showing very good reproducibility of the detection performance between different production runs. The reduced efficiency of some outliers may be due to debris partially covering the SPAD area. In particular, the deep SPAD structure shows remarkable PDE, peaking at over 60% when operated at 4.5 V excess bias, a result besting most planar SPADs, both in CMOS and even in custom technologies. The Dark Count Rates of the two imagers at normal operating temperature are reported in the cumulative graph of Fig. 4 ; the median DCR is 590 cps (shallow SPAD) and 720 cps (deep SPAD) respectively. Even the noisier SPADs show DCR levels acceptable for many applications, especially in presence of background ambient light. Optical crosstalk was computed among various combinations of pixels; results show a crosstalk probability of 140 ppm between nearest neighbors (100 µm center-to-center distance), dropping to 12 ppm for first neighbors along the diagonal, down to 6 ppm for second orthogonal neighbors (200 µm center-to-center distance); crosstalk between pixels further apart was too low to be reliably measured. The low optical crosstalk is in part due to the Deep Trench Isolation (DTI) around each SPAD, the relatively large distance between detectors, and to the prompt avalanche quenching action enforced by the front-end circuitry, which reduces the number of secondary photons generated during each avalanche.
SPAD gating
Temporal gating of the detectors is performed by each SPAD's front-end circuitry, by applying or removing the excess bias voltage to the detectors themselves. Although this method does not result in the fastest possible transition time, it allows the SPAD to be completely non-sensitive to photons arriving before the gate signal, unlike gating implementations which use simple pass transistors or logic gates on the SPAD output. This is useful when measuring faint signals which are preceded by strong disturbances. In order to test the effectiveness of the detector temporal gating, the array was illuminated with an 850 nm, 40 ps FWHM (Full-Width at Half-Maximum) gainswitched pulsed diode laser; attenuation was placed on the optical path to avoid hold-off and pile-up distortion in the recorded data. The measurement reported in Fig. 5 was carried out by moving the temporal position of the laser pulse with respect to the rising edge of the gate signal, and evaluating the recorded photon counts at each measurement step. The gate-on rising edge, measured as the 20%-80% transition, ranges between 420 to 640 ps; a knee in the transition can be seen for some detectors, and is due to a combination of temporal skew between the activation of SPADs placed at different positions across the array, and voltage sag of the cathode rail, due to the high peak current when all SPADs are enabled simultaneously; the ringing visible in Fig. 5 is likely to be due to the long wire bonds used in the test setup, and can be reduced by a more optimized PCB design.
Temporal response and TDC linearity
Before evaluating the imager timing performance, a clock phase tuning procedure was performed. This procedure analyzes the raw output codes from the TDC interpolators, and modifies the configuration of the clock generator's tunable buffers (described in Section 2.1.1) to obtain uniformly spaced clock phases. After performing this procedure, the system temporal response was measured using the same laser setup described in Section 3.1.2. The FWHM of the Impulse Response Function (IRF) was measured at multiple temporal positions with respect to the START signal. Fig. 6 shows the average FWHM for each pixel across the array; overall, the entire imager shows an average response of 138 ps, with a spread limited within the 125 ps -150 ps range. TDC linearity was first evaluated by means of a code density test: the imagers was illuminated with uncorrelated light and the resulting random TDC conversions were recorded, which should ideally result in a flat code density histogram. The reported rms value, as seen in Fig. 7 , shows an excellent linearity with an average standard deviation of 0.63% of the TDC bin width. However, this measurement overestimates the real non-linearity, as the noise component due to the light's intrinsic Poisson noise accounts for most of the reported rms value. A proper measurement of this parameter would have required an unfeasibly long data acquisition time. However, to more accurately estimate the TDC linearity, we performed an electrical test on a separate chip, which includes the same timing electronics, but no photodetectors. In that case, the measurement was carried out by feeding the START and STOP inputs of the system with two uncorrelated oscillators. The measurement was stopped after 48 hours of acquisition: the DNL was measured at 0.15% rms of the LSB (Least Significant Bit) and the INL was 0.48% rms of the LSB. In order to evaluate the effectiveness of the hardware calibration, a second measurement was carried out by configuring the phase tuning circuit to default values, i.e. with the imager completely uncalibrated. The uncalibrated DNL was measured at 1.1% rms of the LSB and the INL at 1.2% rms of LSB, thus showing that the calibration procedure leads to a clear improvement; nevertheless, the imager shows remarkable DNL and INL performance even with the uncalibrated chip. Fig. 7 shows a comparison between the measured non-linearities before and after calibration.
CONCLUSIONS
The two imagers we presented and characterized in this paper presents very good performance for effective use in LIDAR applications based on the Time-of-Flight (TOF) of single photons. The most important features are a fast and uniform detector-gating (to avoid unwelcome light or remove unwanted laser reflections, as exploited in (8) ), a photon-coincidence mode (to reduce the detrimental effect of uncorrelated background illumination, providing rejection at the pixel level), high frame-rate readout, high detection efficiency (still 5% at 900 nm wavelength), and good photon timing precision (always better than 150 ps FWHM for all 1024 SPADs and 256 TDCs composing the array). The combination of high detection efficiency, low noise and time-gating, combined with the very good timing response of the SPADs (in terms of both FWHM and fast exponential tail, as shown in more details in (5) ) and the remarkable DNL and INL linearity in TDC measurements, also make this imager the viable array detector in many scientific applications, such as FLIM, Diffuse Optical Spectroscopy, and other TCSPC-based applications. 
