We present the design and preliminary characterization of a 32 × 32 SPAD time-gated imager in a 0.16 µm BCD technology featuring an innovative pixel structure, composed of four single-photon detectors, independent event counters and a shared Time-to-Digital Converter (TDC). This approach allows our design to reach higher fill factor (9.6 % with a pixel pitch of 100 µm) than typical SPAD imagers with in-pixel TDC, as well as reducing the power dissipation of the chip itself.
INTRODUCTION
Time-resolved detection of faint (down to single photon) light signals is required by many industrial and scientific applications, ranging from 3D imaging (such as LIDAR, surveillance, object tracking) [1] , [2] to Time Correlated Single Photon Counting (TCSPC) 3 . Moreover, many of these applications benefit from the ability to perform temporal selection ("time gating") of the incoming photons in order to reduce or reject unwanted components of the light signal.
Single-Photon Avalanche Diodes (SPADs) are detectors capable of single photon sensitivity which easily lend themselves to sub-nanosecond time-gated operation and can be manufactured in standard CMOS processes, allowing the integration of detectors, front-end and more sophisticated circuitry such as Time-to-Digital Converters in a single, low cost monolithic IC.
This trend has driven the development of low cost, large size CMOS SPAD imagers. Indeed, several examples of CMOS SPAD arrays can be found in literature, both capable of photon-counting or photon-timing and having from few tens of pixels up to QVGA resolution [4] [5] [6] [7] [8] . However, only few of those imagers are able to perform simultaneous counting and timing operation, and the in-pixel integration of high resolution TDCs generally results in low fill factor.
IMAGER ARCHITECTURE
We present a SPAD imager developed in a 0.16 µm BCD technology. The array is composed of 32 × 32 SPAD detectors having 32 µm × 32 µm square-shaped active area with rounded corners, with 100 µm detector pitch and a fill-factor of 9.6 %. The imager naturally operates in simultaneous photon counting and timing mode, but it can be configured to work as counting only or timing only for reduced power consumption or increased frame rate. The array architecture is the same presented in 9 . A global GATE signal is used to enable and disable the SPADs, allowing for temporal selection of incoming photons, as well as operating as a global shutter signal. Gating windows with sub-ns transitions guarantee good detection uniformity within the gate window.
The SPAD array is divided into 16 × 16 basic units called macropixels, whose structure is schematically shown in Figure 1 . Each macropixel is composed of 2 × 2 SPADs, each one having an independent sensing and quenching circuit and photon counter, a shared TDC and storage registers. Smart arbitration logic allows the TDC to be shared among the four detectors without loss of X -Y resolution in normal operating mode, as well as providing an alternate "two-photon-coincidence" operation mode which reduces the effect of background illumination by exploiting coincidence of photon arrival times.
Macropixel structure and operating modes
The SPAD frontend is a Variable Load Quenching Circuit (VLQC) 10 , modified to be able to operate in either free-running or Time-Gated mode. This is done by adding a thick oxide PMOS transistor connected between the SPAD anode and a 5 V rail, used to reduce the reverse bias applied to the SPAD detector below breakdown, thus preventing any avalanche to be triggered when in gate-off condition. This functionality also allows to selectively disable too noisy SPADs by means of a configuration register.
The hold-off time is enforced by a digital counter that is incremented by the gate signal: two configuration signals allow to configure the hold-off duration from a maximum of 3 gate periods down to a minimum of zero (corresponding to the detector being reactivated at the following gate signal's rising edge).
A 5-bit photon counter is associated to each SPAD to perform counting operation. A central arbitration logic receives signals from all four VLQCs to share the single TDC without loss of resolution, as well as providing for two main operation modes: standard single-photon operation and a two-photon-coincidence mode, intended for rejection of background illumination.
In single photon mode, the first SPAD to detect a photon within the gate-on time wins access to the TDC and generates a conversion, which is then stored in the associated 18-bit register. This detector is then prevented from triggering again the TDC in the following gates, as only one conversion per SPAD can be stored.
In two-photon-coincidence mode, a TDC conversion is triggered only when two SPADs detect a photon within a coincidence window of about 1 ns. In particular, the first detected photon acts as an enable signal for the TDC, while the second one actually triggers the conversion. This approach avoids wasting TDC conversions for single photon events, (which is especially important in our design as the TDC can be triggered at most once per gate window) at the expense of saving the timing information of the second photon.
In this operating mode, the entire macropixel effectively acts as a single detector with a threshold set at two photons, resulting in spatial resolution halved in both X and Y. Because of this, the four storage registers are no longer connected to a specific SPAD; instead, the arbitration logic rewires them to store up to four two-photon conversion results. Similarly, one of the four photon counters is rewired to count the number of two-photon events.
Lastly, each macropixel contains a second set of output registers, a small readout management logic and bus drivers, which allow for flexible readout modes and greatly reduced dead time thanks to the double-buffered operation.
Time-to-Digital Converter
The TDC used in this work is a flash type architecture, based on two "fine" interpolators and a "coarse" counter to extend the limited range of the interpolators, and is shown in Figure 2 . The counter is 7-bit wide, while the interpolator gives a 5-bit result, yielding a 12-bit conversion result with a LSB of 75 ps and FSR of about 300 ns with a nominal reference clock of 420 MHz. The imager receives a START signal (synchronous with the excitation laser) from the user at the beginning of each gate window, and its arrival time with respect to the reference clock is measured by the global START interpolator and stored in a memory bank. The START signal also enables the reference clock to be distributed to the macropixels, where it clocks the TDC coarse counter.
The STOP signal is generated inside the macropixel and is synchronous to a photon detection. Arrival of the STOP signal halts the coarse counter and triggers the in-pixel STOP interpolator, which measures the photon arrival time against the reference clock. The final START-STOP delay (Tmeas) can simply be calculated as the difference between the two "fine" measurements (TSTART and TSTOP) plus the number of "coarse" clock cycles (Ncounter):
where Tck is the reference clock period. The "fine" interpolators uses arbiters to sample the state of multiple clock phases provided by a global clock generator circuit to divide the reference clock cycle in 32 sub-intervals, giving a 5-bit interpolation result. Standard implementations of this type of interpolator usually require a clock phase per each sub-interval, as they are sensitive only to the rising edge of the clocks 11 . However, the interpolators used in this work are sensitive to both clock edges, thus halving the number of clock lines to be routed and reducing power consumption compared to the standard implementation.
The use of both clock edges for the interpolator results in a stringent 50 % duty cycle requirement for the clock phases. This is addressed in the clock generator circuit and clock distribution tree, shown as a block diagram in Figure 3 . First, a DLL (Delay Locked Loop) is used to generate 16 phases from the 420 MHz reference clock, resulting in a 150 ps phaseto-phase delay. An interpolator circuit 12 is used to generate 16 additional intermediate phases, thus reaching the 75 ps TDC LSB. This two-step procedure is needed to obtain a phase delay which is shorter than the minimum delay achievable by the DLL. Lastly, an edge combiner circuit uses the rising edges of the first 16 phases to generate the rising edges of the TDC clocks, while the rising edges of the last 16 phases generate the corresponding falling edges.
The clock generator circuit provides 16 TDC clock phases having a nominal delay of 75 ps (1/32 of the reference clock period) and 50 % duty cycle. To correct for any mismatches or process variation that could affect the multiphase clocks, digitally adjustable buffers and drivers have been placed in various points of the clock generation and distribution tree, allowing to tune each clock edge in a range of few tens of ps. Calibration of these buffers can be performed by means of Code Density tests performed on the interpolators' output data. Lastly, the TDC includes a 6-bit gate counter, whose value is saved each time a conversion is triggered. As the TDC can only perform one conversion per gate window, this value is used to match the STOP conversion results from the macropixels with the entries in the START memory bank, allowing to perform multiple TDC acquisitions within the same frame. This makes possible for all four detectors in the macropixel to trigger a conversion within the same frame, as well as reducing the acquisition dead time.
Readout electronics
Macropixels along the same row share a common 23-bit row bus by means of three-state drivers; because as described earlier each detector is associated to a 23-bit register, complete readout of a single macropixel takes 4 clock cycles. However, depending on the selected operation mode, it is possible to obtain faster readout operation; additional readout modes requiring 1 or 2 cycles are available depending on the intended application.
In particular, a full readout mode requiring 4 clock cycles per macropixel is provided for both single photon and two-photon-coincidence operating modes, which allows to collect the greatest amount of data from the imager. For applications which expect a low number of TDC conversions, a fast readout option is available which only stores one TDC conversion per macropixel and requires a single clock cycle for two-photon operation and either one or two cycles in single photon mode, depending whether counting data is required. Lastly, a photon counting only mode is also available, which requires a single clock cycle per macropixel. This flexibility is obtained by including in each macropixel a small readout state machine which manages the internal data multiplexers and row bus drivers. This in-pixel management strategy also removes the need for a global column select circuit, as each row of pixels essentially contains its own, independent column selector. In our design, schematically shown in Figure 4 , each pixel receives a row clock signal from the preceding column. The state machine can either capture the clock signal to time its own readout operations, enabling its output buffers and taking ownership of the row bus, or allow the clock to propagate to the next pixel. Configuration signals are used by the state machine to format the output data according to the selected operating mode and to enable clock propagation after the correct number of row clock cycles.
A common output bus is shared among the 16 row buses using three-state buffers, which are controlled by a row selection circuit. The row selector is essentially a 16-bit shift register that circulates a one-hot pattern, enabling a new row on each cycle of the master readout clock. As soon as a new row is enabled, a clock pulse is sent to the previously read one. In this design, only the output bus and row selector are required to operate at the full readout clock frequency, while each macropixel has 15 readout clock cycles available to drive the row bus to valid logic levels, allowing the use of much smaller in-pixel bus drivers.
This design lends itself well to scaling to larger array sizes, as it avoids the use of fast, global signals (row clocks are only locally routed from one pixel to the next one) and only requires the output bus to be able to operate at full readout speed, while the time available to precharge each row bus is always maximized and increases with the array size. It also allows for easy partitioning of the array into subsections with independent readout with no need to alter the macropixel design or routing within the array core.
PRELIMINARY CHARACTERIZATION
Separate test chips have been manufactured to validate the operation of key structures in the imager. Preliminary characterization measurements have been carried out on these test chips; specifically, operation of the SPAD with integrated VLQC and the TDC structure have been validated on these test chips.
First, a START signal generated by the VLQC test chip and a constant frequency STOP signal were supplied to the TDC to calibrate the multiphase clock generator. As it is not possible to independently tune each interpolator, the calibration procedure aims at making the average code density of both interpolators uniform. Time (ns)
250
After performing the calibration, the timing response of the complete signal chain was characterized by shining an 850 nm, 55 ps FWHM laser on the SPAD; the laser synchronization signal was delayed and then used as a STOP signal for the TDC. The recorded timing histogram is reported in Figure 5 , which shows a FWHM of 140 ps when operated at 4.5 V of excess bias. This can comfortably guarantee correct detector off-gating even in case of breakdown voltage variation across the array and resistive drops reducing power supply to pixels further from the periphery.
Conversion linearity has also been characterized using a fixed frequency STOP signal and an external SPAD as a source of uncorrelated START signals. The measured DNL (Differential Non-Linearity) is 0.65 % of the TDC LSB (equal to 0.48 ps RMS) and is shown in Figure 6 .
Preliminary tests have also been carried out on the array chip to evaluate SPAD Dark Count Rate (DCR) and yield. Figure  7 shows the DCR distribution of the 1024 SPADs in the test array when operated at 4.5 V excess bias with no cooling. The median DCR is 720 cps, with about 6.7 % of hot pixels (where a pixel is defined "hot" if its DCR is one decade higher than the median DCR). This value is higher than the expected room temperature DCR of a single SPAD of same active area. However, it should be noted that on the test PCB the array chip configured for photon-counting mode operates at around 35°C, due both to the internal circuitry and to other active components placed in the vicinity. A full characterization of the SPADs is provided in 13 . Pixel #
CONCLUSIONS
We presented the architecture and preliminary characterization of a 32 × 32 SPAD imager, developed in a 0.16 µm BCD technology capable of providing simultaneous photon-timing and photon-counting information and able to work in time-gated mode. The imager is based on macropixels containing 2 × 2 SPADs and shared electronics, which allow for increased fill factor (9.6 %) and reduced power dissipation, as well as providing an alternate operation mode intended for background light rejection.
Preliminary characterization performed on test chips show a timing accuracy of 140 ps FWHM in response to a 55 ps FWHM laser pulse for the entire timing chain and a TDC DNL of 0.48 ps RMS.
