A 256 × 256 single-photon avalanche diode (SPAD) sensor integrated into a 3-D-stacked 90-nm 1P4M/40-nm 1P8M process is reported for flash light detection and ranging (LIDAR) or high-speed direct time-of-flight (ToF) 3-D imaging. The sensor bottom tier is composed of a 64 × 64 matrix of 36.72-µm pitch modular photon processing units which operate from shared 4 × 4 SPADs at 9.18-µm pitch and 51% fill-factor. A 16 × 14 bit counter array integrates photon counts or events to compress data to 31.4 Mb/s at 30-frame/s readout over 8 I/O operating at 100 MHz. The pixel-parallel multi-event time-todigital converter (TDC) approach employs a programmable internal or external clock for 0.56-560-ns time bin resolution.
I. INTRODUCTION
L IGHT detection and ranging (LIDAR) applications pose extremely challenging dynamic range (DR) requirements on optical time-of-flight (ToF) receivers due to laser returns being affected by the inverse square law over 2-3 decades of distance, diverse target reflectivity, and high solar background [1] , [2] . Integrated CMOS single-photon avalanche Manuscript diodes (SPADs) have a native DR exceeding 140 dB, typically extending from the noise floor of a few counts per second (cps) to 100's of Mcps peak rate [3] . To deliver this DR to downstream digital signal processing (DSP), large SPAD time-resolved imaging arrays must count and time billions of single-photon events per second demanding massively parallel on-chip pixel processing to achieve practical I/O power consumption and data rates. Hybrid Cu-Cu bonding offers a massmanufacturable platform to implement these sensors by providing high fill-factor SPADs optimized for the near infrared (NIR) stacked on dense nanoscale digital processors [4] - [6] . While some of the key challenges to practical SPAD widefield imagers are resolved by advanced manufacturing technologies, others must be addressed by design innovation. In particular, achieving simultaneously high spatial/temporal resolution and DR at low power consumption and I/O data rates are especially challenging. The new design freedom offered by 3-D-stacking of SPAD arrays has inspired a number of novel stacked sensor architecture involving pixel-level histogramming, on-chip peak detection, and time-to-digital converter (TDC)/processor resource sharing [7] - [9] . However, so far, none has entirely managed to balance satisfactorily these conflicting factors, either consuming high power or generating high volumes of timestamp data. 0018-9200 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article presents the first large imaging array of compact, reconfigurable SPAD time-resolved pixels in a 3-D-stacked 90-nm 1P4M/40-nm 1P8M CMOS technology. We address pixel power consumption by employing an in-pixel correlator which triggers the TDC activity only on multiple photon detections within nanosecond timescales, most likely to occur from short laser pulses rather than background light or dark count [10] . In addition, power is conserved by a selectable internal/external TDC clock generator which uses high-resolution internal TDC power sparingly in small time ranges to refine coarse but lower power range estimates from external TDC clocking. DR is tackled by in-pixel integration of photon counts or events in a counter bank, with TDC throughput enhanced by a multi-event shift-register approach. either rolling pairwise readout, while all other rows are in integration or a global shutter scheme, where all rows are in integration which is then suspended before readout. A uniform top tier backside illumination (BSI) matrix of 256 × 256 pwell/deep-n-well SPADs at the 9.18-µm pitch with 51% fillfactor occupies the top tier [2] , [4] . The SPAD has a median dark count rate (DCR) of 20 cps at 1.5-V excess bias and a peak photon detection probability (PDP) of 28% at 615-nm falling to around 5% at 900 nm [11] . In the bottom-tier, a 64 × 64 matrix of modular pixel processors at the 36.72-µm pitch is integrated in 40-nm CMOS technology. A dense custom layout approach employing area-optimized d-type flip-flops has been employed to allow ∼40 M transistors to be integrated under the 5.38-mm 2 focal plane.
II. SENSOR DESIGN
The reconfigurable pixel architecture is shown in Fig. 3 . A 16 × 14 bit ripple counter array and a mode multiplexer allow the sensor to operate in a number of imaging modalities, which are summarized in Table I including their key signals, resolutions, and main applications. The counter bank can be re-purposed for either photon counting or as time bins for in-pixel histogramming. The full resolution, 256 × 256, of the sensor is available in the single-photon counting (SPC) mode with the bit depth commensurate with common full well depths (and hence SNR) in conventional pinned photodiode pixels (16 384). Photon counting modes can be combined with windowing or electrical gating of the SPADs to suppress the background or provide an indirect ToF or frequency-domain lifetime imaging. This function also provides an option to operate the pixel in global shutter exposure (with zero parasitic light sensitivity), while the default is a high temporal aperture (127/128) rolling shutter exposure.
The gated ring oscillator (GRO) can be used for singleshot timestamping of single photons for precise but low throughput time-correlated SPC (TCSPC). This mode finds use in scientific imaging or long-range LIDAR where the 38-ps time resolution and 143-ns range (tunable to 500 ns) are suited to common distances and fluorophores. The same GRO can be employed as the frequency reference for an in-pixel histogramming mode for the short-range ToF with high photon timing throughput at a modest I/O rate. Here, the time resolution is coarse (560-ps upward in factors of 2), but a finer depth is recovered by peak centroiding. In microscopy applications, the bin size is still sufficient for common fluorescence lifetimes (few ns decay time). The same GRO can be employed as the frequency reference for an in-pixel histogramming mode for short-range ToF with high photon timing throughput at a modest I/O rate. The various modes generally involve a tradeoff of spatial and temporal resolutions where the counters and GRO are shared between all 16 SPADs via an XOR tree. Alternatively, DR and counter depth are traded for spatial resolution by chaining counters or sharing the timing functions of the multi-event TDC (METDC) or GRO. A counting correlator can be applied with a variable threshold to suppress background and save power in all timing modes. The use of an external clock for multi-event histogramming or a STOP clock in single-shot timestamping offers longer time ranges at lower power consumption than the in-pixel GRO.
Switching between various modes can be achieved either instantaneously where the control signals are available on pads or within around 320 ns where reprogramming of the serial interface is required. Rapidly alternating between say multi-event histogramming for 3-D imaging and the SPC mode has been proven very useful to combine the lower spatial resolution depth estimation with improved spatial intensity information [12] . All operating modes apply globally to the whole array and cannot be reprogrammed on a per-pixel basis. Although the sensor is operable in both direct and indirect ToF modes, we favor the latter in foregoing discussion and characterization. DToF is particularly suited to the fast response and timing precision of SPADs matched to narrowband pulsed laser sources, offering good background immunity and long-range depth imaging. Indirect ToF operating modes are incorporated in the sensor for later side-by-side comparison with the dToF mode, but not evaluated here.
In the SPC mode, the 16 SPADs can be multiplexed directly to a bank of 16 14-bit counters realizing shot noise limited digital photon counting imaging with a 16 384 photon fullwell capacity (84-dB DR). GS-HDR modes are realized by chaining the counters into 8 × 28 bit while binning groups of 4 SPADs, thereby allowing 256M photons per SPAD to be counted in an exposure period with 64 × 256 spatial resolution. Binning SPADs in groups of 4 (rather than 2 for higher spatial resolution) were chosen to allow a pingpong global-shutter indirect ToF mode. This is of value to allow the imager to operate over long exposure times (1 s) without the possibility of countersaturation even at the peak SPAD event rate of 200 Mcps. The same 16 counters are repurposed for LIDAR or 3-D imaging as histogram bins for the dToF operation where the resolution is now 64 × 64 pixels. In this case, a tradeoff is made between temporal and spatial resolutions to allow on-chip integration of timed photons, greatly reduced I/O rates, and extended DR. Fig. 4 shows the SPAD interface circuit which has been designed to be highly area-efficient and multi-functional. Only four thick oxide devices are employed to perform passive quenching or gating of the SPAD as well as level shifting from the excess bias voltage V eb (up to 3.3 V) to the logic voltage V dd of 1.1 V. The interface is also agnostic to the SPAD polarity for compatibility with future top tier detector generations which may integrate n-on-p or p-on-n SPAD structures. The combination of external global V qp and V qn voltages and the invert signal allow selection of either NMOS or PMOS quench transistors and time gating. An enable SRAM allows masking of noisy detectors. Two 16-bit datapaths (see Fig. 3 ) connect the SPAD interface circuits to the pixel counter array, implementing parallel photon counting or timing using the S P AD(k) or Star t (k) signals, respectively. The Star t (k) signals act as a start trigger for the in-pixel TDC which can operate in either single-shot timestamping mode or multi-event histogramming mode. When T oggleEn is low, only the first photon within a frame exposure time will be time-stamped by the single-shot TDC. This first photon in the kth SPAD is encoded as a rising edge of Star t (k) and is combined by the XOR tree and the correlator as PhotonEdges to initiate the single-shot timestamping operation (see Fig. 5 ). Alternatively, if the S P AD(k) outputs are used in conjunction with SPC mode = 0, the counters can be used to count STOP edges to provide per-SPAD photon arrival macro times. A macro-time is the coarse time offset from the start of frame exposure to the detection of the first photon arrival, as a count of the number of laser cycles (or STOP clock cycles). A macro-time can be combined with a micro-time (the fine TDC timestamp of the photon arrival) to generate a precise time arrival estimate of the first photon in a frame for every pixel.
When ToggleEn is high, multiple photons can be now time converted per laser cycle by the METDC and integrated in the histogram memory over many laser cycles. Photon arrivals are encoded on both rising and falling edges of Start(k). The 16 Start(k) toggling outputs are then mixed via an XOR tree and a correlator to a single sequence PhotonEdges, which is passed to the METDC (see Fig. 5 ).
PhotonEdges is processed by a counter-based correlator circuit (see Fig. 5 ), which continually counts photons from the first rising edge received in each laser cycle. A threshold of 1, 2, 4, and 8 photons occurring within a 0.5-10-ns delay time (adjustable by the current starving voltage V ndelay ) will generate an output trigger to start the GRO. This circuit saves power by activating only the GRO if there is a highly correlated burst of photons likely to belong to a laser return while suppressing uncorrelated background light. An example Fig. 7 . GRO which can operate either as a coarse-fine single-shot TDC or as a programmable clock source for the METDC. Also, notice the option to multiplex in a global external clock for longer time intervals. of the timing of this circuit is shown in Fig. 6 . When Qtrig is asserted, it enables a qualified trigger as the delayed first photon in the burst, provided that the threshold has been met. Although this creates a histogram with a time-offset, the matching of the V ndelay controlled current starving circuit is sufficient to allow acceptable uniformity and simple offset correction. The toggling trigger sequence can also be generated on multiple bursts for use in the METDC.
The METDC is based on the architecture presented in [13] operating with a shift-register delay chain to allow digitally programmable time bin resolution over a wide temporal range. This architecture allows time digitization of up to 16 photons per laser cycle, or 1 photon per SPAD per laser cycle. The time bin resolution is set over a range of 0.56-560 ns by TDCclk by selection of a divided phase of the GRO. This clock starts only on the first trigger (single or multi-photon) for power saving. The 14-bit GRO TDC shown in Fig. 7 is adjustable by the V ddro voltage over a full-scale range of 560 ns to 1.6 µs for a 35-100-ps LSB resolution, matched with common automotive LIDAR ranges. Alternatively, for the long-distance range, TDCclk can be connected to a lower frequency global external clock ExtClk. GRO TDCs are attractive because they offer high time resolution and low average power consumption proportional to the photon flux. However, they also feature high peak power consumption, accumulated jitter, and mismatch. We propose here to operate the GRO TDCs only for short time intervals (50-100 ns), where these disadvantages are minimized. For longer time intervals such as required by the LIDAR operating over hundreds of meters or kilometers, the global ExtClk generated by an on-chip phase-locked loop (PLL) would be favored. In this case, long term accumulation of jitter will be eliminated by the PLL although constant average power consumption will be drawn by the clock distribution to the array with Extclk typically in the range of tens to hundreds of MHz.
TDCclk is then used to clock a 16-stage shift register within the METDC (see Fig. 8 ) setting the bin time duration of the resulting 16-bin histogram. The operation of this circuit is illustrated in Fig. 9 using a simplified example for a circuit with only 4 bins. The XOR-tree output from the SPADs toggles twice the edges representing two photon arrivals from among the 16 SPADs. The GRO is started by the first of these and oscillates for around two and a half periods producing three rising edges which clock the shift register onward three times. The initial reset state of the METDC in Fig. 9(b) shows 0 states at all internal nodes. After the first TDCclk edge, the high state of XOR16 is latched into the first position in the shift register [see Fig. 9(c) ]. On the second TDCclk edge, this 1 moves one position forward and another 1 is introduced as no new photon has arrived [see Fig. 9(d) ]. On the third clock edge, the XOR16 has toggled from the second photon and a 0 is introduced [see Fig. 9(e) ]. The chain of XORs picks out these transitions in the shift register generating a multiple hot code. The positions of ones in this code represent the time offset in TDCclk cycles of photon arrivals from the STOPd clock which is synchronized to the pulsed light source. On the high transition of the STOPd clock, this multiple hot code is latched as TDC 3:0 which is multiplexed into the clock inputs of an array of counters forming a 4-bin histogram. The bits TDC 0 and TDC 2 which transition high increment their respective histogram bins in parallel. Thus, multiple photons can be recorded within one laser cycle compared to a more conventional TDC which places a single timestamp in a histogram memory per laser cycle.
Readout of the 16 × 14 bit counter array is accomplished in either a rolling shutter or a global shutter approach. The pixel reads out over 4 × 16 bit parallel buses which are subsequently serialized over eight I/O pads operating at 100 MHz. Operating at 30 frames/s, the sensor produces a very modest 31.4 Mb/s of data which are readily processed to extract histogram peak locations by an external field-programmable gate array (FPGA). A frame rate of 760 frames/s has been achieved in the global shutter mode although a glitch in the token-passing shift register currently limits the sensor to access only half the available image resolution. Fig. 10 (a)-(d) shows a series of 33-ms exposure global shutter photon counting images taken using a white LED array lamp to illustrate the high DR of the sensor. In 1 klux lighting, the 14-bit counters do not saturate [see Fig. 10(a) ]. The light level is increased to 100 klux, showing saturation and clipping in the Nikon reference image [see Fig. 10(b) ]. Under the same illumination conditions, the 14-bit counters rollover and corrupt the image in the top left corner [see Fig. 10(c) ]. Details of this area are recovered in the tone-mapped 28b photon-counting image, albeit at four times lower image resolution [see Fig. 10(d) ]. No electronic masking or image post-processing has been applied to remove high DCR pixels in Fig. 10 .
III. SENSOR CHARACTERIZATION
The HDR global shutter mode is insensitive to parasitic light as the frame is stored in a digital form. A DR of 120 dB is seen in the photon transfer curve in Fig. 11 with peak count rates of 200 Mcps/SPAD or 13 Tcps for the whole array. We believe this is the first time the full native SPAD DR is available in a video rate widefield image sensor. Fig. 12 shows an impulse response function (IRF) in a single-shot TDC mode captured using a 775-nm Coherent Chameleon Ultra laser operating at 80 MHz. The TDC Fig. 13 . Linearity in the single-shot TDC mode. resolution is 38 ps and the mean full-width at half-maximum (FWHM) across the array including the laser, SPAD, and system jitter is 277 ps with a standard deviation of 30 ps. Fig. 13 shows the linearity of the single-shot TDC operating in dToF as ±10 cm over a 50-m range. The latter data were obtained with a 671-nm Picoquant pulsed laser coupled to a 100-µm multimode fiber and passed through a 15-mm lens to flood illuminate the scene. The pulse duration of the laser was ∼100 ps, the repetition rate was 1.9 MHz, and after the fiber and imaging lens, the laser power was measured to be 1.8 mW. Fig. 14 shows the distance accuracy in the multi-event histogramming mode improved over that expected from the 560-ps bin size by spreading the pulse energy over bins and performing centroiding around the peak. As the linearity of the METDC depends primarily on clock intervals and not on delay cell matching, the integral nonlinearity (INL) and differential nonlinearity (DNL) are far in excess of the 4-bit level. Fig. 15 shows a map of INL/DNL across the array taken by uniformly illuminating the image with uncorrelated light Fig. 15 . Linearity in the multi-event histogramming mode. and performing a code density test. The linearity in this mode exceeds the few bits expected from the 16 bins as the bin spacing is only dependent on clock edges within the shift register. Fig. 16 shows the operation of the in-pixel counter-based correlator with a threshold of 2 (a logic error prevents the characterization of four and eight thresholds). A 3.5-ns laser pulse is generated by an 840-nm Picoquant Laser Diode at 3.8 MHz. A controlled background light level is generated by a blue 450-nm LED operated with a 20-mA bias current causing the SPAD to operate around 900 kcps (measured in the SPC mode). The TDC histograms are shown in the singleshot timestamping mode with the laser on in all cases for a single pixel with combinations of DCR and background with and without a correlator. The high background swamps the laser return in Fig. 16 (a) making it indistinguishable from background. In Fig. 16(b) , the correlator then reduces the LED background by an order of magnitude and recovers the laser peak. In Fig. 16(c) and (d) , the lower DCR level allows the laser peak to be seen and is then entirely suppressed by the correlator. A study of the degree of rejection of uncorrelated noise sources is given in [14] versus the coincidence level setting. A 30-frames/s video of a subject waving 50 m away down a corridor under daylight conditions was taken with the previously mentioned 671-nm Picoquant laser setup using in-pixel histogramming (see Fig. 17 ). Under this operating condition, the sensor consumes 77.6 mW including the SPAD, GRO, and digital core. The images are post-processed in MATLAB software to calculate the centroid around the peak to generate real-time depth maps with low computational overhead. These calculations could readily be performed in FPGA or embedded on chip in a custom digital processor. Noisy pixels are removed by spatial filtering. Fig. 18(a) and (b) details two-step coarse-fine timings to allow the sensor to operate in a power-efficient manner to deliver video ToF images at >50-m distances. In the scheme of Fig. 18(a) , the sensor uses only the multi-event histogramming mode. A first exposure is taken with a coarse 16-ns bin size to determine the four MSBs of the target range. Power consumption and jitter can be reduced by operating the METDC with an Extclk at 60 MHz and a laser rate of 3.9 MHz. In a second exposure, the laser rate is increased by 16× and the in-pixel GRO TDCclk generates the METDC clock tuned to 1 GHz. This determines the next four LSBs of the target range but aliases the background.
The second timing approach in Fig. 18(b) achieves the same disambiguation by applying the single-shot timestamping mode with the correlator prior to the short-range METDC step. FIG. 18 Here, the power consumption and the accumulated jitter of the GRO can be minimized by operating with a STOP clock (60 MHz) at 16× laser repetition rate (3.9 MHz) and using the macro-time stamp feature to count STOP cycles. The second step is the same as in the first scheme using the METDC at 16 times the STOP rate. The METDC will alias photon returns outside the few meter bin ranges.
The power consumption in the corresponding modes is shown in Table II . Global Extclk reduces power by a half over the same 60-MHz frequency generated by the GRO, while the correlator is able to reduce power by a third in the single-shot timestamping mode. A number of practical issues remain to be evaluated in terms of the real application of the proposed two-step scheme. A few are the "stitching" of the two-step images, associated motion blur, mismatch between the bin sizes, effect of noise, and errors in the MSB step on the LSB step, especially for moving targets. These require quite extensive modeling and experimental work, which is underway but beyond the immediate scope of this article. Fig. 19 shows the use of the global shutter histogramming mode to capture and freeze a fast motion. Peak centroids are calculated real-time using MATLAB software without any Table III compares this sensor with other recent advanced SPAD sensors. Our sensor is distinguished by low power consumption, high sensitivity, and small pixel pitch, as well as by multi-mode imaging capability at a high DR.
IV. CONCLUSION
It has been demonstrated that a compact SPAD pixel can be designed in advanced stacked-3-D manufacturing technologies offering reconfigurable functionalities to reduce power and extend the DR enabling fast 3-D imaging and flash LIDAR applications. Tradeoffs are made in the spatial and temporal resolutions as well as sharing of resources is done among SPADs to achieve practical performance levels. Istvan Gyongy received the M.Eng. and Ph.D. degrees from the University of Oxford, Oxford, U.K. Following a period in the industry, where he worked on processors for smartphones and on a cloud-connected activity tracking system for dairy farms, he joined The University of Edinburgh, Edinburgh, U.K., where he is currently a Research Fellow and is developing single-photon avalanche diode (SPAD) cameras and exploring applications in 3-D capture as well as in the life sciences.
Tarek Al Abbas (M'13) received the master's degree in analog electronics design and the Ph.D. degree in the design of CMOS singlephoton avalanche diode (SPAD) image sensors from The University of Edinburgh, Edinburgh, U.K., in 2013 and 2019, respectively.
He is currently with the Pixel Design Team, Sense
