Time Delay Integration and In-Pixel Spatiotemporal
Filtering Using a Nanoscale Digital CMOS Focal Plane Readout
I. INTRODUCTION
T HE CONVENTIONAL approach to imaging applications may be characterized by an attempt to maximize the collection of information from an array of photodetectors and then pass that data to subsequent conversion, processing, and interpretation functions. This typically requires maximizing the effective number of bits and readout frame rate, which are limited by the pixel dynamic range (DR), readout noise, and settling time. This paper describes a recent effort to leverage the availability of nanometer-scale CMOS technology to implement a smart focal plane architecture that makes more efficient use of the achievable signal-to-noise ratio (SNR), DR, and readout bandwidth by pushing the analog-digital boundary into the pixel. This paper focuses on cooled long-wave infrared (LWIR) applications that use low-bandgap detectors. In conventional analog approaches, large integration capacitors are used to provide sufficient DR to accommodate the high backgrounds associated with these detectors. Since a significant percentage of the available DR is consumed by the background, the data from many subsequent frames often must be aggregated off the chip to overcome the detector shot noise. This must be done frequently in a way that mitigates what is often a relatively poor detector pixel yield. Our initial motivation was to eliminate this need for off-chip aggregation by incorporating a per-pixel analog-to-digital converter (ADC) that uses a digital counter that can roll over to accommodate high background currents. By allowing the digitized count value within each pixel to be shifted orthogonally, this design allows for image stabilization and various signal processing operations. Our goal has been to demonstrate that it is possible to obtain a significant effective readout bandwidth advantage while maintaining competitive power, noise, and pixel pitch to conventional analog cooled long-wave focal planes. Multiple 256 × 256 30-μm-pitch digital focal plane array (DFPA) readout integrated circuit (ROIC) designs have been implemented to date, all using IBM 90-nm bulk CMOS technology. The most extensive testing has been performed on a DFPA design that was developed primarily to support digital time delay integration (TDI) operation, which is described in Section III. A model for the pixel-level behavior is presented and then applied to digital TDI and other exemplary DFPA applications.
II. ROIC ARCHITECTURE AND CIRCUIT PERFORMANCE

A. Per-Pixel Analog-to-Digital Conversion
A simple conceptualization of this digital ROIC architecture is illustrated in Fig. 1 . A detector photocurrent is integrated onto a femtofarad-scale integration capacitor C int that comprises gate, diffusion, and interconnect parasitics of the integration node. This integration node is sensed by a comparator circuit with threshold voltage V m , which triggers both a reset circuit and a bidirectional counter. The reset circuit has been successfully realized using both a charge subtraction and a reset-to-voltage approach. For the reset-to-voltage operation, the integration node is set to V reset , thus making the number of signal electrons per count equal to s e = C int (V m − V reset )/q, where q is the charge of an electron. For example, if C int is 1.4 fF and V m − V reset is 0.5 V, s e is 4375 electrons/count. This conversion gain may be improved by adjusting the reset level and represents the lowest signal that can be measured, a least significant bit (LSB). While useful for some LWIR applications, a much smaller LSB is required for general use. Lowering V m − V reset may allow an improvement by a factor 0018-9383/$26.00 © 2009 IEEE of two to three before CMOS process variations make the approach impractical.
Implementation of a low-capacitance integration node naturally gives rise to significant cross-die mismatch in capacitance ΔC int and comparator threshold voltage ΔV m . The effect of threshold voltage variation becomes more severe as ΔV m becomes a larger fraction of V m − V reset , limiting the utility of reset level adjustment for achieving a small LSB size. This limitation first becomes apparent as gain nonuniformity, which affects the system DR, and noise susceptibility and then later as functional failure when particular comparators can no longer be triggered.
Two techniques have been investigated for mitigating these limitations due to mismatch. By using a charge subtraction approach as illustrated in Fig. 2 , one may attempt to make
where Q ref is a reference charge packet. Counting subtracted reference charge packets instead of comparator transitions ideally removes sensitivity to ΔC int and ΔV m . In addition, the referencing of the integration node voltage swing to the comparator threshold and the gated oscillator architecture provide assurance that, for input currents that are smaller than a saturation current I sat , the integration node voltage will always return to below the comparator threshold. I sat approaches Q ref · f osc , where f osc is the free-running frequency of the gated oscillator. Using this charge subtraction approach, we have successfully implemented pixels with conversion gains on the order of 1000 electrons/count, which is sufficiently small for our LWIR applications.
The second mismatch mitigation technique uses local pixel programmability. For reset-to-voltage operation, the comparator threshold or the reset voltage may be locally tuned to provide gain nonuniformity correction (NUC). For a charge subtraction approach, the size of the reference charge packet may be fine-tuned to achieve a similar effect. In Fig. 3(c) , a pixel layout incorporating a 12-b local SRAM is shown. One function of this SRAM is to control a digital-to-analog converter (DAC) that tunes the ADC gain. This tunability is not only useful for compensating for CMOS variation but also an effective technique for equalizing effective detector response across the array, improving the system DR.
Noise performance of the pixel front end is derived in [1] . The total white noise contribution in terms of electrons per count is
where V mr = V m − V reset , e pa−white is the white noise component of the preamplifier noise (in V/ √ Hz), R d is the dynamic impedance of the detector, τ is the average time per count, g m is the transconductance of the comparator, k is Boltzmann's constant, and T is the temperature. The four terms correspond to the shot noise, reset (kT /C) noise, preamplifier noise, and comparator noise. For N counts , the number of signal electrons scales proportionately, and the white noise contributions are averaged over multiple counts and longer integration times, so the white noise contribution in N counts becomes
For integration time t int and a time between samples of t samp , we have respective photocurrent and detector 1/f noise contributions of
where α 1/f , α gain_n@1Hz , and e DI_n@1Hz correspond to the detector, comparator node, and injection or preamplifier 1/f noise contributions, respectively. The comparator node noise includes the input-referred noise voltage of the comparator itself and any 1/f noise in the voltage reset level. For charge subtraction operation, each comparison event is referenced (i.e., correlated) to the prior comparison, and, so, for the charge-subtraction case, the ROIC gain noise is from the reset charge packet generation circuit. The quantization noise in terms of the number of electrons is given by
Thus, the total noise electron in N counts is
which, as illustrated in Fig. 4 , is consistent with the measured single-frame data from a Lincoln DFPA ROIC. Because of this noise averaging over the number of counts, it is possible to achieve background-limited detection with a small integration capacitor. Note that, once digitized, the remainder of the readout process is noiseless and that, by using TDI techniques, the noise contribution of multiple pixels may be averaged, thus reducing not only white noise but also low-frequency noise that is not correlated between pixels.
B. Register Implementation
Two approaches to register implementation have been realized in Lincoln Laboratory DFPA designs. A compact 12-b dynamic linear feedback shift register (LFSR) with serial orthogonal transfer was used to obtain a 15-μm pixel footprint, as illustrated in Fig. 3(a) . For the LWIR application of interest, the background pedestal resulted in a sufficiently high count rate to maintain valid count data. Excessive static power dissipation in pixels with low input current due to dynamic node leakage, the high activity factor (0.5) of the LFSR, and the serial data transfer resulted in a less than optimal power dissipation of 800 mW for a 100-Hz full-frame readout. A modified 16-b register implementation based on an asynchronous static ripple counter with parallel data transfer resulted in a 75-mW measured total ROIC power dissipation for similar operational conditions. A substantial fraction of this figure can be attributed to static power dissipation. A similar implementation in a 90-nm CMOS process variant optimized for portable electronics further reduced the measured power dissipation by about a factor of two.
For purposes of power analysis, it is convenient to consider the energy per count as a key figure of merit. For example, assume a Q ref of 5000 electrons and a flat field input photocurrent of 20 nA/pixel. Dividing the photocurrent by the reference charge gives a 25-MHz pulse-frequency modulator (PFM) oscillation frequency. Assume that the ADC front end consumes 10 fJ/count and that a register transition consumes 10 fJ/count. For a ripple counter, which has, on average, two transitioning bits per increment, we obtain a total energy per count E count of 30 fJ. This number is consistent with the measured values for a 1-V supply rail V DD . Multiplying by the PFM frequency and the total number of pixels gives a projected counter power dissipation of about 50 mW. Note that the same set of assumptions applied to a 16-b LFSR projects a threefold increase in the counter power dissipation. Power dissipation scales as V 2 DD and may significantly be reduced by the adoption of low-voltage and low-swing design techniques.
Most of the pixel area utilization is consumed by registers and digital logic, and, thus, the design benefits from Moore's law feature size and supply voltage scaling. A detailed discussion of the achievable pixel size versus technology node is provided in [1] .
C. Data Transfer and Readout
A simplified version of the top-level data transfer structure is illustrated in Fig. 5 . Data may be orthogonally shifted by way of a local orthogonal transfer multiplexer within each pixel. For a parallel shift implementation, a slow shift clock may be used throughout the array, and the energy per shift is approximated using an activity factor of α pixelData = 0.5 for the local transfer. Anticipated image statistics may be used to refine this pessimistic estimate that assumes uncorrelated randomized pixels. On a global scale, a second activity factor of α loadData = 0.5 arises if we shift a constant value into the array to preload for the next frame (if we preload a 2-D array of NUC offset values, then this second activity factor may be significantly higher). For orthogonal transfer operations, the wire load on the register output is significantly higher than that for a counting operation, so it is reasonable to assume an energy E shift of 100 fJ per register per shift for a 1-V supply rail V DD . If N X is the array size in the shift direction, N Y is the orthogonal array dimension, N bits is the number of register bits, and α clk is one plus the clock overhead, shifting of array contents by a single pixel requires an energy E shiftArray = α pixelData α clk N X N Y N bits E shift . To read out the entire array requires energy
If serialization of the output is performed by a set of N taps -wide peripheral shift registers that first perform a parallel shift of the output data into an N bits -b serializer, then we obtain two additional terms so that
For a 256 × 256 16-b array with 50% clock overhead reading out to four taps at 100 frames/s, readout requires 1.3 mW of shift power plus the power that is necessary to perform any desired encoding and to drive the ROIC output pins. As is the case for E count , E shift can be expected to scale as V 2 DD and to decrease with node capacitance. Architectural modifications, such as binary tree serialization, may also provide useful power savings, as the higher shift rates in the serializer structure may preclude some of the low-voltage options that are suitable for the optimization of the array core.
III. TDI APPLICATIONS
Digital TDI is conceptually similar to charge-coupled-device (CCD)-based analog TDI [2] , but, instead of transferring charge in synchrony with an optically scanned image, the digital case transfers a binary representation of the integrated charge. Assume that the image is scanned at a rate of 1/t dwell pixels/s so that, at time t 0 , the image ξ(i, j) is incident on the array, where i and j are pixel indexes; at time t 0 + t dwell , the image ξ(i − 1, j) is incident on the array; and so on. For each time interval, the contents of each pixel's digital register are incremented proportional to the received photocurrent. The data from the column with the highest i index N X − 1 is read out. Thus, the spatial x-axis of the image data is represented as a time sequence in the output data, and the total integration time for each output image pixel approaches N X · t dwell .
Using the orthogonal transfer structure, it is straightforward to implement a 256-stage TDI-mode camera utilizing a 256 × 256-format digital ROIC constructed with this architectural approach. To demonstrate the TDI capability of the Lincoln DFPA ROIC, a tripod-mounted camera with a spinning polygon 1-D scan mirror was constructed. A DFPA ROIC was hybridized to an LWIR detector with an approximate cutoff wavelength of 11 μm. Fig. 6(a) shows an example 5000 × 256 raw TDI image of the Boston skyline, which was successfully captured up to a frame rate of 155 Hz, thus corresponding to a sustained pixel output rate of 200 megapixels/s. Because the ROIC is implemented in an aggressive digital CMOS process, the architecture is readily compatible with state-of-the-art high-speed serializer circuits that offer data rates in excess of 10 Gb/s. Full resolution and sensitivity are maintained across the entire scene, and, as is apparent from the image of an unresolved target in Fig. 6(d) , it is possible to maintain the image point spread function (PSF) after 256 stages of TDI. The TDI operation averages out detector responsivities and thus gives a highly uniform image. Fig. 6(c) shows the pixel response map for the same LWIR detector array. This Hg x Cd 1−x Te (MCT) array was obtained as a discarded residual that had inadequate operability. Note that there are many bad pixels and clusters, including a variety of disconnected, nonresponsive, high-dark-current, and noisy detectors. All of these operability concerns are mitigated by the TDI operation and the high pixel DR, yielding an image in which every pixel contains valid data and in which single pixel failure does not compromise the full image. Note also from the enlarged image in Fig. 6(b) that detector nonuniformities can easily be mitigated. For a 16-b DFPA design, the entire DR of the counter may contain signal data, allowing data collection that would require an equivalent analog focal plane with a well depth of many billion electrons-a difficult task in a comparably sized analog pixel cell. The DFPA has been successfully operated in this imaging mode at temperatures approaching 100 K while hybridized to MCT LWIR detectors specified for 60 K-80 K operation. The excess dark current is mitigated by the counter rollover capability, and the corresponding increased shot noise is mitigated by the TDI averaging.
In a CCD, TDI is most often associated with "noiseless" signal aggregation that allows the readout noise contribution and ADC operation to be applied to the larger aggregate signal [2] . If one can indeed assume a readout-noise-dominated design with low detector, injection, and transfer noise, then the signal is improved by the number of TDI stages, while the readout noise remains constant. In fact, however, noise contributions arising from the CCD transfer process, while insignificant for lowbackground applications, become more significant when the aggregate signal includes a background pedestal (as is the case for longer wave infrared) or an extremely large number of TDI stages are used, and detector shot noise begins to become more dominant as well [2] , [3] . If we assume the same readout (or conversion) noise domination as for a TDI CCD imager, then for TDI operations based on this digital ROIC, there is a noise benefit that is proportional to the square root of the number of TDI stages. Note that, for background-limited applications, both the CCD and digitized approach have the same trend of square root improvement, but that the digitized approach eliminates any transfer noise contribution. The digital TDI approach is also compatible with photon-counting pixel architectures [4] , [5] .
Unlike a simple increase in staring-mode integration time, uncorrelated low-frequency noise and detector nonuniformity are also averaged. However, because an anticorrelation exists between the quantization noise in a particular pixel and its trailing neighbor and because a shift operation typically occurs within a nanosecond-scale duration, as compared to a much longer microsecond-scale count period, a residual charge due to quantization of a leading pixel is essentially emptied into a trailing pixel, thereby imparting a spatial frequency dependence to the quantization noise components.
In TDI operation, counting operations are nearly continuous, and the count power may be obtained as previously described. For every 256-pixel column, one array shift and one serializer operation are required, giving an energy per output column of
For a pixel readout rate of f readout , one must output f readout /(N Y · N taps ) columns/s. Thus, the readout power dissipation is given by
, E shift = 100, and f readout = 200 megapixels/s, we get a projected readout power dissipation of 17 mW, which may be added to the count power and the input/output power dissipation. Note that, for TDI operation, the readout power begins to become a more significant fraction of the total power than is the case for staring-mode operation. Remember that the value of E shift = 100 assumes that V DD = 1 V and that both the count and shift power scale with V 2 DD . If the circuit must be operated at a higher supply voltage, the total power dissipation may increase into hundreds of milliwatts.
Architectural improvements to the output circuit, such as binary tree serialization, can enhance the readout power. In addition, since the counting and array shift operations occur at a very low frequency, the array core may be designed to use a low-voltage island. With these two improvements, the power dissipation is given by
so, for V DD_core = 0.7 V and V DD_periphery = 1.5 V, one obtains a readout power of 11 mW. The count power is also reduced by about a factor of two. This suggests an opportunity for scalability to very large format TDI ROICs. In addition to linear TDI, a known platform or target jitter may be used to boost the SNR. In Fig. 7 (left) , a dim target with known motion with respect to the imager platform is viewed with a 1-ms integration time, resulting in a maximum signal level of a few hundred counts (the target was created using a scene projection system). In Fig. 7 (right) , this 1-ms integration time is replaced with a 100-ms total integration time in which the orthogonal transfer capability is used to move digitized values along with the target signal for half of the total integration time. For the remaining 50 ms, the object is allowed to exit the field of view, and the counters are configured to count down, thus subtracting the background pedestal, resulting in a nonuniformity-corrected high-SNR image. Because the target object is imaged by multiple detectors, the imaging operation is robust to array nonuniformities, inoperable pixels, and local low-frequency noise.
IV. SPATIAL AND TEMPORAL FILTER OPERATIONS
A. Spatial Filter Operations
By combining orthogonal transfer capability, bidirectional integration, and variable integration times, a wide range of linear static kernel filters may be implemented [6] . Fig. 8(a) shows a high-pass filtered image obtained in real time using the filter kernel shown in Fig. 8(b) . The imaging process was set up as follows. First, the array was reset, and the scene was integrated for eight time units. Negative integration was then used for eight subsequent integration periods, each one time unit in duration. Before each of these shorter negative integration periods, the data were orthogonally shifted by one pixel, first up, then left, then down, etc. Because the orthogonal shift can occur much faster than the object motion, the end result is a static filter operation that may be read out with no further off-chip processing.
Static kernel filter operations are sensitive to array nonuniformities and pixel operability problems. Some of these limitations are addressed by in-pixel gain nonuniformity compensation and bad pixel mitigation, which are currently a focus of investigation for Lincoln DFPA designs. Power dissipation may be estimated using the array shift expression E shiftArray = α pixelData α clk N X N Y N bits E shift , multiplied by the total number of shifts to implement the filter kernel. This is simply multiplied by the frame rate and added to the staring-mode power dissipation. There is no power penalty for the computation; this example required a total ROIC power of only 75 mW, but half of the integration time is consumed in the negative integration process. However, the data volume may be significantly reduced, thus reducing the system level readout bandwidth requirements.
In the pixel illustrated in Fig. 3(c) , the 16-b register bank may be operated as a pair of independent 8-b register banks. This allows for more sophisticated real-time image processing. For example, a Robert's cross operation [3] may be applied using the following kernels in the two register banks:
Because the register banks operate independently, this only requires that the total integration time be divided into four parts. Upon readout, the edge gradient magnitude at each pixel of the image may be easily computed by adding the values in the two register banks in quadrature (or approximated by simply adding their magnitudes). At each pixel, the angle of orientation of the edge giving rise to the local gradient may be found using the expression tan −1 (U bank2 /U bank1 ) − (3π/4), where U bank1 and U bank2 are the contents of the two register banks. It would be straightforward to incorporate onto a future ROIC the capability to perform these two simple computations and to output only the data exceeding a particular gradient threshold. Thus, it is possible to realize a reduction in the output bit depth, a reduction in the number of pixels for which values must be read out, and a reduction in the complexity of downstream processing electronics.
B. Temporal Filter Operations
Many applications require high-speed tracking and velocimetry capabilities. The bidirectional integration capability Fig. 9 . On-chip velocity filter was applied to measure the track and velocity of the bullet and reject scene clutter. The raw DFPA output data are displayed, and no off-chip processing was required.
of this DFPA architecture allows it to perform many useful motion detection and measurement operations without requiring a high readout frame rate. Fig. 9 shows one example. The same LWIR digital focal plane component and tripod camera used in the TDI example in Fig. 6 were set up to view a 357 magnum gunshot at a firing range. The camera was configured so that each frame would consist of 40 immediately consecutive 200-μs integration periods prior to the readout. The counter was configured to count up for the odd integration periods and down for the even integration periods. After this 8-ms total integration time, the frame was read out, and the process was repeated.
As shown in Fig. 9 , fast-moving objects in the scene produced an alternating white and black signature. The bullet track was thus imaged as a dashed line with the spatial period proportional to the bullet velocity. From this track, the bullet speed was computed to be 395 m/s, which is consistent with the ballistic data. The heat wake produced by the muzzle flash is also captured, providing useful fluid velocimetry data that would conventionally require a frame rate of several kilohertz. Stationary and slow-moving objects are not detected, once again providing an inherent data reduction benefit. The filtered image in Fig. 9 was output from the DFPA as a raw picture in a 100-Hz sequence without any need for further off-chip processing. The ROIC power dissipation was almost identical to that for a standard staring-mode configuration, which is less than 100 mW.
V. SCALABILITY TO LARGE FORMATS
Regardless of the application, it is often desirable to scale the ROIC to large formats. A historical concern surrounding digital pixels is excessive power dissipation. To illustrate the implications of the previously discussed power expressions, Fig. 10 includes a plot showing the total ROIC power dissipation for two cases. The first case, which approximates the DFPA ROIC demonstrated in the examples presented here, assumes that E count is 30 fJ/count. This is consistent with a 90-nmnode 16-b-ripple counter-based pixel operating at V DD = 1 V. E shift is 100 fJ/transition, which is consistent with a wire load capacitance for a 30-μm-pitch pixel. The second case assumes that the same CMOS technology is used but that the core voltage is reduced to 0.7 V and the pixel pitch is reduced to 20 μm, giving an E count of 15 fJ/count and an E shift_core of 40 fJ/transition. Note that part of the decrease in E shift_core is due to the reduced orthogonal transfer wire load. The supply voltage for the periphery is assumed to be 1.5 V, thus giving an E shift_core of 180 fJ/transition. Other parameters are as discussed in Section II, including the 25-MHz mean PFM oscillation frequency, four output taps, and the 100-Hz frame rate. A square-format array is assumed, with the size of each side plotted on the abscissa from 32 to 2048. The design point demonstrated experimentally in a Lincoln DFPA is indicated on the plot.
VI. CONCLUSION
A DFPA architecture that combines detection, ADC, and signal processing electronics into a single massively parallel architecture has been developed and applied to several interesting applications. Significant size, weight, power, and imaging performance benefits have been realized for TDI, stabilization, and signal processing modes of operation. While demonstrations have focused on the midwave infrared and LWIR wavebands, aspects of the architecture may prove useful across the optical spectrum. Because the design is predominantly digital, it stands to benefit in pixel pitch, scalability to large formats, power performance, and maximum readout bandwidth with advances in industry-standard digital CMOS technology. It is expected that future developments in such pixel processing imagers will require imaging system designers to revisit design tradeoffs in light of new opportunities for SNR improvement and data reduction early in the signal chain.
ACKNOWLEDGMENT
Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.
