Abstract. Architectures for focal plane image processing are discussed. On-chip image preprocessing for solid-state imagers using analog CCD circuits is described for low, medium, and high density detector arrays. A spatially parallel architecture for low density, high throughput applica tions is described. For sparse illumination or event detection, a content· addressable architecture is proposed. A new pipelined vector pixel pro· cessor architecture for medium density infrared staring focal plane ar rays is described. Neighborhood reconstruction during serial readout of high density TV-quality imagers for a pixel processor is considered using delay and analog frame memory techniques. The potential of on-chip read/write analog frame memory for image transformation and frame·to frame processing is discussed. 
INTRODUCTION
Solid-state imaging devices have evolved rapidly in the past five years, and this trend is expected to continue. The con sumer market for home video cameras and the industrial market for machine vision and security cameras have pro vided a strong driving force for this evolution. Technology derived from aerospace research and development in the area of visible and infrared imagers has further pushed the state of the arLI-3
Although the pace in computer electronics has also been brisk, technology for processing electronic images acquired by solid-state devices has been lagging and now represents a severe bottleneck for real-time image processing. Television quality imagers capture images containing between 200,000 and 300,000 pixels per frame at frame rates of 30 Hz. Thus, a real-time image processor must contend with data rates of the order of 10 Mpixels/s. For simple image processing (e.g., edge detection), approximately 100 elementary opera tions such as addition and subtraction must be performed per pixel, resulting in a corresponding throughput require ment of 1 billion operations per second (Bops). Machine vi sion applications may require higher frame rates and more sophisticated processing, while higher definition imagers re quire more pixels to be processed, making the throughput Paper 2618 received Aug. 22, 1988 ; revised manuscript received March 14, 1989 ; accepted for publication March 14, 1989 . ©1989 Society of Photo-Optical Instrumentation Engineers. requirement on the image processor potentially climb to 100 Bops. Such throughput is difficult to achieve in a general purpose digital computer without massively parallel pro cessing.
Recently, some special-purpose digital integrated circuits have been proposed to relieve this bottleneck. 4 -6 These cir cuits generally consume between 500 and 1500 mW of power, and most are restricted in function. Their digital nature requires high speed analog-to-digital (AID) conver sion of the pixel data between the imager output and the digital processor input, which is often a major source of power consumption. Most digital processors operate on a subset of the image data, and additional circuitry to feed the processor pixel blocks is required. Digital storage of a single frame of imagery for frame-to-frame operations or for buf fer memory requires approximately 2 to 4 Mbits, with conse quent real-estate and power consumption. Thus, a complete digital image processor system may require dozens of ICs and several printed circuit boards of electronics.
Analog image processing, which occurs prior to AID con version, has several advantages. These include lower power consumption, lower real-estate consumption, and no AID converter. Some approaches for implementing analog image processing with CCDs have been explored, with processing occurring either in separate ICs or on the same IC as the im ager itself. 7 -15 In general, analog approaches to image pro cessing are perceived to suffer from more limited accuracy, from design and fabrication complexity, and from the dynamic nature of CCDs. In practice, these perceived prob lems are not particularly critical, and future image process ing circuits may combine analog and digital functions in a CCD/CMOS process.
In this paper, architectures for image processing circuits located on the same chip as the imager itself, or hybridized with the imager, are considered. Such focal plane image pro cessing* has a high potential for achieving high throughput *This is actually image plane image preprocessing, but the term "focal plane array" has become an accepted part of the technical language. with low power and chip-area consumption. The choice of architecture depends on the detector array density, the total number of pixels, the frame rate, and the processing com plexity required by the application. Imager readout technology is described first, followed by discussions of ar chitectures for low, medium, and high density detector arrays.
IMAGER READOUT TECHNOLOGY
Technology currently employed for detector readout is cen tral to a discussion of focal plane image processing architec tures. Currently, there are three major readout or multi plexer technologies employed for solid-state imager readout.! Multiplexers deliver pixel data one row at a time, and within each row, column positions are scanned horizon tally. During the readout, the pixel data are shielded from light either by use of a fast frame-transfer operation or by interline transfer, and by the use of a light shield. In the first readout technology, the CCO approach, row data are shifted in parallel in the columnar direction by use of a slow CCO analog shift register. The bottom row is loaded, in parallel, into a serial CCO shift register that, in turn, shifts the data in the row direction to an output amplifier, as shown in Fig. 1 . There is a tradeoff in the CCO approach between fill factor (percentage of the pixel unit cell used for detection) and overall chip size. Chip size is im portant from an IC manufacturing and packaging perspec tive, but imager performance is improved with increasing fill factor. Interline transfer yields a smaller chip size but a relatively poor fill factor; frame transfer improves fill factor at the expense of chip size.
The CCO multiplexer delivers the charge generated within each unit cell to the output amplifier by successive transfers. The charge transfer efficiency (CTE) is defined as the frac tion of charge successfully transferred in each transfer pro cess. A charge packet in a four-phase CCO might undergo 3000 to 4000 transfers before reaching the output amplifier, so a chip with a CTE of 99.995070 would deliver only about 90% of the original charge packet to the output amplifier (and portions of other charge packets as well). This is ade quate in most applications. Noise is introduced in the charge transfer process, but in a buried-channel CCO the multiplexer noise is usually smaller than the background photon shot noise.
The second readout technology is the MOS X-Y or direct readout approach, in which pixels are addressed like digital bits in a random access memory. Each pixel is individually addressed, and the analog charge packet is placed on a global read line connected to an output amplifier. The capacitance for the readline may be large, making recovery of the pixel signal difficult. The resultant dynamic range is comparable to surface-channel CCO multiplexers but generally poorer than buried-channel CCO multiplexers. However, the MOS X-Y readout circuitry is easier to design and simpler to fabricate with high yield than are CCO cir cuits. The full random access capability of the direct readout architecture is generally not utilized in imaging systems, and the array is scanned in a rasterlike manner (usually by on chip scanner circuits). The ability to focus the readout in a particular subregion of the array is interesting but so far unexploited.
The third readout architecture is the most recent and is a hybrid of the previous two approaches. The MOS-CCO ap proach is to use direct readout for rows but a CCO serial shift register (loaded in parallel with the row data) to per form column multiplexing. The high read line capacitance of the MOS X-Y multiplexer is reduced, and clocking is simplified. The transfer efficiency of the readline-to-CCO bucket process is typically 95%, with overall transfer effi ciency further reduced by the serial CCO shift register. This architecture is shown schematically in Fig. 2 .
In general, solid-state imagers for scientific applications have a single source-follower output stage. Commercially oriented imagers may include on-chip sample and hold cir cuitry and drive amplifiers. Video output for images that are digitally processed are sent to off-chip scaling amplifiers and then converted to digital format using an AID converter. On-chip AID conversion, desirable for improvement of dynamic range and system simplification, is rarely per formed, although a CCO-based AID converter for this pur pose is currently being prototyped. t
The choice of readout architecture for solid-state imagers depends on a number of factors ranging from fabrication capability and design expertise to system considerations. It appears that architectures utilizing analog devices and cir cuits compatible with digital circuit technology are the most likely candidates for success.
FOCAL PLANE IMAGE PROCESSING ISSUES
The degree to which an electronic representation of the photon flux is altered or manipulated on the focal plane can vary significantly. For example, simple transimpedance amplifier (TIA) circuits placed between the detector and multiplexer for buffering purposes barely constitutes focal plane image processing since they do not alter the spatial content of the image. On the other hand, circuitry for im plementing on-chip discrete image transforms or image halftoning represents significant image processing rather than preprocessing. The extent of the processing possible is dictated primarily by the available chip area, which in turn depends on the detector array density and size.
There are strong motivations for performing image pro cessing on the focal plane of the imager. Although digital circuitry is often easier to design and fabricate, analog signal processing uses little power and real estate and avoids the need for prior (and higher resolution) AID conversion. The low power. aspect of analog signal processing is especially important in aerospace applications, in which the focal plane array may be cooled or in which total system power is limited (e.g., satellites or missile systems). Thus, most of the ensuing discussion is made with analog circuitry in mind.
A major reason for considering focal plane image pro cessing is that it avoids the introduction of noise and distor tion through off-chip driving of the multiplexed output. Output amplifiers are often a major source of noise in readout circuits, and pickup on output lines is also a dif ficulty. Distortion introduced by the output amplifier can require nonlinear gain compensation.
Additionally, focal plane image processing has the poten tial to reduce the bandwidth of the signal driven off-chip. For example, for thresholded images (which become binary in nature) the requirements on off-chip drive circuitry and AID converter resolution are significantly reduced. Alter natively, for video compression, frame-to-frame com parison of pixels can be performed to transmit only those pixels that change.
Interconnection of processing elements in a parallel pro cessor adds a substantial hardware and power burden to conventional digital computing systems. On-chip spatially parallel processing elements (described in Sec. 4) operating in the analog domain have particularly simple interconnect structures, which adds to the desirability of performing focal plane image processing.
Finally, on-chip processing has the potential to alleviate bottlenecks in massive detector arrays that are sparsely il luminated or have sparse event occurrences. For example, the detection of a sudden bright spot in an otherwise deac tivated array might trigger full readout of the array. Surveillance of large fields of view is another example in which full array readout could be avoided until motion is detected by on-chip processing circuitry.
There is an unfortunate relationship between array size and processor complexity that exists for all image processing systems and is particularly acute for focal plane image pro cessing. In the latter, for a given chip size, as the detector ar ray size becomes larger the throughput requirements of the processor become more stringent as the chip area available for image processing is reduced. Three-dimensional stacked or hybridized structures (e.g., flip-chip or z-plane topologies) can be used to retain the advantages of focal plane image processing while extending the real estate available for the processor. 16 ,17 4. LOW DENSITY ARRAYS Low density detector arrays, in which chip real estate is readily available, offer the largest opportunity for focal plane image processing. A low density array is defined as one having a detector pitch greater than approximately SOL, where L is a typical feature size. For example, a low density detector array with a feature size of L = 3 /Lm would have a detector pitch greater than approximately 150 /Lm. Low den sity arrays are used in low resolution applications (e.g., event or motion detection) or in low carrier generation (e.g., low light or poor quantum efficiency) applications. In the former case, detectors may be monolithically integrated within the unit cell, and in the latter case, hybrid flip-chip configurations or amorphous silicon overlayers provide high detector fill factor without sacrificing readout chip real estate.
Circuitry for performing image processing functions can be placed within the available real estate in the unit cell. The circuitry, or processing element (PE), may provide only buf fer lamplifier functions for the unit cell or more sophisticated functions. For example, circuitry simulating neuron behavior to perform motion or edge detection has been proto typed for low density arrays.18 Implemented as a switched capacitor CMOS IC, the unit cell size is 164 /Lm x 143 /Lm and the array size is 48 x 48.
Such an architecture is termed spatially paralle\l9 since the physical interconnect relationship between processing elements corresponds to the spatial connectivity of the im age, as shown in Fig. 3 . Spatially parallel general-purpose charge-domain analog computing circuitry can also be located in the unit cell to provide more sophisticated kernel functions. 9 . lo ,12,14 Implemented in a double-polysilicon CCD process, the detector pitch is typically 150 to 200 /Lm. Pro cessing elements are designed to communicate with their nearest neighbors and can perform functions such as smoothing, signal averaging, edge detection, and AID con version. Such a focal plane image processor is a single instruction, multiple-data (SIMD) architecture. Unlike the neural network circuit, it is digitally programmable by ap plying various clocking sequences.
Spatially parallel architectures can also be implemented with the z-plane technology.17 In this technology, a laminated stack of perhaps 128 chips is mated perpen dicularly to a detector array using flip-chip technology. The detector array pitch is typically 100 /Lm, and the edge of each chip becomes mated to one detector array column, pro viding unit cell real estate in the z-direction. The z-plane hybridized approach has the advantage of providing more real estate per unit cell for pixel processing, such as nonuniformity correction or multiple frame buffering, 20 but it is significantly more difficult to manufacture. Further more, PE communication in the columnar direction is easily Detector achieved, but communication in the row direction (perpen Splitter dicular to the lamination plane) must be performed through a backplane connection. It is expected that as this Differencer technology matures and becomes applicable to medium and perhaps high density detector arrays, it will provide the Comparator ultrahigh throughput required in future real-time image pro cessing systems.
Three-dimensional integrated circuits utilizing laser Gated recrystallized silicon for spatially parallel image processing Differencer have also been reported. 16 Digital CMOS technology is used after unit cell AID conversion (2 bits) to perform some logic Memory functions. Since it is a digital PE requiring approximately 7000 transistors per unit cell, the unit cell size is nearly 1000
The design of the PE depends strongly on the application, although some general-purpose operations can be antici pated. A serious difficulty in PE design is the reduction of the number of control lines required to operate the PE. These control lines can consume a significant portion of the available chip area. Decoding of control signals within the unit cell can reduce the number of control lines, but the decoding circuitry also requires significant unit cell real estate.
The throughput of a spatially parallel architecture can be very high. For an array size of N x M, the throughput in creases simply as NM. For example, a 10 mm x 10 mm im ager with a detector pitch of 150 J.tm could have approx imately 4000 pixels. Assuming 100 elemental operations per pixel, a serial processor operating at a rate of 1 Jl.s/operation would take 400 ms to process an acquired image, corre sponding to a frame rate of 2.5 Hz. However, a spatially parallel architecture could process the image at a frame rate of 10,000 Hz! Thus, the image processing throughput is taken from a realm that is barely real time to one acceptable for ultrahigh velocity intercept applications. Alternatively, the PE design can be simplified by employing slower but more efficient circuits. The spatially parallel architecture is limited not by image processing functional throughput but by readout (or I/O) rate.
A lower degree of parallelism in a spatially parallel archi tecture can also be utilized to conserve real estate. For exam ple, a single PE could serve p pixel detectors by time-domain multiplexing, reducing the parallelism by the factor p. The penalty for a lower degree of parallelism in a SIMD machine is increased software complexity for shuffling the data. This may be more cumbersome in an analog-circuit-based PE.
In a CCD-based spatially parallel image processor cur rently being fabricated,1O an array of 48 x 48 detectors (180 J.tm pitch) and 24 x24 charge-coupled computer PEs (p =4) shown in Fig. 4 . Each PE is designed to perform nearest halving in the analog charge domain. A bidirectional stack is used for local memory. This processor, which is digitally Ideally, PEs would be addressed externally in a direct programmable, can perform algorithms for smoothing, readout method since the output would be buffered or thresholding, edge detection, and AID conversion. The chip digital and thus immune to the parasitic effects described is projected to provide a maximum throughput of 0. pixel values or subpatterns would be useful for sparse il lumination or motion detection applications. In this case, unit cells that are illuminated above a particular threshold or in which a frame-to-frame change in photon flux has been detected set a system flag requesting readout. The unit cell then places its address on a global bus. Arbitration of bus contention can be achieved in a number of ways. For exam ple, priority encoding based on location, asynchronous enabling of unit cells, and multiple bus lines are some ways to ensure that only one address is on the bus in a given cycle. Thus, unlike normal architectures in which an address is given and,the data are returned, here the data are prescribed and the address of the pixel with the prescribed data is returned.
The readout of contours generated by edge detection is a second example for which nonconventional readout might be more useful. In this case it would be desirable to have ad jacent pixels on the same edge read out sequentially. Such "stitched" readout could be implemented on the focal plane by appropriate PE design.
MEDIUM DENSITY ARRAYS
Medium density arrays have detector pitches between ap proximately lOL and SOL. The small unit cell size prohibits all but the simplest PE circuits from being implemented. A unit cell PE would permit signal modification prior to readout to improve the overall dynamic range of the sensor array. Possible PEs include linear and logarithmic amplifiers, buffers, and perhaps magnitude comparison for AID conversion. Alternatively, externally adjustable unit cell amplifier gain would provide for adaptive imaging or perhaps fixed-pattern noise removal.
For low and medium density arrays, a new approach would be to have image processing circuitry located at the bottom of each column, as illustrated in Fig. 5 . For an N x M array, the degree of parallelism becomes just M, reduced by the factor N over spatially parallel architectures. The throughput requirement of each PE is increased over that of a spatially parallel architecture, but throughput re quirements for real-time processing can most likely be met.
This new architecture, even for low density arrays, has the advantage of making more real estate available for each PE. A pipelined PE design (in the columnar direction) can be readily achieved. The resulting architecture is referred to as a pipelined vector processor. The architecture would reduce the total fixed-pattern noise introduced by the analog PE circuits (since there are fewer PEs) and is compatible with time delay and integration (TDI) imaging and MOS-CCD readout structures. However, the possibility of pixel data distortion between the detector and the PE is increased.
The pipelined vector processor architecture has the highest potential for monolithic focal plane image process ing for infrared and other non-TV-quality images.
6. HIGH DENSITY ARRAYS High density arrays are defined as arrays having a detector pitch less than approximately lOL, thereby prohibiting PE circuitry. The imager chip area is used for detection and readout. For some dedicated applications and harsh en vironments it may be possible to fan out an otherwise dense array using electronic or optical means (e.g., optical fibers) so that the detector density is reduced, but in this discussion, it is assumed that the on-chip image processing circuitry must be located beyond the region of detection and readout. One exception to this is the use of the readout circuitry in a nonconventional manner to perform some limited image processing functions such as Gaussian convolution. 13 In this case, buckets are clocked to effect charge mixing and simulate a diffusion process.
The advantages of on-chip focal plane image processing given a serial readout of the detector array are more limited since only the off-chip transmission of the serial data would be avoided by on-chip processing. If the on-chip processor utilizes significant chip area or I/O pads or introduces noise through clock feedthrough, the advantages of on-chip pro cessing could be negated.
Many image processing tasks involve the convolution of the image data with a 3 x3 or S x5 kernel; i.e., a processed pixel is a weighted sum of its surrounding neighborhood. The pixel may be further processed by applying nonlinear operators on it and its neighbors. These two steps may be repeated several times before the image processing task is complete. Rarely does the convolution kernel need to exceed a size of 3 x 3, especially if multiple convolutions can be per formed. Thus, the reconstruction of a small local neighbor hood is required, after serial readout, in order to process a given pixel.
The image processing architecture can be divided into two parts: neighborhoed reconstruction and pixel processing. These two parts are intimately interrelated. For example, if the pixel processor destructively senses neighborhood data, those data must be regenerated prior to processing since a 3 x 3 neighborhood implies that each pixel must be utilized at least nine times. The necessity of regeneration in turn in fluences the manner in which the neighborhood is reconstructed.
Methods for neighborhood reconstruction using delay lines have been demonstrated both on-chip22 and off-chip.? In the on-chip case, serial output data from an N x M detec tor array were loaded into a CCD delay line 2M stages long. The delay line pixel data were sensed nondestructively using a floating-gate technique at the beginning, middle, and end of the delay line. Three adjacent pixels were sensed at each location, the locations corresponding to the same columns on adjacent rows. Thus, nine analog outputs corresponding to the local neighborhood were simultaneously provided to an off-chip image processor. The off-chip method used two separate delay lines, each M stages long, with data regenera tion required between the two. A drawback of both ap proaches is that pixels delayed by 2M stages can suffer from CTE effects. A technique utilizing a floating-diffusion direct-readout approach has also been reported, but it suf fers from fixed pattern noise and low speed. 23 An improved version of the delay approach to neighbor hood reconstruction is proposed in Fig. 6 . Pixel regenerators located at the bottom of each column fan out the pixel data into three row reconstruction registers. The first register is immediately loaded into a serial horizontal shift register and shifted toward the pixel processor. The second and third registers are delayed by one and two rows, respectively, prior to loading into their serial registers. At the end of each serial register, the data are regenerated a second time for fan-out into three column reconstruction registers. These add a further delay to the pixel data. The three column reconstruction registers for each of the three row reconstruc tion registers provide a total of nine simultaneous inputs to the 3 x 3 neighborhood pixel processor. To maximize throughput, a pipelined approach to the on-chip pixel pro cessor might be employed.
Neighborhood reconstruction might also be performed utilizing a buffer analog frame memory. An analog frame memory would require approximately the same amount of chip area as the imager, but if the frame store region of a frame-transfer mode imager is used, the total chip area re mains approximately the same. The data in the frame memory are stored in a spatially parallel format. Ideally, the analog frame memory would have nondestructive random access readout capability to avoid the pixel delay circuitry described above. Without complex unit cell circuitry in the frame memory, neighborhood access time would increase since there is only one readline per column and multiple read cycles might be needed to obtain the 3 x 3 neighborhood. However, with buffer memory added to the pixel processor, the access time could be reduced since actually only three new pixel data are added to the neighborhood as the array is scanned, and these could be sensed in parallel.
The full potential of a frame buffer memory could be realized if the unit cell included read and write capability. (In a sense, frame buffer memory in a frame-transfer imager already does have read and write capability.) The unit cell complexity would increase, as would the real estate require ment. However, if a hybrid flip-chip or silicon-on-insulator 3-D IC approach is used, the overall chip size would not in crease significantly. The proposed analog frame memory ar chitecture is illustrated in Fig. 7 .
With a read and write frame memory, algorithms having slow convergence to the final processed state could be ex ecuted. The overall frame rate of the system might be diminished, but some applications do not require high frame rates. A good example of this is image halftoning,24,25 in which neural-network-like algorithms might be used, with convergence occurring within a few hundred cycles. Since the resultant image is binary, off-chip transmission band width requirements drop considerably. A 1 Hz frame rate would be acceptable for facsimile transmission and other document scanning systems.
A second application that can be implemented with a frame buffer memory architecture of Fig. 7 with nondestructive readout is image transformation. For exam ple, implementation of the discrete cosine transform (OCT) using CCOSII is a good candidate for this architecture. Other transforms to enable image data compression 26 are currently under investigation.
Further increasing the chip complexity but also enhancing processing capability would be a second analog frame memory. This memory could be used for frame-to-frame operations such as motion or event detection. It might also be used for the temporary storage of intermediate results.
The real-estate penalty for a second frame memory is signifi cant, and off-chip processing could become an attractive alternative. The off-chip processing of two frames of analog memory using CCO-like circuits continues to provide power and real-estate advantages. Cooled CCO circuits can pro vide charge storage times measured in hours, and proximity to a cooled detector array (i.e., in the same dewar) would help reduce noise and pickup. The z-plane architecture might also be employed for frame memory applications. 20 Design of the pixel processor for either on-chip or off chip processing is dependent on the application and choice of technology. Fixed algorithm architectures are easier to design and require fewer clock control lines but do not offer flexibility. General-purpose processors provide flexibility for multiple applications but may be less efficient. Pipelined architectures require more chip area, but the throughput scales as the number of stages in the pipeline with a small signal delay as the penalty.
7. CONCLUSIONS Focal plane image processing, particularly in the analog do main, shows promise for reducing the severe throughput, power, and real-estate problems associated with current digital technology. Several new architectures have been pro posed and discussed. These included a spatially parallel ar chitecture for low density arrays, a pipelined vector pro cessor architecture for medium density arrays, and a read/write frame memory architecture for high density ar rays. The choice of architecture depends on the application. The higher the degree of parallelism, the higher the through put, power, and chip-area consumption.
It can be anticipated that focal plane image processing in solid-state imaging systems will develop rapidly in the next few years. Since system input is expected to remain analog in nature, and high level processing digital, such systems will need to combine the best features of analog and digital pro cessing circuitry. 
