I. Introduction time required by the vision process.
Charge-coupled-computing circuits 3. Prototype charge-coupled computer
In the course of effecting the preprocessing tasks, each 4 . Prospects for spatially parallel image plane pixel may undergo as many as 100 to 500 simple arithmetic preprocessing operations per frame. The frame rate may range from I Hz 5 . Conclusions in low speed systems, to 100 Hz in systems operating on a 6 . Acknowledgments par with human vision, to JOOO to JO,OOO Hz in high perfor 7 . References mance systems. Thus, the number of operations required in a 100 Hz frame rate system with a modest array size of 1. INTRODUCTION 100 x 100 elements easily could exceed I x J08 arithmetic Real-time vision in machines requires the synthesis of imag operations per second. Advances in very high speed inte ing hardware, computing hardware, and image processing grated circuits (VHSIC) may allow serial computing systems software. For mobile robots, the system must be portable. to achieve such high throughput, but a parallel computing These mobile robots may be used in the future for under approach (single instruction, multiple data) can be used to water inspection, agriculture, space exploration, and alleviate the performance requirements. Unfortunately, transportation in addition to obvious defense applications.
massively parallel computing systems are presently incom Thus, the vision system must be lightweight and require low patible with the portability needs of a mobile robot system. power as well as having the high throughput necessary for Furthermore, the advantages of massively parallel systems achieving real-time operation.
often are offset by the problem of loading and unloading the In the image preprocessing portion of the vision system, parallel data from a serial data stream at sufficiently high the tasks performed include smoothing (noise removal), data rates. level shifting (fixed-pattern-noise removal), gain adjustment
In this work, an advanced approach is advocated. It is (sensor nonuniformity compensation or adaptive contrast proposed to perform the image preprocessing functions in enhancement), sharpening, edge enhancement, threshold parallel on the image plane itself, similar to a biological ing, frame-to-frame subtraction, motion detection, and retina. Since the image data arrive in a parallel manner and region growing. These tasks can be performed (although not are transduced to an electrical form in parallel, it seems always optimally) using local neighborhood operations on natural to perform spatially parallel image preprocessing on the image plane as well. By contrast, conventional imaging serial data bottleneck is one of the most serious problems facing conventional approaches to real-time imaging pro cessing. With processing being performed on the image plane, not only are the throughput requirements of subse quent signal processing alleviated, but the opportunity exists for selectively reading only relevant data off the image plane, thus further reducing serial signal processing and transmission bottlenecks. The computation on the image plane may be done in a number of ways. For example, conventional or bit-slice digital architectures might be employed. However, the real estate available for such computation is critically small, and the need for a parallel array of analog-to-digital converters will make such an approach particularly unattractive. Thus, a real-estate-efficient analog computer would obviate the need for A/D conversion prior to processing.
The degree of parallelism present on the image plane determines the real estate available for each processing ele ment (PE). Ideally, a truly spatially parallel architecture, in which there is one PE for each pixel, would be used. For monolithic devices, the real estate consumed by the PE cir cuitry takes away from the real estate available for the photon transducer itself. This can be a serious problem if the light levels are low or the frame rate is high. In a hybrid "bumped" detector/readout system, the image resolution is not necessarily hindered by the density of the PE circuitry, but if the number of pixels exceeds the number of pro cessors, then a lower degree of parallelism must be accepted. In future s-D integrated circuits, the PE density can be in creased if the PE real estate can be spread into the Z direction.
Charge-coupled-device structures are well suited for such a system. 1 The analog nature of the image data can be com pactly represented in the charge domain, requiring a single electrode for storage. The image data are refreshed at the frame rate, so the dynamic nature of CCO signal representa tion is generally not a concern. The accuracy of computa tion in the charge domain, which can be of the order of one part in 256 or better, is more than adequate for most image preprocessing applications. Furthermore, charge-transfer devices already have established themselves as the technology of choice for image data readout. The difficulty lies in the design of the charge-domain circuits.
There recently has been a growing interest in performing image preprocessing on the image plane in the charge do main. Joseph 
CHARGE-COUPLED-COMPUTING CIRCUITS
There are several issues to be concerned with when designing and implementing arithmetic functions in the charge domain for image plane processing applications. First, the circuit must be as real estate efficient as possible. Second, it can use only simple biasing; that is, all clock voltages must swing between the same two rail voltages, and the circuit operation should be insensitive to these rail voltages. Third, the circuit must have enough dynamic range for the application, so in general small signal approximations cannot be used to describe device behavior. Finally, the circuit must have the required accuracy. One may note that doing computation in the charge do main essentially limits one to positive signal quantities. However, since image data are essentially a representation of the local photon flux, the input data are always greater than zero, and there are no intrinsically negative signals to be concerned with. Intermediate negative quantities cannot be used easily and should be avoided in the design of algorithms. It also is possible to use a sign bit to represent negative quantities, but this often is cumbersome.
The approach taken in this work is to use three dimensional charge coupling. In conventional charge transfer devices, capacitors are coupled laterally through the use of fringing fields generated by bias voltages applied to the electrodes, as shown in Fig. l(a) . The circuits advocated here also use vertical coupling between the charge on the electrode and the charge in the channel. This is illustrated in Fig. l(b) . The approach is similar to that used in floating diffusion output amplifiers in conventional CCOs except that the discharged electrode controls a CCD potential well instead of a transistor current. The sequence is as follows: an electrode is precharged to a voltage ( -V) by means of a transistor switch and is then left floating. Signal charge col lected by a floating diffusion attached to the electrode by means of a wire (metal, polysilicon, diffusion, etc.) is "sub tracted" from the precharged electrode. The resultant change in potential well depth is used to generate or control charge packets.
Such 3-0 coupling has several advantages. Primarily, it offers a new set of possibilities for novel circuit design since both electrodes and the semiconductor are used in the charge domain. A second advantage is that charge can be transferred across long distances in a short time through the use of conventional wiring without requiring sequential elec trode transfers. Thus, the planar topological constraint nor mally associated with CCO circuits is alleviated. Third, the precharged gate can be used as a summing node or for cur rent integration.
There are problems introduced by this method as well. For example, if the wiring capacitance is not carefully minimized, charge-transfer efficiency can seriously suffer. Switched capacitor circuit designers face similar difficulties. Another problem is that gate charge subtraction based cir cuits are susceptible to kTC noise, which may dominate the noise floor.
The simplest arithmetic function implemented in the charge domain is addition. The preferred approach is the use of a summing bucket or accumulator. Charge packets are successively inserted into the summing bucket, and then the sum is transferred when complete. In this circuit, and in
ILfJ.l")' succeeding circuits, the dynamic range of the operation is determined by the size of the electrode area and the voltages applied. However, since all buckets swing between the same clock rail voltages, electrode area is the only real variable. A summing bucket therefore, by its nature, should be larger than average. Increasing the electrode lengths can impair circuit speed since the charge transfer time scales as Ln, where n is 2 or larger. Hence, there is a trade-off between dynamic range and speed of operation. The accuracy of the accumulator is limited by the input transfer efficiency, which depends on clock width. In general, accuracy is also traded for speed. A simple function, although difficult to implement ac curately, is time-invariant signal attenuation, wherein the output charge packet is a fixed fraction of the input charge packet. Previous workers have tried various approaches, but the most successful is that reported by Bencuya and Steckl,5 which uses a channel stop barrier to divide a bucket in the transverse direction (parallel to charge flow). Partition noise is minimized by this technique, and the accuracy is deter mined by photolithography. Repeated application of charge-packet splitting and summation can be used to effect variable attenuation 4 of charge packets. A more difficult function to implement compactly is charge-packet differencing. Fossum and Barker 6 reported on an intrinsically linear and compact charge-packet differencing circuit that generates an output charge packet equal to the difference of two input charge packets, such that Q o = Q a -Q b for Q a > Q B and Q o = 0 otherwise. The circuit is shown schematically in Fig. 2 . The advantage of this scheme is that the output charge packet can be regenerated many times. For example, if Q b = 0, then the circuit can be used as a charge-packet-copying circuit, pro ducing multiple copies of an input charge packet. This prop erty also can be used to serve as short-term memory for frame-to-frame operations.
The absolute value of the difference can be implemented using the differencing circuit just discussed if it is operated twice. one of the t"l0 output packets can be nonzero, the result is the absolute value of the difference.
. Fixed gain or attenuation also can be implemented using the differencing circuit if the areas of the A and B electrodes are not equal. However, as discussed, circuit speed will suf fer if the electrode width becomes too large.
For several applications, it would be useful to compare the magnitude of two charge packets. The output of such a magnitude-comparator circuit is used to conditionally generate a charge packet. For example, the magnitude com parator could gate a second differencing circuit whose out put would be zero if the circuit were disabled by the magnitude comparator. Colbeth et al.? recently developed a comparator that has a large dynamic range and the sensitiv ity and speed required for charge-coupled computing. This circuit, shown schematically in Fig. 3 , is basically a flip-flop whose nodes A and Bare precharged according to the size of charge packets Q a and Qb' respectively. When the flip-flop is enabled, its positive feedback swings it into one of two stable states. The final state depends on the precharged node voltages, and the stable node voltages can be used to drive other circuits. In the case of the differencing circuit, the node voltage can be used to selectively gate the fill cycle, as illustrated in Fig. 4 .
Two arithmetic functions whose implementation in the charge domain are not yet fully developed are a charge domain multiplier and a logarithmic compressor. In the former, an output charge packet is generated such that Q o = QaQb/Qref' where Q ref is a reference size and may be ex ternally controlled. A scheme for implementing this in the charge domain by adapting the circuit of Yamasaki and Ando s is being investigated. A logarithmic compression cir cuit has limited use in a computing circuit (although charge packet multiplication would then become easy) but can be implemented in the photon transducer fairly easily, by using either photocapacitive transduction or an open circuit solar cell type detector.
Input and output functions also must be implemented as circuits. In particular, to execute 3 x 3 kernel operations, a given PE must be able to obtain data from the local neigh borhood on the image plane. A simple charge-transfer device structure could be used here, but since the same charge packet frequently must be shared with many neighbors, a replicator-based transceiver structure is sug gested_ Data to be transmitted are loaded into the replicator and are transmitted by replicating the charge packet and transferring it using the 3-D charge-coupling technique described.
PROTOTYPE CHARGE-COUPLED COMPUTER
To exper-imentally investigate the charge-coupled computing schemes described, a simple charge-coupled computer was designed, fabricated, and tested. The fabrica tion process used at Columbia to build the devices is primitive by modern standards but nevertheless allows ex ploration of the relevant concepts. P-channel diffused junc tions, diffused channel stops, and single-level aluminum electrodes separated by open submicrometer gaps were employed, leading to a 10 J.'m design rule. An internal getter ing cycle was used to reduce dark current to the nA/cm 2 level. The process results in relatively poor charge-transfer efficiency due to 20 mV barriers in the open gaps. Source and drain to gate capacitance is another problem with this process, especially in view of the 3-D coupled circuits used.
The architecture of the simple charge-coupled computer is shown in Fig. 5 . It consists of input and output structures, a charge-packet differencer, a charge-packet-magnitude com parator, and a second charge-packet differencer gated by the output of the magnitude comparator. The circuits are connected together by means of a central diffused-junction bus through which charge to be exchanged between circuits is passed. MOSFET switches control the connection of various circuits to the bus at the appropriate time. A MOSFET switch connects the bus to an input diode to allow the precharging of electrodes to the voltage applied to the in put diode. For example, the output amplifier reset voltage is applied in this manner.
The computer is general purpose and is programmed by applying a particular sequence of clocking signals. Figure 6 is a photograph of the fabricated computer.
Testing of the charge-coupled computer is under way, and the functionality of the charge-coupled-computing circuits has been verified. A block diagram of the test station is the transient performance of the charge-packet-magnitude comparator, following the A and B node voltages. Data are captured using a dual differential sample-and-hold circuit and are either manually or automatically recorded through the use of a Keithley model 230 programmable voltage source and a Keithley model 617 electrometer. Despite reasonable precautions, the test station noise is dominated by residual 60 Hz pickup, corresponding to 10 mV peak-to peak as seen on the output amplifier. The simplest operation to verify on the charge-coupled computer is input and output. Figure 8 shows the input/out put transfer characteristic with the input operated in a surface-equilibration mode. Charge is transferred to the output amplifier using "drop-and-push" clocking. From the geometry, a transfer time of 200 ns is expected, and the system clock rate was set accordingly. A rise rate of 0.5 VIns was used on all push clocks. The output amplifier is reset through use of the bus gate and diode. The input circuit delivers 43 pC/V at the output as measured by monitoring the dc current flow through the input diode, and the output circuit delivers 0.65 V at the output per volt at its input, as measured by dc use of the bus gate and diode. Thus, the ef fective output amplifier capacitance is 28 pF, which agrees with calculations based on the layout. The parasitic bus capacitance is dominated by source and drain to gate capacitance for all bus switches and was estimated to be 1.1 pF. Although the charge-transfer loss due to this parasitic capacitance is expected to be large in this prototype com puter, a modern self-aligned process would eliminate nearly all of the parasitic capacitance. The charge-packet differencer was tested both in replicator mode (Qb = 0) and as a differencer. These data are shown in the oscilloscope photograph in Fig. 9 and in Fig. 10 . The differencer linearity and gain are strongly af fected by the large bus parasitic capacitance. Nevertheless, it works surprisingly well. The differencer output was generated and summed two times in the differencer sum ming bucket prior to transfer to the output amplifier. However, to get optimal performance from the differencer, the operating voltages were 25 V on the input switches and 12 V on the transfer line.
The charge-packet-magnitude comparator was tested by again forming two charge packets Q and Q b in the input cir in Fig. 11 . The two critical parameters for characterizing the comparator (other than speed, which is sufficient for this computer) are offset and sensitivity. The offset for the com parator depends on alignment accuracy in this layout and was 150 fC for the prototype computer tested. The sensitiv ity arises due to noise. In this case, the node voltages de velop a randomly signed difference for the same size of in put charge packet. The sensitivity can be obtained by measuring the frequency of logic ones generated by the com parator as a function of the relative charge-packet size dif ference. This is shown in Fig. 12 . The sensitivity cor responds to a dynamic range of better thari 1 part in 100 and is limited by residual 60 Hz pickup. The gated charge-packet differencer was tested by generating charge packets Q and Q b in the input structure a and sending them to the comparator. The charge packet Q b was replicated by the differencer only if Q was larger than a Qb' The output is shown in Fig. 13 .
With the functionality of the prototype computer established, measurements are now under way to test the computer's performance when executing more complex pro grams. These results will be reported when they become available.
PROSPECTS FOR SPATIALLY PARALLEL IMAGE PLANE PREPROCESSING
The feasibility of an array of charge-coupled computers is now addressed. There are several issues, as was discussed in the introduction. First is the matter of packing density. The prototype device has nominal dimensions of 600 pm X 1500 /km using a 10 /km design rule. Using a 1.5 /km design rule, allowing for additional real estate for a phototransducer and nearest-neighbor I/O, and with improved design,· it is reasonable to expect a packing density of approximately 40 to 50 PE/cm, or between 1600 and 2500 PE/cm 2 • Certainly, an array size of 32 x 32 should be obtainable and would give adequate resolution in a number of applications.
A second consideration is processing throughput. With a 1.5 /km design rule, 20 ns clock widths would be more than adequate for good charge-transfer efficiency. Each arith metic function may take approximately 25 clock cycles, or 500 ns. If 500 operations are required per pixel, the total computation time is 0.25 ms. Thus, for complex image preprocessing tasks, a 4000 Hz frame rate should readily be obtainable. Since this is faster than most applications would require. one might suggest that a more optimal design is to create one charge-coupled computer for every four pixels, thus allowing a resolution on the order of 6000 to 10,000 pixels/cm 2 . Such a compromise may make image plane pro cessing attractive for staring IR focal plane array applica tions, depending on image contrast and dynamic range requirements.
Power density is another issue to be considered. This will depend on the average amount of charge transferred during each clock cycle. A bucket with 10 7 carriers, which is, on average, half full, will dissipate approximately 8 pJ per transfer if the bucket potential changes by 10 V. If each arithmetic function requires 25 transfers and if 500 opera tions are performed, then 100 nJ are dissipated in the semiconductor to process one pixel. For 5000 pixels per frame, 1 mJ is consumed per frame. Thus, at 100 Hz, 50 mW of power is dissipated, and power consumption will not be a problem in most applications. It should not be overlooked that 5000 pixels are being preprocessed at 100 Hz, corresponding to 250 x 10 6 operations per second, at a cost of 50 mW.
Noise in charge-coupled-computing circuits should not be a problem since the number of noise-generated carriers is ex pected to be smaller than the number of error carriers in troduced by circuit inaccuracy. (The impact of predictable circuit inaccuracy is an interesting but unexplored issue. *) The noise floor can be expected to be dominated by kTC noise for 3-D coupled circuits with otherwise good charge transfer efficiency. If the minimum resolvable charge packet Q rnin is defined as AC ox Vir, where A is the electrode area, Cox is the specific oxide capacitance, V is the bucket depth in volts, and r is the resolution of the circuit, then the minimum resolvable charge packet is equal to the root mean-square number of noise carriers, (kTCIq2) v,, when the electrode area is reduced to a critical area A given by A er , er = kTr 2 /C ox V2. This critical area is significantly smaller than those envisioned in scaled charge-coupled-computing circuits. For example, with r = 256, a 100 Aoxide, and a 1 V bucket, the critical electrode size is approximately 0.25 1tm x 0.25 1tm and Qrnin/q is 7 carriers. During operation, kTC noise will be additive, but the number of operations required to build up noise equal to Qrnin is equal to AIA er and is larger than the number of operations per pixel per frame an ticipated for charge-coupled-computing applications. For example, for an electrode area of 100 1tm2, r = 256, a 250 A oxide, and a 5 V bucket, Qrnin/q is 16,800 carriers and the critical number of operations is 12,700.
Encouraged by this analysis, we are currently designing and fabricating a modest array of charge-coupled computers for image plane preprocessing experiments.
CONCLUSIONS
A new class of CCDs that perform arithmetic and logic functions in the analog charge domain has been described. A prototype charge-coupled computer employing these cir cuits has been designed, fabricated, and tested. The results of experimental and theoretical studies indicate that it is feasible to put an array of these simple computers on the im age plane to perform image preprocessing functions in a spatially parallel way.
