A prototype 32 32 array processor fabricated in 2 m CCD/CMOS technology implementing the multi-scale veto edge detection algorithm is presented. In this algorithm, di erences between pixel values are computed in the original image, as well as after applying a series of smoothing lters of varying spatial scales. An edge exists between two pixels only if the magnitude of their di erence is greater than a given threshold for all levels of smoothing tested. This algorithm maps particularly well to implementation as a focal plane processor as it requires only nearest neighbor communication. The CCD array performs the functions of image acquisition, charge loading and removal, and image smoothing. Analog circuits between each pair of pixels in the array compute the absolute value of di erence between neighboring values and compare it to a global threshold. These circuits have been designed to allow reliable discrimination of di erences from 3:1% to 10:1% of full scale range and thus meet the performance requirements of many machine vision applications.
I. Introduction
Analog processing on the focal plane, directly coupled to the outputs of individual sensors, provides the possibility of performing preliminary operations on the image data at high speed and low power. The computational complexity associated with the early stages of image processing is tremendous, not so much because of the complexity of the calculations involved as because of the huge amount of data to be handled. For example, a single convolution operation required for low-pass ltering can involve from 9 to over 90 multiplyaccumulates per pixel. Typical imagers produce hundreds of thousands to millions of pixels at a rate of 1/30s. Processing this data digitally at video rates requires high speed analog-to-digital converters, large frame bu ers, and fast processors capable of performing from 100 million to one billion arithmetic operations per second.
By placing analog circuitry on the focal plane, we can perform many basic operations, including convolution, with a high degree of parallelism and without the need for preliminary A/D conversion. Traditional disadvantages of analog processing, such as low precision and signal degradation, are irrelevant on the focal plane given that high precision, i.e., > 8 bits, is not required and signal loss is negligible. There are some drawbacks to analog focal plane processors. In particular, the additional circuitry takes up space, reducing the area available for light sensing and possibly interfering with the sensors. Also, the resultant devices, which require a major design e ort to build, are limited to performing only a few speci c tasks. These limitations should not be overstated, however. Technology improvements, such as reductions in lithographic limits and the development of novel device structures and architectures, will mitigate problems of having circuits and sensors share a common substrate. Furthermore, the growing body of work in developing analog computational sensors ( 1]{ 8], to name only a few) will eventually reduce design times as we improve our knowledge of how to build these devices.
The processor described in this paper is the rst working implementation of the multi-scale veto algorithm for edge detection, originally presented in 9], and is an example of a fully parallel architecture incorporating analog processors within a CCD imaging array. This algorithm, described brie y in the next section, di ers from most edge detection methods used in machine vision in that it was speci cally developed to be implemented on a focal plane processor by taking into consideration the constraints of limited area and connectivity inherent to these devices.
A prototype 32 32 array was fabricated using a 2 m CCD/CMOS process and fully characterized. The CCD array performs the functions of image acquisition, charge loading and removal, and image smoothing by implementing a convolution operation with a kernel approximating a Gaussian smoothing function. Within each unit cell of the array, there are two edge detection processors which compute the absolute value of di erence between neighboring pixel values and compare it to a global threshold. These circuits have been designed to allow discrimination of small di erences of better than 1/32 of full scale range, which is adequate for many machine vision applications.
II. The Multi-Scale Veto Algorithm
Edge detection, in some form, is at the heart of all algorithms for object recognition and image interpretation. Edges, which are locations of rapid change in the 2-D image brightness function, are caused by variations in the surface re ectance of objects in the scene and indicate changes in material properties as well as surface discontinuities such as object boundaries. Two issues are of importance in edge detection. The rst is the criterion for determining the presence of an edge, while the second is the matter of how to lter out edges that correspond to unwanted detail.
In the multi-scale veto algorithm, an edge is de ned as a sharp change in brightness which is signi cant over a range of spatial scales. Scale refers to the extent of features in the image and is related to the frequency spectrum of their maximal rst derivative. For example, small-scale features such as point impulses, i.e., a single pixel whose brightness is very di erent from all of its surroundings, and thin lines have maximal rst derivatives whose energy is concentrated almost entirely in their high frequency components. On the other hand, a step edge, which occurs at the juxtaposition of two large blocks of constant, but di erent, brightnesses is a large-scale feature. Its maximal rst derivative is an impulse which has a broadband spectrum.
In order to test for the presence of an edge, the di erences between the brightness values of all pairs of neighboring pixels are computed and compared to a global threshold. If the magnitude of the di erence exceeds the threshold, a candidate edge is considered to exist between the pair. The image is then smoothed with a low-pass lter, and the di erences between neighboring pixel values are again computed and compared to another threshold which is chosen to account for the low frequency attenuation of the lter. This process may be repeated several times with lters of di erent bandwidths if desired. The end result is that an edge is considered to exist between two pixels only if the magnitude of their di erence is greater than the speci ed threshold at every level of smoothing. If the threshold test is failed even once, the edge is vetoed.
The rationale behind the multi-scale veto method can be explained brie y by observing how it treats di erent types of features. For example, consider the cases of an ideal step edge and an isolated noise impulse, both of which are of equal magnitude and are above a given threshold in the original image. When the image is smoothed, the impulse, which is a high frequency event, will be attenuated more severely than the step edge. If the bandwidth of the smoothing lter is su ciently narrow and the threshold properly chosen, the smoothed step will still pass the test while the smoothed impulse will not. The impulse will thus be vetoed and not produce an edge. On the other hand, even though the step will be smeared over some extent by the smoothing operation, the edge will be marked at its original location since that is the only site which was a candidate from the unsmoothed image. The multi-scale veto algorithm thus allows small unwanted features to be removed, with the minimum scale selected by the user who chooses the lter size, while preserving the location of edges that are retained.
III. System Architecture
To implement the multi-scale veto algorithm all that is required is a single two dimensional network which can perform repeated smoothing operations and which contains circuitry to compute and store the results of the threshold tests for each pair of pixels. The architecture used in the present system is shown in Figure 1 . It consists of a grid of orthogonal horizontal and vertical CCD transfer channels with edge detection circuits placed between the nodes. The numbers on the gates signify the di erent clock phases which are used to move signal charges in the array.
The CCD array, which is used for image acquisition, charge transfer, and smoothing, has the same structure with only minor di erences as that previously developed by Keast and Sodini 6] . By appropriately sequencing the clock phases, it can implement a convolution operation on the image with a discrete binomial kernel of arbitrary size.
The novel aspect of this structure is the design of the circuits, reperesented by the boxes labeled EDC in Figure 1 , which implement the multi-scale veto rule. Each of these contains an absolute-value-of-di erence circuit and a 1-bit memory cell which stores the edge signal. This memory cell is initially charged before any tests are performed and is discharged if any of the threshold tests is failed. The complete execution of the multi-scale veto algorithm can be summarized as follows: The array is initialized by transferring signal charge proportional to image brightness under each node gate (pixel) and by charging the edge memory cells. The signal charge is formed either by direct acquisition or by loading the pixel values from o -chip. A rst threshold test is performed by non-destructively sensing the charge at each node, computing the magnitude of the di erence between all adjoining neighbors, comparing these to a global threshold value supplied from o -chip, and discharging the edge storage cells at all sites where the threshold wins. A smoothing operation is then performed and the threshold tests repeated. This smooth-test cycle can be executed as many times as desired, as the sequence is entirely programmable from the control signals brought o -chip. The contents of the memory cells, which constitute the binary edge map of the image, are read-out once the tests are completed.
A. Charge smoothing Six di erent clock phases are required to operate the CCD array. Only four are needed to move charge laterally across the array when loading an image or unloading residual charge. However, for the smoothing operation the motion is along all four branches connected to each node and two more clock phases are required to control the direction of charge ow.
The 2-D smoothing operation is performed in two passes by sequentially executing a 1-D smoothing operation along the horizontal and vertical branches. Each 1-D operation consists of four steps: (1) splitting the charge held under the node gates into two equal packets, (2) moving the packets out the branches connected to the node towards the mixing gate, (3) averaging the packets from adjacent nodes, and (4) returning the averaged packets back to the node gate where they are added together.
Splitting is performed when the charge is entirely con ned under the node gate ( 1 high) by rst raising the signals on the adjacent gates ( 2 and 5 for horizontal smoothing; 6 for vertical smoothing) which causes the charge to distribute itself evenly over the high potential regions under the three gates. Bringing 1 low then divides the charge into two equal and isolated packets.
The separate packets are moved away from the node in opposite directions by executing a pseudo four-phase clocking sequence. It may be seen that the clock phases 1 , 3 , and 4 along each branch are mirrored about the central gate clocked by 3 , which is also referred to as the mixing gate. When phases 2 and 5 for the horizontal branches, or 2 and 6 for the vertical branches, are clocked identically the half-packets from adjacent nodes can be moved towards each other. The situation which arises when these charges are about to collide at the mixing gate on the horizontal branch is illustrated in Figure 2 . As can be seen, the charges are rst combined and then split in two, creating two identical packets equal to the average of half the previous values of the two neighboring nodes. The averaged packets are returned to their respective node gates by reversing the clock sequence used to move the charges away and are added to the averaged half-packets from the opposite sides once they arrive.
It can easily be shown that the horizontal operation is equivalent to convolving the stored image with the discrete kernel 
Performing both a horizontal and a vertical operation thus results in a convolution with the second order 2-D . Smoothing is of course limited by the physical size of the array. Along the array boundaries where nodes lack one or two neighbors, averaging can not take place in at least one direction. The consequence of this restricted smoothing capacity is that the multi-scale veto rule is much less e ective in removing edges near and parallel to the array boundaries. If the array is large enough, however, the e ect on edges in the interior of the image is negligible.
B. Edge detection circuit design
A block diagram of the unit cell is shown in Figure 3 . The corresponding layout in the 2 m CCD/CMOS process measures 224 m 224 m. At the boundary of the cell are the CCD gates in alternating levels of polysilicon which are sized so that when cells are abutted to form the processing array, the gate structure seen in Figure 1 results. One oating gate ampli er (FGA) per cell senses the signal charge under the node gate in the lower lefthand corner and feeds the output voltage to the four di erential ampli ers which are paired with its nearest neighbors. For reasons dictated by the layout, it was simplest to have the oating gate ampli er communicate with the di erencing circuit directly adjacent to it and to those in the neighboring cells to the west and southwest. The edge signals stored in the blocks marked`Horizontal' and`Vertical' thus correspond to edges between the two nodes at the base of the cell and the two on the righthand vertical side, respectively.
B.1 Charge sensing
A more detailed picture of the oating gate ampli er used for charge sensing at the signal nodes is shown in Figure 4 . The clock phase 1 is gated through the PMOS reset transistor controlled by the signal V f g . When V f g is low, the node gate voltage is controlled by 1 for the purposes of charge transfer and storage. When V f g is brought high, however, the gate is left oating and can thus be used for measuring charge levels.
In order to sense the signal level, the charge must be initially transferred away from the node. Once this is done and 1 is brought high, V f g is also brought high, turning o the reset transistor and initializing the node gate voltage to an initial value V i . The signal charge is then returned and dumped back into the empty potential well, causing the oating gate voltage to change to the value, V f , such that
where C load represents the total load capacitance on the oating gate and is a positive constant, 0 < 1, that accounts for the capacitive divider e ect of the gate oxide capacitance and the depletion capacitances of the buried CCD channel in combination with C load . Since the charge consists of electrons, Q sig is negative and hence V f V i .
The voltage change on the oating gate is measured via a source-follower bu er whose design was determined by two considerations. The rst was the need to minimize C load , while the second was the need to have the output voltage in the correct range for interfacing to the di erential ampli ers. For the latter reason, an NMOS source follower was used, despite the fact that a higher gain could be achieved in this process using a PMOS design with separate wells connected to the sources of the bias and input transistors. Due to the backgate e ect, the source-follower bu er in this design has a gain of approximately 0.89.
B.2 Absolute-value-of-di erence circuit
The primary challenge in designing the absolute-value-of-di erence circuits was to obtain the required operating range. Because of the attenuation of the smoothing operation, there can be a wide variation in the di erences that pass the threshold test in the original image and those that pass the test in the smoothed image 9]. For general use, the di erencing circuits must operate in their linear regime for di erences as small as 3% to as large as 10% of full scale range (FSR).
The circuit implementation of the di erencing, threshold, and veto operations is shown in block diagram form in Figure 5 . The outputs of the oating gate ampli ers from two neighboring nodes are connected to the inputs of a double-ended di erential ampli er with gain A. The two outputs of the di erential ampli er are equal to V oc +A V and V oc ?A V , where V = V 1 ?V 2 and V oc is the common-mode output when V = 0. Since it is not known whether V is positive or negative, the threshold test is performed by comparing both outputs to a voltage representing the threshold plus an o set to compensate for V oc as well as any systematic bias in the comparator. If the threshold voltage is greater than both +A V and ?A V , the edge is vetoed by grounding the input to the storage latch.
Since space is limited in the unit cell, it is not practical to duplicate the comparator circuit to perform both tests simultaneously. Instead, the tests are performed sequentially with a single comparator by selectively gating the di erential ampli er outputs using the clock signals R 1 and R 2 . The comparator output, which is high if the threshold voltage is less than the di erential output, is also selectively switched to one of the inputs of the NOR circuit and is stored on the input gate capacitance when the switch is opened.
If the di erencing circuit is to be e ective in detecting edges over the full range of input values, two more requirements must be satis ed. First, the di erential ampli er must have a very high common mode rejection ratio so that a given di erence V corresponds to approximately the same output signal for all common mode input levels. Second, the gain, A, of the ampli er should be large enough to magnify the minimum input di erence so that it is greater than the minimum resolution of the comparator circuit. On the other hand, A can not be so large that the ampli er output saturates when the input di erence is less than the maximum value which must be measured. Given the range of voltages planned for use in this system, these constraints translate into requiring the ampli er gain to be between 2 and 15.
The circuit diagram of the di erential ampli ers used in the prototype processor is shown in Figure 6 . It consists of two identical cascaded di erential pairs with diode-connected PMOS loads each sized for a theoretical di erential gain of A d ' 3:7, based on the average device parameters for the process. By cascading two low-gain di erential ampli ers, a higher combined di erential gain and common mode rejection ratio with less input capacitance can be obtained than by using a single stage.
The comparator and dynamic NOR circuits used for the threshold tests are shown in Figure 7 . The basis of the comparator is a standard clocked CMOS sense ampli er developed for measuring small voltage di erences in memory circuits 10]. The sense ampli er output connected to the V input side is fed into an inverter whose output is gated to one of the inputs on the adjoining NOR circuit. The inverter isolates the sense ampli er from the uneven capacitance of the NOR input gates. Although the inverter input creates a capacitive imbalance between the two sides of the sense ampli er, this imbalance is constant and can be compensated for in the selection of the threshold voltage V , while the input capacitance of the NOR circuit depends on the result of the previous test.
Since the inverter output is low when V wins the comparison, the NOR output will be high only if the threshold voltage, V is greater than both V oc + A V and V oc ? A V . In this case the switch transistor at the bottom of the two transistor chain connected to the edge storage latch input is turned on. The second transistor, clocked by the signal CG, is turned on after both comparisons have been completed and the NOR output is stable. If the NOR output is high, the storage latch input is connected to ground, and the edge signal is discharged. If it is low, however, the lower switch is open, and raising CG has no e ect on the state of the storage latch.
B.3 Edge storage
The edge storage latch, consisting of a pair of cross-coupled PMOS transitors along with two NMOS transistors for initializing and discharging the edge signal, is shown in Figure 8 . At the beginning of the multi-scale veto procedure, before any threshold tests are performed, the latch is initialized by bringing the SET signal high. Positive feedback between the two PMOS transistors pulls the voltage on the input to the CMOS inverter formed by the righthand n-p transistor combination to V DD and maintains this state as long as the edge is not vetoed. If the latch is discharged by the comparator circuit, i.e., if the NOR output is high when CG goes high, it will remain so for the remainder of the edge detection procedure as there is no mechanism other than the SET transistor for recharging it.
The edge signals for a given row are read out by bringing the`Row Select' signal high which connects the latch output to the bit line for its column. The bit line is in turn connected to an inverting digital output pad driver to bring the signal o -chip. The current for charging and discharging the pad driver is supplied by the CMOS inverter on the righthand side of the latch. Since this current can be supplied without a ecting the state of the inverter input, the edge signals can be read out nondestructively at any point during the edge detection procedure.
IV. Test Results
A 32 32 array|the largest which would t on the maximum available die size for the 2 m CCD/CMOS process used|was laid out and fabricated as a single chip. The oor plan of the complete processor, shown in Figure 9 , includes an input device and shift register for loading an image electronically as well as a decoder for selecting the edge output of a given row. A separate chip was also fabricated in which the component structures of the array and the pixel processors could be individually tested and characterized.
A. Charge input and calibration
One problem which was faced in testing the 32 32 array was that the CCD process available for fabricating the prototype device was not of the same high quality as used for producing commercial imagers. The measured charge transfer e ciency (CTE) was only .995 per gate. There were also no vertical anti-blooming structures, nor controlled depth wells to favor electron-hole pair generation from photons in the wavelengths to which humans are most sensitive (400{700nm). In order to achieve better control over the test inputs to the array, it was thus decided to load images electronically rather than by focusing light onto the array.
The ll-and-spill structure diagrammed in Figure 10 was used to load each pixel of the image. In this structure, signal electrons are supplied by pulsing the voltage V d on the n+ di usion which lls the potential well under the poly2 gate connected to the input voltage V in . After creating the charge packet, the stop gate (SG) and transfer gates (TG 1 and TG 2 ) are pulsed to move the charge into the shift register. The size of the charge packet is determined by the di erence between V in and the reference voltage, V ref , connected to the poly1 gate next to the di usion. An image is loaded one column at a time, using a four-phase CCD shift register, clocked by phases s 1 { s 4 , to move the pixels to their appropriate rows. Once the last pixel of the column has been read in, the contents of the entire shift register are transferred horizontally into the processing array. In order to compensate for the poor CTE of the CCDs, a modi ed transfer sequence, taking advantage of the arrangement of the ve independent clock phases on the horizontal channel, was used to ensure that the CTE inside the array was .995 per node rather than per gate. The input shift register was also designed so that each pixel could be input several times before arriving at its respective row, thus allowing the charge levels to stabilize and minimizing interference between rows.
To calibrate the input voltages and signal charge levels, the source-follower bu ers of the oating-gate ampli ers for the last column of the array were connected to output pads to allow their voltages to be monitored as a function of the input signal. A constant signal level was applied for an entire column and the output was measured after transferring this column completely across the array. The total output swing is 1.95V, from a maximum of 2.75V for V sig = ?0:3V to the minimum value of 0.8V for V in ? V ref > 0:5V. Over the part of the range usable for interfacing with the di erential ampli ers, 2:6 V out 1:2V , the average slope is -4.94, and V sig varies from 0V to .26V. Correcting for the measured source-follower gain of 0.89, this variation corresponds to a change in the oating gate voltage of 1.57V. Using calculated values of C load = 83fF and = 0:43, the corresponding di erence in the size of the charge packet is, from equation (5), -.3pC or 1.9 million electrons.
B. Absolute-value-of-di erence and comparator circuits A complete edge detection circuit, starting from the inputs to the source-follower bu ers of the oating-gate ampli ers and ending with the output of the edge storage latch, was laid out as an individual test structure, as were its primary components: the source-follower, the cascaded di erential ampli er, and the comparator circuit. The output characteristic of one complete edge detection circuit measured with V DD = 6:1V and V bias = 1:5V is shown in Figure 12 , plotted for both V 1 ? V 2 and jV 1 ? V 2 j against the value of V for which the input di erence could be considered as an edge. The dotted lines represent the maximum variation in V due to the nite resolution of the sense ampli er comparator, while the separation between the parallel lines in the plot vs. jV 1 ? V 2 j is caused by the o set, of approximately 6mV in this case, in the di erencing circuit.
In order to estimate the statistical behavior of these circuits over the entire array, the composite responses of 32 di erent edge detection processors were measured, with the results plotted in Figure 13 . Several factors contribute to the spread in the di erence values corresponding to each value of V . Among these the most signi cant is o set in the di erential ampli ers, followed by variations in their common mode output voltages V oc , mismatches in the source followers, and, to a minor extent, the resolution of the comparator circuits.
The solid lines in Figure 13 represent the expected standard variation of the edge detection circuits and are a measure of the overall resolution of the array processor. The horizontal distance between the solid lines determines the minimum di erence levels than can be reliably discriminated by the edge detection circuits.
For values of jV 1 ? V 2 j < 140mV, this distance is approximately 48mV, while for di erences greater than 160mV, the distance becomes in nite. Given the full scale range of 1.57V for the oating gate voltages, these limits correspond to minimum and maximum discernible di erences of 3:1% and 10:1% of FSR respectively.
To estimate the e ects of variations in the di erential ampli ers on overall performance, 32 individual circuits were characterized. Their average o set was found to be 6.6mV with a standard deviation of 4.43mV, while the average value of V oc was 3:32V :06V . It was also found that there was a slight systematic positive bias in the o sets. Of the 32 circuits measured, only two had o sets less than zero. It is reasonably certain that this bias could be eliminated with minor adjustments to the layout. Removing it could shave as much as 12mV o the minimum measurable di erence in future versions of this design.
C. Test pattern input
In order to test the operation of the 32 32 processor, several test images were loaded into the array, using the calibration curve of Figure 11 to convert pixel gray levels to input voltages. The edges found for two of these images are shown in Figure 14 after zero, four, and eight complete 2-D smoothing operations.
The threshold voltage for the unsmoothed data was chosen as V 0 = 4:55V , based on the composite absolutevalue-of-di erence response curve, to select initial candidate edges at near the maximum discernible di erence. Threshold voltages for the smoothed images were chosen by computing the attenuation factors of the equivalent binomial lters for an ideal unit step edge and scaling the zero-level threshold by this amount, resulting in the selection of V 4 = 4:14V and V 8 = 4:06V .
Edges are represented by drawing a line between the two pixels where they occur. Artifact edges occur along the top and bottom rows of both test images due to the design of the input shift register which resulted in somewhat poorer transfer e ciency into these rows. Elsewhere the edges clearly occur around the prominent features in the image. In the letter-A image, some \noise" edges are picked up in the unsmoothed threshold tests. However, most of these are removed after four smoothing operations, and all but two are gone by the eighth operation. The edges for this image, which are spread over a two pixel width instead of being abrupt as one would expect, re ect the low CTE of 0.995 which caused some smearing as the image was read in. Because of the multi-scale veto rule, however, the repeated smoothing does not cause any further degradation in edge location.
In the face image, the initial test nds edges over almost the entire area of the face and shoulders as well as along some contours in the background. Unlike the rst image, which has only abrupt step-like features, the face has many more subtle variations in brightness. As a result, the repeated smoothing removes many edges from small-scale events so that after eight cycles only the more prominent outlines remain.
V. Conclusions
The 32 32 array processor designed in 2 m CCD/CMOS technology described in this paper is the rst working implementation of the multi-scale veto edge detection algorithm in a fully parallel focal plane architecture. The overall resolution of the absolute-value-of-di erence circuits contained within the array allows reliable discrimination of di erences from 3:1% to 10:1% of the full scale range of pixel values. While this performance is adequate for many machine vision applications, minor adjustments to the design and layout of the processors may improve it considerably in future versions of this device.
Using a 0.8 m process, an 80 80 array edge detector could be placed on a 1cm die. An array of this size could have practical applications as a scanning, or moving eye, device in automated vision systems. Alternatively, by switching to a row-and column-parallel architecture, it would certainly be possible to build much larger arrays (> 256 256) using most of the same structures contained in the present fully parallel design.
Texas at Austin in 1989, and the Ph.D. in Electrical Engineering and Computer Science from MIT in 1994. While at the University of Texas, she was awarded an N. K. Wright Endowed Presidential Scholarship and, during her time at MIT, was supported by a graduate fellowship from AT&T Bell Labs. She also spent two summers (1989 and 1993) as an intern at the Bell Labs research facility in Holmdel, NJ. Her research interests are in developing \intelligent" imaging devices for use in machine vision, robotics, and image processing. She joined the faculty of the Electrical and Computer Engineering department at Northeastern University in September 1994 and is currently developing a new research program there to build new types of imaging sensors with on-chip processing. 
