Two architectures for a programmable image processor with on-chip light sensing capability are described. The first is a VLSI implementation of a cellular neural network. The second is a distributed dual-structure mutation of the first architecture. The distributed dual architecture leverages the speed of silicon against the large silicon area requirements. Moreover, the innovative integrated nature of the dual-structure design significantly reduces the bottleneck and computational overload caused by data transfer from sensory focal plane to the image processor. The paper also describesVLSI chip prototypes and test results.
Introduction: Image Processing with Cellular Neural Networks (CNN)
CNN is a hybrid of Cellular Automata and Neural Networks (hence the name Cellular Neural Networks) and it incorporates the best features of both concepts. Its continuous time feature allows real-time signal processing, and its local interconnection feature makes physical realization in VLSI possible. Its grid-like structure is suitable for the solution of a high-order system of first order nonlinear differential equations on-line and in real-time. In summary, CNN can be viewed as an analog nonlinear dynamic processor array [Chua88] . The basic unit of CNN is called a cell. Each cell receives input from its immediate neighbors (and itself via feedback), and also from external sources (e.g., the sensor array points and/or previous layers). 
and I I ij = Equation (3) where u is the input, x is the state, and y is the output associated with the cell (i,j), τ τ ij is the time constant of the cell (i,j) determined by R and C values, I ij is the bias current, N r is the neighbor hood of connectivity of the cell (i,j) such that kl ∈ ∈ N r , and A and B represent the cloning templates. In a typical CNN, local connections between the neighbors (feedback weights, or the entries of the matrix A in Equation 1), along with connections form the sensory array (input weights, or entries of the matrix B in Equation 1) form the programmable cloning templates. Cloning templates to perform numerous types of visual processing tasks have been developed. Each template is specific to a particular application, e.g. a cloning template for edge detection [Lee94] or binocular stereo [Park94] . Cellular neural networks are attractive in image processing because of their programmability: One needs to change only the template to perform a different iconic task.
VLSI Adaptations of the CNN Model:

Motivation
The CNN model described in Equations (1-3) is not suitable for direct VLSI implementation. Below we introduce a qualitative discussion that motivates the models we implemented :
In an integrated circuit implementation of the CNN model, the summation equation is a current based computation, as the circuit model in Figure 1 suggests. By Kichoff's law, all currents coming into the node that defines the state of the cell ( x ) must add to zero. Because the intrinsic resistance values ( R ) are very large and capacitance values are very small, little charge is required to maintain a particular voltage. This also means that the current required to alter the voltage value of the state is very small. The values of the noise currents are of sufficient magnitude to make a significant difference. When one adds to that the fact that the transistor characteristics can generally vary as much as 20% within the same substrate, it becomes clear that the ( x ) node is highly likely to charge up or down to a power rail. One remedy is adding significant capacitance to each node. This is not desirable at all since it requires precious VLSI real estate and furthermore increases the response time of the cell.
To overcome these obstracles, we introduced a self-normalizing feedback structure for the computation of the inputs and feedforward template. This borrows a crucial element from the resistive grid. The feedback in the feedforward input weight multiplication process is necessary to keep the state node ( x ) from quickly accumulating or losing charge to saturate to a power rail.
In the circuit implementation shown in Figure 2: 1. both u (unsigned) and x (signed) are represented as analog voltages u and x; 2. template weights can be digitally stored and are programmable; and 3. the transfer function f ( ) in Equation 2 can in fact become a programmable tri-level saturation function, that produces zero output in a region of the input (rather than only at a point).
The particular equations that incorporate the characteristics of the circuit structure are listed below.
Equation (5) and I I ij = Equation (6) where the hyperbolic tangent (tanh) describes a basic transconductance circuit.
Note that the input (feedforward) weights (B) are implemented in a negative feedback structure. The reasons have been qualitatively explained above. This manner of implementation was chosen because (i) the feedback structure stabilizes the state (x) node, which otherwise has a tendency to charge all the way up or down to a rail limited by the power supply, and (ii) the structure is far more resilient to manufacturing related issues such as transistor mismatches. In Figure 2 , the modified CNN model is shown. Some assumptions have been made about the polarity of inputs and weights: These are that B is nonnegative and A is nonpositive. 
Integrated Light Sensing and Processing in the Cellular Realm
An overall integrated light sensing and processing architecture one can propose based on the cellular neural network paradigm is comprised of an array of identical sensor processor cells, each of which contain:
1. a photodiode and circuits for active light sensing, 2. transconductance amplifiers for feedforward template weight multiplication, 3. wide range transconductance amplifiers for feedback template weights, 4. analog/ single bit digital local dynamic memory cells, 5. a data bus for transfers, 6. local programmable logic, and 7. read/write data controls.
Within this structure, one can perform a variety of operations of the cellular paradigm using, for instance, a pair of 3 x 3 cloning templates. Each operation can be completed in about 30 nsec. over the entire image. This implies a processing rate of over 30 million frames/sec per template instruction. The operations include all known convolution operations, plus connected component detection and manipulation allowed for by the feedback weights. Example operations are edge location, morphology operators such as dilation, thinning, and erosion, light adaptation, scratch removal, texture, color and shape analysis. In addition, by using the initialization of the states by previously obtained frames, it is also possible to use the two template array to implement temporal operations, such as motion analysis, e.g., local velocity detection, motion direction detection.
In Figure 3 , the overall cellular chip is illustrated as a system. In this system, an external microcontroller generates the necessary command signals to the integrated sensor processor, such as sensor timing, row select, and program code. This microcontroller can be replaced by a processor or a digital signal processor depending on the computing needs of the particular application at hand. The program memory can be internal, where a more compact address gets horizontally coded into bit level microinstructions and determine template values and data transfers within the architecture.
In the same figure we also illustrate the contents of a cell in detail. The feedforward and feedback weights which also integrate the transfer function are shown in outline. In addition, the local logic and memory functions are illustrated. There is a main data bus across which data transfers can occur. Two way connections are made between the data bus and the two analog and four digital memory units, the state (x) and input (u). Also, connection is made from the logic output and the reference voltage to the data bus. The logic can be implemented by programming, similar to programmable logic arrays. The emerging reconfigurable logic arrays represent another option. 
Implementation of Several Cellular Model Circuits
A CMOS chip containing several types of cellular units was designed, simulated and laid out. The design was submitted for manufacturing to the MOSIS 2 micron ORBIT ANALOG process. The die size is the so-called TINYCHIP package that is a 2.3 mm x 2.3 mm area bonded to a 40 pin DIP.
The implemented chip contains eleven types of cells. All outputs are available through wide range followers at the outputs and all programmable cell parameters, i.e., feedforward and feedback weights, and bias are at 3 bit resolution (plus sign bit). External inputs are available via pins to set these weights through the pins. The one input -one output cell with initialization.
Two Programmable Type (Positive or Negative) Feedback Cells
Two three input one output selectable kind (+ or -) of feedback cells were implemented. One of these also allowed for initialization of the state in the manner identical to the one illustrated in Figure  5 . Figure 6 shows this type of cell without initialization. Note that in this cell, there is an added delay of a pass-through transistor in feedback of the state. The two signals (the state node x and Vref) are directed to the terminals of the feedback weight amplifier based on the feedback sign select bit. All amplifiers have three bit programmable gain. Two three input one output cells illustrated in Figure 7 were included with initialization. These cells differ from the ones described in Figure 6 only slightly. The type of feedback is directly implemented rather than being programmable as seen in Figure 6 . Two combinations of +/-terminals are possible and both have been implemented. Both amplifiers have three bit programmable gain. 
Initializable cells
Sample Tests: Feedforward Weights and Image Convolution
Convolution is a very common image processing step that precedes many vision tasks. It is often used as a technique to generate a second (different) image from the original where desired features are enhanced and/or undesired characteristics (e.g., noise.) are suppressed. Convolution can be described as a continuous spatial or temporal operation, but its application to sampled images is discrete and involves a convolution kernel with discrete values. This discrete convolution operation denoted by ⊗, between an image I and kernel k can be described as : (7) where I is the input image, x and y are the two dimensional image coordinates, and k is an (2N+1 x 2N+1) square kernel.
One can easily see the similarity between the feedforward weight input product sum
( ) ∈ ∑ in Equation (1) and the sum in Equation (7).
Thus, if one sets, I=0 and A=0, the steady state solution to Equation (1) is
which is in fact equivalent to I k x,y i, j ⊗ .
Convolution operations pertinent to this project are constrained by the analog VLSI hardware implementation. Even though this may appear to be a disadvantage, in reality, it creates the opportunity to implement a normalized convolution operation.
The described VLSI adapted model performs a normalized operation which replaces feedforward weights time input values summation terms, in the canonical model of Equation (1) 
in Equation (4).
We already discussed qualitatively the reasons for this modification in the previous section.
Normalized convolution, denoted by ⊗ n is very similar to the conventional convolution operation, except that the output is normalized by the sum of kernel entries. Because kernel entries are the same across the whole image, the result is essentially division by a normalization factor common to the entire image. The advantage is that the dynamic range of the resulting image is essentially the same as that of the initial image. For the implementation of normalized convolution described above, input voltage values are loaded with input voltages from the image, and gains from the entries of the kernel. Since all conductances (or gain values) are necessarily positive, to implement negative kernel values with this approach, one needs to be able to define negative input voltages and select negative polarity for the input to the transconductance amplifier for which a negative kernel value is desired. This cuts the dynamic range of images in half, although the range of kernel entries remain unaffected. One can view the cell arrangement in Figure 8 In the tests we performed on the actual VLSI circuits the four bit plus sign representation shown in Figure 9 as entries of the B template in Equations (1) 
where Vdd = 5 V and logic 1 = 5 V. We can select 1 V as bias to operate the bias transistors around threshold. 1 1 Note that the state node (x) is referenced to V ref and does not need a signed representation.
Positive entry values for the feedback template "A" can be used to steer x to the + terminal and negative entries steer it to the -terminal of the wide range differential amplifier. In the four bit plus sign representation A values too can take on values in the range -3.75 and 3.75 in increments of 0.25. With more area and more bits higher resolution and dynamic range would be possible. Other parameters are as follows: -3.75 < a ij < 3.75 and 0 < x < Vdd
The feedback configuration of the feedforward weights provides for some safety against the tendency of active current computation units to be attracted to a power supply rail.
Two real images were captured using a CCD camera and presented to the chip three inputs at a time. Three weights for the inputs are set to represent the one dimensional vertical edge kernel of 1 0 -1. Each pixel is presented in this fashion. The whole image was presented three pixels at a time along the horizontal direction to the model circuit shown in Figure 8 (b) with the bias amplifier disabled. The result is shown in Figure 11 . 
Area estimates and Other Issues
The following are area rough estimates based on the full implementation of the cellular paradigm as shown in Figure 3 .
In determining the dimensions of the preprototype, we used the models implemented as a guide.
Results are in Table 1 . Consequently, the technology size used for the area estimate is given in units of lambda. Lambda equals technology size, i.e., lambda=2µ in 2µ CMOS technology. Note that as the technology scales, the area will do so accordingly. For instance, for a 1.2 µ process, the size estimate should be multiplied with a scale factor of 0.36, or more precisely 1.44/4.0 (which equals (1.2 µ) 2 /(2.0 µ) 2 . The size estimate also includes a buffer analog storage for the pixel entries. These data show that a 16x16 pixel chip can be implemented using a roughly 2.0 mm x 2.0 mm area in 0.25µ technology. Pin count is also within most standard package limits. A one third inch chip (8.0 mm x 8.0 mm) in the same technology would be able to contain 64 x 64 pixels.
Area per cell
There are several comments to be made at this point:
(1) A commercial camera contains about an order of magnitude higher resolution.
(2) Only a small portion of the cell is light sensitive. Consequently, the sensor would have a very small fill factor. Even for operation at a quite low resolution, the chip would require a mechanism for focusing the light incident over the whole cell onto the light sensitive area. (3) The sensitivity of CMOS sensors to light (or quantum efficiency) is lowered as the technology shrinks. Thus using 0.25 micron technology may not be feasible. (4) 3x3 kernels may be too small and restrictive.
A New Paradigm : The Dual Distributed Architecture
Motivation
In order to build a commercially viable product, one must pay close attention to the findings of the previous section. Commercial objectives would generally dictate that :
(1) The light sensing resolution of the imaging system be close to that available commercially.
(2) The fill factor be significantly higher. (3) The sensor performance be maintained with shrinking technology size. (4) A set of programmable kernel sizes rather than only a 3x3 kernel.
The cellular neural network paradigm, as is, is not likely to allow us to reach these goals in a cost effective manner using currently available technology. We therefore undertake to think about image sensing and processing outside of it.
Since our goal is to build a system in which one can leverage the speed of silicon against the large area requirements of a cellular processor, we view again the components of the cell as they are listed Section 2, namely:
1. a photodiode and circuits for active light sensing, 2. transconductance amplifiers for feedforward template weight multiplication, 3. wide range transconductance amplifiers for feedback template weights, 4. analog/single bit digital local dynamic memory cells, 5. a data bus for transfers, 6. local programmable logic, and 7. read/write data controls.
At this point, we take a systematic approach to meeting our objectives itemized above. We note that:
(1) To improve the resolution one needs to shrink the cell. (2) To improve the fill factor, one needs to increase the light sensitive areas. (3) To keep the sensor operational, one may need to build the sensors (or the whole chip) using a larger technology which again means that one needs to make the cell yet smaller. (4) To increase the kernel size, one may need to build yet a larger cell that connects to more of its neighbors.
Item four runs counter to all the rest until we decide to break up the cell into two, leading to the division of the array into (1) sensory convolution unit and (2) logic unit. This concept is illustrated in Figure 12 . In this manner, one can also implement (in the feedfoward connections) arbitrary size kernels, including a kernel that is as large as the sensor itself, but only with one such kernel at a time.
In summary, the described architecture contains two distinct structures: (i) PCPA, a light sensing grid with convolution capability, which could also include some short term memory, and (ii) PCLA, a digital logic and memory area on which a "mental picture or representation of the image" is recreated. This is in some very coarse sense analogous to the concept of the retina and the visual cortex being separated from one another, where the latter constructs a mental picture from the information it receives from the former. Both areas compute in parallel and communicate with each other as needed in a serial but random access fashion. Also, it is primarily the digital logic array (PCLA) that communicates with a conventional digital signal processor, microprocessor or microcontroller.
The Signed Output Pixel (SOP)
For the functioning of the convolution circuits, the light sensor should produce signed outputs, i.e., there needs to be a (+) and a (-) pixel output. Positive (+) output is used for feeding into the amplifier if it contains a positive kernel value. Correspondingly, the negative (-) pixel output is used for negative kernel value. A one dimensional example is given below.
Suppose that one needs to implement a one dimensional edge kernel of [ 1 0 -1]. The equivalent arithmetic operation is the sum pixel[i-1] + (-pixel[i+1]). Since current summation is used, one needs to add the current for the positive representation of pixel[i-1] to the current for negative representation of pixel[i+1]. Figure 13 shows the layout of such a single signed output active sensor. The corresponding schematic of the signed sensor is illustrated in Figure 14 , where the positive and negative pixel representations are marked. Note that the additional follower is necessary since the photocurrent is very small and the photosensing node should not be perturbed. Figure 15 shows a representative drawing of the schematic of one element of the sensor array-PCPA. Each such pixel cell can be addressed by the combination of row and column select signals. Weight bits can also thus be written to selected pixels. Pixel midpoint determines the "0" valuei.e., no light, of the signed pixel value representation. Reset pixel will charge the pixel output to that value. A dedicated GND signal need not be routed since a metal layer covering the circuits (with opening at light sensitive areas) can be grounded. Each cell, shown schematically in Figure 15 , occupies about a 350 λ by 350 λ square area and contains :
Programmable Convolution Pixel Array (PCPA)
1. one signed output pixel (SOP), 2. four D flip flops for storage of three bit weight and sign, 3. four multiplexers, 4. one wide range transconductance amplifier connected in the unity follower configuration, and 5. a three input AND gate.
The size of the cell can be reduced significantly, since as the layout shows in Figure 16 , there are some empty spaces. Added metal layers will no doubt compact the cell further. A significant area savings would result from a dynamic memory cell, since each kernel would be used for a short Layout of a single cell with areas of the circuit elements marked.
In Figure 17 , a PCPA layout of 5 x 5 pixels is illustrated. The 2 micron CMOS technology design with its pads fits into an area of 2.3 mm x 2.3 mm and is embedded in a 40 pin DIP chip package. All 25 pixel outputs are summed on a common output line. One can select different signs and weights for each of the cells and thus perform arbitrary convolution operations. Arbitrary 4-bit kernels up to 5 x 5 can be programmed. 
Programmable Cellular Logic Array (PCLA)
The output of a convolution operation performed by the PCPA is an analog voltage value. While this rapid arbitrary size kernel convolution operation is extremely useful, there are many early vision operations that require further processing of the results of many convolution operations. The programmable cellular logic processor is a means of implementing this stage. The cellular logic processor is a binary (or digital) set of programmable logic elements arranged on a cellular grid . The size of the PCPA and the size of the PCLA could be different, where the latter grid functions as a scratch-pad for a set of early vision operations, such as shape detection, contour following, pattern matching, and performed using the outputs of the former grid, potentially at different resolutions.
Furthermore, each element of the PCLA can be addressed like a memory array where the PCPA results can be stored and processed. Added functionality would result from the ability to transfer bits between arbitrary locations of the cellular logic grid.
A simple implementation is illustrated in Figure 18 . Each cell of the PCLA minimally accepts inputs from (1) the outputs of its immediate neighbors, and (2) the PCPA. Furthermore, added functionality could be realized (1) with increased connectivity, where each cell receives more inputs, such as those from additional cells in the neighborhood, or from arbitrary cells on the array via memory-like addressing; and (2) with increased cell memory, where the results of previous operations can be stored in the cell. The logic within the cell should be programmable. In fact, the program will change many times during the operation. Sample logic functions could be AND, OR, INVERT, XOR etc. of a set of the available inputs.
The PCLA would benefit from the use of the emerging reconfigurable field programmable gate arrays (FGPA's). This is a relatively new field, however, researchers are keenly looking at ways to capitalize on the inherent flexibility in these devices to facilitate the building of a better computing paradigm.
output of cell thresholded output from the PCPA outputs of (8) immediate neighbors
One specific concept in this domain, termed the plastic cell architecture (PCA), brings forth a new circuit type that is laid out as an array of identical computing elements -or cells which could dynamically reconfigure themselves for specific problems [Nagami98] . This new computing paradigm offers a novel feature beyond the common reconfigurable FPGA concept, where so far, it's been possible to reconfigure circuits only via software downloaded to one or more FPGA and the chips then directly execute the prescribed functions as hardwired circuits. This added feature is the ability of one circuit to dynamically configure another circuit. The resulting processing array is able to mimic the ability to create specialized cells, which in turn allow a cellular array like the PCLA to configure itself based on outputs of its neighbors, or of itself. This level of data driven performance allow for implementation of very complex functions from very simple rules.
Communication between the PCPA and PCLA
It is evident that in addition to power and ground signals, many other data, control and address signals common need to be distributed all across the chip. As is the case with many photosensor chips, the PCPA portion of the chip also needs to be covered with a layer of metal with openings only at the photosensitive areas to allow for the exposure to light of the package. Typically, one would ground that layer of metal that covers the whole chip. A similar layer could also be used on the PCLA portion of the dual architecture to carry the output of the PCPA to all points of the logic array.
The row select and column select signals of the PCPA are reminiscent of memory addressing diagrams. Only when both signals are logic high's, a specific pixel is selected. These signals are used only to load the weights to the cell.
Elements of the PCLA can also be addressed as one would address memory cells. A row of cells could be thought of as a long word. A set of write and shift operations could replace the need to route multiple address signals across the PCLA.
Conclusions and Future Work
With the advent of CMOS light sensors, the chance to implement integrated sensor processor devices has taken a big step towards realization. The integration of CMOS based light sensors with locally connected arrays will form very powerful as well as low power integrated sensor -processor arrays that will meet the visual computation challenges of the twenty first century. In this context, the two architectures represented this paper offer two potential roadmaps.
For successful commercial realization, an integrated image sensor-processor must be packaged as a viable marketable product or a cost-effective problem solution option. Actual demonstrations should not only prove functionality but also address how the investment in the full-scale development of these systems can return high dividends to those who take the risk.
IC Tech has demonstrated the feasibility of several neurally inspired image processing systems to address specific opportunities in emerging consumer vision markets. Two architectures, their discussion and some results were included in this paper.
Future work involves full scale prototype implementation and test of the two architectures. In many miniature applications with little symbolic processing need, the integrated analog sensor processor may actually embody the whole system.
