I. INTRODUCTION

C
ONVENTIONAL image-processing systems use a charge-coupled device (CCD) camera for parallel acquisition of the input image and serial transmission of the digitalized image to a separate processing element. It results in huge data rates which conventional computers are not capable of analyzing in real-time. For instance, a color 512 512 pixel camera delivers about 20 MB/s, for
Manuscript received November 28, 1996 ; revised January 31, 1997. This work has been supported in part by Spanish CICYT under Contract TIC96-1392-C02-02 (SIVA).
R. Domínguez-Castro, S. Espejo, and A. a typical rate of 25 frames per second. Such a huge rate may be managed by conventional computers for operations such as auto-focus, image stabilization, control of the luminance/chrominance, etc. However, real-time completion of more intricate spatio-temporal operations requires bulky and sophisticated processors. In contrast to this, the smallest insects, albeit equipped with really tiny brains, are capable of analyzing complex time-varying scenes in real-time [1] . This contrast between artificial and natural vision systems is due to the inherent parallelism of the latter. Particularly, the cells of the natural retina combine photo-transduction and collective parallel processing for the realization of lowlevel image processing operations (light adaptation, feature extraction, motion analysis, etc.) concurrent with the acquisition of the image [1] . Inspired by this, new generations of image processing systems have addressed the incorporation of distributed parallel processing already at the plane of the image sensor. One common strategy is to incorporate the sensory and the processing circuitry on the same semiconductor substrate [2] . CMOS technologies offer unique features for this type of chip due to the availability of good CMOS photo-transduction devices [3] and the possibility to realize linear and nonlinear processing functions with simple CMOS circuitry [4] , [5] .
A number of CMOS retinas have been previously reported in literature [2] . In many cases their development has emphasized light adaptation, i.e., the capability to adapt the response of the two-dimensional (2-D) optical sensor to the lighting conditions of the incoming image, while image processing has remained secondary. Some circuits which incorporate processing capabilities are intended for fixed processing functions [6] . On the other hand, the programmable circuits found in literature have neither accurate standard control interfaces nor the capability of flexible operation. Besides, their development is not systematic because there is not a clearly defined design path from the top processing algorithm down to the circuit design [7] . To establish such a path requires first to capture the retina processing function into signal processing algorithms. Different remarkable efforts have been undertaken in this direction. In particular, researchers of the University of California at Berkeley and the Hungarian Academy of Sciences have recently set up a powerful methodological framework for the systematic formulation of the low-level image processing functions of the retina [8] - [10] . It is based on the observation that most image processing operations can be With the addition of a few key functionalities, CNN's have evolved into a broader concept, the CNN universal machine (CNN-UM) [10] , on which the architecture of the chip reported here is based. In particular, as compared to previous work by the authors [6] , [11] , this chip includes the following CNN-UM extended capabilities: a) electrically programmable local analog interactions; b) on-chip memories for internal storage of instructions (sets of local interconnections values), which can be loaded from the outside of the chip and used any number of times in arbitrary order; c) four spatiallydistributed 2-D on-chip image-memories (each local processor includes its corresponding pixel of each stored image, allowing parallel data transferences in the network); d) a programmable Boolean operator at each local processor for parallel logic operation among images; e) multiplexing and control circuitry for the realization of complex processing tasks (for instance sequential and/or bifurcated-flow algorithms) based on the stored instructions and the available internal image-memories; and f) completely digital interface, making the chip easy to control with conventional computing systems.
The chip, whose fundamental operation is analog and continuous-time, is realized in 0.8-m CMOS single-poly double-metal technology and intended for focal-plane array processing of binary images. To that purpose, it incorporates a 2-D optical interface for image acquisition with adaptive contrast adjustment. The additional CNN-UM functionalities allow the chip to operate as a powerful front-end for the realization of simple and medium-complexity image processing tasks, including sequential and bifurcated-flow algorithms. 1 1 CNN-UM's realize the vast majority of image processing tasks through proper selection of the interaction strengths and/or task sequencing, and can hence be considered as general-purpose image processing computers [10] , [12] .
Other programmable CNN chips have been reported in the literature [13] - [16] . However, none of them incorporate the additional CNN-UM functionalities presented here, nor the optical interface. Also, it is common to limit the local interconnections to vertical and horizontal directions, while the chip described here allows simultaneous interaction among cells located in vertical, horizontal, and both diagonal directions. Despite these extended capabilities, the achieved cell density (27.5 cells/mm ) roughly doubles those previously reported, with similar technology resolution. Other favorable differences include a higher complexity (in terms of total number of cells in the array) and the completely digital external interface.
II. SYSTEM ARCHITECTURE AND PROCESSING ALGORITHM Fig. 1(a) shows the architecture of the chip, which consists of a core programmable array processor plus peripheral circuitry used for electrical I/O, control, storage of the processing coefficients, and conversion of the external digital programming codes into the internal analog programming signals. Processing is according to the paradigm of CNN's and is described by using three variables per cell: cell state, ; cell output, , which is a saturation-type nonlinear version of the state variable; and cell input, , which represents the external excitation of the cell. Each cell realizes a nonlinear transient evolution according to the following evolution law, and where denotes the interacting cell neighborhood. In our chip this includes the nine nearest neighbors (the cell itself, plus the adjacent cells at the top, at the bottom, at the right, and at the left; in the vertical, horizontal, and both diagonal directions).
The input to the above evolution law consists actually of two 2-D images: one is the matrix of initial values of the state variables ; the other is the matrix of inputs [ ], assumed time-invariant during processing. The output image is the matrix [ ]. In most applications only the steadystate value of this output, reached after the network transient evolution, is significant for processing [9] . This steady-state, and hence the processing function performed by the system, is determined by the set of coefficients (feedback coefficients), the set of coefficients (control coefficients), the offset term , and the state of the spatial boundary cells in the array. These are the electrically programmable coefficients of the network.
Our chip, as is common in CNN's, has the property of translational invariance, meaning that the interconnection pattern does not change throughout the array, and can therefore be described by a scalar (offset term) and two 3 3 template matrices whose entries represent the strength of the feedback and control interactions. Thus, leaving aside the boundary cells state, the processing function of the chip is determined by only 19 parameters, rendering feasible the routing of programming control lines. On the other hand, this property of translational invariance does not over-constrain the processing capabilities of the CNN, as the results in [9] , [10] , and [12] demonstrate. In particular, [12] shows a full catalog of the processing functions which can be realized by translational-invariant CNN's.
Most of the cell area (about 70%) is occupied by the analog circuitry required to realize (1) . The cell also contains other circuitry, as the inset of Fig. 1(b) shows. The functions associated to this circuitry are the following.
• Photo-Transduction and Light-Adaptation: Each cell includes a photo-sensor and a collective computation adaptive circuit [18] for automatic contrast enhancement.
• Local Logic Unit (LLU): A fully programmable two-input Boolean operator is included in each cell, providing the capability of parallel logical operations among binary images.
• Memory Storage (LLM): Each cell incorporates four image pixels which can be loaded from the image-sensor, the output of the processing circuitry, or the output of the LLU.
• Local Communication and Control Unit (LCCU): which connects the source and destination of signal transfers realized inside the cell. Eight complete sets of CNN processing coefficients can be stored on chip. Although the internal control is analog, the coefficients are codified and stored in digital form, with a resolution of 7 b sign, in accordance to the expected accuracy of the analog processing circuitry [5] . On the other hand, any stored image can be used as any of the two input images of the network, or as input to the Boolean operator. Besides the parallel optical loading, images can also be loaded or downloaded, on a row by row basis, through an external I/O bidirectional bus. Finally, because the spatial boundary conditions also play an important role for processing, the cell array is surrounded by a ring of border cells with programmable output variable. Fig. 2 shows a block diagram of the cell analog processing circuitry, including different functional blocks: integrator, nonlinearity, memory, and programmable interconnection synapse. These latter are time-multiplexed, meaning that the 18 parameters associated to the feedback and the control templates are implemented using only nine synapse. To that purpose, the control contribution [see (1) ] is first calculated at each computation cycle (bear in mind that the inputs remain constants during one such cycle) by programming the nine synapse with the values of the control coefficients , and driving the disabled integrator with the input cell variable . The result is stored in an analog current-mode memory. Note that during this process a sign inversion takes place. This is internally solved using a weight inversion circuit, active only during the control contribution computation. After the result has been stored in the analog memory, the nine synapse are reprogrammed with the feedback coefficients , the integrator is enabled, its initial condition is set to , and the value previously stored in the analog memory is added (actually subtracted) to the feedback contribution to realize (1). The sign inversion performed by the analog memory obtains a functional cancellation of the output-referred offset of the synapse. The offset term is generated by another synapse circuit driven by a constant reference signal.
III. LOCAL ANALOG CIRCUITRY
The analog processing circuitry is fully differential and operates in transconductance mode. The synapse inputs are voltages, which eases the intrachip distribution of template coefficients (common to all cells in the array), and the intracell distribution of the state-variable to all the cell synapses. On the other hand, the synapse output is a current, thus facilitating summation at the integrator input nodes.
A. Programmable Synapse: Control Strategy
The programmable synapse is a key building block of massively parallel analog processing systems. Its design must face structural and parametric optimization to combine large cell densities and moderate accuracy [19] . However, a previous step is to choose whether to use digital or analog signals to control the weights.
In a general electrically programmable analog synapse, the input ( ) and output ( ) signals must be analog. On the other hand, the control signal ( ) may be either analog or digital.
In the case of a generic digitally controlled synapse, its transfer characteristic may be written as follows (2) where is a digital code for the control signal, is an approximately linear and continuous function of , is the vector of technological parameters, is the absolute temperature.
as well as depend on the circuit structures used at the synapses.
On the other hand, for a generic analog-controlled synapse, the transfer characteristic may be written as (3) where the relationship between the output and the control signal may be nonlinear and the function may change with the value of the control signal.
The equations above highlight advantages and drawbacks of each strategy. Digital control yields inherent linearity (excluding quantization) with the weight signal. Also, because the functions and are independent of the weight, their influence can be eliminated by using some inverse function cancellation technique-similar to what happens, for instance, in a current mirror. In contrast to this, function depends on the weight signal, and the output signal of the analog synapse may be a nonlinear function of the weight. In addition, digitally-coded weight signals are much better suited for on-chip storage than analog ones. Some crucial system-level advantages of analog-controlled synapses are their smaller area and power consumption and their reduced number of control routing lines. Fig. 3 illustrates the control strategy adopted in our chip, which combines the benefits of analog and digital control. It is similar to the strategy used for on-chip filter tuning [20] . Ten peripheral tuning stages 3 (i.e., located in the periphery of the cell array) are employed to generate analog weight signals from their digitally coded values. The analog weight signals are then used to control the synapse within the cells in the array, using only ten global routing channels (20 for differential schemes).
Each peripheral weight tuning stage consists of an analogcontrolled synapse and a digital-controlled synapse connected in a feedback loop. The latter is driven by the digitally coded weight , while the former is driven by the output of the tuning loop, yielding the following steady-state value: (4) Assuming that inner synapses are perfectly matched (same functions, same technological parameters, and same temperature) to the ones used at the tuning stage, and combining (3) and (4), one obtains (5) which retains the feature of linearity with the weight of the digital synapse and attenuates the dependence of the function with . Note also that the control interface is digital, and hence the weights are easily storable using conventional digital memories. On the other hand, since the internal routing signals are analog, the area occupation due to the routing channels is the smallest possible. In addition, as for digitally controlled synapses, process parameter variations will not affect the linearity of the programmed weight values. 
B. Analog Synapses
The second step in the design of the programmable interconnections is the choice of the synapse circuitry. The major functional issues here are linearity with the input signal and sensitivity to technological parameter mismatches. An exhaustive analysis of the area-accuracy tradeoff of alternative CMOS synapse [5] has lead us to choose the structure of Fig. 4 , which consists of a linear multiplier core with four transistors operating in ohmic region [21] and two sourcefollower buffers. The transfer characteristic of the multiplier core (dashed box in Fig. 4 ) can be expressed as (6) where is the large-signal transconductance of the inner transistors, is the differential input signal, and the differential weight signal for the multiplier core. Although the obtention of (6) does not account for mobility degradation effects, the measured transfer characteristics are shown to be highly linear with both and .
The addition of the (resistively loaded) source-follower buffers results in a nonlinear characteristic with respect to the new differential weight signal . However, it can be shown that the linearity with respect to the input signal is preserved. In fact, the source followers did not preclude the obtention of integral nonlinearity (INL) values smaller than 0.4% and total harmonic distortion (THD) values below 0.2% for up to 2 V differential input range. On the other hand, the nonlinear dependence with the analog weight signal is transformed into a linear relationship with respect to its digital code by the tuning strategy described previously.
The second functional issue is related to mismatch sensitivity. Different instances of the same synapses will have, in general, different technological parameters, , and temperature, . The influence of the former is better understood with reference to the following behavioral representation of a mismatch-affected synapse: (7) where equals , and the terms , , , and , representing the gain error and three different offset sources, are nominally null. Table I shows simplified expressions for the variances of these errors as functions of the variances of the threshold voltages and large-signal transconductances of the MOS transistors enclosed in the dashed box in Fig. 4 . Mismatch effects on the source-followers ( ) and current-conveyors input offset ( ) can be separately computed in a simple form. 4 For convenience, the errors in the offset terms are normalized by the maximum values of the corresponding signals ( and ). These simplified error expressions correspond to the optimum design conditions given in the fifth row of Table I , which assume that the internal weight ( ) and input signals have approximately the same signal range ( ), and that the common-mode component of the internal weight signals equals the dc voltage at the low-impedance input nodes of the current conveyors (
). This last condition is accomplished by the common-mode circuitry in the weighttuning stages. The sixth row shows the maximum current which is drawn from the power supply under these design conditions; while the seventh row shows the maximum output current delivered by the synapses.
Similar expressions have been obtained for other synapse alternatives, allowing a comparison in terms of mismatch sensitivity, signal range, and linearity under the same constraints of area occupation and power consumption. 5 Our choice of the topology in Fig. 4 is based on a clear superiority with respect to other alternatives, in particular as compared to synapse circuits based on MOS transistors operating in the saturation strong-inversion region. 4 The buffers have only a slight influence on the error in Wos. The effect of the current conveyors input offset voltage is cancelled as a side effect of the computation of the control contribution, as explained earlier. 5 The power expended by the buffers in Fig. 4 is about 50% of that in the core cell. The sizes of the transistors in Fig. 4 have been chosen based on the formulation of the short-distance (spatial white noise) component of the mismatch [19] to obtain 1% accuracy using the formulae in Table I . The core transistors are of 5 m/12 m, while the transistors in the buffer are of 5 m/3 m. Fig. 5(a) shows the dc response of the synapse for different weight values. As already mentioned, the INL is better than 0.4% and the THD is 0.2% within the required signal range. Statistical characterization, based on ten samples 6 located on different chip-units, yielded a standard deviation of 6 The foundry delivered only ten samples of the chip. the output current offset below 1% relative to the maximum output current; and an average INL below 0.4% relative to full signal range-illustrated in Fig. 5(b) . Fig. 6(a) shows the schematics of the cell integrator and the nonlinear block, where and are the aggregated currents from the synapse driving the cell, and is a reference voltage. The integrator is realized with two currentconveyors and two grounded capacitors. The conveyor has a fully differential architecture to conform to the differential nature of the synapse signals and to reduce common-mode parasitics. To save silicon area, the integrating capacitors are in fact realized through the gate capacitance of the synapses connected to the integrator output. To this purpose, we take advantage of the fact that the signal transistors in the synapse operate inside the ohmic region in strong inversion, where the gate capacitance is fairly linear. Fig. 6(b) shows the schematics of the current conveyor, jointly with the analog memory and the common-mode feedback circuitry. The current conveyor includes a local regulation feedback to reduce the input impedance and, thus, the current division errors which arise when these nodes are driven by the multiple low-impedance synapses. The analog memories are realized through the capacitor-connected transistors shown at the top right and left of the figure. For clarity, the schematic does not contain details of the analog switches; those driving the memories are realized through complementary transistors, while all the others are realized through single transistors. Fig. 6(c) shows the schematics of the nonlinear resistor (realized with two diode-connected transistors), and the initialization circuitry, which is responsible for setting the initial condition and the common-mode signal of the cell. Fig. 7(a) shows the large signal V-I characteristic observed at the low-impedance input node of the current conveyors. Statistical characterization over ten samples showed a standard deviation for the input offset-voltage of 3 mV. The effect of this offset is cancelled by the storage and subtraction process employed to compute the control contribution, as explained earlier. Fig. 7(b) contains the I-V characteristic of the voltagelimiter employed for integrators saturation. Saturated voltage levels exhibit a standard deviation of 2% of the full signal range, resulting in the dominant accuracy limitation. This is attributed to the minimum-length transistors employed in the nonlinear element, and could be easily improved in future prototypes at the cost of a negligible increase in area consumption.
C. Integrator, Nonlinearity, Analog Memory, and Common-Mode Feedback
As a final point in this section, it seems appropriate to describe the operation of the cell, step by step.
• First: set the state variable (nodes and ) of every cell to the value corresponding to the cell input . This is done by the circuitry of Fig. 6(c) , in which switches sign and sign control the sign of the initial condition under the driving of the image memories. Simultaneously, the control weight voltages [ in (1) ] are applied to the synapse. During this process, signals and are enabled and is disabled. At the end of this process, is disabled and the value is stored in the (currentmode) analog memory. A sign inversion occurs in this process, which is internally solved using opposite-sign control weights at every cell.
• Second: set the state variable of every cell to the value corresponding to the initial conditions , using again the circuitry of Fig. 6(c) . Simultaneously, the feedback weight voltages [ in (1) ] are applied to the synapse. This process ends when is disabled. At this point, the cell loop is not yet closed, and nodes and of Fig. 6(b) are both grounded.
• Third: signal is enabled, closing the loop of the cells, and starting the analog processing transient. After a typical 2 s of evolution time, the processing is complete; the outputs are then stored into an image memory and the system is ready to start a new operation cycle.
IV. WEIGHT TUNING STAGES
As mentioned earlier, ten weight-tuning stages located in the periphery of the chip are employed to generate the analog weight signals from their digitally coded values. Each of them is a fully differential version of Fig. 3 . The feedback loop comprises an analog synapse, an integrator, and a linear digitally controlled synapse (MDAC). The integrator is identical to those used inside the cells and described above. On the other hand, the analog synapse in the tuning stage is formed as a parallel connection of six instances of Fig. 4 , this is, the analog synapse employed within the cells. The subsequent averaging of the random errors at each of these instances results in a reduction of the output current standard deviation by a factor of , and hence features better matching. As can be seen from (4), the weight-tuning stages behave as a nonlinear digital-to-analog converter, predistorting the nonlinear relationship between the analog weight signal and the programmed weight of the analog synapse. Fig. 9 shows measurements taken from ten different samples of the tuning stage. In particular, it shows the integral nonlinearity [in least significan bits (LSB's)] of the output current of an analog synapse driven by the tuning stages (and with a fixed input signal ) versus the digitally coded weight value. The worsttransition adaptation time of the tuning stage was below 1 s with a capacitive load of 35 pF, which is the estimated on-chip load of the tuning stage.
V. OPTICAL INTERFACE CIRCUITRY
Image acquisition relies on photo-generated currents at floating-base vertical CMOS-bipolar junction transistors (BJT's) placed at every pixel. An automatic adaptive scheme is used to ensure appropriate contrast levels by shifting the observed scene to obtain a zero-mean distribution of pixel values. Optional external circuitry can be employed to adjust the mean of the distribution when needed, for instance for highly regular images with dominant background. This optical interface circuitry is actually identical to that previously described by the authors in [18] , and used already in their fixed-function CNN image processing chips [6] . Pixel sensors consist of two vertical BJT's arranged in Darlington configuration, as shown in Fig. 10(a) . The photo-generated current at the base-collector junction (n-well/p-substrate) of transistor is amplified by a factor by and , yielding output current levels of about 0.8 A under an environmental laboratory lighting of 0.9 W/m . The acquired gray-scale image is shifted by subtraction of the spatial average, thus ensuring proper contrast adjustment over a wide range of illumination conditions. The averaging circuitry, shown in Fig. 10(b) , is included in every cell and globally interconnected through a common node SUM. A CMOS inverter transforms the shifted output current to digital levels, yielding a binary version of the image which can then be stored at internal memories. Fig. 10(c) illustrates a system level simulation of the performance of the threshold regulation circuitry. The area of the imaging circuitry (sensor regulation) is of about 110 m , which amounts up to about 7% of the cell area.
VI. LOCAL LOGIC CIRCUITRY
About 25% of the cell area is dedicated to local logic circuitry, including the programmable Boolean operator, image memories, and some signal multiplexing and control. Their realization is based on switches and conventional digital circuitry. The 4-b image memory, based on charge storage, employs metal-1 shields over sensitive areas to avoid the adverse effect of light on reverse diode currents, which could result in a significant reduction of storage time. Fig. 11 shows the LLM. It consists of four digital dynamic memory cells of one transistor and one capacitor (this is really a transistor with the drain and source shorted). The binary nature of the output of analog CNN operations allows the use of the LLM to store the results of both analog and logic operations. In this sense, the LLM combines the functionality of the LLM and the local analog memory (LAM) of a general CNN universal machine [10] . The storage time of the digital memory is in the range of 200 ms. However, a refreshing procedure can be employed to maintain the information as long as desired, as explained below. Any of the four bits can be addressed at any time using global signals , , , and . An inverter and four switches implement the sense amplifier. During writing, signal is enabled and signal is disabled. In this situation the digital data in is inverted and written in a particular capacitor, depending on which signal is activated. The read and refresh process follows three steps. First, signals and are simultaneously enabled, the input and output of the inverter are shorted, and the parasitic capacitors are precharged to the quiescent point of the inverter, while all are disabled. Second, is disabled and the inverter is directed toward . At this time one of the signals is enabled and the corresponding data is inverted and written to the capacitor in . Finally, the inverter is flipped again ( enabled and disabled) to refresh the data in the memory cell. The capacitor in can be used as a temporal memory for intermemory data-transferences. Fig. 12(a) illustrates the implementation of the LLU, which is a two-input digital operator with fully programmable truth . The two input signals of the LLU are always taken from the contents of memories 1 and 2 through signals , , , and , (see Fig. 11 ). Fig. 12(b) shows the LCCU. It is an analog multiplexer which selects the origin or destination of the signal to be stored in (alternately retrieved from) one of the digital memories. Possible sources are the photo-sensor ( ), the external I/O signal corresponding to a particular cell column , the cell output (obtained from the state variable by a simple comparator realized by a differential amplifier), and the output of the logic unit . Possible destinations are the analog network initialization circuitry (through node ) or the external I/O signal . In addition, by selecting two signal paths simultaneously, the multiplexer permits the downloading of the cell output ( ) on a row-by-row schedule, as well as reading the output of the photo-sensors ( ) or the local logic unit ( ). Although the logical circuitry plays an important role in the system operation, it may degrade the analog function through the induced substrate noise. To reduce this, all the digital circuitry is latent while the analog operation takes place.
VII. SYSTEM-LEVEL PERFORMANCE Fig. 13(a) shows a microphotograph of the prototype and a summary of its specifications. Technology is a doublemetal, single-poly, n-well, 0.8-m CMOS available through the EUROCHIP consortium. The total area of the prototype is 30 mm , the time constant of the analog network is about 0.4 s, and the operation speed of the digital circuitry is 10 MHz. Fig. 13(b) shows the microphotograph and the layout of one cell. The ten synapses occupy 45% of its area, 8% is occupied by the integrator and the nonlinear element, 15% is devoted to the analog memories, 25% to the digital and control circuitry, and 7% to the optical interface. Two additional smaller chips with analog and digital parts were fabricated for testing and characterization purposes. The area occupied by the cell array is 53% of the total chip area. The tuning stages and the coefficient memories use about 4% of the chip area, 35% is for routing, and 8% for the pads. The electrical operation of key circuit blocks has been illustrated in the main text. In what follows we will illustrate its system-level performance. The prototype has been globally tested and its functionality verified by the designers at the IMSE-CNM Laboratories, where the electrical characterization was undertaken, and it was also shown that the chip realizes low-level operations such as low-pass image filtering, corner and border extraction, hole filling, etc., according to specifications [22] . The test setup used for these measurements is similar to that described in connection to previous fixedfunction chips reported by the designers [6] . In parallel to this, authors from the Hungarian Academy of Sciences have undertaken the system-level characterization and have tested the features of task sequencing and algorithmic control using stored instructions. A CNN chip prototyping system has been built for this purpose, and a number of programs have been developed for different applications [23] , two of which are illustrated below.
A. Motion Analysis
The task here is to detect the motion of an object in a specific range of direction and speed. Processing is based on the comparison of two images acquired at different time instants and relies on a sequential, bifurcated data-path algorithm as shown in Fig. 14 , where real images acquired by the chip through its optical interface are employed. The top row shows a sequence of five input images. Each new input image is processed in sequential steps by applying different CNN templates or logic operations from the instructions memory. The first step consists of a CNN diffusion-type filtering (row B). The second step is the CNN mathematical morphology function of erosion (row C). The role of these steps are to eliminate the noise and to get a spot approximately in the center of the traced objects. A reference image is then generated from this spot by shifting it in accordance to the nominal speed vector and enlarging according to tolerance (row E), both using analog CNN processing. The motion is detected by logic comparison of the reference image with the spot obtained after erosion of the next one. If both fit, it means that the motion is in the expected direction and range, and the latter image is restored using a final CNN instruction.
As can be seen from the above description, this application makes extensive use of all the capabilities of the chip, including the electrical programmability of CNN coefficients, the internal image memories, the programmable local Boolean operator, the internal instructions (CNN templates) memory, and the optical interface.
B. Texture Classification
In this experiment different Brodatz textures [24] are optically projected on the chip. The textures, which are kept in smooth motion, are repeatedly sampled at different positions, with a resolution of about 0.2 mm/pixel [25] . This configuration ensures random sampling at changing illumination and noise. We use a genetic real-time learning algorithm to determine the optimal templates of the network [26] - [28] . However, instead of using a simulator, the chip itself is used for the detection/learning process. In the learning phase, five texture-pictures/texture types are considered for learning (memorized images). For the real-time test, several thousands of images are used scanned by the on-chip sensors.
The ratio of black pixels (ROB) is about 0.49 for the input images, their small variation is independent of the texture-class. Fig. 15 shows four examples of the original, the scanned-thresholded input, and the output images. The texture set consists of samples taken from [24] , in particular French canvas (21) , two Straw cloths (52) and (53) and Straw matting (55). As a result, the ROB at the output is significantly different for the four texture classes. The measured ROB distributions for the different textures are plotted in Fig. 16 . Each curve is measured over about 4000 test samples. We can find that histograms of the different texture classes can 16 . Classification statistics of the chip measurements, testing the ROB values for four different texture-classes. Histograms (measured probability density) generated by moving the textures over the scanning window of the chip sampling at 4000 different locations for each texture, and measuring the ROB of the output. be separated well from the others. Here the histogram is the measured probability density function of the ROB values for a given texture result. Measuring the contrast by the average difference of the neighboring histogram peaks of the different texture classes, this contrast is better than 5% for four textures, while the in-class consistency (the curve width around a peak) is good. Considering these probability densities, the Bayesianerror for the recognition of the four textures is about 4-5%.
In a robustness experiment, the textures were defocused and rotated in random positions. The result of our texture detector is nearly independent of the texture-rotation and sheettilt in a wide range. In addition, we found that different chips calculated the same result. We tested other texture-sets with similar results. In the above experiments the Bayes misclassification error is about 4-5% for the four textures, 2-4% for three textures, and 0-4% for two textures. In another experiment, we trained the chip to detect different positions of textures containing nearly straight sections. The template was trained for one texture, but it can detect any other if the periods are in similar range.
In these texture detection experiments only one template is used. More precise recognition can be achieved from a multitemplate process exploiting the template and image storing capabilities of the chip. In the new version of our training program, there are no stored images for the training. The whole learning process is done in real-time using the chip itself. The training period is only some minutes for 3-5 textures.
VIII. SUMMARY
We have briefly described the design of a fully programmable vision chip with a wide range of potential applications. The processing function is based on the cellular neural network paradigm, while the photo-transduction relies on vertical BJT's available on standard CMOS technologies and includes an automatic contrast-enhancement circuitry. Additional features like internal image memories, algorithmic control, and programmable logic operators provide a high versatility for simple and medium complexity artificial-vision applications. Although the internal operation is fundamentally analog, the interface of the prototype is completely digital, making it directly controllable by conventional computing devices. The prototype has been manufactured in a standard 0.8-m CMOS technology and successfully tested.
The basic trend in the design of the analog cell circuitry has been the maximization of the density, while guaranteeing a reasonable degree of accuracy in the analog operations. The involved area-accuracy tradeoff [19] has been addressed through structural and parametric optimization [5] . The achieved cell density is of 27.5 cells/mm -larger than achieved in other programmable CNN implementations with similar analog accuracy levels and fewer on-chip functions [13] . 
