Abstract'
Introduction
Vision is a computation-demanding activity which involves many tasks and data types. Tasks can be clustered hierarchically starting with low-level image processing (basically 2-D spatial filters), passing through the feature estimation (segmentation of textures, motion, etc.), and ending with the image/object analysis (image classification, identification, 3-D reconstruction, etc.). Low-level tasks consists of simple operations executed on a very large data set -pixelwise. These low-level tasks do not require floating-point accuracy. However, operating with images involve intensive accesses to the memory, and pose hard constraints to the bandwidth of the communications between memory and processor. Besides, using the front-end chip of vision systems for just sensing (imager) forces the necessity to process the whole data set, and requires high speed data transferences. Though reaching 30FPS is not certainly impossible, eten for large reso-lution images, high speed industrial applications requiring a higher frame rate might turn unfeasible.
The relaxed accuracy requeriment of low-level vision tasks gives analog processing blocks an opportunity to compete with digital processors. The area and energy efficiency of moderate-accuracy analog circuits make them suitable for the implementation of focal-plane processing. Thus, during the last few years different analog and mixed-signal solutions [ 11- [4] have been proposed to combine sensing + low-level image processing at high speed in a single chip.
The chip presented here outperforms all previous ones in terms of complexity, computational capability, and performance. It is a mixed-signal programmable vision sensorlprocessor chip conceived as a general-purpose, programmable device. First, it can be used as the core element in a vision system. It can acquire images, process them according to an user-defined program (which may contain image combinations, data bifucartions, and conditional executions), and finally give the result already as an 8-bit digital image. In this approach, resolution is limited by chip size 128x128, but processing tasks run at chip maximum attainable speed -around 4 ~s for convolutions.
Its second use is as image co-processor. In this case, higher resolution imagers are captured by a conventional imager, windowed in chip-sized pieces by a controller and transmitted to the chip -in 8-bit format -for processing.
This approach unavoidably leads to lower frame rates but opens the doors for low frame-rate high resolution applications also.
Chip Description

Global Overview
The chip reported in this paper, whose floorplan and microphotograph are shown in Fig. 1 The array of PES constitutes the core of the system, the computing engine. As shown in Fig. 1 , it is surrounded by circuitry employed for addressing, I/O, timing, ... and, most important for the storage and decoding of userselectable instructions which are executed by the computing engine for the realization of involved vision processing algorithms. While the internal operation of the chip is basically analog, the external interface is fully digital.
The chip has been designed in a completely digital CMOS 0,35pm 5M-1P technology. It contains more than 3,75 mill. transistors (85% of them working in analog mode) and provides peak computing figures of 330 GOPS , 3,6 GOPS/mm2 and 82,5 GOPS/W
Program Memory
It is composed of simple S U M blocks and 8-bit D/A converters. This memory serves two purposes. A first sector of the memory is used to store the machine code of the algorithm to be implemented. Every instruction in this memory consists of a 64-bit digital word which defines the state of different switches within the PES and configures the I/O digital port. A second sector of the memory is employed to store (in 8-bit format) 32 sets of 19 analog coefficients, needed to define internal analog references, and the coefficients of the convolution masks to be applied. Since internal building blocks are analog, the outputs of this second sector of the memory are connected to the input port of a bank of 19, 8-bit D/A converters (resistor ladder plus multiplexer [6]) which drives a spatially distributed bank of buffers [7] to distribute analog voltages to the PES.
1 / 0 Port
Processed images are provided in digital format (using 8-bit coding) through a 32-bit bus. Communications between chip and host are carried-out by means of simple hand-shaking protocols. The chip also accepts images through this bus (in the mentioned format) to cover the possible cases in which a larger resolution imager already exists in the system.
The digital port consists of a bank of 128 (one per col- Fig. 2(a) shows the block diagram of the PE [4] . Arrows indicate how information flows. It contains 8 fundamental mixed-signal building blocks that communicate to each other by a global wire.
Processing Unit
In addition to the processing kernel for running 2-D 
Spatial Processing Kernel
Each PE updates its state driven by the cells located within its neighborhood. A bank of analog multipliers is used to implement these interactions. These analog multipliers, designed by using a single transistor technique [4] , are driven by voltages at both inputs (the signal input and the scaling input) and provide a current at the output. The bank of multipliers, depicted at the conceptual level in Fig. 2(b) , is driven by three different pixel values, PA , PE and Pc so that the current which flows into the PE is expressed as, where the A and PA are defined as, It,* = A * P A + b . P E + c . P , + z + l , f f (1) the operator ( ) accounts for the convolution product of those matrices, and Ioff is a spurious offset term produced by the one transistor multiplier [4] .
The currents generated by the multipliers are collected by the input block of the PE, also in Fig. 2(c) , and are sent to a very simple current processing block. The offset term generated by the multipliers is substracted by using a high accuracy current memory block based on an s31 memorization scheme -see Fig. 2(c) . Afterwards, the actual signal current, Iin = A * P A + b , P B + c . P c + Z (3) can be either directly steered to the global node or sent to the input of a current comparator, whose output connects to the global wire through a switch -see Fig. 2 
(c).
Two different situations may occur depending on *If the switch is ON, the voltage delivered to the global wire corresponds to the sign of Iin ; i.e. to the sign of the convolution operation, In this case the output is a black-and-white pixel. *If the switch is OFF, the analog current I , is routed to one of the capacitors associated to the pixels, and the output is a gray-scale pixel value. In the latter case above, the specific capacitor to which I , is routed is selected by the user through the activation of some bits in the currently selected instruction. By so doing, the evolution of each PE is described by a state equation whose actual expression depends on the selected integrating capacitor. Therefore, different kinds of processing kernels are available. For instance, to run a Sobel operator, the convolution matrix is defined in A ; the image is loaded into P A ; the following values are set: c = z = 0, and b = -1 ; while the input current is routed to C, . Hence, the state equation obtained for each PE is, whether this latter switch is ON or OFF.
sign(A P A + b . P B + c . P c + z ) (4)
Iin + I of/ -""sh 
Optical Input
The optical input module, in Fig. 3 , [8] , consists of a multimode sensor in which both the physical device used as sensor and the transduction mechanism are programmable. A P-diff/N-well diode, a N-well/P-subs diode, or a Pdiff/N-well/P-subs phototransistor, are available.
by some bits of the 64-bit vector which defines the state of the chip. Furthermore, the phototransduction scheme is also programmable. Both linear integranerated current, and logarithmic compression, are also available by proper definition of the digital instructions controlling the operation of the chip. 
Selection is done
Experimental Results
The test of the chip, is being performed by using a dedicated hardwaresoftware environment [4] .
The hardware part of the system con- Fig. 4 . Development Platform tains various layers of boards of the same size, which allows connecting them simply by stacking. The first layer, in Fig.4 , hosts the chip, while the others are intended to provide the program, power, data, and to accommodate inputs/outputs to/from the chip from/to the PCI bus.
Optical Input
Though the chip allows the selection of seven sensing modes, after test, employing the well-substrate diode as light-sensitive device has experimentally demonstrated to provide the largest sensitivity. Besides, albeit having such a complex optical sensor [8] might have lead to undesirable large fixed-pattern noise figures, this has not been the case in practice. First of all, because we designed both the schematic and layout of the readout buffer by paying special attention to mismatches. To that purpose, we employed PMOS transistors in the differential pair and we connected source and bulk -hot well -to avoid mis- . Fixed-Pattem Noise in the Chip matches due to the body factor -which introduces signal dependent spatial error and cannot be compensated by conventional CDS [9] . .Moreover, thanks to the on-chip image memories we can store and substract the offset of the readout buffer at the PE level. Finally, the successive approximation A/D conversion algorithm includes some steps to calibrate the offset of the comparators within the converter thus allowing to eliminate most of column-wise FPN contribution.
Fig .5 shows the FPN result in one of the samples of the chip and its statistical distribution. Here, the cancellation of the FPN due to the offset of the comparators in the column was enabled but that due to the offset of the readout buffers in each processing element was disabled -to isolate the performance of the sensor from that of the A/D converter. Standard deviation is 5.2 LSB -about 16mV. Fig.6 shows the result of capturing an image by employing the well/substrate diode with an exposition time of lms when local cancellation of the FPN is enabled. Accurate optical characterizatio-n is currently -under evaluation. This image was captured by using the illumination from a 60W@30cm bulb.
Contents on the analog register and different calibration memories do not show any apparent degradation due to the use of the optical input (8-bit accuracy remains), meaning that the chip can be processing or downloading and image as the same time as it is capturing the next frame (it works in a pipelined. fashion). Fig.7 shows the experimental results of implementing two well-known 3 x 3 image processing kernels; namely a low-pass filter Fig.7(b) and a horizontal Sobel filter Fig.7(c) [5] . These kernels must run' for Ips before the transients settle. Outputs can be stored within the processing element in any of its local pixel memories without effective degradation (keeping the 8-bit accuracy) for lms. Moreover, this accuracy is also kept after eight internal or external readouts -which is more than enough for standard algorithms. Image downloading takes 135 ps , meaning that the I/O port runs at 121, 36 MBytes/s .
Image Processing Examples
I . Before running a kernel, different calibrations are optionally executed.
When executing all of them the processing time increases up to 4 p . 
Conclusions
Experimental evidence about the suitability of using a recently designed 128x128 Mixed-Signal Vision SensorProcessor to solve real time low-level vision problems have been shown through this paper. It can store different images on-chip and run user-defined 3 x 3 convolution masks. Input images and results can be arbitrarily combined on-chip by means of any linear operation, or two-inputs boolean function. Experimental results show that frame rates of IOOOFPS can be achieved under normal illumination conditions. Moreover, as the chip processes and downloads images during optical integration, almost 150 image filtering operations, additions, or boolean combinations can be executed during the frame time.
The chip works from a 3.3V power supply and provides peak computing figures of 330 GOPS , 3, 6 GOPS/mm2 and 82,5 GOPS/W, while achieving an equivalent accuracy of 8-bits. Output images are provided to the hosting system in digital format at a maximum rate of 121, 36 MBytes/s .
