This paper presents a systematic approach to design CMOS chips with concurrent picture acquisition and processing capabilities. These chips consist of regular arrangements of elemen- technology. In addition to the sensory and processing circuitry, both chips incorporate lightadaptation circuitry for automatic contrast adjustment. They obtain smart-pixel densities up to 89 units/mm 2 , with a power consumption down to 105µW/unit and image processing times below 2µs.
I. INTRODUCTION
Common architectures for image-processing systems use a front-end sensory plane with digital encoding of the pixel values, and serial transmission of these digital data for subsequent processing using either ASICs or general-purpose computers. Contrary to this approach, smartpixel chips [3] incorporate an analog computing cell at each sensory point, achieving high speed and low area occupation in the combined sensory/processing functions by fully exploiting parallelism. The combined spatial distribution of sensory and processing circuitry eliminates the time required for data transmission from the sensory to the processing plane during the image acquisition process. In addition, in some image-processing applications, the relevant information contained in the output image can be described by a reduced number of variables, allowing a fast downloading of the results for subsequent evaluation.
CMOS technologies offer unique features for the design of smart-pixel chips. On one hand, MOS transistor operation under normal biasing in strong inversion is not drastically affected by incident light; on the other, photosensitive CMOS devices can be built by exploiting the many junction devices available in CMOS technologies [4] . However, previous approaches to CMOS design of smart-pixel chips lack generality, as they rely on implementation methods suitable for specific applications. In some cases, the processing-task performed at each pixel does not imply collective computation [3] , while most of the approaches for "pixel-smartness" are based on active implementations of resistive-grid networks [5] , [6] .
The paradigm of Cellular Neural Networks (CNN) [1] , [2] is a very suitable framework for systematic design of parallel sensory-processing chips. On one hand, CNNs consist of regular arrangements of cells --topologically identical to smart-pixel chips. On the other, their cells are only locally connected, and thus, require simple routing. Also, the vast body of literature on CNN theory and applications demonstrates outstanding features of this paradigm for array-processing [7] . In particular, resistive grids have recently been demonstrated as a particular CNN class [8] .
No experimental smart-pixel CNN chips have been reported to date. This paper outlines a design approach using Darlington phototransistors and current-mode processing circuitry. It is based on a modified version of the original CNN model which enables optimum speed/power and area occupation in VLSI design [9] , [10] . The sensors include an automatic adjustment circuitry which ensures proper behavior under different illumination conditions. Our proposals are demonstrated via two working smart-pixel chips, in a single-poly, 1.6µm n-well CMOS technology. In addition to their optical input, these chips exhibit much better area and speed/power figures than previous CNN implementations [11] , [12] .
Section II describes some general aspects of smart-pixel chips, and Section III outlines the proposed computation algorithm. Sections IV and V discuss the sensory and processing circuitry, respectively, and the experimental prototypes are described in Section VI.
II. SMART-PIXEL CHIPS
In this paper, pixel denotes the elementary sensory unit used to detect pointwise light signals. These sensory units are realized in CMOS technology using any compatible junction device to generate a current whose value is an increasing function of the light intensity [3] , [13] , [14] . The acquisition of two-dimensional scenes requires pixels arranged onto regular grids, as shown in Fig. 1 . Each pixel in this sensory plane generates a current I c which codifies a corresponding point of the input image, where the index c ≡ (i, j) indicates the pixel at the i-th row and j-th column on the grid and varies over the whole grid domain GD (c ∈ GD). Thus, the whole image is captured into a matrix of currents [I c ]. Smart-pixel chips are of strong practical interest for pattern recognition problems, to detect features of the input image. For example, Fig. 2 illustrates the task of detection of connected components (DCC), which consists of counting the number of connected pieces encountered by scanning an input image in a given direction [15] . Pattern recognition can be realized by processing the data obtained after performing this task in the directions shown in Fig. 2 [16] , [17] . This data is contained in a few rows and columns at the grid borders. In addition to their usage for preprocessing tasks, smart-pixel chips are also useful as stand-alone units for nonintensive computation tasks such as halftoning [18] , motion detection [19] , [20] , [21] , range-finding [3] , etc.
III. THE CNN PARALLEL PROCESSING PARADIGM
As Fig. 1 Fig. 3(a) ; and (c) cell external-input: u c . Processing itself is governed by a set of coupled nonlinear differential equations, one per cell. We use equations that differ from that originally proposed by Chua-Yang [1] , and which enable optimizing the speed/power ratio and area occupation of VLSI CNN chips. The proposed equations are given by [9] , [10] , (2) where g(•) is a nonlinear dissipative term defined as, (3) where m > 1 is a parameter of the model. Function g(•) is drawn in Fig. 3(b) . Summations in (2) extend over the neighborhood of the c-th cell, denoted by N r (c), which contains adjacent cells located within a distance r in the grid, and includes cell c itself.
Processing tasks performed by CNNs are determined by the convergence of (2) to binary 
and under the boundary conditions imposed by cells at the net border. Depending on the application, the current I c generated at each cell's photosensor is used as initial value of the state variable x c (0) or as external input u c . In the later case, the initial states are usually set to a constant value. The outcome of the task depends on parameters B cd , A cd , and D c of (2), called control, feedback, and offset parameters, respectively, and on the boundary conditions. The control and feedback parameters can be arranged into matrices, which provide a pictorial view of the interactions within each cell's neighborhood. For uniform networks these matrices are invariant throughout the grid domain --they are templates. The functionality of uniform CNNs is determined by its control, B, and feedback, A, template matrices, and its offset parameter, D. For illustration purposes, Table I summarizes the templates used for some significant preprocessing tasks.
To guarantee correct operation of smart-pixel CNN chips, an important mathematical issue is to determine conditions of the template parameters that yield convergence of the output matrix [y c ] to binary states for any input. Such a mathematical analysis for the model proposed in this paper, given by (2) and (3), is out of this paper's scope and has been reported elsewhere 
IV. SENSORY CIRCUITRY

A. Photosensors
The simplest photosensitive devices for CMOS n-well technologies are reverse-biased photodiodes, formed either directly between n + -diffusion and substrate [3] or between well and substrate [13] . Current level for both devices is an increasing function of the junction area. In particular, we have measured currents up to 20nA for well-substrate photodiodes with well area of 100 × 100µm 2 , in a 1.6µm single-poly technology, under environmental laboratory lighting.
This current level increases significantly using a vertical CMOS-BJT as photosensor. Fig. 4(a) shows a conceptual layout and cross-section for this device, whose current is approximately proportional to the area of the well/substrate junction, A w in the figure. Current generated by this device is β+1 times larger than that of a photodiode with the same well area,
where I T denotes the phototransistor current, I W is the corresponding current for the well-diode,
and β is the transistor current-gain; measured β for this technology is 37.7±0.8, basically independent of transistor geometry [22] .
We have measured currents up to 430±30nA (under normal laboratory illumination) for phototransistors with passivated well area of 60 × 60µm 2 . Consequently, and since current and well area are approximately linearly related, it extrapolates current levels of 20nA for minimum area devices (13.6 × 13.6µm 2 ) --needed for increased density smart-pixel chips. However, for some critical tasks [7] these levels provided by minimum photosensors may not be large enough to guarantee the matching level required by the signal processing circuitry, thus requiring some amplification. Simplest strategies use either larger wells or cascaded current amplifiers --very costly in terms of area occupation and, for the latter, inaccurate. Instead, we use an additional vertical BJT to achieve Darlington amplification by a factor of β+1, with practically no area overhead. Fig. 4 (b) shows the conceptual layout and cross-section for this Darlington phototransistor. Current for this device is,
while its area occupation is scarcely increased by that of a minimum-size vertical BJT. Measurements with A w in Fig. 4 (b) equal to 60 × 60µm 2 result in currents up to 18±2µA. shows the result obtained when the environment light is gradually reduced to complete darkness. Dark-current was 215±10pA, which means that the bright-to-dark current-range is close to 100dB for environmental laboratory illumination. The same range is observed for a single p-n-p device, while simple photodiodes yield about 80dB. Although these results are optimistic in the sense that in real images there will be no completely dark areas, the bright-to-dark current-ratios measured provide a wide enough range for data acquisition. The amplification of the Darlington structure provides a sufficient current level even if device area is substantially decreased.
B. Autozero Strategy
Although photosensors produce unidirectional current flow, double-rail signals are easily obtained by bias-shifting, as shown in Fig. 6 (a). Current source I TH sets the zero-level of the double-rail signal. To guarantee good contrast, its value should be set somewhere between the maximum and the minimum light-induced currents among all photosensors. If lighting condi-
tions for all possible input scenes are uniform and known a priori, I TH can be set to a fixed value.
In a more general case where the chip must handle scenes with different lighting conditions, some kind of auto-zero strategy must be devised to generate I TH approximately equal to the average of the photosensor currents over the whole array.
A simple, yet convenient, auto-zero strategy uses four extra transistors at each sensor. Fig. 6(b) shows the schematic of a sensor including the auto-zero circuitry. All p-channel transistors have equal size; the same applies to n-channel transistors. The low-impedance node labelled SUM is a global node, common to all pixels. Note that the current I c at the c-th photosensor is replicated twice. One of the replicas interfaces the processing circuitry, while the other is rooted to the global-node SUM, and aggregated to the remaining sensor currents. Thus, calculation of the current I TH through transistor M TH obtains the following, (6) where g op is the output conductance of the p-channel transistor, g mn is the transconductance of the n-channel transistor, and N the number of pixels. For simplicity (6) assumes equal transconductances and conductances for all pixels. The first factor in (6) reflects the current division performed at node SUM, while the second corresponds to the gain of the mirror formed by the parallel combination of the M SUM transistors and M TH . Assuming g mn >> g op , (6) gives I TH equal to the average of the photosensor currents, and the light-threshold is automatically adjusted to the average illumination. Fig. 7 is a block diagram for the processing circuitry of the c-th unit in a smart-pixel CNN chip, according to (2) . This figure shows a core integrator with nonlinear losses and an output structure to generate weighted replicas of the c-th input u c and state x c , for transmission to the neighbor smart-pixels. The integrator is driven by weighted replicas of the input and state signals of the smart pixels in the neighborhood N r (c), plus an offset term, obtaining the following signal to drive the core integrator,
V. PROCESSING CIRCUITRY
A. Basic Circuit Building Blocks
Current-mode provides a convenient choice to realize the processing circuitry of smartpixel CNNs. On one hand, it enables direct interface with the sensors, whose outputs are currents. On the other, current summation at the integrator input node is directly achieved by routing signals to a common node. Finally, analog operators involved in Fig. 7 (weightedreplication, integration, and limitation) are realized by simple current mirror circuits. Thus, analysis of this circuit results in,
as required to realize (2) , and where g(•) is the function defined in (3) with m → ∞. In practice τ does not remain constant, but varies with input current level. However, most processing tasks tolerate this variation with no degradation of the network functionality [9] . 
B. Some Circuit Design Issues
The following is a brief comment of dominant nonidealities encountered in the practical implementation of smart-pixel CNN chips and associated circuits.
a) Current Gain Error:
A major source of error is the finite ratio of the input conductance g in to the output conductance g o of the current mirrors, which causes current gain error due to spurious current division. It is especially significant at the input node of the integrator, where the gain error ε, is given approximately by [9] ,
where N denotes the number of mirrors driving this node --up to 18 for templates with no zero entries on a rectangular grid net with unity neighborhood parameter. For improved g o /g in figures with short channel devices, cascode mirrors, regulated mirrors, or a combination of both must be used [23] . In particular, analysis shows that the cascode mirror of Fig. 9(a) obtains values of g o /g in several orders of magnitude lower than that for single mirrors, with smaller area occupation. Chips reported in Section VI are realized using these mirrors, and sized to handle the whole input current range with minimum distortion and smallest possible devices. For mirrors biased by a current I Q , we obtain the following sizing equations, (10) where k n = µC ox /2, V Tn is the threshold voltage, and V CAS is the cascode voltage, which can be generated as shown in Fig. 9(b) . We assume the same geometries W n and L n for all n-channel transistors in the cascode mirror. W values for larger currents (associated to weighted replication) are calculated by imposing the constraint that all transistors have equal current density.
Alternatively, for a given aspect ratio W n /L n , equation (10) Mismatch is produced mainly by variations of V T and β = µC ox W/L, whose standard deviations σ(V T ) and σ(β)/β for devices with equal layout show a component inversely proportional to the square root of the channel area, and another proportional to the distance between devices [24] . However, in the technology used and for transistor pairs closer than about 2.5mm, the distance-dependent component is negligible for devices with channel area of less than 100µm 2 [24] --larger than the values obtained using (10) for bias currents below ∼50µA and channel lengths
Lower channel lengths have not been considered for several reasons, like short-channel effects, early-voltage degradation, and increased mismatch effects due to the associated low channel areas. In addition, lower transistor geometries do not result in appreciable area reductions due to the minimum contact size (4µm with surrounding metal and diffusion in the technology used).
Another important consideration is that for a given σ(V T ) and σ(β)/β, the ratio σ(I)/I in MOS transistors operating in strong inversion and after pinch-off has an inverse dependency with v gs −V T . This means that once geometries have been set to achieve acceptable mismatch levels, bias current can not be decreased too far below the bound given by (10) , since this would produce a low v gs voltage at the bias point, with the corresponding large σ(I)/I. Hence, mismatch considerations establish bounds for both minimum area and power trends.
c) Light effects on the processing circuitry:
Optical image acquisition forces the processing circuitry to be exposed to light, which results in an increase of the leakage currents at the reversebiased substrate-diffusion junctions. Unitary bias currents must be sufficiently large in order to neglect this effect. Also, MOS threshold voltage depends on light intensity, increasing the mismatch effect on current mirrors and sources. Current mirror transistors are commonly placed nearby, and hence light-intensity gradients have a reduced effect. On the contrary, current sources in different cells, biased by common global voltages, can exhibit larger dispersions. The tolerance of a particular application to variations in the unitary bias current must be evaluated in general, and local references should be used when required.
VI. EXPERIMENTAL RESULTS
A. 16 × 16 DCC Prototype
The following measurements were taken from a 16 × 16 smart-pixel CNN chip intended for horizontal connected component detection [15] (see Fig. 2 ) --a basic preprocessing step for pattern recognition. Fig. 10 shows a microphotograph of the prototype, which in addition to the smart-pixel array contains boundary cells, output buffers, bias stages, and some digital control circuitry for the output image downloading process. The dimensions of the core array are 1890 × 1530µm 2 , and its power dissipation is 27mW. The total chip dimensions, including the bonding pads, are 2480 × 2500µm 2 , with a total power dissipation of 42mW and a total of 24 pins. is reduced from the nominal 5v down to 2.7v. This is another positive consequence of using current-mode techniques.
B. 16 × 16 Radon Transform Prototype
This prototype performs the Radon Transform [25] of 16 × 16 pixels input images. This chip accepts electrical, as well as optical, input. The processing circuitry is based on a modified version of (2), where time has been discretized, and the nonlinearity is hard,
Also, this application requires signal-dependent weights [26] . In particular, the weights of the contributions going from a particular cell c to its neighbors depend on x c . The complete set of CNN coefficients can be described using unidimensional templates as follows, (12) which reflect the scaling factors applied to the contributions of a particular cell to its neighbors. Fig. 13(a) shows a simplified schematic of a cell, which uses pass transistors to realize the delay required in (11) and a high-resolution current comparator [27] for the hard nonlinearity.
The design technique and the algorithm used in this circuitry is described in detail in [9] . Fig. 13(b) shows a microphotograph of the prototype. Cell dimensions are 121µm × 124µm, and the power dissipated by each cell is 1mW --significantly larger than for the DCC due to the circuitry used to implement the hard nonlinearity.
The system contains a number of blocks located in the periphery of the cell array, like output buffers, bias stages, and digital control circuitry dedicated to the uploading and downloading processes. This additional circuitry, together with the bonding pads, result in a total system area of 2670µm × 2680µm, and a total system dissipation of 330mW. The chip requires a total of 43 pins. This number is significantly higher than that of the previous prototype due to 16 
VII. CONCLUSIONS
Summarizing, this paper has outlined a basic model and some design issues related to a methodology to design CNN smart-pixel chips in digital CMOS processes, and has presented measurements from two working prototypes in a 1.6µm n-well CMOS technology. One calculates the number of connected pieces (DCC) of an input image in the horizontal direction, and the other evaluates the Radon Transform of an input image. The DCC chip obtains a density of ∼89 smart-pixels per mm 2 (each including sensory, regulation and processing circuitry), with a power consumption of 105µW per smart pixel and image processing times below 2µs. Area and speed figures for the RT chip are similar. Although power dissipation is larger for this prototype,
this can be corrected with a careful design of the current comparator [27] .
As compared to previous CNN implementations, the proposed technique makes the required synergy between sensing and processing, and significantly improves area and speed/ power figures. In particular, when compared to previous chips for the same application [11] , [12] , the DCC chip, apart from including sensors at the cells, reduces the area consumption by a factor of 4, and improves the speed/power figure by more than one order of magnitude.
These area and power figures, and the fact that connections among pixels are made by abutment (requiring no extra routing area) enable forecasting single-die CMOS chips with 100 × 100 complexity and about 1W power consumption.
These designs are mainly oriented towards preprocessing tasks which require fixed weights. We feel that there are potential application fields for these chips, provided efficient integration to massive processors is achieved. For this purpose, close cooperation between chip designers and system developers is necessary. 
