Abstract-A VLSI architecture is proposed for the realization of real-time two-dimensional (2-D) image filtering in an addressevent-representation (AER) vision system. The architecture is capable of implementing any convolutional kernel F (x; y) as long as it is decomposable into x-axis and y-axis components, i.e., F (x; y) = H (x)V (y), for some rotated coordinate system fx; yg and if this product can be approximated safely by a signed minimum operation. The proposed architecture is intended to be used in a complete vision system, known as the boundary contour system and feature contour system (BCS-FCS) vision model, proposed by Grossberg and collaborators. The present paper proposes the architecture, provides a circuit implementation using MOS transistors operated in weak inversion, and shows behavioral simulation results at the system level operation and some electrical simulations.
I. INTRODUCTION

H
UMAN beings have the capability of recognizing objects, figures, and shapes, even if they appear embedded within noise, are partially occluded, or look distorted. To achieve this, the human vision-processing system is structured into a number of massively interconnected neural layers with feedforward and feedback connections among them. Neurons communicate by means of electrical streams of pulses. Each neuron broadcasts its output to a large number of other neurons, which can be inside the same layer or at different layers. The way this is done is through physical connections called synapses [1] . One big problem encountered by engineers when it comes to implement bio-inspired (vision) processing systems is to overcome the massive interconnections. An interesting way of trying to solve this is by developing models and algorithms that require a small local interconnectivity among neighboring neurons. Cellular neural networks (CNN's) are one way of doing this [2] - [4] . However, in this paper we will focus on another approach, whose popularity has grown recently, which is known as address even representation (AER) [5] - [8] . Fig. 1 shows a schematic figure, outlining the essence behind AER. Suppose we have an emitter chip containing Manuscript received February 24, 1998; revised September 20, 1998. This paper was recommended by Associate Editor T. Roska. T. Serrrano-Gotarredona and B. Linares-Barranco are with the Instituto de Microelectrónica de Sevilla (IMSE), Centro Nacional de Microelectrónica (CNM), 41012 Sevilla, Spain (e-mail: terese@imse.cnm.es).
A. G. Andreou is with the Department of Electrical and Computer Engineering, The Johns Hopkins University, Baltimore, MD 21218 USA.
Publisher Item Identifier S 1057-7122(99)08066-6. a large number of neurons or cells whose activity changes in time with a relatively slow time constant. For example, if chip one is a retina chip and each neuron's activity represents the illumination sensed by a pixel, the time constant with which this activity changes can be equivalent to the frame rate (i.e., 25-30 changes/s or a time constant of about 30-40 ms) 1 . The purpose of an AER-based communication scheme is to be able to reproduce the time evolution of each neuron's activity inside a second, or receiver, chip, using a fast digital bus with a small number of pins. In the emitter chip, the activity of each pixel has to be transformed into a pulse-stream signal such that pulse width is minimum and the spacing between pulses is reasonably high, to time multiplex the activity of a relatively large number of neurons. Every time a neuron produces a pulse, its address or code should be written on the bus. For the case where more than one pulse is produced simultaneously by several neurons, a classical arbitration tree can be introduced [5] - [7] , or one based in winner-takes-all (WTA) row-wise competitions [9] , or simply by making no neuron accessing the bus [10] . Whatever method is used, the result will be the presence of a sequence of addresses or codes on the digital bus that one or more receiver chips can read. Each receiver chip must contain a decoding circuitry so that a pulse reaches the neuron (or neurons) which ought to be connected to the emitter chip neuron specified by the address read on the bus. If each neuron integrates the sequence of pulses properly, the original activity of the neurons in the emitter chip will be reproduced. Note that in AER, those neurons that are more active access the bus more frequently. This property allows to optimize the use of the bus, since neurons with low activity will not consume much communication bandwidth. Mathematically, if is the intensity (or activity) of pixel at coordinate of the emitter chip, this pixel will generate a stream of pulses so that when integrated at the receiver chip, the original activity is recovered (1) Here, the operator denotes integration of a sequence of pulses. Usually in AER, this operator is a lossy integration and the asynchronous sequence of pulses is such that its pulse density represents the pixel intensity. The interfacing bus activity is the time-multiplexed sequence of pulses of all active emitter chip pixels . This is the simplest AER-based communication scheme among chips. However, AER allows us to easily add more complicated processing. For example, input images can be translated or rotated by remapping the addresses while they travel from one chip to the next. By properly programming an EEPROM as a look-up table, any address remapping can be implemented by simply inserting the EEPROM between the two chips. Furthermore, many EEPROM's can be connected in parallel, each performing, for example, a rotation at a specific angle and each delivering the remapped addresses to a set of specialized processing chips. It is also possible to include synaptic weighting by having the EEPROM store the weight value, dumping it on a data bus, have the receiver chip read both the address and the data bus, and perform a weighted integration in the destination(s) neuron(s). It is also possible to implement projective fields, i.e., for every address that appears on the bus, a small digital system could generate a sequence of addresses around it and send it to the receiver chip. This would be a time-multiplexed projection-field generation. In the architecture proposed in this paper, we implement a synaptically weighted projection field for each address read on the bus, not in a time-multiplexed manner, but in parallel. This can be done by either having a hard-wired kernel in the filtering chip [11] , or by implementing a programmable one, as proposed in this paper.
II. THE PROGRAMMABLE FILTER
The programmable filter described in this paper is intended to be used in a vision model system, known as the boundary contour system (BCS) and feature contour system (FCS) [12] . Such a vision model consists of an image-sensing layer, followed by a set of illumination normalization layers (this is also known as a retina [7] , [8] ). The output, which is a contrast image, is applied to a set of orientation-specific edgeextraction Gabor-like filters. Their outputs are then fed to a set of convolutional processing layers, organized in four stages connected in feedback, intended to extract long-range contours of the input image while removing noise. The convolutional kernels used in most of these layers are decomposable into -and -axis components , for some rotated coordinate system . Using AER allows us to implement a filtering chip only for the coordinate system for which is decomposable. To do the filtering for another coordinate system , rotated with respect to , an arbitrary angle , we can use the same chip, but provide addresses which have been rotated previously (by simply inserting an appropriately programmed EEPROM in the interfacing bus).
In the filtering chip, the convolutional kernel is implemented as follows. Every time a pulse for address is received, pulses are sent to all pixels in its vicinity. In this way, the lossy integrator at pixel of the receiver chip will integrate the sequence of pulses (2) which are all pulses coming in from its vicinity, weighted by the convolutional kernel . The weighting is performed by modulating the width of each incoming pulse. Thus, every time a pulse is received for pixel , a pulse of width is sent to pixel in its vicinity . The resulting lossy integral processing of these stream of pulses is the output image (3) Pulse-width modulation is done as follows. When a pulse for coordinate is received, all columns in the vicinity of column receive a pulse of width and all rows in the vicinity of rows receive a pulse of width . The values of and are stored in a small on-chip RAM. The integrator at coordinate receives a pulse of width equal to the minimum of and . Consequently, the convolutional kernel the system implemented is an approximation to , which is (4) the signed minimum of the vertical and horizontal components.
Certainly, replacing a product operation by a signed minimum introduces an error. However, in most bio-inspired vision-processing models, the task performed by a processing layer is mainly of qualitative importance rather than quantitative. Therefore, choosing a certain mathematical kernel or another to perform a given task (such as edge or orientation extraction) should not be too critical for the global operation of a realistic bio-inspired vision model. Nevertheless, whether or not a product can be substituted by a signed minimum should be evaluated for each particular application. In any case, to give a quantitative feeling of the error introduced, Table I shows the resulting normalized square error 2 (NSE) when changing the product by the signed minimum for some typical image processing -and -decomposable kernels. 2 Figure of merit used by Shi [14] to compare different kernels, defined here as NSE = kF(x;y)0F (x;y)k dx dy kF(x;y)k dx dy . 
III. CIRCUIT DESCRIPTION
This section provides a circuit that implements the previously described functionality. The address bus provides the coordinates of the neuron (or pixel) around which the convolutional kernel should be applied. Pulses will be applied to all rows with a y coordinate in the interval and all columns with an x-coordinate in the interval where is the width considered for the kernel. Pulses will be modulated in width, according to function for the rows and function for the columns. At each pixel there is an AND gate, which provides a pulse of width equal to the minimum of and . This pulse will generate a fixed-magnitude current pulse of the same width, which will be integrated on a capacitor. Each pixel contains two integrators. One of them, the positive integrator, integrates the pulses of width when , while the other, the negative integrator, integrates the pulses when
. The values of and are stored digitally on chip in a small RAM. Fig. 2 shows the block diagram of the system. It consists of two input decoders that decode the address of the arriving pulse, a element required for the AER communication protocol [5] - [8] , an array of integrator cells , two sets of programmable monostables and whose pulse widths are controlled by the bits stored in two RAM's, RAM and RAM (which store the digital words and , respectively), two arrays of and selecting cells and , respectively, two output decoders to select the cells to be scanned, and a scanning circuitry Scan to read out an analog output current . Note that in the present prototype of Fig. 2 , the system does not generate an AER output. This can be solved by either adding the necessary circuitry to each pixel [5] - [7] , which will decrease cell density of the resulting chip, or by adding a postprocessing chip that scans, sequentially, all cells in the array of Fig. 2 and generates an AER output. Once the filter has an AER output, an arbitrary number of filtering stages can be cascaded.
The operation of the system in Fig. 2 is as follows. In RAM and RAM digital words of bits are stored ( and . The first bit (or ) indicates the sign of the function (or ). The following bits indicate the absolute value (or ). These bits linearly control the length of the pulse triggered by monostables (or ). The monostables achieve this by charging with a constant current a programmable capacitor controlled by the bits in or . Hspice simulations showed a linear relationship between digital code word and pulse width. The pulses generated by the monostables are sent through lines (or (a) (b) Fig. 3 . Schematic of (a) the neighborhood-selection cell and (b) one half of the core-cell diode-capacitor integrator.
) and are triggered whenever an external pulse arrives to the system. When an external pulse arrives, the input decoders activate lines and , corresponding to the address of the arriving pulse. The selection cells controlled by (cells in Fig. 2 (5) where is the (lossy) integral over time of the number of pulses the address bus receives for pixel and is the fixed magnitude of the current pulses being integrated. Similarly, the negative integrator accumulates charge when pulses arriving through horizontal and vertical lines of opposite sign and (or and ) are simultaneously high, that is, it performs the operation . Consequently, the difference between the outputs of the positive and negative integrators is given by (6) which is the filter operation we want to implement. Fig. 3(a) Each synaptic cell, has two integrators: the positive and the negative. Fig. 3(b) shows the circuit diagram for the positive integrator. The negative is identical, except for labeling. The integrator is based on the capacitor-diode integrator concept for subthreshold MOS operation [7] . As will be seen next, this integrator has some interesting properties with respect to a conventional linear RC-integrator.
• The steady-state current is proportional to pulse stream frequency.
• The steady-state current is proportional to pulse width • The steady-state current ripple is independent of the current level. In Fig. 3(b) , the two AND and the NOR gates provide a pulse of width equal to the minimum of the pulse width coming in horizontally and vertically. This pulse turns ON current source , providing a current pulse of amplitude (controlled by bias voltage ). Since transistors and are biased in subthreshold, the integrator input and output currents and are related by [7] (7) where is thermal voltage and is a characteristic subthreshold dimensionless technology parameter, whose value may range from 0.60 to 0.98 [14] . When a train of pulses of width and frequency is applied to this integrator, the steady-state output current is [7] (8) with a ripple of (9) where . Equation (9) shows that the relative resolution in the integrator output is constant, independent of the signal level. According to (8) , each integrator outputs a current which is proportional to the frequency and width of the input pulses. If the AER input-image pixel intensity is linearly encoded with the frequency of the arriving pulses and the convolutional kernel is encoded as the pulses width, the output current of the positive integrators would be the input image, filtered with the filter positive terms. Equivalently, the negative integrator output current would be the input image, filtered with the negative terms of the filter. Hence, the result of subtracting the output current of the negative integrator from the output current of the positive one is the filter output. Fig. 4(a) shows an Hspice transient simulation for one of the integrator cells in Fig. 3(b) . Transistor sizes are m and m, the integrating capacitor is pF, pulse amplitude is nA, pulse width is ns, frequency of pulse stream is KHz, V, and voltage was set to 4.67 V (which yields a current gain from transistor to of around 2000). Similar simulations were performed by sweeping the frequency of the input pulse stream and the width of the pulses . The results are shown in Fig. 5 . Fig. 5(a) shows the steady-state current level as a function of frequency, while maintaining ns. Fig. 5(b) shows the steady-state current level as a function of pulse width, while maintaining the frequency constant at 4 KHz.
Sometimes in 2-D image-filtering processing a rectification operation has to be performed. This is the case, for instance, when doing orientation extraction with Gabor-like kernel filters. The output of the filter is rectified for each pixel [12] . Because of this, the chip scan-out circuitry, which brings out of the chip the state of a cell, has been designed to be able to add a rectification operation. The random-access scanning circuitry can read the rectified output current of any cell selected by the random scan bus of Fig. 2 . The output decoder (see Fig. 2 ) selects a column through line . When a column is not selected, the output currents and of all cells in that column flow to a line of constant voltage [see Fig. 3(b) ]. If column is selected, currents and of all cells in these columns flow to lines and , respectively, of the scan-out cell Scan , shown in Fig. 6 . Each scan-out cell Scan receives two input currents and provides an output current . Current is mirrored through a PMOS current mirror and subtracted from current . The PMOS current mirror has an active input [15] , clamped to a voltage . This maintains a constant voltage at output nodes of cells when they are selected, thus speeding up the read out of currents. Current enters the current comparator composed of transistors and OPAMP [16] , whose input node (and output of all selected cells) is clamped to voltage . If current is positive, transistor will sink this current. Transistor shares its gate with and its source is connected to a voltage reference of value , thus, transistor mirrors the current passing through if otherwise.
The precision of this current reflection depends on how tightly the source of is clamped to voltage . To achieve a good precision a high gain opamp is needed, although this would slow down the process. Thus, a compromise between speed and precision must be taken. If current is positive, transistor sources this current, which is mirrored by transistor because its source is clamped to by the current comparator composed of transistors and OPAMP . Therefore, the current through and is if otherwise.
This current is again reflected by the PMOS transistor pair . At the output node, the currents through transistors and are added together to get the rectified current . Since transistors operate in weak inversion, increasing the source voltage of transistors and , with respect to , will make the current mirrors and to have a gain higher than one (actually, the gain will be exponentially controlled by this voltage difference). This allows us to have a current gain such that the output current is of the order of hundreds of A or even some mili-A, making it possible to drive this current directly off-chip at high speeds. Fig. 4(b) shows an Hspice simulation of the dc characteristic of a scan cell. In this simulation, current was set to 80 nA and current was swept from 0 nA to 160 nA. Two traces are shown in Fig. 4(b) . The dotted line shows the current flowing through transistor . The solid line corresponds to current flowing through transistor .
IV. TIMING CONSIDERATIONS
The time response of the programmable 2-D image filter chip is dominated by the settling time of the integrators in cells when they are fed by pulse streams. For the simulation of Fig. 4(a) , for example, the settling time is about 1 ms. However, a general analysis would be as follows. For the diode-capacitor integrator, fed with a stream of pulses of width and frequency , it is easy to find that the current through the diode before two consecutive pulses are related by [17] 
Calling , and the second term of the right-hand side in (12) , this equation can be rewritten as (13) which converges to , as anticipated by (8) . Consequently, (13) can be rewritten as (14) The number of pulses required to reach the steady state with a relative error is given by the solution of 3 , which yields or, equivalently, (15) The maximum pulse stream frequency is limited by the communication throughput we would like to achieve. For example, consider the case of having a 128 128-pixel contrast output retina [7] , [8] (i.e., the output image is already normalized with respect to contrast). Suppose also that, for a real-world image, the average contrast level is as if 10% of the pixels were maximum and the rest minimum. In this case, we would need to allocate, on the average, 128 128 0.1 pulses of width in a time equal to . A reasonable value for that would assure a good range for the pulse-width modulation described in Section IV could be ns. This would yield s or KHz. Consequently, maximum pixel activity should be coded with KHz. 
V. SYSTEM LEVEL OPERATION BEHAVIORAL SIMULATIONS
Up until this point, electrical (Hspice) simulations of some of the circuit components have been presented. However, to validate the functionality of the proposed architecture, some system level (behavioral) simulations are mandatory. In this section we provide such simulations, using MATLAB on the architecture of Fig. 2 for a system of 128 128 cells. The input image fed to the system is shown in Fig. 7(a) , and the programmed convolutional kernel was a displaced Gaussian (see Table I ). Using MATLAB, the AER stream of addresses that this image could generate was computed. The stream of pulses flowing through the bus is characterized by a sequence where is the address present on the bus at time . This stream of addresses was then used to control the mathematical model of the architecture of Fig. 2 . Each one of the 128 128 cells is characterized by the state of two integrators: the positive integrator and the negative one . The state of the integrators is controlled by the following differential equations [see (7)] (16) whose solutions were computed analytically. These solutions were used to update the state of the integrators in the following manner. For each address present on the bus, all cells in the range were accessed. For each accessed cell, the pulse width was computed, using the approximation of (4) and the simulation results for the monostable. Depending on the resulting sign, either the positive or the negative integrator was updated. After an integrator has been updated, the present time was stored for it so that the next time it needs to be updated, the simulator can compute properly its discharge amount. For each cell , its output is given by . Using this method until all integrators have reached their steady state within 1% tolerance, results in the system output depicted in Fig. 7(b) . In this case, addresses were not prerotated, so that the system is extracting vertical edges. As can be seen, pixels around vertical edges result in a very high output value, while as the edge angle around a pixel deviates from vertical, its output value smoothly decreases until zero.
VI. CONCLUSION AND FUTURE WORK
An architecture that implements a programmable 2-D image filter has been presented. The architecture allows us to implement any 2-D filter , decomposable into -axis and -axis components such that the product can be approximated by a signed minimum. Positive and negative values of and can be programmed. The architecture requires an AER input. This allows us to rotate the 2-D convolution kernel to any angle.
A VLSI circuit implementation that realizes the proposed architecture is provided. Circuit simulation results of critical components were given. System-level behavioral simulations of a 128 128 array have been included, which validate the proposed approach. Cell size is 67.2 m 72.6 m if no AER output is available and 75 m 90.6 m if AER output is included, for a 1.2-m double-poly double-metal CMOS process. This would allow, for a 1-cm die, to implement a 2-D filter with approximately 128 128 pixels for no AER output, and 120 100 pixels if AER output is provided. Future work includes the fabrication of a test prototype, testing it with a retina chip with AER output, and assembling a cascade of convolutional processing layers to implement a vision-model system.
