An architecture is proposed for the realization of real-time edge-extraction filtering operation in an Address-Event-Representation (AER) vision system. Furthermore, the approach is valid for any 2D filtering operation as long as the convolutional kernel F(p,q) is decomposable into an x-axis and a y-axis component, i.e. F(p,q)=H(p)V(q), for some rotated coordinate system {p,q}. If it is possible to find a coordinate system {p,q}, rotated with respect to the absolute coordinate system a certain angle, for which the above decomposition is possible, then the proposed architecture is able to perform the filtering operation for any angle we would like the kernel to be rotated. This is achieved by taking advantage of the AER and manipulating the addresses in real time. The proposed architecture, however, requires one approximation: the product operation between the horizontal component H(p) and vertical component V(q) should be able to be approximated by a signed minimum operation without significant performance degradation. It is shown that for edge-extraction applications this filter does not produce performance degradation. The proposed architecture is intended to be used in a complete vision system known as the Boundary-Contour-System and Feature-ContourSystem Vision Model, proposed by Grossberg and collaborators. The present paper proposes the architecture, provides a circuit implementation using MOS transistors operated in weak inversion, and shows behavioral simulation results at the system level operation and electrical simulation and experimental results at the circuit level operation of some critical subcircuits.
I. Introduction
Human beings have the capability of recognizing objects, figures, and shapes even if they appear embedded within noise, are partially occluded or look distorted. To achieve this, the human vision processing system is structured into a number of massively interconnected neural layers with feedforward and feedback connections among them. Neurons communicate by means of electrical streams of pulses. Each neuron broadcasts its output to a large number of other neurons, which can be inside the same or at different layers, and the way this is done is through physical connections called synapses.
One big problem encountered by engineers when it comes to implement bio-inspired (vision) processing systems is to overcome the massive interconnections. An interesting way of trying to solve this is by Address Even Representation (AER) [1] - [3] . In AER each neuron codes its activity as a pulse stream signal with very low duty cycle, i.e. pulse width must be minimum but separation between pulses should be fairly large. Each neuron has a code or address, and every time it produces a pulse it will try to write its code on a common digital bus. A receiving system will continuously be reading this bus and send the pulse to those neurons who ought to be connected to the sending neuron. In this manner the activity of a large number of neurons can be time multiplexed on a common bus. This principle allows to structure hierarchically a very complex neural system. For example, a retina chip with AER output is continuously putting addresses on a bus representing the sensed images. Several chips, each with an AER receiver system, can be reading the same bus, doing some specialized processing and broadcasting the outputs of all their neurons using again AER on another external bus, and so on. Furthermore, extra processing can be added easily while the "addresses" go from one chip to the next. For instance, image rotation or translation can be performed in a straightforward manner by inserting an EEPROM for which the transformation operation has been programmed pixel by pixel (or address by address). In the architecture proposed in this paper we take advantage of this fact to simplify the processing chip.
As neuroscientists manage to unfold the internal structure and functions of the vision system, it becomes more feasible for mathematicians and computer scientists to propose and understand bio-inspired vision models and algorithms, and for engineers to build bio-inspired artificial vision systems. One powerful vision model proposed recently by Grossberg et al. [5] is the BoundaryContour-System (BCS) and Feature-Contour-System (FCS) vision model. It consists of nine layers which after local illumination normalization and contrast enhancement of an input image, performs local edge extraction for different spatial orientations and scales, and then is able to identify consistent long range contours of the shapes in the input image through processing layers with feedforward and feedback connections. In this vision model one of the stages performs a 2D filtering operation for edge extraction, and other stages perform other 2D filtering operations. The processing architecture proposed in this paper is intended to be used in this vision model to perform a simplified version filter doing the edge-extraction operation. The same processing architecture can be reprogrammed to perform some of the other 2D filters needed in the BCS-FCS vision model.
The present paper is structured as follows. In the next Section we will briefly describe the structure, functionality, and operations performed by the BCS-FCS vision model. In Section III we introduce modification of the edge-extraction kernel which substitutes a product operation by a minimum operation in the original kernel. Section IV describes briefly the essence of AER, and in Section V we introduce a VLSI architecture capable of implementing a 2D programmable filter. Section VI provides system level behavioral simulation results of this architecture programmed with a kernel to do an extraction of vertically oriented edges, and finally Section VII indicates the conclusions and future work.
II. The Boundary-Contour-System and Feature-Contour-System Vision Model Fig. 1 shows a schematic representation of the structure of the BCS-FCS model [5] . The BCS consists of several identical subsystems (three in the case of Fig. 1 ) each of which is tuned for a different spatial scale. Each BCS spatial subsystem consists of 8 layers. Consecutive layers have been drawn in Fig. 1 as connected by thick shaded arrows. We may think of these arrows as the representation of a convolution (or filter) operation applied to the state of the previous layer and resulting in the state of the next layer. For instance, the 2D input image suffers three different filtering or convolutional operations, each of which is the starting point of a BCS subsystem. From here on, each BCS subsystem operates autonomously. From Layer 1 to Layer 3 there are only feedforward filtering operations, while Layers 4 to 8 are connected in a feedback loop configuration, which means the system will reach a steady state after a certain number of iterations (if the system is implemented sequentially on a computer) or after a certain time constant (if the system operates asynchronously and fully parallel, like in biological brains). The outputs of Layers 1 and 5 of the three BCS subsystems are fed to the FCS. Next we will briefly describe the processing performed on the different layers.
A.Stage 1: Center-ON OFF-Surround
Let us assume (1) is an N×M input image provided by a vision sensing front end. This input image is applied to a 2D filter whose impulsive response or kernel of radial symmetry is shown in Fig. 2(a) . We can see that pixels close to the center region of the kernel are going to contribute with positive weights to the convolution, while pixels further away will contribute negatively. The result of such a convolution is local illumination normalization and contrast enhancement. The mathematical expression for this kernel is (2) where , are positive parameters, , and controls the spatial scale of the filtering. In the case of Fig. 1 there are three BCS subsystems, which means three Center-ON OFF-Surround filters are applied in parallel to the same input image, each with a specific ( ). From now on the processing in each BCS subsystem is independent.
B.Stage 2: Simple Cells
The second stage of the BCS system applies an orientation specific convolution for detecting edges oriented within a narrow angle range. This is performed by convolving the output of Layer 1 with the kernel shown in Fig. 2(b) for different orientations. This is why the output of Layer 1 in Fig. 1 suffers several convolutions in parallel, one for each orientation, resulting in as many "Layer 2" as orientations have been considered. The kernel of Fig. 2 
BCS FCS
where the coordinate system is rotated a certain angle with respect to the coordinate system of the input image ,
with being the total number of orientations to be considered.
C.Stage 3: Complex Cells
After applying the filtering of Stage 2, a pixel in Layer 2 for orientation k will display a high positive value if the input image presents a positive change in contrast with respect to the k-th orientation axis. If the change in contrast is negative, the output of this pixel would be a high negative value. In order to detect whether or not there is an edge at that orientation around the given pixel there is no need to distinguish between positive and negative values. Therefore, the purpose of this processing stage is simply to rectify the output of the previous one.
D.Stage 4: Hypercomplex Cells, Competition across Space
At Layer 3 for orientation k, pixels will present a positive value if around that pixel there is an edge at that orientation. The higher the pixel value, the clearer the edge was. At this stage, and independently for each orientation, a 2D Center-ON OFF-Surround filter is applied to contrast enhance the previous image. This is equivalent to performing a spatial competition among pixels, favoring those with higher values.
E.Stage 5: Hypercomplex Cells, Competition across Orientations
At this stage, all Layer 4 pixels of the same spatial position but for all possible orientations k, are going to compete among them to contrast enhance those orientations with higher pixel values. This is done by applying a 1D Center-ON OFF-Surround filter to pixels of Layer 4 of the same spatial position but for all k orientation values.
F.Stage 6: Bipole Cells, Long-Range Cooperation
The operation of this stage is the most complicated. It tries to identify long term "Contours", which can be defined as edges that remain consistent over larger space ranges. This is achieved by performing for each orientation k the following sum of convolutions, (5) where is the resulting state of pixel of Layer 6 for spatial scale g and orientation k, subscript r denotes orientation, and each convolution is given by (6) where denotes the state of pixel of Layer 5 for orientation r, denotes orientation perpendicular to r, and the kernel is defined by (7) with , , and being positive parameters. Fig. 2 
I.Stage 9: Feature Contour System (FCS)
The information about consistent long range contours can be taken from Layer 5 (once the feedback loop has settled), for all computed orientations. The FCS takes the original (local illumination normalized and contrast enhanced) image present in Layer 1 and performs a selective diffusion operation between pixels, using the contour information present at Layer 5: the contours at Layer 5 act as barriers to the diffusion operation. The result of all this processing is a clean noise-free image with clear and consistent long range contours.
III. An Edge-Extraction Filter
In the rest of this paper we will concentrate on Stages 2 and 3, the filter for edge-extraction and subsequent rectification. We will first introduce a simplification of the kernel of eq. (3) which will allow us to propose a very compact and efficient hardware that takes advantage of the AER as well.
The kernel of eq. (3) is decomposable into two factors, each of which depends only on either the xcoordinate or the y-coordinate , ,
with (9) The simplification proposed here consists in substituting the product operation between and by the signed minimum, To evaluate quantitatively the effect of the proposed approximation we can use the Normalized Square Error defined as, .
(
This quantity helps us to evaluate the difference between the original kernel and the modified kernel obtained when the product operation is substituted by the signed minimum. Table 1 gives the computed NSE for several kernels. All kernels in Table  1 are decomposable in the product of two functions that depend separately on the and components.
IV. Using Address Event Representation (AER)
Fig . 4 shows a schematic figure outlining the essence behind the AER. Suppose we have an "emitter" chip containing a large number of neurons or cells D1, D2, D3, ... whose activity changes in time with a "relatively slow" time constant. For example, if Chip 1 is a retina chip and each neuron's activity represents the illumination sensed by a pixel, the time constant with which this activity (c)
changes is, at the most, equivalent to Frame-Rate (i.e., 25-30 changes per second or a time constant of about 30-40ms).
The purpose of an AER based communication scheme is to be able to reproduce the time evolution of each neuron's activity inside a second or "receiver" chip, using a fast digital bus with a small number of pins. In the "emitter" chip the activity of each pixel has to be transformed into a pulse stream signal such that pulse width is minimum and the spacing between pulses is reasonably high to time multiplex the activity of a relatively large number of neurons. Every time a neuron produces a pulse its address or code should be written on the bus. For the case more than one pulses are produced simultaneously by several neurons, a classical arbitration tree can be introduced [1] - [3] , or one based in WinnerTake-All (WTA) row-wise competitions [6] , or simply by making no neuron accessing the bus in case of a "collision" [7] . Whatever method is used the result will be the presence of a continuous sequence of addresses or codes on the digital bus that one or more receiver chips can read. Each receiver chip must contain a decoding circuitry so that a pulse reaches the neuron (or neurons) specified by the address read on the bus. If each neuron integrates the sequence of pulses properly, the original activity of the neurons in the emitter chip will be reproduced. Note that in AER those neurons that are more active access the bus more frequently. This property allows to optimize the use of the bus, since neurons with low activity will not consume much communication bandwidth. This is the simplest AER based communication scheme among chips. However, AER allows easily to add more complicated processing. For example, input images can be translated or rotated by remapping the addresses while they travel from one chip to the next. By properly programming an EEPROM as a look-up table any address remapping can be implemented, by simply inserting the EEPROM between the two chips. Furthermore, many EEPROMs can be connected in parallel each performing, for example, a rotation at a specific angle, and each delivering the remapped addresses to a set of specialized processing chips. It is also possible to include synaptic weighting by having the EEPROM store the weight value, dumping it on a data bus, have the "receiver" chip read both the address and the data bus, and perform a weighted integration in the destination(s) neuron(s). It is also possible to implement "projective fields", i.e. for every address that appears on the bus a small digital system could generate a sequence of addresses around it and send it to the "receiver" chip. This would be a time-multiplexed projection field generation. In the architecture proposed in this paper, we implement a synaptically weighted projection field for each address read on the bus, and not in a time-multiplexed manner but in parallel. As we will see, the receiver chip will perform the following operations: for every address read on the bus it will send pulses to a bubble of neurons around that address. The width of those pulses is modulated according to some weights stored on chip. Time integration of those pulses for the complete array of neurons in the receiver chip implements a convolution operation. In the rest of the paper we will concentrate on describing the circuits able to implement such a convolutional or filtering chip. Fig. 5 shows the basic operating principle of the proposed architecture. The address bus provides the coordinates of the neuron (or pixel) around which the kernel of eq. (10) should be applied. Pulses will be applied to all rows with y-coordinate in the interval , and all columns with x-coordinate in the interval , where is the width considered for the kernel. 
V. System Design
Pulses will be modulated in width according to function (see eq. (8)) for the rows, and function for the columns. At each pixel there is an AND gate which provides a pulse of width equal to the minimum of and . This pulse will generate a fixed magnitude current pulse of the same width which will be integrated on a capacitor. Each pixel contains two integrators. One of them, called the "positive integrator", integrates the pulse of length when ; while the other, called the "negative integrator", integrates the pulse when
. The values of and ( ) are stored digitally on chip on a small RAM. Fig. 6 shows the floorplan diagram of the system. It consists of two input decoders that decode the address of the arriving pulses, a C-element required for the AER communication protocol [1] - [3] , an array of integrator cells , two sets of programmable monostables and whose pulse widths are controlled by the bits stored in two RAMs, and (which store the digital words and , respectively), two arrays of and selecting cells and , respectively, two output decoders to select the 
Cy j s
-s , cells to be scanned, and a column of scanning circuits to read out the integrators analog output current . Note that in the present prototype of Fig. 6 the system does not generate an AER output. This can be solved by either adding the necessary circuitry to each pixel [1] - [3] which will decrease the cell density, or by adding a postprocessing chip that scans sequentially all cells in the array of Fig. 6 and generates an AER output.
The operation of the system in Fig. 6 is as follows. In and digital words of bits are stored ( and ). The first bit (or ) indicates the sign of the function (or ). The following bits indicate the absolute value (or ). These bits linearly control the length of the pulse triggered by monostables (or ). The pulses generated by the monostables are sent through lines (or ) and are triggered whenever an external pulse arrives to the system (whenever signal Rqst pulses). When an external pulse arrives, the input decoders activate lines and corresponding to the address of the arriving pulse. The selection cells controlled by (cells in Fig. 6 where , , is the (lossy) integral over time of the number of pulses pixel is receiving, and is the fixed magnitude of the current pulses being integrated.
Similarly, the negative integrator accumulates charge when pulses arriving through horizontal and vertical lines of opposite sign and (or and ) are simultaneously high, that is, it performs the operation . Hence, along time it computes the following sum (13) Consequently, the difference between the outputs of the positive and negative integrators is given by, , (14) which is the filter operation we want to implement. In what follows we will describe the circuit components and operations of each block in Fig. 6 .
A.Communication Protocol: The C-element
To perform a proper communication between two chips a communication protocol must be implemented [1] - [3] . In the AER scheme, the sender chip indicates when the address of a pulsing neuron is ready on the bus and the receiver chip must acknowledge that the pulse has been received and that it is ready to receive a new pulse. Fig. 7 shows the timing diagram of a valid communication protocol for the two chips. The sender chip generates a request signal and the receiver generates an acknowledge signal . When the sender has put the address on the bus it pulls the request signal to a high value. Once the receiver detects a high signal it latches the received address and pulls the acknowledge signal high. The sender can put now low and begin to process the following pulse. The receiver must wait until all the monostables have sent their pulses to the corresponding neurons and the has gone low to put the signal low. Once the signal is low the sender can activate the signal to a high value again. Fig. 8 shows the schematic of the cell used in the receiver chip to generate the signal. This cell is known as "C-element". This element receives two input signals: a request signal generated by the sender system, and signal which is the wired-NOR of all the monostable output pulses . The C-element generates an output acknowledge signal which is sent back to the sender system. When no pulses are being received, is low. Signal is high as no pulses are being
T x l T y s Px i l -+

Px i l --
Py j s -+
Py j s --
Rx l Ry s c ij
Px i
+
Py j
) n pq p q , generated by the monostables. Consequently, the signal is low. When a valid address pulse arrives the sender puts the signal high. The rising edge of this signal is used to trigger the monostables so that signal becomes low. Once is high and has been set to low the C-element sets to a high value, meaning that the pulse has been received. The high value of is used to latch the present bus address and signal that triggers the monostables. Latching the address assures that the corresponding neighborhood is kept selected until all monostables finish their pulses. By latching we assure that the monostable pulses do not end if signal goes low before the monostable pulses have finished. The C-element waits until goes low and all the monostable pulses finish ( ) to put the signal low again. Once the acknowledge is low the sender is allowed to pull up again and a new communication cycle can begin.
B.The Monostables
The schematic of a monostable with controlling bits is shown in Fig. 9 . Transistors and are equally sized, as well as transistors and . Switches are controlled by a digital n-bit word that set the capacitance connected to node . When no address is being received, and are both low and hence is also low. Transistor is cut off so that node is low. Node is also set low through those transistors with a high value. If all bits are low will always be high (by ) and no pulse will be generated. When an input pulse arrives signal and hence become high. As soon as goes high node goes high. Current begins to flow through the switch formed by transistors and charging node at the rate set by bits . When node reaches voltage value , the current through transistor becomes higher than the current supplied by so that the output node flips from high to low. The length of the pulse at is the time taken by current to charge node up to a voltage of . This time is given by (15) where is the total capacitance present at node and is set by the bits stored in the corresponding RAM word or . With this scheme, the length of the monostable pulses is linearly controlled between 0 and , with being the number of bits controlling each monostable pulse length, and the unit capacitance in Fig. 9 . Fig.  10 
D.The Core Cell
The schematic of cell of Fig. 6 is shown in Fig. 12 . It consists of two diode-capacitor integrators [3] . The Pulse Width (ns)
positive integrator integrates the ANDED pulses that arrive in row and column lines with the same sign, that is, and (or and ). The negative integrator integrates the ANDED pulses that arrive in row and column lines of opposite sign, that is, and (or and ). Each diode-capacitor integrator consists of two transistors and ( and ) operating in the subthreshold region, a capacitor and a transistor (or ) acting as a current source of value (controlled by bias voltage ) with its source pulsed by the output of the NOR gate. The input and output currents and of the positive integrator are related to the voltage at node through the following differential equations (the treatment for the negative integrator would be the same for currents , and voltage ) [3] ,
where is the thermal voltage and , are model parameters of the MOS transistor operating in the subthreshold region. From eqs. (16) and (17) we can get an expression that relates the output and input currents of the diode ,
where (19) Note that current mirror gain A is controlled by voltage . During the time in which and (or and ) are simultaneously high, the source of transistor is low, and this transistor is acting as a current source sinking a constant current from the integration node . In this case in eq. (18).
Suppose that a train of pulses of constant frequency , pulse width and interspike interval (as depicted in Fig. 13) 
where the integrator time constant is given by . When the ANDED pulses are zero, the source of transistor becomes high and . If the pulses go low at time and stay low for an interspike time (see Fig. 13 ), the output current at time just before a new pulse is applied, is given by .
When pulses of width are applied at a constant frequency as shown in Fig. 13 , a steady state is reached in which the charge injected by the diode during the inactive periods equals the charge sank by the 
current source during the pulse. In this steady state, the two following equations must hold,
and .
Solving eqs. (20)- (23), ,
and the steady state ripple will be ,
where the assumption has been made.. According to equation (24), each integrator outputs a current which is proportional to the frequency and width of the input pulses. Supposing the AER input image pixel intensity is linearly encoded with the frequency of the arriving pulses, and the convolutional kernel is encoded as the pulses width, the output current of the positive integrators would be the input image filtered with the filter positive terms. Equivalently, the negative integrator output currents would be the input image filtered with the negative terms of the filter. Hence, the result of subtracting the output current of the negative integrator from the output current of the positive one is the non-rectified filter output. Fig. 12 . Transistor sizes are and , integrating capacitor is , pulse amplitude is , pulse width is , frequency of pulse stream is , and voltage was set to (which yields a current gain from transistor to of around 2000).
To verify the operation of the diode-capacitor integrator a small prototype was integrated in a CMOS 2.5 double-poly single-metal technology. The sizes of transistors and of the integrator were set to and a the total capacitance at node was approximately . To verify the linear dependence of the output current in the steady state versus the frequency of the arriving pulses and the width of the pulses (see eq. (24)), measurements of the output current in the steady state were performed where the frequency of the input pulse stream and the width of the pulses were swept separately. The results are shown in Fig. 15 . Fig. 15(a) shows the measured steady-state current level as a function of frequency. During the measurement voltage was set to , the current bias and the width of the arriving pulses was . Fig.  15(b) shows the steady-state current level as a function of pulse width, while maintaining the frequency constant at , for and .
E.The Scan Out Cell
A random access scanning circuit can read the rectified output current of any cell selected by the Random Scan Bus of Fig. 6 . The output decoder X (see Fig. 6 ) selects a column i. When a column is not selected, the output currents and of all cells in that column flow to a line of constant voltage . If column is selected, currents and of all cells in these columns flow to lines and , respectively, of the scan out cell shown in Fig. 16 . Each scan out cell receives two input currents , and provides an output current . At the input of each scan out cell, current is mirrored through a PMOS current mirror and subtracted 
The precision of this current reflection depends on how tightly the source of is clamped to voltage . To achieve a good precision a high gain opamp is needed. If current is positive transistor sources this current, which is mirrored by transistor because its source is clamped to by the current comparator composed by transistors and OPAMP2. Therefore, the current through and is,
This current is again reflected by the PMOS transistor pair , . At the output node, the currents through transistors and are added together to get the rectified current . Since transistors operate in weak inversion, increasing the source voltage of transistors and with respect to will make the current mirrors and to have a gain higher than one (actually the gain will be exponentially controlled by this voltage difference). . Note that a current amplification of about 150 has been applied from the input to the output currents. This allows speeding up the current read out process.
VI. System Level Operation Behavioral Simulations
So far electrical (Hspice) simulations and experimental measurements of some of the circuit components have been presented. However, to validate the functionality of the proposed architecture, some system level (behavioral) simulations are mandatory. In this section we provide such simulations using MATLAB on the architecture of Fig. 6 for a system of 128×128 cells. The input image fed to the system is shown in Fig. 18(a) . Using MATLAB the AER stream of addresses that this image could generate was computed. The stream of pulses flowing through the bus is characterized by a sequence where is the address present on the bus at time . This stream of addresses was then used to control the mathematical model of the architecture of Fig. 6 . Each one of the 128×128 cells is characterized by the state of two integrators: the positive integrator and the negative one . The state of the integrators is controlled by the following differential equations (see eq. (18)) (28) whose solution is of the form given by eq. (20) (to compute its charging during the presence of a pulse) or by eq. (21) (to compute its discharge during the absence of pulses). These equations were used to update the state of the integrators in the following manner: for each address present on the bus all cells in the range were accessed. For each accessed cell the pulse width was computed using the approximation of eq. (10) and the simulation results of Fig. 10 . Depending on 
the resulting sign, either the positive or the negative integrator was updated. After an integrator has been updated the present time was stored for it, so that the next time it needs to be updated the simulator can compute properly its discharge amount with eq. (21). For each cell its output is given by . Using this method until all integrators have reached their steady state within 1% tolerance results in the system output depicted in Fig. 18(b) . In this case, addresses were not pre-rotated, so that the system is extracting vertical edges. As can be seen, pixels around vertical edges result in a very high output value, while as the edge angle around a pixel deviates from vertical its output value smoothly decreases until zero.
VII. Conclusions and Future Work
An architecture that implements a pseudo-Gabor filter for edge extraction has been presented. The architecture allows to implement any 2D filter F(p,q) decomposable into x-axis and y-axis components such that the product can be approximated by a signed minimum. Positive and negative values of and can be programmed. The architecture requires an AER input. This allows to rotate the 2D convolution kernel any angle.
A VLSI circuit implementation that realizes the proposed architecture is provided. Circuit simulation results and experimental measurements of critical components were given. System-level behavioral simulations of a 128×128 array have been included which validate the proposed approach. Cell size is if no AER output is available and if AER output is included, for a double-poly double-metal CMOS process. This would allow, for a die, to implement a 2D filter with approximately pixels for no AER output, and pixels if AER output is provided.
Future work includes the fabrication of a test prototype, its interface to a retina chip with AER output, and the implementation of more processing layers of the vision model system described in Section II. Note that the present architecture can be used to implement the processing of Stages 4, 5, 7, and 8 as well. For the implementation of Stage 6 the present architecture can be used if it is possible to substitute the kernel of eq. (7) by another one decomposable into horizontal and vertical components and if the product can be approximated by a signed minimum. 
VIII. References
