# **Bio-Inspired Analog Parallel Array Processor Chip with Programmable Spatio-Temporal Dynamics**

R. Carmona, F. Jiménez-Garrido, R. Domínguez-Castro, S. Espejo, A. Rodríguez-Vázquez Instituto de Microelectrónica de Sevilla. IMSE-CNM-CSIC Avda. Reina Mercedes s/n 41012 Sevilla (SPAIN) Tel.:+34955056666, Fax: +34955056686 E-mail: rcarmona@imse.cnm.es

# Abstract<sup>1</sup>

A bio-inspired model for an analog parallel array processor (APAP), based on studies on the vertebrate retina, permits the realization of complex spatio-temporal dynamics in VLSI. This model mimics the way in which images are processed in the natural visual pathway what renders a feasible alternative for the implementation of early vision tasks in standard technologies. A prototype chip has been designed and fabricated in  $0.5\mu$ m CMOS. Design challenges, trade-offs and the building blocks of such a highcomplexity system ( $0.5 \times 10^6$  transistors, most of them operating in analog mode) are presented in this paper.

#### 1. Bio-inspired APAP model

#### 1. 1. Sketch of the biological retina

The vertebrate retina has the structure displayed in Fig. 1. A first layer of photodetectors at the top, the cone cells, captures light and converts it to activation signals [1]. Bipolar cells carry them across the layers to the ganglion cells that interface the retina with the optical nerve, in a trip of several micrometers. The ganglion cells convert the continuous activation signals, proper of the retina, to pulse-like action potential signals that can be transmitted over longer distances by the nervous system. In the way to the ganglion cells, the information carried by bipolar cells is affected by the operation of horizontal and amacrine



1. This work has been partially funded by DICTAM IST-1999-19007, and CICYT TIC1999-0826

cells. They form layers in which signals are weighted and promediated to bias photodetectors and to inhibit the vertical pathway. Patterns of activity are formed dynamically by the presence or absence of visual stimuli. Inhibition is transmitted laterally through these layers of cells.

There are, in this description, some interesting aspects of each retinal layer that markedly resemble the characteristics of the Cellular Neural Networks (CNNs) [2]: 2D aggregations of continuous signals, local connectivity between elementary nonlinear processors, analog weighted interactions between them.

#### 1. 2. CNN-based analogy

Based on measurements of the response of the inner and outter plexiform layers of the retina, a complex-cell CNN-based chip has been proposed [3]. This 2nd-order 3layer CNN cell consists of 2 CNN layers coupled by some inter-layer weights and an additional layer incorporating analog arithmetics to combine the outputs of the dynamically linked layers (Fig. 2). The cells in the two first layers have a first order core, while the third layer, that can be also modeled in this way, has much faster dynamics  $(\tau_3 \ll \tau_1, \tau_2)$ . Complex dynamics can be programmed via the adjustment of the intra- and inter-layer coupling strengths. The evolution law of each layer node in the cell, C(i, j), is given by two coupled differential equations:

$$\tau_n \frac{dx_{n,ij}(t)}{dt} = -g[x_{n,ij}(t)] + b_{nn,00} \cdot u_{n,ij} + z_{n,ij} + \sum_{k=-r_n}^{r_n} \sum_{l=-r_n}^{r_n} a_{nn,kl} \cdot y_{n,(i+k)(j+l)} + a_{nm} \cdot y_{m,ij}$$
(1)

where the nonlinear losses term and the output function in each layer are those of the FSR CNN model [4]:

$$g(x_{n,ij}) = \lim_{m \to \infty} \begin{cases} mx_{n,ij} & \text{if } x_{n,ij} > 1 \\ x_{n,ij} & \text{if } |x_{n,ij}| \le 1 \\ -mx_{n,ij} & \text{if } x_{n,ij} < -1 \end{cases}$$
(2)

and 
$$y_{n,ij} = f(x_{n,ij}) = \frac{1}{2}(|x_{n,ij}+1| - |x_{n,ij}-1|)$$
 (3)



Fig. 2. Diagram of the 2nd-order CNN.

# 2. APAP architecture

### 2. 1. Prototype chip floorplan

The proposed chip consists in a APAP of  $32 \times 32$ identical cells (as can be seen in the microphotograph of Fig. 6). It is surrounded boundary conditions circuits for the CNN dynamics. There is also an I/O interface, a timing and control unit and a program memory. The I/O interface consists in a serializing-deserializing analog multiplexor. The program memory is composed of 24 blocks of SRAM of 64 bytes of capacity, 1kB dedicated to the analog program, and 0.5kB to the logic program. In addition, the analog instructions and reference signals need to be transmitted to every cell in the network in the form of analog voltages. Thus, a bank of D/A converters interfaces the analog program memory with the processing array. Distributing analog references across large distances within a chip is not a trivial task. Apart from the problems derived from electromagnetic interference, voltage drops in long metal lines carrying currents can be quite noticeable. Thus, signal buffering and low-resistance paths must be provided to avoid this. Finally, the timing unit is composed by an internal clock/counter and a set of FSMs that generate the internal signals that enable the processes of images up/downloading and program memory accesses.

#### 2. 2. Basic cell scheme

The elementary processor of the CNN-based APAP includes two coupled continuous-time CNN cores (Fig. 3) belonging to each of the two different layers of the network. The synaptic connections between processing elements of the same or different layer are represented by arrows in the diagram. The basic processor contains also a programmable local logic unit (LLU) and local analog and logic memories (LAMs and LLMs) to store intermediate results. All the blocks in the cell communicate via an intra-cell data bus, which is multiplexed to the array I/O interface. Control bits and switch configuration are passed to the cell directly from the global programming unit.

The internal structure of each CNN core is depicted in the diagram of Fig. 4. They receive contributions from the rest of the processing nodes in the neighbourhood which are summed and integrated in the state capacitor. The two layers differ in that the first layer has



a scalable time constant, controlled by the appropriate binary code, while the second layer has a fixed time constant. The evolution of the state variable is also driven by self-feedback and by the feedforward action of the stored input and bias patterns. There is a voltage limiter for implementing the FSR CNN model. The state variable is transmitted in voltage form to the synaptic blocks, in the periphery of the cell, where weighted contributions to the neighbours' are generated. There is also a current memory that will be employed for cancellation of the offset of the synaptic blocks. Initialization of the state, input and/or bias voltages is done through a mesh of multiplexing analog switches that connect to the cell's internal data bus.

#### 3. Analog building blocks for the basic cell

#### 3. 1. Single-transistor synapse

The synapse is a four-quadrant analog multiplier. Their inputs will be the cell state or input and the weight voltages, while the output will be the cell's current contribution to a neighbouring cell. It can be achieved by a single transistor biased in the ohmic region [5]. For a PMOS with gate voltage  $V_X = V_{x_0} + V_x$ , and the p-diffusion terminals at  $V_W = V_{w_0} + V_w$  and  $V_w$ , the drain-to-source current is:

$$I_{o} = -\beta_{p}V_{w}V_{x} - \beta_{p}V_{w}\left(V_{x_{0}} + \left|\hat{V}_{T_{p}}\right| - V_{w_{0}} - \frac{V_{w}}{2}\right) (4)$$



Fig. 4. Internal structure of each CNN layer node.

which is a four-quadrant multiplier with an offset term that is time-invariant —at least during the evolution of the network— and not depending on the cell state. This offset that can be eliminated by a calibration step, with the help of a current memory.

### 3. 2. Current conveyor and level shifting

For the synapse to operate properly, the input node of the CNN core must be kept at a constant voltage, independently of what current is entered. This is achieved by a current conveyor (Fig. 5). Any difference between the voltage at node  $(\underline{D})$  and the reference  $V_{w_0}$  is amplified and the negative feedback corrects the deviation. Notice that a voltage offset in the amplifier results in an error of the same order. Using the offset cancellation mechanism in Fig. 5 the current injected into the load is offset-free:

$$I_{L} = I_{o} + I_{mem} - I_{b} = g_{m}v_{d}$$
 (5)

# 3. 3. S<sup>3</sup>I current memory

As it has been referred, the offset term of the synapse current must be removed for its output current to represent the result of a four-quadrant multiplication. For this purpose all the synapses are reset to  $V_X = V_{x_0}$ . Then the resulting current, which is the sum of the offset currents of all the synapses concurrently connected to the same node, is memorized. This value will be substracted on-line from the input current when the CNN loop is closed, resulting in a one-step cancellation of the errors of all the synapses. The validity of this method relies in the accuracy of the current memory. For instance, in this chip, the sum of all the contributions will range, for the applications for which it has been designed, from 18µA to  $46\mu A$ . On the other side, the maximum signal to be handled is 1µA. If a signal resolution of 8b is pretended, then 0.5LSB = 2nA. Thus, our current memory must be able to distinguish 2nA out of  $46\mu A$ . This represents an equivalent resolution of 14.5b. In order to achieve such accuracy level, a  $S^{3}I$  current memory is used. It is composed by three stages (Fig. 5), each one consisting in a switch, a capacitor and a transistor.  $I_{B}$  is the current to be memorized. After memorization the only error left corresponds to the last stage. The former stages do not contribute to the error in the memorized current. If the  $S^{3}I$  block is designed so as to store the most significant bits in the first capacitor, and the less significant bits in the last one, the error can be made quite small.

# 3. 4. Time-constant scaling

The differential equation that governs the evolution of the network (1) can be written as a sum of current contributions injected to the state capacitor. Scaling up/ down this sum of currents is equivalent to scaling the capacitor and, thus, speeding up/down the network dynamics. Therefore, scaling the input current with the help of a current mirror, for instance, will have the effect of scaling the time-constant. A circuit for continuously adjusting the current gain of a mirror can be designed based on a regulated-Cascode current mirror in the ohmic region. But the strong dependence of the ohmicregion biased transistors on the power rail voltage causes mismatches in  $\tau$  between cells in the same layer. An alternative to this is a binary programmable current mirror. It trades resolution in  $\tau$  for robustness, hence, the mismatch between the time constants of the different cells is now fairly attenuated.

A new problem arises, though, because of current scaling. If the input current can be reshaped to a 16times smaller waveform, then the current memory has operate over larger and the smaller signals. But, if designed to operate on large currents, the current memory will not work for the tiny currents of the scaled version of the input. If it is designed to run on small input currents, long transistors will be needed, and the operation will be unreliable for the larger currents. One way



Fig. 5. Input block with current scaling, S<sup>3</sup>I memory and offset-corrected OTA schematic.

of avoiding this situation is to make the S<sup>3</sup>I memory to work on the original unscaled version of the input current. Therefore, the adjustable-time-constant CNN core will be a current conveyor, followed by the S<sup>5</sup>I current memory and then the binary weighted current mirror. The problem now is that the offsets introduced by the scaling block add up to the signal and the required accuracy levels can be lost. Our proposal is depicted in Fig. 5. It consists in placing the scaling block (programmable mirror) between the current conveyor and the current memory. In this way, any offset error will be cancelled at the auto-zeroing phase. In the picture, the voltage reference generated with the current conveyor. the regulated-Cascode current mirrors and the S<sup>3</sup>I memory can be easily identified. The inverter,  $A_i$ , driving the gates of the transistors of the current memory is required for stability.

#### 4. Experimental results

A prototype chip has been designed and fabricated in a  $0.5\mu$ m single-poly triple-metal CMOS technology. Its dimensions are  $9.27 \times 8.45$  sq. mm. (Fig. 6). The cell density achieved is 29.24 cells/mm<sup>2</sup>. The time constant of the layers is around 100ns (unscaled). The programmable dynamics of the chip permit the observation of different phenomena of the type of propagation of waves, pattern generation, etc. By controlling the network dynamics and combining the results with the help of the built-in local logic and arithmetic operators, rather involved image processing tasks can be programmed [3]. Fig. 6 depicts the propagation of a travelling wave obtained from the first functional tests of the prototype.

# 5. Conclusions

The proposed approach supposes a promising alternative to conventional digital image processing for applications related with early-vision and low-level focal-plane image processing. Based on a simple but precise model of part of the real biological system, a fea-



Fig. 6. Microphotograph of the prototype chip.

sible efficient implementation of an artificial vision device has been designed.

#### References

- F. Werblin, "Synaptic Connections, Receptive Fields and Patterns of Activity in the Tiger Salamander Retina", *Investigative Ophthalmology and Visual Science*, Vol. 32, No. 3, pp. 459-483, March 1991.
- [2] F. Werblin et al., "The Analogic Cellular Neural Network as a Bionic Eye". *IJCTA*, Vol. 23, No. 6, pp. 541-69, November-December 1995.
- [3] Cs. Rekeczky et al., "A Stored Program 2nd Order/3-Layer Complex Cell CNN-UM". Proc. of the 6th IEEE WCNNA, pp. 219-224, Catania, Italy, May 2000.
- [4] S. Espejo et al., "Convergence and Stability of the FSR CNN Model". Proc. 3rd IEEE WCNNA, pp. 411-417, Rome, December 1994.
- [5] R. Dominguez-Castro et al, "Four-Quadrant One-Transistor Synapse for High Density CNN Implementations". Proc. of the 5th IEEE WCNNA, pp. 243-248, London, UK, April 1998.

t



Fig. 7. Evolution of the 2 layers implementing travelling wave dynamics.

#### 1st layer (slow)