CMOS realization of a 2-layer CNN universal machine chip by Carmona Galán, Ricardo et al.
CMOS REALIZATION OF A 2-LAYER CNN UNIVERSAL MACHINE CHIP 
R.  CARMONA. E JIbl6NEZ-GARRIDO. R. DOMfNGUEZ-CASTRO. 
s ESPEJO AND A. R O D R ~ E Z . V ~ Z Q U E Z  
Inrrituro de Micmelec-rronica de Sedb-CNM-CSIC. AI&. Reina Mercedrs s/n 
41U12 Sevrlla lSPAlNl. Ye/.. t34 9.55056666. Fur: +34 955056686. 
L.’-mail; rcamiow Oinire min. e! 
Same of the features of the biological retina can be modelled by a cellular neural network (CNN) 
composed of two dynamically coupled layers of locally connected elementary nonlinear pmessors. In 
order lo explore the possibilities of these complex spatia-temporal dynamics in image prmessing, a 
prototype chip has been developed implementing this CNN model with analog sigaal processing 
blocks. This chip has been designed in a O.Spm CMOS technolo8y. Design challenges, trade-offs and 
the building blocks of such a high-complexity system (0.5 x 10 transistors, most of them operating 
io analog mode) are presented in this pap?. 
1 CNN-UM chip architecture 
I .  I CNN-based analogy of the biological retina 
The vertebrate retina is composed of several layers of horizontal and amacrine cells I .  
These layers, coupled by means of bipolar cells, end, on one side, in a layer of photodetec- 
tors and, on the other, in a layer of ganglion cells. The photodetectors capture the visual 
stimuli and translate it into activation patterns. The ganglion cells, at the other end of the 
retina, convert the continuous activation signals into pulse-like action potential signals that 
can be transmitted over longer distances by the nervous system. The activation signals in 
the retina are weighted and promediated to bias photodetectors and to inhibit the vertical 
pathway. Patterns of activity are formed dynamically by the presence or absence of visual 
stimuli. In this description, similarities can be found with the CNNs *: not only in the 
topology, but also in that we have 2D aggregations of continuous signals, local connectiv- 
ity between elementary nonlinear processors and analog weighted interactions between 
them. Motivated by these coincidences, a model for the operations of the biological retina 
based on CNNs has been developed ’. It contains two coupled CNN layers plus an addi- 
tional layer incorporating analog arithmetics to combine the outputs of the dynamically 
linked layers. This can be realized by a CNN Universal Machine (CNN-UM) 
architecture in which each cell contains two first-order cores, common local analog and 
logic memories (LAMS and LLMs) and common logic and communication units (LLU 
and LAOU). The evolution law of each cell, C(i, j )  , is given by two coupled equations: 
.l rl 
dxi, i j ( 4  
= -g[xl,lj(-r)l+ b l l , C Q u l . l j + ’ l . i j +  c. “ I I , k l Y l . ( i + k ) ( j + I ) + 0 1 Z Y 2 . i j  
(1) 
k = - r , I = - r  
‘1 1 2  dxz, 
7 2 7  = - g [ x , t j ( - r ) l  + b22.wu2,  lj + 2 2 , i j  + %2, k l y 2 , ( i +  k ) ( j + l )  +a21Y1ij  
k i -,*I i -r2 
a. This workhasbeensuppaledby ~PR~VV~rojeetIST-1999-190U7andbyON~ICOPGranfN-OWI4-M 
1-0429, and the Spanish ClCYT Project TIC-1999-0826. 
444 
445 
where the nonlinear losses term and the output function in each layer are those of the Full- 
Signal-Range (FSR) CNN model s, which, having a limitation on the cell state voltage 
allows for identifying state and output: 
mx,, j j  if xn,ij  > 1 I -mix, j j l  if xn, ii < -1 g(x,, j j )  = lim xn,  i j  if lxn. i j l  5 1 (2) m+-  
1.2 Protorype chipfloorplan 
The proposed chip consists in an analog parallel array processor (MAP) of 32 x 32 iden- 
tical cells (Fig. 4). It is surrounded by the circuits implementing the boundary conditions 
for the CNN dynamics. There is also an VO interface, a timing and control unit and a pro- 
gram memory. The VO interface consists in a serializing-deserializing analog multi- 
plexor. The program memory is composed of 24 blocks of SRAM of 64 bytes of capacity, 
1kB dedicated to the analog program, and O S k B  to the logic program. In addition, the 
analog instructions and reference signals need to be transmitted to every cell in the net- 
work in the form of analog voltages. Thus, a bank of DIA converters interfaces the analog 
program memory with the processing array. Finally, the timing unit is composed by an 
internal clockkounter and a set of finite-state- machines that generate the internal signals 
that enable the processes of image up/downloading and program memory accesses. 
1.3 Basic cell scheme 
The elementary processor of the CNN includes two coupled continuous-time cores 
(Fig. l(a)). Each one belongs to one of the two different layers of the network. The synap- 
tic connections between processing elements of the same or different layer are represented 
by arrows in the diagram. The basic processor contains also the LLU, and the LAMS and 
LLMs to store intermediate results. All the blocks in the cell communicate via an intra-cell 
data bus, which is multiplexed to the array VO interface. Control bits and switch configu- 
ration are passed to the cell directly from the global programming unit. 
The internal structure of each CNN core is depicted in the diagram of Fig. l(b). Each 
core receives contributions from the rest of the processing nodes in the neighbourhood, 
and these contributions are summed and integrated in the state capacitor. The two layers 
differ in that the first layer has a scalable time constant, controlled by the appropriate 
binary code, while the second layer has a fixed time constant. The evolution of the state 
variable is also driven by self-feedhack and by the feedforward action of the stored input 
and bias patterns. There is a voltage limiter for implementing the FSR CNN model. The 
state variable is transmitted in voltage form to the synaptic blocks, in the periphery of the 
cell, where weighted contributions to the neighbours’ are generated. There is also a cur- 
rent memory that will be employed for cancellation of the offset of the synaptic blocks. 
446 
y 110 
2nd CNN layer node 
Figure 1. (a) Coocephlal diagram of the basic cell and (b) the CNN layen’ nodes. 
Initialization of the state, input and/or bias voltages is done through a mesh of multiplex- 
ing analog switches that connect to the cell’s internal data bus. 
2 Analog building blocks for the basic cell 
2.1 Single-transistor synapse 
The synapse is a four-quadrant analog multiplier. Their inputs will be the cell state ( Y x ) -  
identified with the cell output in the FSR model- or input and the weight voltages ( Y,), 
while the output ( I , )  will be the cell’s current contribution to a neighbouring cell. It can 
be achieved by a single transistor biased in the ohmic region ‘. For a PMOS with gate volt- 
age Y ,  = Y ,  + Y,, and the p-diffusion terminals at V ,  = Yw0 + Y ,  and Y ,  -where 
Yxo and Y ,  are the reference central values for the state and weight voltages, that allow 
signals Y ,  and V ,  to have either sign- the drain-to-source current is: 
I ,  = - P , Y , Y , - P , Y ~ Y ~ ~ + I i . . J - Y  W O  -- “9 2 (4) 
which is a four-quadrant multiplier with an offset term that is time-invariant -at least dur- 
ing the evolution of the network- and not depending on the cell state. This offset that can 
be eliminated by a calibration step, with the help of a current memory. 
2.2 Current conveyor and level shifring 
For the synapse to operate properly, the input node of the CNN core must be kept at con- 
stant voltage, independently of what current is entered. This is achieved by a current con- 
veyor (Fig. 2(a)). Any difference between the voltage at node 0 and the reference Ywo is 
amplified and the negative feedback corrects the deviation. Notice that a voltage offset in 
the amplifier will result in an error of the same order. An offset cancellation mechanism is 
provided (Fig. 2(b)). Signal oca, shorts the Operational Transconductance Amplifier 
(OTA) inputs and enables diode-mode operation of transistor M,, , that will conduce a 
447 
vos 
Io 
Figure 2. (a) Current conveyor and (b) OTA realization with offsetcorrection mechanism 
current I,, such as to cancel out the current offset. Once &, is turned off, the total cur- 
rent injected into the load capacitor is offset-free: 
I ,  = I ,  + I,,, - I, = g,v, ( 5 )  
2.3 $1 current memory 
As referred, the offset term of the synapse current must be removed for its output current 
to represent the result of a four-quadrant multiplication. For this purpose all the synapses 
are reset to V ,  = V, , Then, the resulting current, which is the sum of the offset currents 
of all the synapses concurrently connected to the same node, is memorized. This value 
will be substracted on-line from the input current when the CNN loop is closed, resulting 
in a one-step cancellation of the errors of all the synapses. The validity of this method 
relies in the accuracy of the current memory. For instance, in this chip, the sum of all the 
contributions will range, for the applications for which it has been designed, from 18pA 
to 46kA . On the other side, the maximum signal to be handled is 1 pA . If a signal resolu- 
tion of 8b is pretended, then OSLSB = Z n A .  Thus, our current memory must be able to 
distinguish Z n A  out of 46pA. This represents an equivalent resolution of 14.5b. In order 
to achieve such accuracy level, a S'I current memory is used. It is composed by three 
stages (Fig. 3). each one consisting in a switch, a capacitor and a transistor. I ,  is the cur- 
rent to be memorized. After memorization the only error left corresponds to the last stage. 
The former stages do not contribute to the error in the memorized current. If the S'I block 
is designed so as to store the most significant bits in the first capacitor, and the less signif- 
icant bits in the last one, this error can be made quite small. 
2.4 lime-constant scaling 
The differential equation that governs the evolution of the network, Eq. 1, can be written 
as a sum of current contributions injected to the state capacitor. Scaling up/down this sum 
448 
of currents is equivalent to scaling the capacitor and, thus, speeding up/down the network 
dynamics. Therefore, scaling the input current with the help of a current mirror, for 
instance, will have the effect of scaling the time-constant. A circuit for continuously 
adjusting the current gain of a mirror can be designed based on a regulated-Cascode cur- 
rent mirror in the ohmic region. But the strong dependence of the ohmic-region biased 
transistors on the power rail voltage causes mismatches in T between cells in the same 
layer. An alternative to this is a binary programmable current mirror. It trades resolution in 
T for robustness, hence, the mismatch between the time constants of the different cells is 
now fairly attenuated. 
A new problem arises, though, because of current scaling. If the input current is 
allowed to be reshaped to a 16-times smaller waveform, then the current memory is 
obliged to operate over a wider dynamic range. But, if designed to operate on large cur- 
rents, the current memory will not work for the tiny currents of the scaled version of the 
input. On the contrary, if it is designed to run on small input currents, long transistors will 
be needed, and the operation will be unreliable for the larger currents. One way of avoid- 
ing this situation is to make the S31 memory to work on the original unscaled version of 
the input current. Therefore, the adjustable-time-constant CNN core consists in a current 
conveyor, followed by the S I current memory and then the binary weighted current mir- 
ror. The problem now is that the offsets introduced by the scaling block add up to the sig- 
nal and the required accuracy levels can be lost. Our proposal is depicted in Fig. 3. It 
consists in placing the scaling block (programmable mirror) between the current conveyor 
and the current memory. In this way, any offset error will be cancelled at the auto-zeroing 
phase. In the picture, the voltage reference generated with the current conveyor, the regu- 
lated-Cascode current mirrors and the S I memory can be easily identified. The inverter, 
A i ,  driving the gates of the transistors of the current memory is required for stability. 
Without it. the output node, 8, will diverge from the equilibrium. 
3 
3 
Rgure 3. laput blaek with current scaling. S’I memory and offset-corrected OTA schematic. 
449 
Figure 4. Prototype chip photopph 
3 Chip data and simulations 
A prototype chip has been designed and fabricated in a 0.5pm single-ply triple-metal 
CMOS technology. Its dimensions are 9.27 x 8.45 sq. mm (photograph in Fig. 4). The 
cell density achieved is 29.24cells/mm2. The programmable dynamics of the chip permit 
the observation of different phenomena of the type of propagation of waves, pattern gener- 
ation, etc. Fig. 5 displays the evolution of the state variable in a reduced network, 1 x 8 
cells, in which the propagation of a wave front in 1-D has been programmed. It is triggered 
by a marker in the first layer of cell C,, and induced in the second layer as can be seen. 
By controlling the network dynamics and combining the results with the help of the built- 
in local logic and arithmetic operators, rather involved image processing tasks can be pro- 
grammed, for instance, grayscale contour detection, skeletonization, etc. '. 
4 Conclusions 
The proposed approach supposes a promising altemative to conventional digital image 
processing for applications related with early-vision and low-level focal-plane image pro- 
cessing. Based on a simple but precise model of part of the real biological system, a feasi- 
ble efficient implementation of an artificial vision device has been designed. The peak 
operation speed of the chip will outdo its digital counterparts due to the fully parallel 
nature of the processing, which is, once more, based on the analogy not on the simulation. 
X16 ......... I ........... 
................. 
I I I I I 
0 0.2 0.4 0.6 0.6 1 x10-4 1.2 
lime (semnda) 
(a) Slow CNN layer 
0 0.2 0.4 0.6 0.6 1 x,0.41.2 
lime ( m n d s )  (b) Fast CNN layer 
Figure 5 .  1-D wave propagation 
451 
References 
1.  E Werblin, Synaptic Connections, Receptive Fields and Patterns of Activity in the Tiger 
Salamander Retina, Znv. Oph. and vis. Sc. 32 (3), 459 (1991). 
2. E Werblin, T. Roska and L. 0. Chua, The Analogic Cellular Neural Network as a Bionic 
Eye. Znt. J. Circ. Theos andApp. 23 (6). 541 (Wiley, Boston, 1995). 
3. Cs. Rekeczky, T. Serrano-Gotmedona. T. Roska and A. Rodriguez-VBzquez, A Stored 
Program 2nd Orded3-Layer Complex Cell CNN-UM. Pmc. 6th Znt. U! Cel. Neuc Net. 
Apps., 219 (Catania, 2000). 
4. T. Roska and L. 0. Chua, The CNN Universal Machine: An Analogic Array Computer. 
ZEEE Trans. Circ. Syst. ZI: Anal. Dig. Sign. Proc., 40 (3), 163 (1993). 
5. S .  Espejo, R. Carmona, R. Dominguez-Castro and A. Rodriguez-Vdzquez, A VLSI 
Oriented Continuous-Time CNN Model, h t .  J. Circ. Theor: Apps., 24 (3) 341 (Wiley, 
Boston, 1996) 
6. R. Domfnguez-Castro, A. Rodriguez-VAzquez, S .  Espejo and R. Carmona, Four- 
Quadrant One-Transistor Synapse for High Density CNN Implementations. Proc. 5th 
Znt. W Cel. Neus Net. and Apps., 243 (London, 1998). 
