The CNNUC3: an analog I/O 64x64 CNN universal machine chip prototype with 7-bit analog accuracy by Liñán Cembrano, Gustavo et al.
ZOO0 6w IEEE International Workshop on Cellular Neural Networks and Their Applications Proceedings 
The CNNUC3: An Analog U0 64 x 64 CNN Universal Machine Chip Proto- 
type with 7-bit Analog Accuracy 
G. LiiiAn, S. Espejo, R. Dominguez-Castm and A. Rodriguez-V6zquez 
Jnstituto de Miaoelectn5nica de Sevilla - CNM-CSIC 
Edificio CICA-CNM, Cnania dn, 41012- Sevilla, SPAIN 
Phone: +34 95 4239923, Fax: +34 95 4231832, Email: linan@imse.cnmes 
ABSTRACT 
This paper describes a full-custom mixed-signal chip which embeds distributed optical signal 
acquisition, digitally-programmable analog parallel processing, and distributed image memory - 
cache -on a common silicon substrate. This chip, designed in a 0Spm CMOS standard technology 
contains around 1. OOO. OOO transistors, 80% of which operate in analog mode; it is hence one the 
most complex mixed-signal chip reported to now. Chip functional features are in accordance to the 
CNN Universal Machine [l] paradigm cellular, spatial-invariant array architecture; programmable 
local interactions among cells; randomly-selectable memory of instructions (elementary instructions 
are defined by specific values of the cell local interactions); random storagdretrieval of intermediate 
images; capability to complete algorithmic image processing tasks controlled by the user-selected 
stored instructions and interacting with the cache memory, etc. Thus, as illustrated in this paper, the 
chip is capable to complete complex spatio-temporal image processing tasks within short computa- 
tion time ( - ZWns for linear convolutions) and using a low power budget (42W for the complete 
chip). The internal circuitq of the chip has been designed to operate in robust manner with >7-bit 
equivalent accuracy in the internal analog operations, which has been confirmed by experimental 
measurements. Hence, to all practical purposes, processing tasks completed by the chip have the 
same accuracy than those completed by digital processors preceded by 7-bit digital-to-analog con- 
verters for image digitalization. Such 7-bit accuracy is enough for most image processing applica- 
tions. CNNUC3 has been demonstrated capable to implement - either directly or through template 
decomposition - 100% of the linear 3 x 3 templates in reported [2]. 
I. Introduction+ 
Full exploitation of Cellular Neural Network capabilities for image processing can only be exploited through 
VLSI chips. Several CNN and CNN-UM chips have been made in the past; particularly, those having a size larger 
than 10 x 10 and whose operation have been actually demonstrated through experimental evidence are described in 
[3]-[6]. The chips in [3], 141 and [5] are intended for binary images, while that in [6] is intended for gray-scale 
images. Those in [41 and 151 have been designed by keeping analog accuracy and robustness as targets, while those 
in [3] and [6] are targeted for maximum cell density. Finally. only the chip in [5]  embeds distributed optical sensors 
for direct optical image acquisition. 
CNNUC3 also embeds distributed optical sensors - it is a true focal-plane analog programmable array processor 
- and is capable to acquire gray-scale inputs and produce gray-scale outputs. It has been designed to achieve around 
7-bit equivalent resolution in the internal analog operations, and its robust operation has been experimentally dem- 
omgrated through implementation of 100% of the linear 3 x 3 templates in reported [2]. Besides, it can be directly 
interfaced to digital equipments and incorporate all functional features needed for the realization of complex image 
processing algorithms. 
2. General Characteristics 
CNNUC3 consists basically of an may  of 64 x 64 identical cells. Its processing is continuous-time and spa- 
Feedback and control templates, and the offset (or bias) term are programmable with a resolution of eight bits 
tially-invariant, with radius-1 neighbourhood and the cell state equation given by the FSR model [7]. 
7. This work hm been partially funded by ONR-NICOP N68171-98-C-9004, DICTAM IST-19w-lw07 and TIC 990826. 
0-7803-6344-2/00/$10.00 02000 IEEE 201 
- seven + sign. Input and output pixel values are analog (gray-scale) in general. However, specific functions are 
included for binary @lack&white) images, which can also be processed. Spatially-distributed image memories are 
available for storage of both analog and binary images on a pixel-by-pixel basis. This allows fully-parallel (64 x 64 
wide) data-transference between processors and memory. 
#of Cells 
The prototype incorporates global-control and program- 
ming circuitry, located at the periphery of the array. This 
includes memory for 32 arbitrary sets of coefficients which, 
after programmed, can be randomly selected from the outside. 
Extemal control is completely digital. The interface has 
. been designed to be easily embedded in conventional digital 
systems centred around a CPU or a DSP unit. Two bidirec- 
tional data-buses, one analog and one digital, are employed for 
image loading and downloading. 
The prototype has been designed and manufactured in a 
0.5pm. single poly, three metal layer CMOS technology. Cell 
size is 102.2 x 120pm’ - necessary to guarantee 7-bit equiva- 
lent accuracy in the intemal analog operations, while total die 
size is 9.145 x 9.534mm’. The cell array occupies 58% of the 
die area. Nominal power supply is 3.3V. and worst-case. 
power consumption is 1.2W. Table 1 shows the most relevant 
physical and electrical data of the prototype. 
3. Chip Description 
Fig.2 (a) shows the chip architecture. The prototype incor- 
porates some global-control and programming circuitry 
located at the array periphery. This includes memory for 32 
arbitrary sets of CNN coefficients and for 64 arbitrary sets of 
48 digital signals that are used as digital instructions to con- 
figure properly the cell in order to perform the different tasks 
that the cell is designed for. These memories can be randomly 
addressed from the outside once they have been programmed. 
Fig.1 @) shows the chip microphotograph. 
4096 (64 x 64 Array) 
Table 1 : h to type  Data 
cell Size 1 2 o p x  1 0 2 . 2 p  
#Transistors on the cell I 172 I 
Power per cell 250pW 
[0.6, 1.4lV (Program- 
[2.15.2.95]V (Program- 
mable.) 
l ime Constant -1.2ps 
l ime Constant for Linear 
Convolutions 
U 0  Digital Rate 
VO Analog Ratc 
Power Supply I3.3v I 
#of Templates Memo- 
# of Instructions 
Fig. 1: (a)  Chip Atchilecture. Fig. I :  (b) Chip Micmphorography 
202 
“Y 
(1) I ,  = I,(vi, vw) z G(v,)vi + Io(vw) 
DI: DigiW htruCtion 
AI: Analog Inslnlctiion 
where vi is some v: , v: , or vIor,  and vw some v i  or v i  . Both vi and vw are relative to their corresponding 
203 
zero-levels: vx0 and vwo respectively. It is clear that both vi and vw must be kept within some bounds for (1) to be 
valid. The signal ranges are limited to [-vsar. vSor]  and [ - w , ~ ~ ,  wSar] , respectively. Replacing the real form (1) 
of every synapse output into the integral form of the FSR cell state equation, yields 
The last sum on the right hand-side constitutes an undesired contribution to the offset term that must be can- 
celled. For this purpose, it needs to he “computed”, stored, and suhstracted. All this is done very easily using the 
same 20 synapses (physically the same transistors: mismatch insensitive) in a previous step in which every v, is 
made zero (synapse input signals are all connected to vXO).  The resulting currents are added at the cell’ input node, 
and the result (the tenn to be cancelled) is stored in a current memory. Substraction comes intrinsically associated 
to the current-memorization operation. Because the cancelled term depends on every weight signal, the cancellation 
must be repeated whenever the weight signals are changed. 
Cancellation results in an effective elimination of any other offset current arising from circuitry imperfections. 
In fact, such elimination is needed or at least very convenient in most CNN hardware implementations because the 
output-referred random offsets of every synapse add together resulting in a large random (spatially variant) error 
for the “offset” or bias term. This offset term is usually the dominant error source in circuit implementations, and 
therefore, the small amount of additional hardware required for its elimination is commonly worth it. 
3.2.2 Current Memory 
The cancellation strategy employed in CNNUC3 follows a “store & substract” strategy. The main drawback of 
this alternative is that the resulting current-memory specifications are tight, with a simultaneous requirement of a 
large current range (maximum current to he stored) and low absolute current error. This has been solved using an 
extension of the S21 technique [9] based on the addition of a third current memorization stage. This results in a S31 
current-memory. For optimum performance, the three current memories must be carefully sized because their cor- 
responding signal ranges are different. 
After the storage cycle, the resulting current source constitutes the biasing stage of the current conveyor 
employed at the cell’ input node. 
3.2.3 Current Conveyor 
Because transistors employed at the synapses operate into ohmic region, and because a moderately large number 
of them (20) are connected to the same cell’ input node, the input-impedance of the cell’ input node must be very 
low. A class-I1 current conveyor is employed for this purpose. It is based on a common-gate amplifier with the input 
admittance boosted using an internal amplifier and negative feedback. The high-impedance output of the current 
conveyor is directly driven (through some initialization and control switches) to the integrating capacitor. 
Because the random (spatially variant) component of the input-referred offset voltage of the current conveyors 
would affect the weights accuracy, a calibration circuitry can optionally he employed to cancel these offsets. 
3.2.4 Integrating and Sampling Capacitors 
The integrating capacitor is implemented by the input capacitance of the 9 synapses corresponding to the feed- 
hack template. An identical capacitor, implemented by the input capacitance of the 9 synapses corresponding to 
the control template is employed for the storage of the cell’ input level ( U ‘ ) .  In fact, the role of each of the two 
capacitors can be selected for each CNN process. At first step, each of the two capacitors is precharged to the cor- 
responding pixel value (gray or B&W) of one of two images. The distinction between x‘(0) and U‘ (altematively, 
between feedback and control templates) comes only afterwards, when one out of two control signals selects which 
capacitor ( C , )  will receive the current conveyor’ output current, while the other ( C , )  remains disconnected. On the 




3.2.5 Voltage Limiter 
There are several very simple and hardwareefficient ways to implement the nonlinear resistor needed at the FSR 
cell state equation. A possibility is using two diodes and two reference levels. Diodes can be emulated using MOS 
transistors with moderately large aspect-ratio. This approach, however, has the disadvantages of smooth transition 
and finite slope in the saturation region and, much more important, its sensitivity to mismatch produces a random 
spatial variation of the cell saturation level. Note that the contribution of one cell to their neighbours, which is 
always proportional to the corresponding weight, is also proportional to the local value of vso, whenever the cell is 
saturated. Cell saturation occurs in many propagative templates, at the final steady state in binary output applica- 
tions, and at the beginning of the transient in binary input applications. In other words: in practically all CNN 
processing functions. Therefore, the accuracy and uniformity of the local saturation levels is as important as the 
accuracy and uniformity of the weights. 
Another altemative is based on using active diodes, which employ negative feedback to achieve abrupt transi- 
tion. closer to the ideal, but still sensitive to mismatch due to amplifier offsets. A previous offset calibration cycle 
could be used to eliminate this effect, at the expense of a more complex circuitry and some additional global control 
lines. Still, one problem would be present: a substantial amount of power is needed in order to obtain sufficiently 
fast “diodes” without a significant overshoot (i.e., with a dominant time constant well below that of the CNN 
processing circuitry). 
For these reasons, the limiter circuitry employed in CNNUC3 is somewhat involved. It is based on two compa- 
rators that detect when the cell’ signal goes beyond either border of the linear region. In that case, the integrating 
capacitor is directly connected to one of two global wires driven by the corresponding saturation level -v, or vSalr 
whichever corresponds to the reached border. Although the input-referred offsets of the comparators will result in 
small errors, this deviations are effective only during the small transient (response time) of the comparator. Some 
minor additional tricks are needed to avoid possible instabilities in the proximity of the border points, and to allow 
for the state-variable signal to reenter the linear region. 
3.2.6 Initialization and Control Circuitry 
A number of analog switches in every cell, and a similar number of global control lines are required to control 
the different cancellation circuits, the initialization process, and to actually launch the CNN transient. As a matter 
of fact, most of the control circuitry and global control lines are related to the enhanced functionalities described 
below. 
3 3  Enhanced Functionalities 
bilities [l] as required for relevant processing functions. 
3.3.1 Image Memories 
Every cell has the capability of storing four analog (gray-scale) and four binary (black & white) pixel values. At 
system level, this means that the chip can simultaneously store eight different images. These images can be used as 
inputs at any time during a processing sequence, and modified at any time as well; writting/reading time of the mem- 
ories is around 0 . 1 ~ ~ .  Binary memories employ conventional digital latches, while analog memories relay on ‘’bot- 
tom-plate sampling’’ switched-capacitor stages following the guidelines given in [lo]. By using these memories for 
storage of intermediate results significant computation time reductions are achieved in the realization of complex 
algorithms requiring iterative template applications, as well as in the realization of biffurcated-flow algorithms. 
3.3.2 Local Logic Unit 
The local logic unit (LLU) is a programmable boolean gate whose truth table is defined as part of the digital 
instructions stored in the programming circuitry. It allows a completely parallel realization of arbitrary bit-to-bit 
logic operations between images stored at two user-selectable binary memories. The resulting image can be 
down-loaded or stored in any of the four binary memories. Conventional digital circuitry is employed for this pur- 
pose. 
Additional functionalities have been incorporated for further improvement of the CNN Universal Machine capa- 
205 
3.3.3 Freezing Mask 
Having a “freezing” mask means that the content of one user-selectable binary image memory can (optionally) 
be used as a flag which disabl? the evolution of the marked pixels during CNN processing transients, keeping their 
state variables timeinvariant. The realization of this function requires just a few analog switches. 
3.3.4 Global Gates 
In many cases it is interesting to find out if some specific image is completely white or completely black, without 
wasting the time required to download the whole image. The prototype incorporates two global gates, one NAND 
and one NOR, to perform these logic operations over the pixel values of one user-selectable binary memory. With 
this functionality, the time required to check if some image is completely black or white is around 311s. 
3.3.5 Optical Input 
In many real-life high-speed applications, the information to be processed by the network is an image that is 
available in optical form while the output contains only a few details extracted from the input. In these situations, 
the read-out process is extremely simplified and hence speeded up. However, the input image is always a complete 
frame and therefore, the time needed to transfer the image to the array can constitute an actual bottleneck. In those 
cases, the capability of combining the sensory and the processing planes, provides a dramatic system performances 
enhancement, since it produces systems that do not only exploit the advantages of the fully parallel processing but 
also those of the fully parallel image acquisition that are provided by a matrix of photosensors merged with that of 
processors. C ” u C 3  incorporates a photosensing device within each cell that allows the acquisition of images that 
are directly projected over the silicon surface. The sensing scheme is based on the integration, in the capacitor of 
any of the analog image memory, of the current that is generated by a diffusion-substrate photodiode. 
4. Conclusions 
This paper describes a recently designed analog programmable array processor chip. The new prototype, called 
CNNUC3, contains 64 x 64 cells arranged onto an array and follows the CNNUivI computing paradigm. For that 
purpose it includes several specially designed modules like the Local Logic Unit, the Local Analog Memory, the 
Switch Configuration Register, the Global Gates or the Freezing Map, that increase prototype capabilities. The chip 
is able to process, store and provide gray-scale images. An optical acquisition mode is also available thus allowing 
not only the full exploitation of the parallel processing but also of the parallel acquisition. 
5. References 
[ 11 T. Roska and L.O. Chua, “The CNN Universal Machine: An Analogic Array Computer”. IEEE Trans. Circuits andsystems 
11, Vol. 40, pp 163-173, March 1993. 
[2] T.Roska, L. K&, L. Nemes, A. Zarindy, M. Brendel, CSL - CNN Sofiare Libmry - Version 7.2, Analogical and Neural 
Compuhng Laboratory, Computer and Automation Institute, Hungarian Academy of Sciences, Budapest, 1998. 
[3] A. Paasio, V. Porra, “A CNN Universal Machine with 295 ceWmm*”. Pmc. of the I997 Int. Symposium on Non Lineal 
Theory and its Applications (NOLTA.97). Honolulu, USA, 1997, pp. 221-224. 
[4] P. Kinget and M. Steyaert, Analog V U 1  Integration of Massive Parallel Processing Systems. Kluver Academic Publishers, 
ISBN 0-7923-9823-8, 1997 
[SI R. Dom’nguez-Cash et al., “A 0.8pm CMOS 2-D Programmable Mixed-Signal Focal-Plane Array Pmcessor with 
On-Chip Binary Imaging and Instructions Storage”. IEEE J. Solid-state Circuits, Vol. 32, pp. 1013-1026, No. 7, July 1997. 
[6] 1. CNZ and L. Chua, “A 16x16 Cellular Neural Network Universal Chip”. Analog Integrated Cimuits andsignal Processing, 
Vol. 15, pp. 226-238, March 1998. 
[7] S .  Espejo. R Carmona. R. Dominguez-Castro and A. Rodriguez-Vgzquez, “A VLSl-Oriented Continuous-Time CNN 
Model”. International Journal of Circuit Theory andApplications. VoL 24, pp 341-356, May-June 1996. 
[8] R. Dominguez-Castro. A. Rodriguez-Vazquez, S .  Espejo, R. Carmona, “Four-Quadrant One-Transistor-Synapse for 
High-Density CNN Implementations”. Pmc. of 5“ IEEE Int. Workshops on Cellular Neuml Networks and their Applica- 
tions, pp. 243-248, London, April 1998. 
[9] J.B. Hughes and K.W. Moulding, “SzI: A WO-Step Approach to Switched-Currents”. P m .  1993 IEEE Int. Symp. Circuits 
andsystem. pp. 1235-1238, May1993. 
[IOIR. Carmona, S .  Espejo, R. Dominguez-Castro, A. Rodriguez-Vazquez, T. Roska. T. Kozek. L.O. Chua, “A 0.5 pm CMOS 
CNN Analog Random Access Memory Chip for Massive Image Processing”. Proc. of 51h IEEE Int. Workshops on Cellular 
Neural Networks and theirdpplications, pp. 271-276. London, April 1998. 
206 
