ACE16K: A 128×128 focal plane analog processor with digital I/O by Liñán Cembrano, Gustavo et al.
ACE16K: A 128x128 FOCAL PLANE ANALOG PROCESSOR with DIGITAL UO. 
G. LINAN, A. RODRfGUEZ-VAZQUEZ, S. ESPEJO and R. DOMfNGUEZ-CASTRO 
Instituto de Microelectr6nica de Sevilla - CNM-CSIC 
Edifrcio CICA-CNM, C4arf.I s/n, 41012- Sevilla, SPAIN 
Phone: +34 95 5056666, Far: +34 95 5056686. E-mail: linan@imre.cnm.es 
7his paper presents a new generation 128x128 Focal-Plane Analog Programmable Array Processor - 
FPAPAP-, from a system level perspective, which has been manufactured in a 0.3Spm standard digital 
IP-SM CMOS technology. The chip has been designed to achieve the high-speed and mcderate-accu- 
racy -8b- requirements of most real time -early-vision processing applications. It is easily embedded 
in conventional digtal hosting system: external data interchange and control are wmplelely digital. 
The chip contains close to four millioos transistors. 90% of them working in analog mode. and exhib- 
its a relatively low power consumption -<4W, i.e. less than 1pW per transistor. Computing vs. power 
peak values are in the order of ITeraOPSN, while maintained VGA processing throughputs of 
IOOFrameds are possible with about 10-20 hasic image processing tasks on each frame. 
1 Introduction 
The retina, the front-end “device” encountered at Natural vision systems is capable to 
acquire and process the visual information in concurrent manner ’. Among many other 
tasks, the early processing realized at the retina serves to extract important features from 
the raw sensory data and, thus, to reduce the amount of information transmitted to the 
brain for subsequent processing. In contrast to that, image acquisition and processing are 
usually separated in conventional artificial vision systems. Consequently, these systems 
become much slower, bulkier and more inefficient that even the simplest natural vision 
ones. 
Biased by Nature’s efficiency, during the last few years significant efforts have been 
made to develop new vision devices capable of overcoming the drawbacks of traditional 
ones through the incorporation, at the sensory plane, of 2-D distributed parallel processors 
that operate concurrently with signal acquisition *, ’. 
The chip presented in this paper belongs to this general family. However, while most 
of its relatives are designed for specific functions, the herein reported chip is a general- 
purpose front-end vision device with the following features: 1) a massively parallel 2-D 
imaginglprocessing core array consisting of locally-connected pixels with embedded opti- 
cal sensors and digitally-controlled analog processing circuitry; 2) a distributed circuitry 
for storing locally, pixel by pixel, several 2-D intermediate images; and, 3) stored on-chip 
programmability. 
In addition to the core array, the ACE16K incorporates additional circuihy for control 
and timing, a fully digital interface, address-event downloading, and on-cbip program 
storage. Hence, the ACE16K is actually a visual micropmcessor on-chip capable to realize 
a very large variety of image-related spatio-temporal operations and algorithms through 







Figure 1. The ACE16K Chip. a) Architecfure. b) Mimphotography 
2 System description 
2.1 Architecture 
ACE16K can be basically described as an array of 128x128 identical, locally interacting, 
analog processing units designed for high speed image processing tasks requiring moder- 
ate accuracy levels -around 8b-. It contains a set of on-chip peripheral circuitries that, on 
one hand, allow a completely digital interface with the host, and on the other provide high 
algorithmic capabilities by means of the use of conventional programming memories. 
Despite ACE16K is, essentially, an analog processor, it is digitally controlled. For this 
purpose, the prototype incorporates DA and AD converters which conform a digital VO 
port for images. The chip is conceived to be used in two altemative ways. First, in applica- 
tions where the images to be processed are directly acquired by the optical input module 
of the chip ’, and secondly, as a conventional image co-processor working in parallel with 
a digital hosting system that provides and receives the images in electrical form. 
The architecture of the system is sketched in Fig. 1 and contains five functional 
blocks. 1) The analog processing core, which comprises the inner array of 128 x128 iden- 
tical cells, a ring of border cells used to establish spatial boundary conditions for image 
processing, and several buffers driving analog and digital signals to the cell array. 2) A 
programming block, which contains SRAM digital memories used to store the algorithms 
to be executed by the chip. Finally, blocks third to fifth are dedicated to -electrical form- 
images VO tasks. It contains a global YO control unit which generates the signals required 
for VO image accesses, row and column addressing signals, and the control of the Digital- 
to-Analog and Analog-to-Digital VO converters bank. 
The chip uses a 32b bidirectional data bus for external communication purposes, and 
several address buses for the different blocks within the programming memory. The VO 
134 
Weight Signal Swing 
Time-Constant -linear. convo1.- 
Time-Constant -CT Dynamics- 
interface follows very simple hand-shaking protocols. Table 1 summarizes the main char- 
acteristics of the prototype. 




1 Technology I STM-0.35 pm 5M-1P I 
Design Style Full Custom (Analog Core) and 
Standard Cells (Digital YO block) 
I Package I Ceramic~FP144 I 
I #ofcells I 16384 (128 x 128 Array) I 
I # of Transistors I 3,748,170 I 
1 #of Transistors per cell I 198 I 
1 Cell Density I 
I state Signal Swing I [0.6, 1.41V (Programmable) I 
I VO Master Clock I 3 2 M ~ z  I 
I Power supply I 3.3vi.1-5% I 
I Power Consumption I <4watts  I 
I #of Analog Instructions in mem. I 32 I 
I #of Digital Instructions in mem. I 64 x 64 Configurations I 
I DieSize I 11885.0 pm x 12230 pm I 
2.2 Programming block 
The programming block, illustrated in Fig. 2, provides the algorithmic capability of 
ACE16K. It is basically a set of 8 SRAM memory blocks with miscellaneous contents 
purpose, varying from digital vectors defining the algorithmsto be executed -what we call 
“digital instructions”-, to sets of cell-to-cell interaction weights and reference levels to be 
applied to the cell array -what we call “analog instructions”-. 
The chip has two operating modes, namely the programming and the operation mode. 
During the programming mode, each of the 8 SRAM blocks can be independently 
accessed trough the data bus in order to be written -or read, just for testability purposes-. 
On the other hand, in the operation mode. the contents of different groups of memory 
135 
blocks are selected through different address buses, and transmitted in parallel to the cell 
array. 
The programming block can be divided into three sub-groups. Two of them -Opera- 
tions Memory and Addresses Memory- are used to store digital instructions. Each of these 
blocks is designed to store 64 words of 32 hits. A digital instruction is defined as a 64b 
digital vector that controls the configuration of the chip circuitry. It comprises a word from 
the operations memory -32b and another one from the addresses memory -32b. The third 
group -Weight and Analog References memory- is used to store cell-to-cell interaction 
weights and some references levels. This group consists of six identical SRAM blocks, 
each of them designed to store 32 words of 32b. Analog coefficients are defined by 8b 
words -each of these blocks stores 32 sets of 4 analog values. An analog instruction com- 
prises 24 -i.e., 6 x 4- analog values that are transmitted in parallel to the processing core by 
means of a bank of 24 digital to analog converters. 
2.3 Annlog core 
The analog Processing Core in ACE16K consists of an array of 128x128 locally interact- 
ing, identical processing units arranged in a rectangular grida. 
Fig. 3 shows the block diagram of the cell in ACE16K. Arrows indicate how informa- 
tion flows. It contains 8 fundamental building blocks that communicate to each other by 
means of the so-called ACE-BUS. Data transferences are always carried out in the same 
way; some block -the data source- drives the ACE-BUS while another one -the data desti- 
nation-, at the same time acquires this information from the ACE-BUS. Since the process- 
ing is done in the analog domain, this bus is, in practice, a single wire. In addition to the 
basic analog processing kemel -which will be described later-, the cell contains the fol- 
lowing functional blocks: 1) An Analog Random Access Memory -LAM- with capacity 
for 8 gray-scale pixel values with a resolution of 8b. 2) A Local Logic Unit, consisting of a 
programmable two-input one-output logic operator. 3) A multimode optical sensor ’. 4) 
An Address Event Downloading module, which allows the chip to download, sequentially, 
To Le analog Pmcessiog CO- To lbc andog h s s i n g  Core 
Figure 2. Diapam of Ihe p o g ” i n g  block 
a. In addition, a ring of surrounding blocks is used to establish the proper spatial boundary conditions and to 
buffer the analog and digital insuuctions to the inner m y  
136 
I neighbours 
Figure 3. Block Diagam of the PE ia ACE16K. 
the location of active pixels. And 4) a resistive grid module that allows for continuous- 
time diffusion in a resistive-grid like manner. 
2.3. I Image Processing Kemel. 
A bank of programmable analog multipliers is used to implement the neighborhood opera- 
tions required in low-level image processing. It connects the cell with its 8 nearest neigh- 
bors and with the cell itself. Multipliers are designed using a one transistor technique ’’, 
which, in addition to the intended product term, also generates a signal independent cur- 
rent -offset- that must be cancelled afterwards. Both, pixel and scaling coefficient vari- 
ables, are codified in voltage form, while multiplier’ output is provided as a current. 
Multipliers, in Fig. 4a) , are driven by three different pixel values, P A ,  P ,  and P,  in such 
a way that the current which flows to the processing core is expressed as, 
I! ,  = A * P , + b . P , + c . P , + r  (1) 
where the A and PA matrices are defined as, 
rabr ‘ b j  
The currents, generated by those multipliers, are collected by the input block of the 
cell -in Fig. 4 (b). Due to the low output impedance of the one-transistor multipliers, a vir- 
tual ground -with the appropriate voltage value Vw,  - must be provided -by a class I1 cur- 
rent conveyor. The non-desired offset contribution generated by the multiplier topology, is 
substracted from the total input current, by using a high accuracy current memory block 
based on a s31 memorization scheme. Afterwards, I , ,  can be either directly steered to the 
ACE-BUS or sent to the input of a current comparator, whose output can be also con- 
nected to the ACE-BUS. When the cell is operated to produce a grey-scale result, the input 
current is allowed to flow into any -user selectable- of the capacitors associated to the pix- 
els. Depending on this selection, different processing kemels are obtained. Thus, for 
instance, to run a Sobel operator, we would define the operator in the A matrix, the image 
to be processed would be loaded to the PA pixel, and we would use c = z = 0, and 
b = -1 , to  obtain, 
137 
Figure 4. Distrjbution ofhlultipliers. a) BanL of Multipliers. b) Curreo! Recessing Block 
CBZ dPB = - P B + A * P A  
(3) 
whose steady state solution is P ,  = A PA 
If the capacitor which is allowed to be updated is C A ,  then cells become dynamically 
coupled, and we get CNN-like behavior. In addition, by allowing the current to flow into 
C A ,  and by defining acc = -1 and 4, = 0, the steady state solution is, 
P A  = b . P , + c , P , + r  (4) 
thus providing grey-scale arithmetic operations. 
2.4 U0 interface 
As compared to previous analog focal plane processor implementations -[61, [71, [SI, [91-, 
and leaving aside the increase in the number of cells, the main improvement of ACE16K is 
the incorporation of a completely digital interface -not only for system control, but for dig- 
itized gray-scale images U0 as well -see Fig. 5.. 
The chip incorporates 128 -one per column- DA and AD converters. DAs, used for 
image input, are based on a resistor string and an analog multiplexer while ADS, for image 
output, follow a successive approximation approach. These converter architectures pro- 
Cootml Buses 
Exlemal 128x8 Register 
f t  
lllftlll 
... 
2x128 S&H B d  
DTBUS<OJI> 
128x128 Cells Array 
Figure 5. 110 block diagram 
vide a very good compromise in terms of area and power dissipation in this particular sys- 
tem. On one hand, the same DACs used for image input can be used as part of the 
successive approximation ADCs -comparison levels are shifted up IRLSB. On the other 
hand, because the 128 converters work in parallel, a significant part of the digital circuitry 
needed to control the successive approximation circuitry can he shared in a common 
peripheral block, resulting in a substantial reduction in area and power dissipation. Finally, 
A self calibration process is automatically executed at the beginning of every data conver- 
sion for VO-related fixed-pattern noise elimination. 
Transferring a row to/from the chip requires 1 ps . Since the chip uses a two-stages pipe- 
lined architecture, the total time for image loadinghploading is 130p.s. In order to avoid 
undesirable digital coupling with the analog processing circuitry, image VO and process- 
ing are normally done sequentially at different times. In most practical cases, an allocation 
of 140ps for image processing is more than enough -around 11 basic image processing 
tasks can be executed within this time-. With this assumption, the time required to load, 
process, and download a 128x128 image is about 400ps while VGA frames would be pro- 
cessed at 100 Frameslsecond 
3 Conclusions 
A new Focal Plane Analog Programmable Array Processor has been presented. The chip 
core consists of an array of 128x128 identical, locally interacting analog processing, sens- 
ing and storing units. On-chip program memory allows the execution of complex, sequen- 
tial and/or bifurcation flow image processing algorithms. The systems is specially suited 
for real-time, concurrent image sensing and processing applications, with maintained 
complex processing rates in the range of 100 VGA-Frames per second, with a power dissi- 
pation below 4W. Its fully digital interface allows an easy interconnection with conven- 
tional digital systems 
References 
1. B. Roska and F. Werblin, “Vertical Interactions across ten parallel, stacked 
representations in the mammalian retina”, Nature, No. 410, pp. 583-587, March 2001. 
2. C. Koch, H. Li (Eds.), vision Chips, Implementing vision Algorithms with Analog VU1 
Circuits, IEEE Press, 1995. 
3. A. Moini, Ksion Chips. Kluwer Academic Publishers, 2000. 
4. T. Roska and A. Rodn’guez-Vbquez (Editors), Towards the visual Micmpmcessor. 
John Wiley & Sons Ltd., 2ooO. 
5. G. LifiAn, A. Rodriguez-Vbquez, S. Espejo, R. Domfnguez-Castro and E. Roca, “A 
Multimode Gray-Scale CMOS Optical Sensor for Visual Computers”, Submitted to this 
Conference. 
6. G. Lifiln, P. Foldesy, S. Espejo, R. Domfnguez-Castro and A. Rodriguez-Vbquez. “A 
0.5pm CMOS lo6 Transistors Analog Programmable Array Processor for Real-Time 
139 
Image Processing”, Proc. of the 25“ European Solid-State Circuits Conference, pp. 
358-36, Duisburg-Germany, Sept. 1999. 
7. R. Domfnguez-Castro, S .  Espejo, A. Rodriguez-Vkzquez, R. Carmona, P. Foldesy, A. 
Zarindy, P. Szolgay, T. Sziranyi and T. Roska, “A 0.8 pm CMOS Programmable Mixed- 
Signal Focal-Plane Array Processor with On-Chip Binary Imaging and Instructions 
Storage”, IEEE Journal of Solid State Circuits, Vol. 32, No. 7, pp. 1013-1026, July 
1997. 
8. A. Paasio, A. Dawidziuk, K. Halonen and V. Porra, “Minimum Size 0.5pm CMOS 
Programmable 48 x 48 CNN Test Chip”, Proc. of the 1997 European Conference on 
Circuit Theory and Design, pp. 154-156, Budapest, Hungary, September 1997. 
9. P. Kinget and M. Steyaert. “Analog VLSI Integration of Massive Parallel Processing 
Systems”, Ed.Kluwer Academic Publishers, 1996. 
10.G. Liiiin, Design of Programmable lrision Chips with Low Power Consumption Levels, 
Ph. D. Thesis, University of Seville, to be published in 2002. 
