FAST AND EFFICIENT MULTI-LAYER CNN-UM EMULATOR USING FPGA by Nagy, Zoltán & Szolgay, Péter
PERIODICA FOLYTECHNtCA SER. E L ENG. VOL 47. NO 1-2. PP. 57-70(2003) 
FAST AND E F F I C I E N T M U L T I - L A Y E R CNN-UM E M U L A T O R 
USING F P G A 
ZoltSn N A G Y and P6ter SZOLGAY 1 
Department of Image Processing and Ncurocomputing 
University of Vcszprcm 
Egyetem u. 10, H-8200 VeszpnJm, Hungary 
e-mail: nagyz@almos.vcin.hu 
Received: August 30, 2002; Revised: May 27, 2003 
Abstract 
A new emulated digital multi-layer CNN-UM chip architecture called Falcon has been developed. 
Simulation running time can be hundred limes shorter using the Falcon processor array compared to the 
software simulation. This huge computing power makes real time image processing possible. In this 
paper the main steps of the FPGA implementation and optimization arc introduced. The Distributed 
Arithmetic technique is used to optimize the architecture on FPGAs. Using this technique, smaller 
and faster arithmetic units can be designed than using conventional approach where multiplier cores 
and adder trees are used to compute the state equation of the CNN array. 
Keywords: cellular neural networks, reconfigurablc computing. 
1. Introduction 
A Cellular Neural Network is a non-linear dynamic processor array. Its extended 
version, the CNN Universal Machine (CNN-UM) was invented in 1993, [ ] ] . The 
main application area of this architecture is 2D signal or image processing. The most 
effective implementation of the C N N - U M architecture seems to be analog VLSI . 
The latest analogue CNN chip has a 128 x 128 pixel resolution and its equivalent 
computing power is 4 tera operation/second but its computational precision is about 
7 or 8 bits [2]. In many applications these parameters are not high enough. I f the 
resolution is higher we do not need to slice the images. I f the precision is higher, 
less robust or more sophisticated analogical algorithms can be used [3], 
A multi-layer CNN array can be used to solve the state equation of complex 
dynamical system. Currently the only method to solve the state equation of multi-
layer CNN array is software simulation. I f every layer has different time constant 
very small simulation step size must be chosen, thus software simulation is very 
slow. To achieve affordable runtimes the simulations have to be accelerated. This 
motivation came from the analysis of a retina model [4]. 
1 Also affiliated to Analogic and Neural Computing Laboratory, Computer and Automation Insti-
tute of HAS, Kende u. 13-17. H-1111 Budapest, Hungary 
38 Z NAGY and P. SZOIGAY 
The Falcon emulated digital CNN chip was designed to reach these goals. 
Special flexible emulated digital CNN-UM was developed where the accuracy, 
template size, cell array size and the number of layers can be configured. This 
paper describes the synthesis, implementation and optimization methods used to 
implement the Falcon processor array on FPGA. 
2. Problem Statement 
The Falcon architecture is designed to solve the full range model of a CNN cell (1). 
2-n 2-n 2-n 2-n 
hj (') = ^ J] Ak-' ' X<+k-n.j+l-n 0 0 + X! Z Bk-1 ' "i+k-n.j+t-n 0 ) + kj ( 1 ) 
k=0 1=0 k=0 1=0 
Where x, u and / are the state, input and the bias values of the CNN cell, n is the 
neighbour value, A is the feedback, B is the feed forward template. The templates 
are (2n + 1) x (2n + 1) sized matrices. At the edges of the CNN array we use 
zero-flux boundary conditions, e.g. the value of the edge cells are duplicated. 
The state equation of the CNN array is solved by forward Euler discretization. 
The h time step value can be inserted into the templates A and B, these modified 
templates denoted by A' and B\ Usually the input values do not change for several 
time steps so the state equation (1) can be broken into two parts: the feedback (2) 
and the input part (3), which is computed once at the beginning of the emulation. 
, 2-n 2-n 
Xij(m + 1) = ^ ^ 4 ; • XHt-Hjw-nW+gu (2) 
*=0 /=0 
2-n 2-n 
Slj = X] Z BU ' "i+k-njM-n + h • lij (3) 
k=0 1=0 
In the case of the multi-layer CNN we have the following set of state equations: 
r - l 2-n 2-n 
Xp.i.jiO ~ Z Z Z Ap.q.k,l • Xq,i+k-n,j+l-n(t) 
q=0 k=0 1=0 
r - l 2-n 2-n 
" uq,i+k-n,j+l-n 
q=0 k=0 1=0 
where r is the number of layers, x, u and / are the state, input and the bias vectors 
of one multi-layer cell. After discretization the following set of equations can be 
MULTI-LAYER CNN-UM EMULATOR 59 
r - l 2-n 2-n 
£ £ £ ^ , ? . A . * • ^ . i + t - « . j + / - n (m)+g p , / ,y (5) 
9 = 0 *=0 /=0 
r - l 2-R 2-n 
E Z Z B/».?.U ' Uq.<+k-n,j+l-n + '/>,/,, (6) 
9 = o *=0 /=0 
3. The Falcon CNN-UM Architecture 
3.1. I/O Requirements 
The first issue, which must be considered at design time, is the I/O requirements 
of the processor core. Even in the simplest case (when the neighbourhood n = 1) 
we need to load 9 template, 9 state and 1 constant values to update a cell: this is 
19 values altogether. It is obvious that we cannot provide all these values from an 
external memory real time. On the other hand we cannot store the whole picture 
on the chip because of the memory limitations of the FPGAs. When the number 
of templates is low, it is evident to store the templates on the chip and this solution 
cuts the input requirements by half. But we still have to load 10 values to update 
one cell which requires very high bandwidth, so we must analyze the data-flow of 
Eq-(2). 
When updating the cell array, the state value x,,y(m) of a cell must be loaded, 
when the upper left neighbour of the cell _ j , , _ i (m - H ) is computed, the state value 
Xjj (m) is still required for the next two updates ; t y _ i (m + 1 ) and (m + 1 ) , in 
the following two lines Xij (m) appears 3 times per line. Because xtj appears when 
we update lines / — 1, j and j; + 1 the simplest way to reduce memory bandwidth 
is to store these three lines of the picture on the FPGA, see Fig. 1. Similarly, i f we 
used higher neighbourhood values we got the same result but we must store 2n + 1 
lines. The only drawback of this method is that we have to store the constants on 
the FPGA, fortunately, only n + 1 lines have to be stored. After these optimizations 
the I/O requirements of the processor are reduced to two input (one state and one 
constant) and two output operations per cell update. The memory size required to 
store the belt can be computed by the following equation: 
((2 • n + 1) - sm + (rt + 1) • cw) • w, (7) 
where n is the neighbourhood value, sw and cw are the width of the state and 
constant value in bits and w is the width of the cell array. 
derived: 
xP,ij(m + 1) = 
Spj.j — 
60 Z NAGY and P SZOLGAY 
•4 
it i - * * * * * 1 * 
Fig. / . The hell of the array stored on the 
• • • • • . 
• • C O C C 
E a o n c n . 
c a n t i n g 
• i i i i i 
in case of 3 x 3 and 5 x 5 templates 
Table I . I/O and memory requirements of the processor after optimization (assuming 256 
cell wide array and 150 MHz clock frequency) 
State and constant width (bit) 
4 6 8 12 16 24 32 
Input bandwidth (GB/s) 0.15 0.225 0.3 0.45 0.6 0.9 1.2 
On-chip memory 
requirements (kbit) 
3 x 3 5 7.5 10 15 20 30 40 
5 x 5 8 12 16 24 32 48 64 
7 x 7 11 16.5 22 33 44 66 88 
3.2. Conventional Arithmetic Unit 
After successful reduction of the I/O requirements of the processor cores, the next 
question is how to organize our computing resources in the arithmetic unit. How 
many multipliers can we use efficiently? To achieve the highest performance, 
(2n + 1) x (2n - f 1) multipliers can be used in the arithmetic unit. An adder tree 
is also required to sum the multiplied values. Using this arithmetic unit, one clock 
cycle is required to update a cell value. Unfortunately, this arithmetic unit is very 
huge because the multipliers require a large area, see Table 2. 
This huge area can be reduced i f the number of multipliers is decreased and a 
new cell value is computed serially, for example, in a row-wise order. In this case, 
In + 1 multipliers, an adder tree and an accumulator register to store the partial 
results arc used. The template operation is computed in 2n + 1 clock cycles. A new 
block, called mixer, should be used between the memory and the arithmetic unit 
to simplify the control and the architecture of the memory unit. The mixer holds a 
vertical stripe of the cell array belt from the memory unit (dark grey area in Fig. J). 
Storing these values in a small memory allows to use the same memory unit as in 
the previous case. The I/O requirements of the processor core arc also reduced by 
this solution according to the clock cycles required for the cell update. 
3.3. Distributed Arithmetic 
Distributed arithmetic is a bit level rearrangement of a multiply accumulate to hide 
the multiplication [5]. It is a powerful technique for reducing the size of a parallel 
MULTI-LAYER CNS-UM EMULATOR 61 
Table 2. Area requirements of a 3 x 3 arithmetic unit in Virtex slices 
Template 
width 
(bit) 
! inulliplk-rs 9 multipliers 
State and constant width (bit) State and constant width Ibiti 
4 6 8 12 1ft 24 32 4 6 8 12 1ft 24 32 
4 66 87 102 138 174 246 3 IS 211 276 323 435 547 771 995 
6 87 135 165 219 273 381 4S'> 276 422 514 680 846 1178 1510 
s 102 165 192 252 312 432 552 323 514 597 781 965 1333 1701 
12 138 219 252 396 486 666 846 435 680 781 1217 1491 2039 2587 
If. 174 273 312 486 603 819 1035 547 846 965 1491 1 S4ft 2502 3156 
hardware multiply-accumulate that is well suited to FPGA designs. Distributed 
arithmetic is widely used in FIR filter implementations on FPGAs [6]. 
The conventional FIR filter computes the following convolution sum: 
y(*) = (8) 
n=0 
where y(k) is the response of the tiller at time k, a{n) are the filter coefficients, 
x(k — n) is the input sample of the filter and N is the number of filter coefficients. 
The input sample can be written in the following fractional format: 
B-2 
(9) 
6=0 
where Xff is a binary variable, B is the width o f * and p is the position of the radix 
point. If Eq. (9) is substituted into Eq. (8), the following equation can be derived: 
/v-i 
y{k) = £ -a(n)xB-i(k - ft) • 2 f l - ' ' - , +J2a(n)xB-2(k - «) • 28-"-2 
n=0 n=0 
• V - l A ' - l 
+ a(n)x{ (k - n) • 2 ^ ' + 1 + £ a(n)x0{k - n) • 2~p (10) 
n=0 n=0 
Eq. (10) can be computed serially by the architecture depicted in Fig. 2, which 
contains only look up tables, shift registers and one scaling adder. 
If the input samples are represented with B bits of precision, B clock cycles 
arc required to complete the calculation. Additional speed can be achieved by using 
two or more partial product LUTs and a scaling adder tree to sum partial products 
[7]. To achieve maximum performance, fully parallel distributed arithmetic FIR 
filter can be built which can compute the new result in a single clock cycle. The 
only drawback of the architecture is that space variant CNN templates cannot be 
62 Z NACY tnd P. S7.0LGAY 
Scaling 
accumulator 
N-1 shift registers 
<(n) - | PSC 
Parallel to serial 
converter 
Partial 
products 
2' 
2" R 
word e 
LUT fl 
B-blt shift registers 
A d d ' s u b t r a c t 
Fig. 2. Serial distributed arithmetic FIR filter 
used because the coefficients in the partial product L U T cannot be changed when 
emulation is running. The cycle length of the arithmetic unit is determined by the 
parallelism of the FIR filters. Trade off can be made between speed and area, by 
using a fully serial or fully parallel FIR filter. 
According to Eq. (2) this FIR fitter architecture should be extended to 2 
dimensions. This can be done by slightly modifying the shift register section and 
using larger partial product L U T as shown in Fig. 3. Inputs of the 2 dimensional 
FIR filter are connected to the cell memory which store a belt of the cell array. 
Increasing the number of inputs of the partial product LUT greatly increases its 
size. Using 3 x 3 sized templates the partial product L U T has 9 inputs and the area 
requirement is 9 slice for every bit of the partial product. This area requirement can 
be reduced to 3 slices/bit by using the architecture in Fig. 3, were the coefficients 
arc grouped to fit into a 4 input FPGA L U T and adders are used to calculate the 
final partial products. Area requirements of the arithmetic unit built by various 
distributed arithmetic FIR filters are summarized in Tabic 3. 
LUT 
J 
xjn) H PSC LUT 
e - i - y(rt) 
A d d ' s u b t r a c t 
Fig. 3. 2 dimensional serial distributed arithmetic FIR filter 
The main advantageof this approach is its easy scalability, while using the con-
ventional arithmetic unit, the scalability is limited to 3 cases when (2n + 1 ) x (2n +1 ) 
or 2n + I or just one multiplier is used. In the case of distributed arithmetic, the cy-
cle length is determined by the width of the state value, for example, in a 12 bit case 
the cycle length can be 1,2,3,4,6 or 12 clock cycles/cell. The area requirements of 
the distributed arithmetic units are usually smaller than the conventional approach, 
especially when the template width is high. 
MULTI-LAYER CNN-UM EMULATOR 63 
Table 3. Size of the 3 x 3 arithmetic unit in Virtex slices 
Template 
width 
(bit) 
2 clock cycle/cell 1 clock cycle/cell 
State and constant width (bit) State and constant width (bit) 
4 6 8 12 16 24 32 4 6 8 12 16 24 32 
4 113 149 174 259 324 504 649 148 216 266 412 520 824 1069 
6 129 171 202 299 376 580 749 175 255 317 487 619 971 1264 
8 145 193 230 338 428 655 849 202 293 368 563 719 1118 1459 
12 177 237 286 418 532 807 1049 256 371 470 713 918 1412 1849 
16 209 281 342 498 648 959 1249 309 449 572 863 1116 1706 2239 
3.4. Achieving Even More Performance 
Using the largest Virtex-II series FPGA form Xil inx which contains 46.592 config-
urable logical blocks (slices), several arithmetic units can be implemented. How 
can we use this huge amount of resources to achieve more performance? 
The Falcon processors can be connected in a square grid on the FPGA. The 
performance of the array scales linearly according to the number of processors. 
The processed image is partitioned between the physical processors. Each physical 
processor column works on a long and narrow vertical stripe of the image. In one 
cycle a row of processor units gets the result of the previous iteration from the row 
of processor units above, calculates one iteration and sends the results to the row of 
processor units below. Adding additional columns to the grid increases the input 
bandwidth of the whole array and the available user I/O pins on the FPGA device 
limits the number of columns. 
3.5. Multi-Layer Extension: the Falcon-ML Processor 
To emulate a multi-layer CNN array we have to make some modifications on the 
original Falcon architecture. The main structure is not changed and the processor 
cores are arranged in a square grid. In a multi-layer case the same optimizations 
can be made to reduce I/O bandwidth as in the single-layer case. The memory 
requirements and the required input bandwidth of the r-layer processor core are r 
times higher than the single layer architecture. 
The Falcon-ML processor emulates a general multi-layer CNN array where 
every layer is connected together in all possible ways. This means that the arithmetic 
unit must do r 2 times more work man in the single layer case. Templates in the 
multi-layer case can be treated as r x r pieces of single-layer templates and r single-
layer arithmetic units can compute the template operation for every layer. It is 
possible at high-precision cases that this arithmetic unit requires a huge area, in 
this case the parallelism is reduced, and one multiplier per single-layer template or 
serial distributed arithmetic FIR filters can be used in the arithmetic unit. 
M ZNAGYoadPSZOLGAY 
4. Features and Performance 
After these design considerations we were able to make a synthesizable RTL-level 
(Register Transfer Level) V H D L (Very high speed Hardware Description Language) 
description of the Falcon and Falcon-ML architectures. We used Synopsis FPGA-
Express to synthesize our processors. The processors can be configured via a 
configuration file before the synthesis. 
The configurable parameters are the following: 
• the bit width and displacement of the radix point for the state, constant and 
template values, possible values for width are between 2 and 64 
• the neighborhood value of the templates 
• the width of the cell array slice 
• the number of processor core rows and columns 
• the number of layers in the multi-layer case 
The large number of configuration parameters makes it easy to synthesize the 
Falcon architecture, which is optimal for our requirements. I f our requirements 
are changed, the same FPGA can be used but with differently configured Falcon 
processors. 
The performance of the Falcon processor is compared to the speed of the soft-
ware simulation and the CASTLE emulated digital CNN-UM [8]. In the software 
simulations, Intel Pentium IV with DDR R A M (Double Data Rate) and R D R A M 
(RamBus) and A M D Athlon XP processors are used. To simulate a CNN array, 
functions of the Intel Signal Processing Library [9] are used, which contain M M X 
optimized functions for various signal and vector processing tasks. The perfor-
mance of the software simulation depends on the size of the cell array. I f the size of 
the data set is greater than the Level 2 cache of the microprocessor, the performance 
drops to a lower level, which is determined by the FSB (Front Side Bus) frequency 
of the processor. 
The Falcon and the CASTLE processor arrays do not make any rounding 
until the final step of the computation, thus the precision used inside the processor 
is higher than the input precision and this must be considered in the comparison. 
We select 24-bit precision to compare with the double precision floating point 
simulation. 
Timing analysis of the implemented Falcon processor using distributed arith-
metic shows that the processor core can run at 200 MHz using 24 bit precision 
and a new value can be computed in every clock cycle. The CASTLE processor 
runs on 125 MHz clock frequency and compute a new cell value in 3 clock cycles. 
Table 4 shows the performance of the software simulation and the emulated digital 
architectures in CNN iterations/s in the case of different number of layers and cell 
array sizes. The Falcon processor seems to be faster than the CASTLE architecture 
but we have to note that the Virtex-II FPGAs use 0.15 /xm technology while the 
CASTLE processor is manufactured with 0.35 (Mm. I f the same technology is used, 
the custom VLSI chip wi l l be faster. In the single layer case the Falcon emulated 
Ml/LTT-LAYER CNN-UM EMULATOR 65 
digital C N N - U M architecture offers 27.63 times more performance than the soft-
ware simulation. In multi-layer configuration the speed up is more significant but 
a larger area is required to implement the processor cores. The results show that 
the Falcon and the CASTLE architectures are considerably faster than the software 
simulation, even in a single processor configuration. 
Table 4. Performance of the software simulation, the Falcon and the CASTLE architecture 
in CNN iteration/s 
Array size 
Single layer 3 layers 
A
th
lo
n 
X
P 
18
00
 +
 
PI
V 
1.
7G
H
z 
DD
R 
R
A
M
 +
 
PI
V 
1.8
 G
H
z 
R
D
R
A
M
 +
 
C
A
ST
L
E
* 
Fa
lc
on
* a * 
c X ° o x: oo 
< ™ PI
V 
1.
7 
G
H
z 
DD
R 
R
A
M
+ 
PI
V
 1
.8
G
 H
z 
R
D
R
A
M
 +
 
Fa
lc
on
* 
180 x 135 338.88 403.65 451.15 1,643.78 8,230.45 39.03 46.18 52.07 8,230.45 
320 x 200 84.72 122.20 148.51 650.94 3,125.00 13.77 16.67 18.98 3,125.00 
640 x 480 17.59 20.18 23.56 135.61 651.04 2.73 3.04 3.45 651.04 
Speedup 0.75 0.86 1.00 5.76 27.63 0.79 0.88 1.00 188.93 
*Performance of the single processor core configuration 
The strength of the emulated digital architectures is to connect multiple pro-
cessor cores to work parallel. The area required to implement a core processor 
depends on the accuracy, template size, cell array slice width and the number of 
layers. Table 5 shows the number of implementable processor cores on different 
FPGAs. Using low precision, more than a hundred processor cores can be imple-
mented on the largest FPGA. I f an array of processor cores is used, the performance 
scales linearly correspond to the number of processors. The result is a 500-fo!d 
speedup compared to the software simulation using moderate accuracy. 
Table 5. Number of implementable Falcon and Falcon-ML processor cores on different 
FPGAs 
State and template width (bit) 
Single layer 3 layers 
6 12 24 6 12 24 
v300 8 3 1 1 0 0 
vlOOO 32 15 6 6 2 0 
v3200 85 41 16 16 6 2 
2vl000 13 6 2 2 1 0 
2v8000 123 59 24 23 9 3 
56 Z. NAGY and P. SZOLGAY 
5. Examples 
5.1. Image Halftoning 
The first example is the 5 x 5 halftoning template from the CNN Template Library 
[10]. This template converts greyscale images into black and white images preserv-
ing the main features of the image. This function is implemented by the following 
template: 
A = 
-0.03 
•0.09 
•0.13 
-0.09 
-0.03 
•0.09 
•0.36 
-0.6 
-0.36 
-0.09 
-0.13 
- 0 . 6 
0.05 
- 0 . 6 
-0.13 
B = 
0 
0 
0.07 
0 
0 
0 
0.36 
0.76 
0.36 
0 
0.07 
0.76 
2.12 
0.76 
0.07 
0 
0.36 
0.76 
0.36 
0 
-0 .09 
-0 .36 
- 0 . 6 
-0 .36 
-0 .09 
0 " 
0 
0.07 
0 
0 
-0.03 
-0 .09 
-0 .13 
-0 .09 
-0.03 
= 0. 
This example was run on our prototyping board with a Xilinx Virtex-300 FPGA. 
Two processor configurations were used; in the first case one Falcon processor was 
used with 16 bit wide state and 8 bit wide template values. In the second case the 
template width remained 8 bit but the state width was decreased to 8 bit which 
enabled us to implement 4 Falcon processors. Each processor used 3 multipliers to 
compute the results. The simulation time step was 25/128 to make template value 
representation more accurate. 100 simulation lime steps ran on an A M D Athlon 
XP 1800+ processor and on both Falcon processors. 
The input, output and error images are shown in Fig. 4 to 9 and the runtime and 
speed up of the computation are summarized in Table 6. The error images show the 
difference between the simulated and the emulated images, black pixel represents 
error larger than 0.01 and white pixel represents no error. In the 16 bit case, most 
of the errors are near to the edges of the image. The main source of these errors 
is the different boundary conditions. In the simulation, fixed boundary conditions 
were used, while the Falcon processor used zero-flux boundary conditions. In the 
8 bit case the number of erroneous pixels is higher but these errors on the output 
image are not noticeable by a human observer. 

68 Z NACY tnd P. SZOLGAY 
Table 6. Runtime and speedup of the examples 
Runtime of 100 iterations(s) Speedup 
Athlon XP 1800+ 1.97475 1 
1 Falcon processor 16 bit wide state 0.65-197 3.015 
4 Falcon processors 8 bit wide state 0.163782 12.0571 
5.2. Emulating a 3-Layer Retina Model 
The simulation of a retina model motivated the development of the multi-layer 
Falcon-ML architecture [4], The retina is modelled with 3-layer CNN array where 
every layer has different time constant. One control and six feedback templates 
describe the connections between the layers. Some template elements have large 
values (±60,000) while others require fine resolution (0.2). The minimum timestep 
required to simulate the array is 0.0001, and the size of the input images is 180 x 135 
pixels. 
The implementation of the Falcon-ML processor which can emulate such a 
CNN array on our prototyping board was a very challenging task. After examining 
the templates we found that at least 28 bit wide state and 19 bit wide template values 
should be used. Unfortunately, the Virtex-300 FPGA in our prototyping board 
has very limited resources, so some modifications were required to implement the 
Falcon-ML architecture with these parameters. 
At the first step the memory unit is changed and the on-chip SRAM memories 
are used to store a belt of the image. The height of the belt stored in the memory 
unit can be reduced to two lines instead of three because only one processor core is 
used. The size of the arithmetic unit is also reduced, using two bit a time serial FIR 
filters and 14 clock cycles are required to update a cell. The displacement of the 
template values can be configured independently for every layer and the template 
precision can be reduced to 9 or even 4 bits depending on the values used in the 
templates. 
This specialized Falcon-ML architecture can be implemented on our proto-
typing board. The processor runs only on 100'MHz clock frequency because of 
the limitations of the Virtex-300 FPGA on our prototyping board and it computes 
a new value every 14 l h clock cycle. The performance of this slow processor is 293 
CNN iteration/s on a 180 x 135 pixels sized image, which is 6 times faster than a 
Pentium IV 1.8 Ghz processor, see Table 5. Using faster memories, higher speed 
grade FPGA or using the more advanced Virtex-E and Virtex-II FPGAs, 30 times 
higher performance can be easily achieved. 
M U L T I - L A Y E R CNN-UM EMULATOR 69 
6. Conclusions 
The implementation of the Falcon architecture was successful on our prototyping 
board, using a Virtex-300 FPGA from Xil inx Inc. The performance of the architec-
ture was encouraging, even in a single processor configuration a 27-fold speedup 
can be achieved. The easy scalability of the array makes it possible to connect 
the processor cores and achieve even more performance. Using re-configurable 
devices to implement the Falcon architecture provides us more flexibility compared 
to the conventional emulated digital architectures, e.g., different configurations can 
be used on the same hardware and extra design effort is not required to implement 
it. 
I f forward Euler method is used, very small time step is required for precise 
emulation of a CNN dynamics, mainly in a multi-layer case. Instead of computing 
thousands of iterations, better numerical method should be used, where the final 
value of the iteration is computed from several substeps or adaptive stepsize control 
can be used. Software implementation of a sophisticated numerical method can be 
slower than the forward Euler method. Re-configurable hardware can be used to 
implement such an algorithm to improve performance and make very precise and 
fast emulation of various CNN architectures possible. 
Acknowledgements 
This research is sponsored by the National Research and Development Funds of the Sz6-
chenyi Plan under the consortium N K F P OM-2/052/2001 and O T K A #029609. 
References 
| l ] ROSKA, T. - C H U A , L . O . , The CNN Universal Machine. An Analogic Array Computer. IEEE 
Trans. On Circuits and Systems-II, 40 (1993). pp. 163-173. 
[2] LiftAN, G. - D O M I N G U E Z - C A S T R O , R. - ESPEJO, S. - R O D R I C U E Z - V A Z Q U E Z , A. , 
ACE 16k: A Programmable Focal Plane Vision Processor with 128x128 Resolution, in Proc. of 
the I5'h European Conference on Circuit Theory and Design, I (2001), pp. 345-348 
[3] SZOLGAY, P. - T O M O R D I , K., Analogic Algorithms for Optical Detection of Breaks and Short 
Circuits on the Layouts of Printed Circuit Boards Using CNN, Int. J. of Circuit Theory and 
Applications, 26 (1998). 
[4] B A L Y A , D. - ROSKA, B. - ROSKA, T. - W E R B L I N , F. S., A CNN Framework for Modeling 
Parallel Processing in a Mammalian Retina. Int. J. on Circuit Theory and Applications, 29 No. 
3.2002. 
[5J LlU, P. - LlU, B.. A New Hardware Realization of Digital Filters, IEEE Trans, on Acoust., 
Speech, Signal Processing, ASSP-22 December 1974, pp. 456-462. 
[6] MlNTZER, L . , FIR filters with the Xilinx FPGA, in Proc. of FPGA '92 ACM/SIGDA Workshop 
on FPGAs. pp. 129-134, 1992. 
[7] W H I T E , S. A., Applications of Distributed Arithmetic to Digital Signal Processing, IEEEASSP 
Magazine, 6 (3) (1989), pp. 4-19. 
70 Z. NAGYand P. SZOLGAY 
[8] K E R E S Z T E S , P. - ZARANDY, A. - ROSKA, T. - SZOLGAY, P. - HlDVEGI, T. - J6NAS. P. -
KATONA, A., An Emulated Digital CNN Implementation, Int. J. of VLSI Signal Processing, 
Kluwer, 1999 September 9. 
[9] Intel Performance Libraries homepage, http://www.intel.com/software/products/perflib/ 
[10] CNN Software Library, http://lab.anaIogic.sztaki.hu/ 
[11] Xilinx products homepage, http://www.xilinx.com/ 
