XILINX BASED HARDWARE FOR PICTURE PROCESSING AND CHARACTER RECOGNITION PURPOSES by Szigeti, Zoltán & Mártonfi, Zsolt
PERIODIG.4 POLYTECHNICA SER. EL. ENG. \70L. 39, NO. 2, PP. 103-114 (1995) 
XILINX BASED HARDWARE FOR PICTURE 
PROCESSING AND CHARACTER RECOGNITION 
PURPOSES 
Zsolt MARTONFI and Zoltan SZIGETI 
Department of Automation 
Technical University of Budapest 
H-l111 Budapest, Hungary 
Fax: (+36 1) 463 2871, Phone: (+36 1) 463 3969 
E-?vlAIL: martonfi@bme-at.aut.bme.hu 
szigeti@bme-at.aut.bme.hu 
Received: June 30, 1995 
Abstract 
The paper deals with the development of hardware that is capable realising picture pro-
cessing and character recognition algorithms. The hardware was implemented as an IBM 
PC peripheral card and contains up to five XILI)[X XC 3090 FPGA devices. Because 
of the on board reconfigurability of the XILINX devices the hardware allows to imple-
ment several separate algorithms at different times. For evaluation the performance of the 
hardware the Dineen and the linger image smoothing techniques were chosen. The image 
smoothing techniques can be used at the pre-processing stage of the character recognition 
process. 
Keywords: IBM PC, XILINX, programmable logic, character recognition, pre-processing, 
image processing. 
1. Introduction 
The development of picture processing hardware at the Department of 
Automation was initiated by Professor Gal in 1988, when an IBM PC based 
frame grabber card \vas deyeloped (PRENCSOVSZKY et al., 1989). The aim 
of the implemented picture processing card is to speed up parts of picture 
processing and character recognition software. The card has been designed 
to provide a cheap, efficient and flexible solution for the problem. The 
paper also deals with some aspects of the card development. 
2. The Predecessor of the XILINX Card 
\Ve saw two ways of speeding up image processing: on one hand, to buy 
faster computers and, on the other, to design and implement special hard-
ware dedicated for this purpose. At the end \ve decided to build an IBM PC 
peripheral card, which will support image processing. One of the most im-
portant requirements, beside the speed, was flexibility, so we started to de-
104 ZS. MARTONFI and Z. SZIGETI 
sign a hardware based on XILINX FPGA chips. The XILINX devices were 
ideal for our application because they could be reprogrammed on board. 
The card was developed in 1991 at the Department of Automation as 
an MSc. thesis work. The card was our first attempt to implement a PC 
card that is capable implementing algorithms in hardware. 
The card has the following parameters: 
- It contains 3 pieces of XC2064 device. 
The card has no on board memory, works from and into the PC 
memory. 
- During the memory accesses the card acts as bus master. 
The memory address generator was implemented with external coun-
ters. 
The bus arbitration was implemented in one of the XILINX chips. 
Fig. 1 sho\vs the block diagram of the card. 
signals for 
+---+ XILlNX programming 
and for other control 
r---
XILlNX config. 
..... and glue 
Cl) logic 
::l t CO ~ 
:E 
!:Q 
..--
+-+ Data bus --7 Address bus ....... XILlNX-XILlNX 
interconnection 
I .. 
• 1. f 
4 ... '" XILlNX " XILlNX XILlNX ... 
"" 
po ...-
1 2 3 
Q5 I control I 
:l!! I I 0 
source addr. counter I 
and displacement IdeStination addr. counter 
J., ,l, 
Fig. 1. 
\Ve implemented some picture processing algorithms, however, rather sim-
ple algorithms could be realised, because of the low capacity of the XC2064 
devices (for example the Dineen filter (ULLMANN, 1973) fills all the three 
XC2064 chips). The address generator with external counters was optimal 
for algorithms that accessed every pixel of the image in a fixed manner, and 
we spared a lot of logic in the programmable devices, though it was inef-
ficient for a couple of algorithms that required complex data transfers be-
tween the PC memory and the peripheral card. The memory was accessed 
directly by the hardware and we had problems with the bus mastering on 
XILINX BASED HARDWARE 105 
some PC clones. \;Ye could initiate only byte wide read and write opera-
tions because the data bus of the card was 8 bit wide and this reduced the 
efficiency of the data transfers. 
Based on the conclusiolis we arrived using the above presented hard-
,,'are we designed a more powerful and efficient peripheral card for image 
processing. To be able to implement more complex algorithms we used 
XILINX XC3090 devices for the programmable part of the new card. Using 
on board memory would make possible to organise the memory in banks 
and so to use memory interleaving for achieving higher memory bandwidth. 
The transfer rate between the processing logic and the on board memory 
can also be increased by making 32 or 64 bit wide accesses. 
3. The XILINX Based Card 
The liew XILINX based card has the following characteristics: 
- it contains up to 5 (minimum 2) XILINX XC3090 FPGA devices, 
the size of the memory can be between 512 KB and 32 MB, depending 
on the memory modules, 
standard SIMM memory modules, 
the control of the dynamic RAM's is realised by the DP8422A DRAM 
controller (DRC) manufactured by the National Semiconductor Corp. 
The controller provides the possibility of the dual port access, 
the card also contains the glue logic needed for programming the 
XILINX devices. 
Fig. 2 shows the block diagram of the card. 
The main mcdules of the card: 
XILINX configuration logic: 
At power up all of the XILINX devices are unconfigured, so a glue logic is 
needed through which the devices can be programmed. This glue logic con-
sists of a simple I/O decoder logic and ports through which the XILINX 
devices can be programmed. The glue logic is used even after the config-
uration phase as decoder logic for I/O.ports implemented in the XILINXO 
device. 
XILINXO: 
This device is the heart of the card, because it controls all the memory ac-
cesses from the PC side, and it addresses the memory in the picture pro-
cessing mode. vVhen data blocks are loaded into the memory of the card 
special interface logic can be programmed into this device to organise (on 
line) the data in the memory optimally for the processing stage. For exam-
106 ZS. UARTONFI and Z. SZIGETI 
signals for 
I -4---+ programming 
lheXILlNX 1 t 1 t 
........ Data bus 
~ Address bus ~ XILlNX XILlNX ~ .... ,. 
~ XILlNX-XILlNX 1 2 
interconnection ~T ... 
,....--
XILlNX 
-
~~ ~ config. 
-
-
logic ~ memory ...,. memory DRAM , block block 
Cl) I controller il A B ;:, !XI 
~ f----?> XILINX 
:i: 0 I DRAM J 19 memory memory It ... 
I' controller .... block ~ block C D 
HIt r--<:;> ... '" , 
- ~ r-;;-
'" '--
y. XILlNX '" .. XILlNX ~ ~
3 4 
i * i * I 
Fig. 2. 
pIe to spread the words in banks in such a way that memory interleaving 
would be possible. 
XILINX1..4: 
These devices are destined for the picture processing and character recog-
nition logic. They are connected in ring and can access different memory 
blocks. So it is possible to implement pipeline processing, reading data at 
one end of the pipeline, processing it in several steps and writing into the 
memory at the other end. If we use separate memory blocks for source and 
destination data (and these memory blocks belong to different DRAM con-
trollers) the memory read and write operatiolls can be fully overlapped. 
Memory subsystem (Dynamic RAM + DRC): 
The memory subsystem stores the image to be processed. It can be accessed 
from the IBM PC side, and from the XILINX side, too. 16, 32, 64 bit 
XILINX BASED HARDWARE 107 
accesses are possible depending how many XILINX devices are present on 
the card. This is because every XILINX chip is related to 16 bit organised 
memory block. If we implement the same algorithm in all chips then 64 
bits can be processed during one memory access. 
4. Interfacing the Card to the IBM PC 
The interface logic provides all functions needed to program the XILINX 
devices and the DRAM controllers, and to be able to read and write the 
memory from the PC side. The whole on board memory is mapped through 
a window in free place in the upper memory of the PC. There is the possi-
bility to access on board memory through I/O ports, so direct memory ac-
cess can be used to transfer data between the PC and the peripheral card. 
However, memory accesses for the data processors are also controlled by 
this interface logic. 
Fig. 3 shows the sketch of the interface logic. 
pc data bus ORC prg. address 
I/O control siqnals co~ Cii CD (/) signals 1/0 select ports e ORC address 0 :::J CD Cl () rn 
.0 0 decod. 'in control signa ~ a::'t: 
!<~ e Cl.'!! AT address Cii .5 ~~ control si nals c: x fB .- 0 CD 
- mem.seJ.'T" 00. ~Mem. '0<= e-
rn El 
: decod. 
..memory paqe select fJl (/) 
- and address bus ~ 
'0 
XILlNX address '0 <l; 
The interface of the control signals . I Clock data processor XILlNX 'I generator 
Fig. 3. 
5. Implementing Image Smoothing Algorithms 
Two image smoothing techniques were chosen for evaluating the image 
processing card for the following reasons: both techniques can use the same 
3*3 window, and show the speed advantage of the hardware solution over 
the software one. 
The Dineen's smoothing technique uses a 3*3 element window that's 
moved into all positions in the binary image. In each position the black 
elements contained in the window are counted. For each position of the 
108 ZS. MARTONFI and Z. SZIGETI 
window one new pixel is generated. The destination pixel is made black 
only if the number of the black elements within the wiudmv exceeds a 
prescribed limit, 8. The technique removes all the standalone black pixels, 
that are caused by noise, and high threshold levels (5 :::; 8 :::; 7) produce 
decreasing in the limb width, weak thinning, while using the low levels 
(2 :::; 8 :::; 4) fattens the character. 
The Unger's technique applies explicit logical rules to the pattern 
element appearing within the 3*3 window (see Fig. 4). 
MA B A B B 
(a) 
Mc C D D D 
(b) 
Fig. 4. 
(c) (d) 
The resulting pixel becomes black if either one pixel at position A at Fig. 4 
(a) and one pixel at position B or one pixel at position C at Fig. 4 (b) and 
one pixel at position D are black. 
The technique removes the black pixel at the centre of Fig. 4 (c) while 
changes to black the pixel in the middle of Fig. 4 (d). 
6. The Principle of the Implementation 
The two smoothing techniques have the following common characteristics: 
- The smoothing is performed by using a 3~3 ·window, 
- the resulting pixel is the function of the 9 pixels of the window, so the 
production of the new pixel can be generated using a combinational 
circuit. 
According to the latter sentence, the implementation of the two tech-
niques is common, except the combinational circuit that is the core of the 
smoothing. The basic step of the smoothing performed by the 9 input com-
binational circuits is shown by the sign CD in Fig. 5. 
The image is stored in the memory in a packecl form, every pixel 
corresponcls to one bit in the memory. This feature implies the following: 
There's a possibility to make a combinational circuit that processes 16 
pixels simultaneously (the XILINX based hardware can only perform 
16 bit memory access). 
XILINX BASED HARDWARE 109 
{ 
rown-1 
orig. picture row n 
row n+1 
output row n 
Fig. 5. 
- The sign @ at the Fig. 5 shows the problem encounters at the word 
boundary (indicated by a thick line). For the smoothing at the word 
boundary, first or last bit of the neighbours must be known. 
7. An Overview of the Smoothing Circuit 
The Fig. 6 shows the block diagram of the smoothing circuit. The smooth-
ing circuit was implemented on the IBTvI-AT compatible card. 
0 
x Data al~ -" 
U....J processor ~x 
OJ c e 
:5::0 c 
~~ 0 U 
~ OJ E 
a:l OJ Address 
-0.. 
generator 
.5 
Fig. 6. 
If the size of the picture is chosen to 2n * 2m the address generation can 
be performed by two separate counters. The address generator contains 
up/down counter pairs. The size of the picture is chosen to 1024*1024 pixel, 
this choice allows using a 10 bit counter and a 7 bit counter as address 
generator. The former counter acts as the row address counter, while the 
latter counter addresses the words within a row. 
110 ZS .. I,fARTONFI and Z. SZIGETI 
8. The Data Processor 
The block diagram of the data processor for the smoothing is shown on 
the Fig. 7. The three 16+2 bit register stores the pixels of the succeeding 
rows. The extra two bits are necessary to perform the smoothing at the 
word boundary. The register stores automatically the last two bits of the 
previous word \vhile storing the new value. 
2 
'" 16 '0, 18 .. .... ~ 
data .". :c ... 2 bus N + lE 
<0 er: ID 
-. 
.... 
w "'iii 
18 <.9 16 .. '0, 16 ... z ~ ::J .. 
-
... 
:5 
z .s-
w ::J 
2 ~ w 0 2 z ,........ 
'" '" is '0, '0, 
-+ 
~ ~ ~ 18 .. :c - :c ... 
N N 
+ + 
<0 <0 
.... 
..-
.... 
..-
control 
Fig. 7. 
In general, both image smoothing techniques can be realised by a 9 input 
combinational circuit. Maximum 5 input combinational circuit can be re-
alised with one logic block (CLB) within the XILINX LCA. Therefore the 
9 input circuit of the smoothing must be decomposed to smaller, less input 
circuits. 
The logical rules of the Unger smoothing (see Fig. 4) can be decom-
posed to four 3 input combinational circuits that check whether at the po-
sitions A, B, C, D in Fig. 4 are at least one black pixel. Using the output 
of the latter circuits, a further 4 input combinational circuit can produce 
one pixel result of the smoothing. 
The Dineen's technique uses the count of the black pixels within the 
windows, so it's obvious to produce the number of black pixels within a 3 
.... 
XILINX BASED HARDvFARE 111 
pixel height column. this value can be used three times, because the 3*3 
windows corresponding to succeeding pixels are overlapping. 
The 16 bit \vide smoothing circuit (see Fig. 1) can be implemented 
with 
- Dineen's case: 
- Unger's case: 
34 CLBs, maximum no. of levels is 3, 
48 CLBs, maximum no. of levels is 3. 
9. The Timing of the Memory Access 
The memory of the XILINX based card \vas built with SIMM memory 
modules. According to the data sheets of the memory modules (INTEL, 
1991) the access to the dynamic RAM can be split into two phases; the 
active phase performs the read or write, while the passive phase called RAS 
precharge time. The duration of the active part (in this case) is 80 ns, 
while the precharge time interval is 60-75 ns. The length of the memory 
cycle is 150 ns. The smoothing can be finished during a precharge time, 
which \vill be 80 ns long, equal to the smoothing delay. 
The timing of the memory access sho\vn on the Fig. 8 was done ac-
cording to the following considerations: 
- The last pixel of the result can be produced after reading the next 
three words of the input image, 
- the smoothing process can be performed during a precharge time, 
so the memory accesses can follO\v each other continuously, although 
that precharge time must be 20 ns longer which is parallel with the 
smoothing, 
- the refresh subsystem should be disabled, since the refresh of the mem-
ory is performed automatically by the subsequent memory accesses. 
10. The Control Circuit 
The control circuit provides all the control signals needed for the proper 
operating the data processor part, and the RAy!. It's also responsible for 
the timings. The block diagram of the control circuit is shown on the Fig. 9. 
The controller is microprogrammed, and it's built up from the follow-
ing parts: 
- The address counter is a loadable forward counter, it addresses the 
memory locations of the microprogram ROM. 
The conditional jump control is a multiplexer; its select lines are 
controlled by the micro program ROM, and the output controls the 
load input of the address counter. Two inputs of the multiplexer are 
tied to logical 0 and 1, respectively, these are for the unconditional 
112 ZS. MARTONFI and Z. SZIGETI 
input image 
AO Ai A2 I A3 
80 81 82 I 83 
CO C1 C2 1 C3 
output image 
mem.: read C1 write DO read A2 I read 82 I read C2 I write 01 I read A3 I ... ····1 I I 
combinational 
circuit.: 
~ production of the last bit 
of DO and the first 15 bit 
of 01 during the RAS prchg. 
Fig. 8. 
~production of the last bit 
of 01 and the first 15 bit 
of 02 during the RAS prchg. 
jump and sequential execution. The remainder two inputs are for 
handling conditions. 
The delay logic has been built from a loadable counter, which is nor-
mally deactivated. Activating the delay logic inhibits the address 
counter and the controller holds it's present state for a given num-
ber of clock periods. The length of the delay can be specified by the 
micro program ROM. This circuit is ideal for performing fixed length 
delays without the waste of the memory cells in the ROM. 
- The microprogram ROM has been realised the XILINX LCA. Since 
the combinational circuit is implemented \yith Look Up Tables (LUT), 
it can be treated as a high speed memory that's data can't be altered. 
\Vith one CLB a 32*1 bit ROM can be realised, if it's not enough it 
can be doubled using an additional CLB and a multiplexer. 
The whole microprogram ROM should be drawn on a separate sheet 
using hierarchical design methodology. Since the development system uses 
a separate netlist file (XNF) for every sheet, so the netlist file of the mi-
croprogram ROM is separated. A utility program written by the author of 
XILfNX BASED HARDWARE 
microprogram ROM 
delay 
logic 
a r. 
control signals 
Fig. 9. 
113 
this paper is used to fill the ROM with the appropriate data. The program 
accepts waveform entry and alters the XNF file that describes the ROM. 
11. Experimental Results 
The Fig. 10 shows the experimental results. The software was running on 
an IBM PC with 486DX-33 processor. The program is written in assembly 
language, uses 8086 and 80386 (32 bit) instructions, respectively. 
hardware 19 ms I 
program using 8086 instructions 340 ms 
program using 80386 instructions 1274 ms 
Fig. 10. 
The hardware is far more fast than any of the software solutions. The 
bottleneck of the hardware solution is the memory access, and so the speed 
of the processing depends strongly on it. Fortunately in the case of Dineen 
and Dnger smoothing all 16 bits that are read at a time can be processed 
in parallel manner. The memory access can be accelerated by using cache, 
and can be achieved even higher speed of processing. The experiment 
114 ZS. Af.4RTONFI and Z. SZIGETI 
proves that even a comparably cheap solution can significantly speed up 
the compute intensive image processing or character recognition tasks. 
12. Conclusion 
The card that was developed at the Dept. of Automation was introduced. 
It's goal a comparable cheap and efficient speed up of the compute inten-
sive algorithms like character recognition and picture processing. The ar-
chitecture of the card was developed using the experience gained from a 
previous card. 
References 
1. PRENCSOVSZKY, Cs. - SZIGETI, Z. (1989): Kepbeviteli modul IBM PC -hez. Vegzos 
konferencia 1989. (in Hungarian). 
2. MARTONFI, Zs. (1993): Hardver algoritmusok megval6sftasara alkalmas eszkCiz keszftese 
Diplomaterv 1993. (in. Hungarian). 
3. XILINX: The Programmable Logic Data Book 1994. 
4. INTEL Corp.: Memory Products 1991. 
5. ULLMANN, J. R. (1973): Pattern Recognition Techniques. 
