Efficient design and implementation of image processing algorithms on reconfigurable hardware using Handel-C by Daggu, Venkateshwar Rao
UNLV Retrospective Theses & Dissertations 
1-1-2003 
Efficient design and implementation of image processing 
algorithms on reconfigurable hardware using Handel-C 
Venkateshwar Rao Daggu 
University of Nevada, Las Vegas 
Follow this and additional works at: https://digitalscholarship.unlv.edu/rtds 
Repository Citation 
Daggu, Venkateshwar Rao, "Efficient design and implementation of image processing algorithms on 
reconfigurable hardware using Handel-C" (2003). UNLV Retrospective Theses & Dissertations. 1586. 
http://dx.doi.org/10.25669/y0r1-qje0 
This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV 
with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the 
copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from 
the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/
or on the work itself. 
 
This Thesis has been accepted for inclusion in UNLV Retrospective Theses & Dissertations by an authorized 
administrator of Digital Scholarship@UNLV. For more information, please contact digitalscholarship@unlv.edu. 
igFFB:IE&nri%BSD3fi/UNI)IbfPLïOdIOfTVVrK)NCM3IWLAC%EITtCK:ESSI}Kj 
AlX}0KUlTIWKCW f̂UECtMyFKjL%LABIJ3tL\RiyWVLRE 
USING HANDEL-C
by
Venkateshwar Rao Daggu
Bachelor of Technology 
Kakaüya University, Warangal, India 
2000
A thesis submitted in partial fulfillment 
of the requirement for the
Master of Science Degree in Computer Engineering 
Department of Electrical and Computer Engineering 
Howard R H u^es College of Engineering
Graduate College 
University of Nevada, Las Vegas 
December 2003
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 1417763
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy 
submitted. Broken or indistinct print, colored or poor quality illustrations and 
photographs, print bleed-through, substandard margins, and improper 
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript 
and there are missing pages, these will be noted. Also, if unauthorized 
copyright material had to be removed, a note will indicate the deletion.
UMI
UMI Microform 1417763 
Copyright 2004 by ProQuest Information and Learning Company. 
All rights reserved. This microform edition is protected against 
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company 
300 North Zeeb Road 
P.O. Box 1346 
Ann Arbor, Ml 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UNiy Thesis ApprovalThe Graduate College 
University of Nevada^ Las Vegas
November 14 .2003
The Thesis prepared by
Venkateshwar R. Daggu
Entitled
An Efficient Design and Implementation of Image-Processing
Algorithms on Reconfigurable Hardware Using Handel-C______
is approved in partial fulfillment of the requirements for the degree of 
 ____________ M aster o f  S c ie n c e  in E l e c t r i c a l  Engineering
iber
dnation '.ittee Chair
Dean of fke G rojzzak CoOege
6r33zzakCoikge forzdty RepzrsezzMz've
PR/1017-53/1-0Û 11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ABSTRACT
Efficient Design and Implementation of Image Processing Algorithms on 
ReconAgurable Hardware using Handel-C
by
Venkateshwar Rao Daggu
Dr. Muthukumar Venkatesan, Examination Committee Chair 
Professor of Electrical and Computer Engineering 
University of Nevada, Las Vegas
Computer manipulation of images is generally defined as Digital image processing 
(DIP). DIP is used in variety of applications, including video surveillance, target 
recognition, and image enhancement. These applications are usually implemented in 
software but may use special purpose hardware for speed. With advances in the VLSI 
technology hardware implementation has become an attractive alternative. Assigning 
complex computation tasks to hardware and exploiting the parallelism and pipelining in 
algorithms yield significant speedup in running times. In this thesis the image processing 
algorithms like median filter, basic morphological operators, convolution and edge 
detection algorithms are implemented on FPGA. A pipelined architecture of these 
algorithms is presented. The proposed architectures are capable of producing one output 
on every clock cycle. The hardware modeling was accomplished using Handel-C (DK2 
environment). The algorithm was tested on standard image processing benchmarks and 
the results are compared with that obtained on software.
Ill
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
TABLE OF CONTENTS
ABSTRACT....................................................................................................................... iii
LIST OF TABLES............................................................................................................. vi
LIST OF FIGURES.......................................................................................................... vii
ACKNOWLEDGEMENTS.............................................................................................viii
CHAPTER 1 INTRODUCTION........................................................................................1
CHAPTER 2 LITERATURE REVIEW............................................................................ 4
2.1 Field Programmable Gate Arrays (FPGAs).............................................................. 4
2.2 Virtex™-E FPGA..................................................................................................... 6
2.2.1 Configurable Logic Block (CLB).......................................................................8
2.2.2 Block SelectRAM (BRAMs).............................................................................9
2.3 Handel-C.................................................................................................................. 10
2.3.1 Parallel Hardware Generation...........................................................................10
2.3.2 Ef^cient FPGA Resources Usage.....................................................................12
2.3.3 Bit Level Operators...........................................................................................12
2.3.4 Channel communications..................................................................................13
2.3.6 External Communication..................................................................................13
2.3.5 Memory............................................................................................................. 14
2.3.7 Targets Supported by Handel-C........................................................................15
2.4 RCIOOO PCI Board.................................................................................................. 16
2.5 Optimization Techniques.........................................................................................16
2.5.1 Parallelism......................................................................................................... 17
2.5.2 Longest Path Delay........................................................................................... 18
2.5.3 Pipelining.......................................................................................................... 19
2.6 Prior Related W ork.................................................................................................. 19
CHAPTER 3 IMAGE PROCESSING ALGORITHMS................................................. 21
3.1 Median Filtering.......................................................................................................21
3.2 Morphological Operators.........................................................................................24
3.3 Convolution.............................................................................................................. 27
3.3.1 D Convolution................................................................................................... 27
3.3.2 2D-Convolution................................................................................................28
3.4 Edge Detection......................................................................................................... 30
3.4.1 Smoothing......................................................................................................... 32
3.4.2 Gradient Calculation......................................................................................... 33
IV
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3 .4.3 Magnitude and Phase....................................................................................... 34
3.4.4 Non-Maximum Suppression............................................................................ 35
3.4.5 Threshold................................................................................................  35
CHAPTER 4 HARDWARE IMPLEMENTATION.......................................................37
4.1 Moving Window Operator...................................................................................... 37
4.2 Median Filter...........................................................................................................40
4.3 Basic Morphological Operators.............................................................................. 43
4.3 Convolution............................................................................................................. 44
4.4 Edge Detection........................................................................................................ 48
4.4.1 Image Smoothing............................................................................................. 49
4.4.2 Vertical and Horizontal Gradient Calculation................................................. 50
4.4.3 Directional Non Maximum Suppression......................................................... 51
4.4.4 Threshold......................................................................................................... 53
CHAPTER5 RESULTS AND FUTURE WORK........................................................... 56
5.1 Results..................................................................................................................... 57
5.2 Future Work............................................................................................................ 62
BIBUOGRAHGY............................................................................................................ 63
APPENDIX........................................................................................................................65
VITA..................................................................................................................................97
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LIST OF TABLES
Table 2.1 Summary of Four Commercial FPGAs............................................................6
Table 5.1 Timing result of Median Filter on a 256 x 256 grayscale image...................57
Table 5.2 Timing Result of Median Filter on a 512 x 512 gray scale image.................58
Table 5.3 Timing result of 5x5 Convolution on a 256 x 256 gray scale image.............58
Table 5.4 Timing result of 5x5 convolution on a 512 x 512 gray scale image..............59
Table 5.5 Timing Results of 3x3 Convolution on a 256 x256 gray scale image...........59
Table 5.6 Timing Results of 3x3 Convolution on a 512 x 512 gray scale image..........60
Table 5.7 Timing Result of edge detection algorithm on 256 x 256 gray scale image. 60
Table 5.8 Timing Result of edge detection algorithm on 512 x 512 gray scale image. 60
Table 5.9 Implementation Cost on FPGA......................................................................61
VI
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
USTOFHGURES
Figure 2.1 FPGA Architecture.......................................................................................5
Figure 2.2 Vertex^-E Architecture Overview..............................................................7
Figure 2.3 Vertex-E Local Routing................................................................................8
Figure 2.4 2-slice VirtexTm-E CLB.............................................................................. 9
Figure 2.5 Parallel branch execution flow..................................................................11
Figure 2.6 Channel Communication........................................................................... 13
Figure 2.7 Translating code into hardware using Handel-C........................................ 15
Figure 2.8 RCIOOO Block Diagram..............................................................................17
Figure 2.9 Basic parallelism.........................................................................................18
Figure 2.10 Breaking Up complex Operations...............................................................19
Figure 3.1 Median Filter...............................................................................................22
Rgure 3.2 Median Filter on an image......................................................................... 23
Figure 3.3 Structuring element fitting and not fitting..................................................25
Figure 3.4 Erosion and Dilation of a Gray Scale Image..............................................26
Figure 3.5 Convolution...........................................................................................  29
Figure 3.6 Convolution Mask »..................................................................   30
Figure 3.7 5x5 Gaussian Convolution of Lena Image.............................................   30
Figure 3.8 Schematic of canny edge detection..........................................................   32
Figure 3.9 5x5 Gaussian convolution mask...........................................................  33
Figure 3.10 Gradient of Image.......................................................................................34
Figure 3.11 Edge Detection of Gray Scale Image......................................................... 36
Figure 4.1 Architecture of 3x3 moving window..........................................................39
Figure 4.2 Architecture of 5x5 moving window..........................................................39
Figure 4.3 Compare Unit..............................................................................................41
Figure 4.4 Hardware Design for Sorting Algorithm................................................... 42
Figure 4.5 Pipelining Process...................................................................................... 43
Figure 4.6 8 bit Constant Coefficient Multiplier (KCM).............................................45
Figure 4.7 Convolution Masks.....................................................................................46
Rgure 4.8 Design Row of Edge Detection................................................................. 48
Figure 4.9 RpeHned Architecture............................................................................... 49
Figure 4.10 Gradient Convolution Kernels....................................................................50
Figure 4.11 Gradient Orientation.................................................................................. 52
Figure 4.12 Pixel Interpolation.......................................................................................53
Figure 4.13 Lena image processed on FPGA................................................................ 55
Figure 5.1 Host FPGA Communication...................................................................... 56
vu
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ACKNOWLEDGEMENTS 
Comments, reviews and support of many people have helped the development of this 
thesis. I would like to thank all of them who have been instrumental in this work. The 
first and foremost I would like to thank Dr. Venkatesan Muthukumar for his guidance on 
my research and study at University of Nevada, Las Vegas. Without his valuable advice 
and encouragement, I could not have reached this stage in my academic pursuits.
I wish to thank members of my Thesis committee Dr. Evangelos Yfantis, Dr. Henry 
Selvaraj, Dr. Emma Regentova and also in provding me some valuable tips which made 
this thesis a success. I would like to extend thanks to my colleagues at University of 
Nevada, Las Vegas for their unconditional support in helping me. Finally I would like to 
thank all those who were involved either directly or indirectly in the completion of this 
thesis. I am too grateful to say anything else.
Last but certainly not least, I am forever indebted to the love and care of my family, 
who made me capable to accomplish their expectations.
vui
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTERl
INTRODUCTION
Digital image processing is an ever expanding and dynamic area with applications 
reaching out into our everyday life such as in medicine, space exploration, surveillance, 
authentication, automated industry inspection and in many more areas.
Applications such as these involve di^erent processes like image enhancement, and 
object detection [14]. Implementing such applications on a generable purpose computer 
can be easier but not very efficient in terms of speed. The reason being the additional 
constraints put on memory and other peripheral device management. Application specific 
hardware offers much greater speed than a software implementation.
There are two types of technologies available for hardware design. FuU custom 
hardware design also called as Application Specific Integrated Circuits (ASIC) and semi 
custom hardware device, which are progranunable devices like Digital signal processors 
(DSP's) and Field Programmable Gate Arrays (FPGA's).
Full custom ASIC design offers highest performance, but the complexity and the cost 
associated with the design is very h i^ . The ASIC design cannot be changed; time taken 
to design the hardware is also very high. ASIC designs are used in high volume 
commercial applications. In addition, if an error exist in the hardware design, once the 
design is fabricated, the product goes useless.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
DSP's are a class of hardware devices that fall somewhere between an ASIC and a 
PC in terms of the performance and the design complexity. DSP's are specialized 
microprocessor, typically programmed in C, perhaps with assembly code for 
performance. It is well suited to extremely complex math intensive tasks such as image 
processing. Hardware design knowledge is still required, but the learning curve is much 
lower than some other design choices [3].
Field Programmable Gate Arrays are programmable devices [2]. They are also called 
reconhgurable devices. Reconfigurable devices are processors which can be 
programmed with a design, and the design can be by reprogramming the devices. 
Hardware design techniques such as parallelism and pipelining techniques can be 
developed on a FPGA [7], which is not possible in dedicated DSP designs. So FPGAs are 
ideal choice for implementation of real time image processing algorithms.
FPGAs have traditionally been configured by hardware engineers using a 
Dgfign Langwuge (HDL). The two principal languages being used are Verilog and 
VHDL. Verilog and VHDL are specialized design techniques that are not immediately 
accessible to software engineers, who have often been trained using imperative 
programming languages. Consequently, over the last few years there have been several 
attempts at translating algorithmic oriented programming languages directly into 
hardware descriptions. A new C like hardware description language called Handel-C 
introduced by Celoxica [5], allows the designer to focus more on the specification of an 
algorithm rather than adopting a structural approach to coding. For these reasons the 
Handel-C is used for implementation of image processing algorithms on FPGA.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The goal of this thesis is to implement image processing algorithms like Median 
filter, convolution and canny edge detection on FPGA using Handel-C and compare 
against the performance of software implementation on a general purpose computer.
Chapter two provides information on FPGA's, Handel-C, RCIOOO and prior related 
work. Chapter three describes the image processing algorithms like Median Rlter, 
convolution and canny edge detection. Chapter four provides the details on the 
implementation of the image processing algorithms on a Xilinx Vertex-E FPGA for a 256 
X 256 gray scale image. Chapter five summaries the results and future work.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2
LITERATURE REVIEW 
This chapter provides information on the basic concepts of field programmable gate 
arrays (FPGAs) including a description of the Xilinx Vertex-E FPGA, Handel-C 
language and RCIOOO board which are used in this thesis followed by prior related work.
2.1 Field Fhogrammable Gate Arrays (FPGAs)
A Field Programmable Gate Array (FPGA) [2] as name suggests is a programmable 
device in which the final logic structure can be directly configured by the end user for a 
variety of applications. In its simplest form an FPGA consists of an array of uncommitted 
elements that can be programmed or intercormected according to a user's specification. 
The ability to reprogram these devices over and over again of the flexibility of 
interconnection resources makes FPGAs an ideal device for implementing & testing 
ASIC prototypes. The Figure 2.1 portrays the architecture of a conceptual FPGA.The 
most important components in an FPGA are configurable logic blocks(CLB's) , input- 
output blocks and programmable switches.
The architecture has a two dimensional array of CLBs that are by general 
interconnection resources. These CLBs can be as simple as 2-input NAND gates or it can 
have a complex structure such as multiplexers or look-up tables. Most logic blocks also 
contain some type of flip-flop, to aid in the implementation of sequential circuits.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
z —  □ □ □ □ □ □  □ □
D
O '
□
□□
□□
□
- j p  m  [ g  w  
m  dR m  aB
æ w w w  lJHli oIB flE,
□ □ □□ □□ □□
Figure 2.1 FPGA Architecture
□
□
□
0□
□□
□
\
Logic Bkzck
The interconnect consists of segments of wire, where the segments may be of various 
lengths. These interconnects are made up programmable switches that serve to connect 
the CLBs to the wire segments , or one wire segment to another. The wire segments 
along with programmable switches are together wined as routing architecture. Similar to 
the logic blocks, these switches can be designed in many ways. Some FPGAs offer a 
large number of simple connections between blocks where as others provide fewer, but 
complex routes. The programmable switches can be constructed in several ways 
including: pan-transistors controlled by static RAM cells, antifuses, EPROM transistors 
and EEPROM transistors.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Commerically FPGAs have been classified into four-m^or categories based on their 
interconnection. The interconnection can be symmetrical array, row based, hierarchical, 
or sea of gates. Tabel 2.1 shows commercially available FPGA's [2].
2.2 Virtex™-E FPGA 
Virtex™-E FPGA [6] produced by Xilinx, Inc. is used for our implementation. The 
Virtex™-E FPGA architecture has two m^or configurable elements: configurable logic 
blocks (CLBs) and Input/output blocks (lOBs). CLBs provide the functional elements for
Table 2.1 Summary of Four Commercial ITGAs
Company Architecture Logic Block Type Programming Technology
Actel Row-Based Multiplexer-Based Anti-fuse
Altera Hierarchical-PLD PLD Block EPROM
QuickLogic Symmetrical Array Mutltiplexer-Based Anti-fuse
Xilinx Symmetrical Array Look-up Table Static RAM
constructing logic. lOBs provide the interface between the package pins and the CLBs. 
The Virtex™-E FPGA also has dedicated block memories called Block SckcrBAAf™ 
memories (BRAMs). TheVirtex™-E belongs to the Virtex™ family of FPGAs which 
features regular arrays of CLBs arranged in columns surrounded on all sides by lOBs as 
shown in Figure 2.2 [6].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The VersaRing I/O interface provides additional routing resources around the 
periphery of the device. This routing improves I/O routability and facilitates pin locking. 
The interconnection within them is very versatile as the wire segments are of varying 
lengths and the programmable switches are fast and placed in locations that allow them to 
efficiently connect these wire segments. Interconnection of CLBs is through a general 
row/ing matrix (GRM) as shown in Rgure 2.3 The GRM contains routing switches that 
connect the vertical and horizontal routing channels. Each CLB nests into a Ver^yaBlock™ 
that connects the CLBs to the GRM. Virtex™ FPGAs are SRAM-based. A design is 
implemented by loading configuration data into their internal memory cells. The values 
stored in static memory cells control the configurable logic elements and the interconnect
Î a a 9 a <gBC ao
Rgure 2.2 Vertex^-E Architecture Overview
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
resources. These values load into the memory cells on power-up and can reload if 
necessary to change the function of the device.
To A(!ÿaceat 
GRM
CLB
Direct ConnecÜMi 
To A<^acmt CLB
Direct Conmectiom 
To A t^ e n t CLB
Figure 2.3 Vertex-E Local Routing
2.2.1 Configurable Logic Block (CLB)
The basic building block of the vertex-E CLB is the logic cell (LC). A Virtex™-E 
CLB contains four logic ccik (LC). An LC contains a four-input function generator, carry 
logic, and a storage element. The entire Vertex CLB is made of two CLB slices, each 
containing two LCs. Figure 2.4 illustrates the various components of the Virtex™-E 
CLB. The output from the function generator in each LC drives both the CLB output and 
the D flip-flop. Four input look-up-tables (LUTs) with 16 locations in each LUT 
implement function generators. A function is implemented in a LC by loading data into 
the LUT. The input into the LC is an address into the LUT. The value stored at that 
address is the output of the LC. Two LUT within a slice can be combined to create a 16 
X 2-bit or 32 xl-bit synchronous RAM, or a 16x1 dual-port synchronous RAM. LUT
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
can also work as 16-bit shiA register, which can be used to store data in high speed 
applications such as Digital Image Processing.
2.2.2 Block SelectRAM (BRAMs)
Block SelectRAM (BRAMs) are dedicated blocks of memory that can store large 
amounts of data. Each memory block is four CLBs high and is organized into memory 
columns stretching the entire height of the chip. There is one such memory column 
between every twelve CLB columns. The block SelectRAM also includes dedicated 
routing to provide an efficient interface with both CLBs and other block SelectRAM 
Each Block SelectRAM is a fully synchronous dual-ported and can store 4096 bits. Each
œuT COUT
CWrol Comlrol
& Cany & Cæn
Coab»
&Cany tC ziy
RATA
Figure 2.4 2-slice VirtexTm-E CLB
port has independent control signals so that the two ports can be configured 
independently. The width of each addressable location can vary from 1 to 16 bits. For 
example, if each location is 16-bits wide, then there will be 256 such locations within one
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Block SelectRAM memory. The dual-port Block SelectRAM is used our implementation 
to store image data.
2.3 Handel-C
Handel-C is essentially an extended subset of the standard ANSI-C language, 
speciAcally designed for use in a hardware environment. Unlike other C to PTGA tools 
which rely on going via several intermediate stages, Handel-C allows hardware to be 
directly targeted from software, allowing a more efficient implementation to be created. 
The language is designed around a simple timing model that makes it very accessible to 
system architects and software engineers.
The Handel-C compiler comes packaged with the Celoxica DKl development 
environment. DKl does not provide synthesis, and the suite must be used in conjunction 
with one of any number of synthesis tools available to complete the design flow from 
idea to hardware.
This section is not intended to be a full description of the tool and language, but it 
does describe the most important features, especially those that influence the design 
decisions described later in this thesis. For full details of the language and development 
environment the reader is referred to the user guides and reference material from the 
manufactures.
2.3.1 Parallel Hardware Generation
One of the advantages of using hardware is the ability to exploit parallelism directly. 
Handel-C has additional constructs to support the parallelization of code using the par 
statement. When instructed to execute two instructions in parallel, those two instructions
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
will be executed at exactly the same instant in time by two separate pieces of hardware. 
When a parallel block is encountered, execution flow splits at the start of the parallel 
block and each branch of the block executes simultaneously. Execution flow then re-joins 
at the end of the block when all branches have completed. Any branches that complete 
nearly are forced to wait for the slowest branch before continuing as shown in Figure 2.5. 
For example, the block
par {
a=10;
b=20;
}
±
1
r PwrnlM
Figure 2.5 Parallel branch execution flow
generates hardware to assign the value 10 to a and 20 to b in a single clock cycle. Using 
this statement, large blocks of functionality can be generated that execute in parallel. It 
should be noted that variable cannot be written multiple times in same clock cycle.
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Par{
a=10;
a=20;
}// this is not allowable.
Hardware can be replicated using the construct 
par (i=0;i<10;i++)
{
a[i] = b[i];
1
which results in 10 parallel assignment operations.
2.3.2 Efficient FPGA Resources Usage 
For efficient use of hardware, Handel-C provides the flexibility of use of user 
defined data types of variable sizes, int n x; This defines a variable x of type 
and size of n bits. For example, int 10 count; is a signed integer that is 10 bits wide.
2.3.3 Bit Level Operators 
Handel-C provides a number of bit manipulation operators . The following bit 
operators are provided:
y =  X \\ n Drops the n least signiGcant bits from x 
y = X <- n Takes the n  least signiGcant bits from x 
y  =  X @ z Concatenates the bit patterns that represent x and z 
y = x[3:1] Selects bits 1,2 and 3 from x
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.3.4 Channel communications 
Channels provide a link between parallel branches. One parallel branch outputs data 
onto the channel and the other branch reads data from the channel. Channels also provide 
synchronization between parallel branches because the data transfer can only complete 
when both parties are ready for it. If the transmitter is not ready for the communication 
then the receiver must wait for it to become ready and vice versa. In Figure 2.6, the 
channel is shown transferring data from the leA branch to the right branch. If the leA 
branch reaches point a before the right branch reaches point b, the leA branch waits at 
point a until the right branch reaches point b.
Figure 2.6 Channel Communication
2.3.6 External Communication 
Communication between the hardware and the outside world is performed using 
interfaces. These may be speciGed as input or output, and, as with assignment, a wiite-to 
or a read-Gom an interface will take one clock cycle. The language allows the designer to 
target particular hardware, assign input and output pins, specify the timing of signals, and 
generally control the low level hardware interfacing details. Macros are available to help 
target particular devices.
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.3.5 Memory
RAMs and ROMs can be implemented directly using the "ram" and "rom" keyword. 
Specifying the "block" parameter in conjunction with the "ram" keyword can identify 
Block RAMs. Normal variables are implemented as flip-Gops. The Handel-C code for 
RAM declaration is as follows.
A 76x8 bit RAM dgcZarafmn 
Ram 8 RAM/^767;
A 4x8 bif memory m Wock RAM decfaratKm
ram 8 MocW(AM/̂ 76y = /̂ 77, 22, 88, 44/ wAA /Moct = 7/;
A 84x6 6A memory in (7;frri6wted 7(AM decZarafion 
ram wrwigned 6 dif(RAM/84/;
2.3.6 Some Restrictions When Using Handel-C and FPGAs
One problem with Handel-C is that it is not designed as a HDL, rather a high-level 
language with a hardware output, and as such does not always completely suppoA the full 
utilisation of the underlying hardware.
Since Handel-C targets hardware, it imposes some programming restrictions when 
compared to a traditional C compiler. These need to be taken into consideration when 
designing code that can be compiled by Handel-C. Some of these restrictions particularly 
affect the implementation of algorithms. Firstly, there is no stack available, so recursive 
functions cannot be directly supported by the language, cannot be implemented without 
some modiGcation. Secondly, the size of memory that can be implemented using 
standard logic cells on an FPGA is limited, because implemenGng memory is an 
inefGcient use of FPGA resources. However, some FPGAs have internal RAM that can
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
be used by Handel-C. A limitation of using RAM or ROM is that it cannot be accessed 
more than once per clock cycle. This restricts the potential for parallel execution of code 
that accesses RAM or ROM.
2.3.7 Targets Supported by Handel-C 
Handel-C suppoAs two targets. The Grst is a simulator target that allows development 
and testing of code without the need to use any hardware. This is supported by a 
debugger and other tools. The second target is the synthesis of a netlist for input to place 
and route tools. Place and route is the process of translating a netlist into a hardware 
layout. This allows the design to be translated into configuration data for particular chips. 
An overview of the process is shown in Figure 2.7 . When compiling the design for a
V y
con:̂ ae&*f
SimûÊor
ConqaktD
nedist
V J
' t ....{ ... ^ r
Similatc and Place and
Debug Rook
V _____J V, J
T __ '
f r
cycle c{«mt ProgramFPGA
V. J J
Gate
Esdmabon
V___ J
EDIF^ 
 ^ a
Figure 2.7 Translating code into hardware using Handel-C
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
hardware target, Handel-C emits the design in Ekctronic /nfgrcAangg Formaf
(EDIF) format. A cycle count is available from the simulator, and an estimate of gate 
count is generated by the Handel-C compiler. To get deGnitive Gming information and 
actual hardware usage, the place and route tools need to be invoked.
2.4 RCIOOO PCI Board 
Figure.2.8 is a block diagram of the full length RCIOOO PCI board is used in this 
thesis. The RCIOOO is a PCI bus plug-in card for PC's. It has one large XiUnx Virtex E 
FfGA, four 2MB banks of memory for data processing operations, a programmable clock 
and 50 auxiliary FOs.
All four memory banks are accessible by both the FPGA and any device on the PCI 
bus. A FPGA has two of its pins connected to clocks. One pin is connected to either a 
programmable clock or an external clock. The programmable clocks are programmed by 
the host PC, and have a frequency range of 400kHz to lOOMHz. The RCIOOO FPGA can 
be programmed Gom the host PC over the PCI bus.
2.5 OpGmizaGon Techniques 
Handel-C is a C based hardware descripGon language. In Handel-C registers are 
implemented using Gip-Gops and all other circuitry is made up of logic gates. Each of the 
logic gates in the circuit has delay associated with it as the inputs propagates through the 
outputs. OpGmizaGon [20] is the main part while modehng hardware to reduce the 
propagaGon delay and to exploit parallelism and pipelining. The following techniques are 
followed in this work for hardware implementaGon of the image processing algonthms.
16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Host Pdmai}' PCI
Am
Figure 2.8 RCIOOO Block Diagram 
2.5.1 Parallelism
Exploiting the potential parallelism of the program and then run different non­
conflicting operations at the same clock cycle to acquire speed up. On FPGA's by 
designing specific hardware many operations can be run in parallel, significant speed up 
can be obtain. This is the main reason why application on FPGA can sometimes run 
faster than the software version even though the FTGA hardware run at much slower 
clock speed. Figure 2.9 shows the basic parallelism.
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.5.2 Longest Path Delay 
Reducing the longest path delay is important because the hardware clock speed will at 
most be the same as the path with longest delay. By reducing the delay, it can make sure
{
a = b * c; cycle 1
k = 1 * m; cycle 2
y =  X *z; cycle 3
a = a * k; cycle 4
y = X + y;
}
cycle 5
The above operations take 5 
cycles.
Par{
a=b*c; cycle 1
k=l*m;
y=x*z;
}
par{ cycle 2
a = a *k;
y = x+y;
}
Using parallelism it takes 2 cycles.
Figure 2.9 Basic parallelism
that parallel optimization will be optimal in the later stages. The delay of a path can be 
defined as
T delay  — T jogic +  Trouting
Where Tdeuyis the total delay of the path.
Tkgic is the delay due to logic.
Trouting is the delay due to routing 
Therefore, reducing the delay is done by reducing one of the Tiogk or T^obug or both.
In hardware a complex operation require longer clock to complete. Longest path 
delay can be reduced by breaking up complex operations into several simpler operations. 
This step effectively reduces the logic in each operation thus reduce the T,ogic. Figure .10 
shows an example of breaking up a complex operation into simple operations at the cost
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
of extra hardware resources.
Complex Operation Simpler Operation
sum=a*b + c*d + e*f; Tl= a * b;
T2 = c * d;
T3 = e * f;
Sum = T1 4- T2 + T3
Figure 2.10 Breaking Up complex Operations
2.5.3 Pipelining
Pipelining is an implementation technique whereby multiple tasks are overlapped in 
execution. Ideally next task is started after every clock cycle. When the pipeline is full, 
the throughput will be a task per cycle in regardless of how many cycles it takes for the 
task to finish.
2.6 Prior Related Work 
Richard G.S [8] discusses the idea of parameterized program generation of 
convolution filters in an FPGA. A 2-D filter is assembled from a set of multipliers and 
adders, which are in turn generated from a canonical serial-parallel multiplier stage. 
Atmel application notes [9] discuss 3x3 convolver with run-time reconfrgurable vector 
multiplier in Atmel FPGA. Ernest and Wiatr [22] discussed a method for development of 
an automated tool for generating convolution in FPGAs. Lorca, Kessal and Demigny [12] 
proposed a new organization of Glter at 2D and ID levels, which reduces the memory 
size and the computation cost by a factor of two for both software and hardware 
implementations. Fahad Alzahrani and Tom Chen [13] present high performance edge
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
detection VLSI architecture for real time image processing applications, the architecture 
is fully pipelined. It is capable of producing one edge-pixel every clock cycle at a clock 
rate of 10 MHz, the architecture can process 30 frames per second. V.Gemignani, M. 
Demi, M Patem i, M Giaimoni and A Benassi [10] presents the real time implementation 
of two mathematical operators which are commonly used to detect edges: (i) gradient of 
Gaussian,(ii) b operator a new operator. The algorithms are implemented on digital signal 
processor.
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTERS
IMAGE PROCESSING ALGORITHMS 
This thesis is focused on developing hardware implementation of image processing 
algorithms on FPGA using Handel-c. This chapter discusses the following basic image 
processing algorithms which include:
i) Median Filter,
ii) Basic Morphological Operators,
iii) Convolution,
iv) Edge Detection.
3.1 Median Filtering
A Median filter is a non-linear digital filter which is able to preserve sharp signal 
changes and is very effective in removing impulse noise (or salt and pepper noise)[l]. 
An impulse noise has a gray level with higher low that is different from the neighborhood 
point. Linear filters have no ability to remove this type of noise without affecting the 
distinguishing characteristics of the signal; median filters have remarkable advantages 
over linear filters for this particular type of noise. Therefore median filter is very widely 
used in digital signal and image/video processing applications.
A standard median operation is implemented by sliding a window of odd size (e.g 
3x3 window) over an image. At each window position the sampled values of signal or
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
image are sorted, and the median value of the samples is taken as the output that replaces 
the sample in the center of the window as shown in Figure 3.1.
3x3 wmdow Center pixel rqdaced wiA médian Vabe
10 5
3 11
2 3 5 10 11 14 15 20 25
Median
Figure 3.1 Median Filter
The main problem of the median Alter is its high computational cost (for sorting N 
pixels, the time complexity is 0(N log N), even with the most efAcient sorting 
algorithms). When the median Alter is earned out in real Ame, the software 
implementation in general-purpose processors does not usually give good results. For 
this reason, FPGAs are used in the real-time implementation of a median Alter.
The iniAal version of the median Alter is programmed using VC-H- on PC, so that its 
operaAon could be veriAed and its results could be compared to the hardware version. 
The following lines represent the pseudo-code of median Alter.
For X = moM^gr q^rowf 
F o r y = moM^gr o/"colwms.
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Window _Arroy = Arroy consisdng q/^cwrrgnf window pucek 
OMZpzd_imogg(x,y) = Mgdionfwindow_Arroy);
End
End
Software implementaAon works by using for loops to simulate a moving window of 
pixel neighborhoods. For every shiA of the window, the algonthm creates a sorted list of 
the pixel values in ascending order or descending order and a middle value is picked from 
the sorted list. The output of the program is an image consisAng of the median values of 
the moving window on an image. Since a 3x3 window is used in median Alter 
implementaAon, the output is dependent on the pixels from neighbonng rows. The result 
of this is that some edge effects occur in the output image, meaning that there is always
(a) Onginal Image (b) Filtered Image
Figure 3.2 Median Filter on an image
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
an invalid pixel along the borders of the output image. This is true for all algorithms 
using windowing approach in image processing. The Figure 3.2 shows noisy gray scale 
lena image(a) image Altered by a median Alter (b).
3.2 Morphological Operators 
The term morphological image processing [17] refers to a class of algonthms that is 
interested in the geometnc structure of an image. Morphology can be used on binary and 
gray scale images, and is useful in many areas of image processing, such as 
skeletonizaAon, edge detecAon, restoraAon and texture analysis.
A morphological operator uses a structunng element to process an image as shown in 
Figure 3.3. The structunng element is a window scanning over an image, which is similar 
to the pixel window used in the median Alter. The structunng element can be of any size, 
but 3x3 and 5x5 sizes are common. When the structuring element scans over an element 
in the image, either the structuring element Ats or does not At Figure 3.3 demonstrates 
the concept of a structunng element AtAng and not AtAng inside an image object.
The most basic building blocks for many morphological operators are groszo/i and 
zfz/aAon [20]. Erosion as the name suggests is shnnking or eroding an object in an image. 
DilaAon on the other hand grows the image object. Both of these objects depend on the 
structunng element and how it Ats within the object. For example, if erosion is applied 
to an binary image, the resultant image is one where there is a foreground pixel for every 
center pixel where its structunng element At within an image. If dilaAon is applied, the 
output will be a foreground pixel for every point in the structunng element.
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Object
Structuring element
Structuring element 
Does not fit
Figure 3.3 Structuring element fitting and not fitting
Important operations like opening and closing of an image can be derived by 
performing erosion and dilation in different order. If the erosion is followed by dilation, 
the resulting operation is called an opening. Closing operation is dilation followed by 
erosion. These two secondary morphological operations can be useful in image 
restoration, and their iterative use can yield further interesting results such as; 
skeletonization of an input image.
While morphological operations usually are performed on binary images, some 
processing techniques also apply to grayscale images. These operations are for the most 
part limited to erosion and dilation. Grayscale erosions and dilations produce results 
identical to the nonlinear minimum and maximum filters.
In a minimum Alter, the center pixel in the moving window is replaced by the 
smallest pixel value. This has the effect of causing the bright areas of an image to shrink.
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
or erode. Similarly, grayscale dilation is performed by using the maximum operator to 
select the greatest value in a window.
(a) Original Image
(b) Erosion (c) Dilation
Figure 3.4 Erosion and Dilation of a Gray Scale Image
26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The basic morphological operators are programmed using VC++, so that its 
operations could be veriAed and its results could be compared to the hardware version. 
Since grayscale erosion and dilation are minimum and maximum Alters respectively, the 
similar algonthm as median Altenng is used. Instead of selecAng the middle value Anm a 
sorted list of window pixels, a minimum value is selected for minimum Alter, that is, 
erosion and a maximum value is selected for dilaAon. Figure 3.4 show the output of a 
erosion and a dilaAon applied on gray scale image.
3.3 ConvoluAon
ConvoluAon is a simple mathemaAcal operaAon which is fundamental to many 
common image processing operators. ConvoluAon is a way of mulAplying together two 
arrays of numbers of different sizes to produce a third array of numbers. In image 
processing the convoluAon is used to implement operators whose output pixel values are 
simple linear combinaAon of certain input pixels values of the image. ConvoluAon 
belongs to a class of algonthms called spaAal Alters. SpaAal Alters use a wide variety of 
masks, also known as kernels, to calculate different results, depending on the desired 
funcAon
3.3.1 D ConvoluAon
The convoluAon operation is a mathemaAcal operaAon which takes two funcAons/(x) and 
g(x) and produces a third funcAon h(x). MathemaAcally, convoluAon is deAned as:
AW = / W  * gW  = j /W g ( x  -  r)Ar (3.1)
where g(x) is referred to as the ̂ Iter.
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.3.2 2D-ConvoluAon 
2D-Convo]uAon, is most important to modem image processing. The basic idea is 
that a window of some Anite size and shape is scanned over an image. The output pixel 
value is the weighted sum of the input pixels within the window where the weights are 
the values of the Alter assigned to every pixel of the window. The window with its 
weights is called the convoluAon nuzyt. MathemaAcally, convoluAon on image can be 
represented by the following equaAon.
Heigh to f image widlh o f image
y(m,n) = T  T  A (l,_ /)x(m -;,n-y),
i=0 ;=0 (3.2)
where X If the input inuzgc, A if the filter onA y if the inuzgc 
An important aspect of convolution algonthm is that it supports a virtually inAnite 
vanety of masks, each with its own feature. This Aexibility allows many powerful 
applicaAons. 3x3 convoluAon masks are most commonly used. For example the 
denvaAve operators which are mosAy used in edge detecAon use 3x3 window kernels. 
They operate only a pixel and its direcAy ai^acent neighbors. Figure 3.5 shows a 3x3 
convoluAon mask operated on an image. The center pixel is replaced with the output of 
the algonthm; this is carried for the entire image. Similarly larger size convoluAon masks 
can be operated on an image.
To illustrate the convoluAon algorithm, Gaussian convoluAon Alters are chosen. It is 
used to blur images. The Gaussian distribuAon in 1-D has the form:
1 —
G ( x ) = - = - c : ' '  (3.3)
V2;z'cr
where a  is the standard deviaAon of the distnbuAon.
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Convohidmi made Input Window Output Pixd
W1 W2 W3
W4 W5 W6
W7 W8 W9
P1 P2 P3
P4 PO P5
PG P7 P8
1%:
y .K f ,
ê-O
Z " ;
iV)
Figure 3.5 ConvoluAon
In 2-D, a circularly symmetnc Gaussian has the form
G(x,y) = 1
2;zty
-K+r
, 2u- (3.4)
The idea of Gaussian convoluAon is to use this 2-D distnbuAon as a point spread 
funcAon, and this is achieved by convoluAon. Since the image is stored as a collecAon of 
discrete pixels. A discrete approximaAon to the Gaussian funcAon is required to perform 
the convoluAon. In theory, the Gaussian distnbuAon is non-zero everywhere, which 
would require an inGnitely large convoluAon kernel, but in pracAce it is effecAvely zero 
more than about three standard deviaAons from the mean, and so convoluAon kernel is 
truncated as shown in Figure 3.6.
Software version of the 5x5 Gaussian convoluAon on an image is implemented in 
VC++. The results obtained in software version are used to compare the results obtained 
in hardware version. Figure 3.7 shows the images obtained after applying the 5x5 
Gaussian convoluAon kernels.
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3x3 Gaussîaa Smooth Füter 3x5 Gaussian Smoo6 Filt« o= 1.4
256
21 31 21
31 48 31
21 31 21 115
2 4 5 4 2
4 9 12 9 4
5 12 15 12 5
4 9 12 9 4
2 4 5 4 2
Figure 3.6 ConvoluAon Mask
i
(a) Onginal Image (b) Convoluted Image
Figure 3.7 5x5 Gaussian ConvoluAon of Lena Image
3.4 Edge DetecAon
Edges are places in the image with strong intensity contrast. Edges often occur at 
image locaAons represenAng object boundaries; edge detecAon is extensively used in
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
image segmentation. RepresenAng an image by its edges has the further advantage that 
the amount of data is reduced signiAcanAy while retaining most of the image informaAon.
Edges can be detected by applying a high pass frequency Alter in the Founer domain 
or by convolving the image with an appropnate kernel in the spaAal domain. In pracAce, 
edge detecAon is performed in the spaAal domain, because it is computaAonally less 
expensive and often yields better results. Since edges correspond to strong iUuminaAon 
gradients, the denvaAves of the image are used for calculaAng the edges.
The Canny edge detecAon algonthm [11] is considered a "standard method" and it is 
used by many researchers, because it was designed to be an opAmal edge detector and 
thin edges. The Canny operator works in a mulA-stage process. Canny edge detecAon 
uses linear Altenng with a Gaussian kernel to smooth noise and then computes the edge 
strength and direcAon for each pixel of the smoothed image. This is done by 
differenAaAng the image in two orthogonal direcAons and compuAng the gradient 
magnitude as the root sum of squares of the denvaAves. The gradient direcAon is 
computed using the arctangent of the raAo of the denvaAves. Candidate edge pixels are 
idenAAed as the pixels that survive a thinning process called non-maximal suppression. 
In this process, the edge strengAi of each candidate edge pixel is set to zero if its edge 
strength is not larger than the edge strength of the two adjacent pixels in the gradient 
direcAon. Thresholding is then done on the thinned edge magnitude image using 
hysteresis. In hysteresis, two edge strength thresholds are used. All candidate edge pixel 
values below the lower threshold are labeled as non-edges, and the pixels values above 
the high threshold are considered as deAnite edges. All pixels above low threshold that an
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
be connected to any pixel above the high threshold through a chain are labeled as edge 
pixels. The schematic of the canny edge detection is shown in Figure 3.8
Figure 3.8 Schematic of canny edge detection
3.4.1 Smoothing 
The Gaussian distribution in 1-D has the form:
1 —
Where o is the standard deviation of the distribution.
In 2-D, a circularly symmetnc Gaussian has the form
(3 5)
2;KT
(3.6)
The idea of Gaussian convolution is to use this 2-D distribution as a point spread 
function, and this is achieved by convolution. Since the image is stored as a collection of 
discrete pixels. A discrete approximation to the Gaussian function is required to perform 
the convolution. In theory, the Gaussian distribution is non-zero everywhere, which 
would require an infinitely large convolution kernel, but in practice it is effectively zero 
more than about three standard deviations from the mean, and so convolution kernel is 
truncated. The convolution kernel of standard deviation(o) 1.4 is used for smoothing in 
this thesis as shown in Figure 3.9.
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5x5 Gaussian SmooA Füter <?= 1.4
115
2 4 5 4 2
4 9 12 9 4
5 12 15 12 5
4 9 12 9 4
2 4 5 4 2
Figure 3.9 5x5 Gaussian convolution mask
The effect of Gaussian convolution is to blur an image. The degree of smoothing is 
determined by the standard deviation of the Gaussian.
3.4.2 Gradient Calculation
After smoothing the image and eliminating the noise, the next step is to find the edge 
strength by taking the gradient of the image. Most edge detection methods work on the 
assumption that an edge occurs where there is a discontinuity in the intensity function or 
a very steep intensity gradient in the image as shown in Figure 3.10.
Most edge-detecting operators can be thought of as gradient-calculators. Because the 
gradient is a continuous-function concept and images are discrete functions, we have to 
approximate it. Since derivatives are linear and shift-invariant, gradient calculation is 
most often done using convolution. Numerous kernels have been proposed for finding 
edges, some of the kernels are: Roberts Kernel, Kirsch Compass Kernel, Prewitt Kernel, 
Sobel Kernel, and many others.
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
M
Image
Rrst Derivative
I I
Rgure 3 .10 Gradient of Image
The Prewitt kernels are based on the simple idea of the central difference between 
rows for horizontal gradient and difference between columns for vertical gradient.
 2   2-------------
The following convolution masks are derived from equations.
(3.7)
Horizontal ConvoMxm
0 0 0
-1 0 1
0 0 0
Vertical C<mvoWon
0 -1 0
0 0 0
0 1 0
These convolutions are used for calculating the horizontal and vertical gradients.
3.4.3 Magnitude and Phase 
Convolution of the image with horizontal and vertical gradients produces horizontal 
gradient ( ds) and vertical gradient ( dy) respectively. The absolute gradient magnitude 
(|G|) is calculated by the mean square root of the horizontal (dx) and vertical (dy)
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
gradients. That is, | G |= . To reduce the computational cost of magnitude, it
is often approximated with absolute sum of the horizontal and vertical gradients (
The direction of the gradient (0) is calculated by arctangent of the vertical gradient to 
the horizontal gradient, g = arctan(dy / A :).
3.4.4 Non-Maximum Suppression
With the magnitude and direction obtained from previous stage one can ^iply the 
thresholding operation in the gradient-based method and end up with ridges of edge pixel. 
To get rid of ridges, the edge strength of each candidate edge pixel is set to zero if its 
edge strength is not larger than the edge strength of the two adjacent pixels in the gradient 
direction. This is called thinning process.
3.4.5 Threshold
The output image of non-maximum suppression stage may consist of broken edge 
contours, single edge points which contribute to noise. This can be eliminated by 
thresholding with Aysferesw. Two thresholds are considered for hysteresis, one high 
threshold other low threshold. If any edge response is above a high threshold, those pixels 
constitute definite edge output of the detector for a particular scale. Individual weak 
responses usually correspond to noise, but if these points are connected to any of the 
pixels with high threshold, they are more likely to be actual edges in the image. Such 
connected pixels are treated as edge pixels if their response is above a low threshold.
To get thin edges two thresholds (high threshold (Tn) and low threshold(TL) ) are 
used. If the gradient of the edge pixel is above the Tn, it is considered as an edge pixel. If 
the gradient of the edge pixel is below Ti. then it is unconditionally set to zero. If the
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
gradient is between these two, then it is set to zero unless there is a path A-om this pixel to 
a pixel with a gradient above T» ; the path must be entirely through pixels with gradients 
of at least Tb-
First a software model is designed and programmed in VC++, so that its operation 
could be verified and its results could be compared with software. Figure 3.11 (b) shows 
the canny edge detection of a 256 x 256 gray scale lena image.
(a) Original Lena Image (b) Canny Edge Detection Image
Figure 3.11 Edge Detection of Gray Scale Image
36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4
HARDWARE IMPLEMENTATION 
This chapter explains in detail the reconfigurable hardware implementation of image 
processing algorithms discussed in (Chapter 3) on a Xilinx Virtex-E FPGA. The 
algorithms implemented are:
» Median Filtering,
# Basic Morphological Operations ( Erosion and Dilation),
# Convolution,
# Edge Detection of an image.
First the implementation of the moving window operator which form the basis of the 
these algorithms is explained
4.1 Moving Window Operator 
The algorithms implemented in this work use the moving window operator. The 
moving window operator usually process one pixel of the image at a time, changing its 
value by some function of a local region of pixels (covered by the window). The operator 
moves over the image to process all the pixels in the image. In this section a 3x3 moving 
window used for the median filtering, moiphological and edge detection algorithms and a 
5x5 moving window used in Gaussian smoothing filter operation are explained.
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
For the pipelined implementation of image processing algorithms all the pixels in the 
moving window operator must be accessed at the same time for every clock. In order to 
access all the pixels in a moving window system, a design was devised that took 
advantage of certain features of FPGAs. The First In First OutfFIFO) buffers are used to 
create the effect of moving an entire window of pixels through the memory for every 
clock cycle. A FIFO consists of a block of memory and a controller that manages the 
trafGc of data to and from the FIFO. The FIFO's are implemented using circular buffers 
constructed from multi-port block RAM with an index keeping track of the front item in 
the buffer. The availability of multi-port block RAM in the Xilinx Vertex-E FPGA helps 
in achieving the read and write operations of the RAM in the same clock cycle. This 
allows a throughput of one pixel per clock cycle. The same effect can be achieved using 
double-width RAMs implemented in lookup tables on the FPGA. However, the use of 
block RAMs is more efficient and has less associated logic for reading and writing.
For a 3x3 moving window two FIFO buffers are used. The size of the FIFO buffer is 
given as W-3, where W is the width of the image. To access all the values of the window 
for every clock cycle the two FIFO buffers must be full. Figure 4.1 shows the architecture 
of the 3x3 moving window. For every clock cycle, a pixel is read Aom the RAM and 
placed into the bottom left comer location of the window .The contents of the window are 
shifted to the right, with the rightmost member being added to the tail of the FIFO. The 
top right pixel is disposed after the computation on the pixels is completed, since it is not 
used in future computation.
Similarly for a 5x5 window operation four FIFO buffers are used. Each FIFO size is 
W-5, where W is width of the image. To access the values in the moving window in
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
FIFO 1
FIFO 2
W11 w12 w13
w21 w22 *23
w31 w32 *33
dapowd
Figure 4.1 Architecture of 3x3 moving window
every clock cycle the four FIFO buffers must be full. Figure 4.2 shows the architecture of 
5x5 moving window. For every clock cycle, a pixel read from the RAM is placed into the 
bottom left comer location of the window. The contents of the window are shifted to the 
light, with the rightmost member being added to the tail of the FIFO. The top right pixel 
is disposed after the computation of the pixels is computed.
disposedW12 W13 W14WllFIFO I
W21 W22 W24 W25W23
W32 W33 W34 W35
W42 W43 W44PIF04
W52 W53
External
RAM
Figure 4.2 Architecture of 5x5 moving window
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.2 Median Filter
A median Alter is implemented by sliding a window of odd size on a image. A 3x3 
window size is chosen for implementation for median Alter, because it is small enough to 
At onto the target FPGA's and is considered large enough to be effective for most 
commonly used image sizes. The median Alter uses the 3x3 window operaAon discussed 
in SecAon 4.1. The median Altenng operaAon sorts the pixel values in a window in 
ascending order and picks up the middle value, the center pixel in the window is replaced 
by the middle value. The most efAcient method of accomplishing this is with a system of 
hardware compare/sort units, which allows sorAng a window of pixels into an 
ascending order.
The sorAng can simply accomplished in Handel-c by the "if /else" statement.
If( wxl < wx2)
{
par{
Cxl_L = wxl; 
Cxl_H = wx2;
}
} else
{
par{
Cxl_T, = wx2; 
Cxl_H = wxl;
}
}
The hardware design of the median Alter is shown in Figure 4.4. The Cxx represents 
a compare uiAt and Rxx represent a register to store the intermediate values. FuncAonality 
of the Compare unit is shown in Figure 4.3
40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Cxx(con]pare Unit)
Figure 4.3 Compare Unit
The median filter algorithm used in this design is pipelined to produce one median 
value for every clock cycle. The output for a given input appears at the other end of the 
pipe after an initial latency. The median filtering algorithm uses a complex pipelining to 
accelerate the flow of data.
The pipelined architecture of a median filter of an image have four processes as 
shown in Figure 4.4. They are as follows
# Reading process: Reads the image pixels from the external RAM.
# Buffer Process: Copies information from the reading process into the FIFO buffers 
and shifts data through the FIFO buffer at the desired time instant, so that valid data is 
placed at the input and collected at the output
# Filtering Process: Processes the data at the desired time instant from the FIFO buffers 
and passes the filtered data to the writing process.
# Writing process: Reads output from the filtering process and writes it into the 
external RAM.
In pipelining each process generates the data that is passed to the subsequent process 
and that process is blocked until the data is available. Initially, the buffer process takes 
the data from the reading process. The first output of buffer process is blocked until the
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
W32W13
C13C12
C24C21
C32
R4a2R4al
R4b4 R4b3C4b0
R53 R34R5IC5I
Figure 4.4 Hardware Design for Sorting Algorithm
FIFO buffer is full, after which the Altering process takes data as input from the buffer 
process. The Arst output of the Altering process is subsequently blocked unAl the Altering 
is done. The writing process starts wnting the data into the external RAM once it gets 
data from the Altenng process. The pipelining process is shown in Figure 4.5.
Median Altering at the border of the image is handled by placing the zero value pixels 
around the borders. To perform this process, the two counters are used to track the 
borders.
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The median Altering applied on a 256 x 256 Lena image was implemented using the 
pipeline design on Xilinx Vertex-E FPGA. Memory read has a latency of two cycles 
while memory writes are completed in the same cycle. So Reading process has a latency 
of two cycles. Filtering process to be ready it takes W+2 clock cycles , where W is the 
width of the image. The output of the filtering process is available after a latency of 
fourteen cycles. The writing process writes the Altered pixels into the RAM in one cycle. 
Since it is a design is pipelined the output is produced on every clock cycle. The median 
Alter implementaAon with this pipelined architecture on FPGA requires far less cycles 
then the same algonthms implemented on general purpose computer.
Blocked until 
write process 
is ready
Blocked until 
buffer process 
is ready
Blocked until 
Filtering process 
is ready
Filtering
Process
Write
Process
Buffer
Process
Reading
Process
Figure 4.5 Pipelining Process
4.3 Basic Morphological Operators 
The basic morphological operators are erosion and diladon. The erosion and dilaüon 
of a grayscale image are called grayscale erosion or dilaAon. The grayscale erosion is 
performed by minimum Alter, whereas the dilaAon is performed by maximtun Alter. In a 
3x3 minimum Alter, the center pixel is replaced by a minimum value of the pixels in the 
window In a maximum Alter , the center pixel is replaced by a maximum value of the 
pixels in the window. The implementaAon of minimum and maximum Alters is similar to 
the median Alters implementaAon.
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.3 ConvoluAon
Convoluüon is one of the basic and common operations on images. It uses a window 
operator discussed in SecAon 4.1. The center pixel in the window is weighted sum of the 
input pixels within the window divided by the sum of the weights in the window, where 
the weights are the values of the Alter assigned to every pixel of the window.
ConvoluAon is a very complex operaAon, requires lot of computaAonal power. To 
calculate a pixel for a given mask of size m x n, m * n mulAplicaAons , m *n-l addiAons 
and one division are required. So to perform a 3x3 convoluAon on a 256 x 256 gray 
scale image, 589824 mulAplicaAons , 393216 addiAons and one division are reqiAred.
MulAplicaAon and division operators produce the deepest logic. A single cycle 
divide, or mulAplicaAon produces a large amount of hardware and long delays through 
deep logic. In order to improve the performance of the convoluAon operaAon, it is 
necessary to reduce the mulAplicaAon and division operators. MulAplicaAon and division 
can be done using bit shifting, but this is only possible with the powers of 2's. MuIAplier 
less mulAplicaAon can be employed to do mulAplicaAon of non power of 2's digits, 
where mulAplicaAon is done with only shifts and addiAons from the binary representaAon 
of the mulAplicand. For example, A mulAplied by B = 14 = IIIO2 (A * B) can be 
implemented a s ( A « l + A « 2  + A «  3), where «  denotes a shiA to the leA.
The coefAcient values in the convoluAon mask employed in most of the image 
processing applicaAons remains constant for the enAre processing. Constant CoefAcient 
MuIAplier (KCMs) can be employed. A KCM composes of Look Up Tables (LUTs) and 
adders. The constant value k is mulAplied by Arst 15 whole numbers and was stored in 
LUTs as shown in Figure 4.7. To get the output of a 8 bit number mulAplicaAon with a 8
44
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
bit constant, the 8 bit number is split into two 4 bit values, each addressing a different 
LUT to produce a 12 bit value. The 12 bit values are combined to produce a 16 bit output. 
The KCMs can be implemented on FPGA very eAiciently, because of its LUT based 
architecture.
X[7:0]
/ X[4:7]
0%k-0 
1 i k - k
2ik=-2k
13xk=13k
Oik —0
I l k  —k 
2 i k - 2 k
U:k=13k
/
1 2 ^
Adder
12
/ &
Figure 4.6 8 bit Constant CoefAcient MuIAplier (KCM)
In this secAon a 3x3 and 5x5 Gaussian convoluAon Alter of standard deviaAon 1.4 (o 
= 1.4) is shown in Figure 4.7 is implemented. In this smoothing window there are no 
negaAve numbers; convoluAon can be calculated using only the unsigned numbers.
45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3x3 Gaussian Smooth Filter 5x5 Gausshn Smooth fiker o= 1.4
256
21 31 21
31 48 31
21 31 21 115
2 4 5 4 2
4 9 12 9 4
5 12 15 12 5
4 9 12 9 4
2 4 5 4 2
Figure 4.7 ConvoluAon Masks
To apply a 5x5 Gaussian convoluAon of an image, a 5x5 moving window operator is 
used. A pipelined implementaAon is earned out for Gaussian convoluAon of an image. In 
order to access all the pixels in the window in a single clock cycle, four 8 bit FIFO 
buffers are used. Since the convoluAon mask is fixed for the whole image a dedicated 
hardware can be designed. Some of the window coefficients are contains mulAple of 2. 
The mulAplicaAon of these coefGcients with the corresponding pixels in the window can 
be carried out using leA shiA operaAons and the non powers of 2 digits can be 
implemented using mulAplier less mulAplicaAon. An important property of a 
symmetrical 5x5 window coefAcients as shown in Figure 4.7 is that it allows pre-addiAon 
of certain input values before any mulAplicaAon takes place. This propeAy is used in the 
implementaAon of convoluAon operaAon, which reduces the number of mulAplicaAon 
operaAons from 25 to 3 mulAplicaAons. Some of the mulAplicaAons are power of 2, they 
are done by leA shiA operaAon.
46
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Division is also very expensive operation on FPGA, instead of using division operator 
it is much simple to use right shiA operator, so a divide by 128 was implemented instead 
of divide by 115.
Longest path delay associated by the wide adders is large; delay associated with the 
carry ripple is more. Therefore shoA adders are employed, which can be done in parallel.
ConvoluAon at the border of the image is handled by placing the zero value pixels 
around the borders. To perform this process, the two counters are used to track the 
borders.
The 3x3 convoluAon mask shown in Figure 4.7 does not have coefAcient powers of 2. 
Direct mulAplicaAon is a complex operaAon on FPGA. Since the window coefAcients are 
constant for entire image, a KCM based mulAplicaAon approach can be employed. The 
mulAplicaAon tables of 0 to 16 Ames of the window coefAcients are stored in ROMs. 
When a 8 bit pixel is mulAplied by a 8 bit constant. The 8 bit pixel is split to two 4-bit 
values to address the two ROMs, a 12 bit result is produced. The two 12 bit results are 
combined to produce a 16 bit output as shown in Figure 4.8. On FPGAs two locaAons of 
the ROM cannot be accessed in parallel. For efAcient implementaAon, two ROMs are 
created to store the mulAplicaAon tables of a constant, so that the ROM can be accessed 
in parallel. In 3x3 ConvoluAon
ConvoluAon at the border of the image is handled by placing the zero value pixels 
around the borders. To perform this process, the two counters are used to track the 
borders.
The 3x3 convoluAon is implemented by using direct mulAphcaAon and look Up based 
mulAplicaAon and the results are analyzed.
47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.4 Edge Detection
Hardware implementation of canny edge detection algorithm is discussed in this 
section. Canny edge detector operation consists of four stages .
# Image smoothing.
# Vertical and Horizontal Gradient Calculation.
# Directional Non Maximum Suppression.
# Threshold.
Non Maiânum 
SoppresaoQ
Direction^ Noo-Mammum
Figure 4.8 Design Flow of Edge Detection
Classically, the image smoothing is implemented by applying the Gaussian 
convolution on the entire image. Furthermore, the smoothened image is used as an input 
to calculate the gradient at every pixel, and these gradient values are used to calculate the 
phase and the magnitude for each pixel, which is followed by non-maximum suppression.
To get rid of ridges, the edge strength of each candidate edge pixel is set to zero if its 
edge strength is not larger than the edge strength of the two ac^acent pixels in the gradient 
direction. This is called non-maximum suppression. Two threshold ( High threshold & 
Low threshold) values are used get the connected edge pixels. This is called hysteresis.
On a general purpose computer the four stages of the canny edge detection are 
performed sequentially on the entire image, one stage followed by other stage. This
48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
approach on FPGA require lot of hardware resources and design is slow. In order to 
efficiently use the hardware resources and increase the speed, hardware features like 
parallelism and pipelining are employed. A pipelined architecture shown in Figure 4.9 is 
designed.
Four 8-bit FIFO 
Anav
Z
Tw# in *  
FIFO
Assay
read RA&I
t
A/
Directional
Non-Max
Siqppression
t
Tpo
IW
FlPOxâ hystenis
5x5
Gaussian
Smo(#iQg
8
HomzomtË
CfadÎMn(dx)
Vaticat
TWO 
8 bit FIFO 
Anav
7 \
/ I  8 
Wfit* to  RAM
Figure 4.9 Pipelined Architecture
Since output in each stage depend on the neighboring pixels, a moving window operator 
discussed in Section 4.1 is adopted.
4.4.1 Image Smoothing 
Smoothing of the image is achieved by 5x5 Gaussian convolutions, as mentioned in 
Section 4.3. A 5x5 moving window operator is used, four FIFO buffers are employed to 
access all the pixels in the 5x5 window at the same time. Since the design is pipelined.
49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the Gaussian smoothing starts once the 2 FIFO buffers are full. That is, the output is 
produced after a latency of twice width of image plus two (2*width +2) cycles. The 
output of this stage is given as input to the horizontal and vertical gradient calculation 
stage.
4.4.2 Vertical and Horizontal Gradient Calculation 
This stage calculates the vertical and horizontal gradients using 3x3 convolution 
kernels shown in Figure 4.10. An 8-bit pixel in row order of the image produced during 
every clock cycle in the image smoothing stage is used as the input in this stage. Since 
3x3 convolution kernels are used to calculate the gradients, neighboring eight
I&«izMitMCoav(Aiti(m Vertical ConvrAitkMi
0 0 0
-1 0 1
0 0 0
0 -1 0
0 0 0
0 1 0
Figure 4.10 Gradient Convolution Kernels
pixels are required to calculate the gradient of the center pixel and the output pixel 
produced in previous stage is a pixel in row order. In order to access eight neighboring 
pixels in a single clock cycle, two FIFO buffers are employed to store the output pixels of 
the previous stage.
The gradient calculation introduces negative numbers. In Handel-C, negative 
numbers can be handled easily by using signed data types. Signed data means that a 
negative number is interpreted as the 2's complement of number. In this design, an extra 
bit is used for signed numbers as compared to unsigned 8 bit numbers i.e. 9 bits are used
50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
to represent a gradient output instead of 8. Two gradient values are calculated for each 
pixel, one for vertical and other for horizontal. The 9 bits of vertical gradient and the 9 
bits of the horizontal gradient are concatenated to produce 18 bits. Since the whole design 
is pipelined, an 18 bit number is generated during every clock cycle, which forms the 
input to the next stage.
4.4.3 Directional Non Maximum Suppression 
Directional non maximum suppression works with the magnitude and orientation of 
the gradient of the pixel under consideration and creates edges of one pixel-width. The 
values of each component of the gradient obtained from the previous stage are used to get 
the magnitude and direction. The direction of the gradient is calculated mathematically as 
the arctangent of vertical gradient component over the horizontal gradient component (
(frrgciron = arctan(— ) ). Since arctangent is a very complex function and also requires 
dk
floating point numbers, it is very difficult to implement such functions on FPGA. Instead, 
the value and sign of the components of the gradient is analyzed to calculate the direction 
of the gradient. If the current pixel is P%,y and the values of the derivatives at that pixel 
are d!x and d|y, the direction of the gradient at P  can be approximated to one of the sectors 
shown in the Rgure. 4.11.
Once the direction of the gradient is known, the values of the pixels found in the 
neighborhood of the pixel under analysis are interpolated. The pixel that has no local 
maximum gradient magnitude is eliminated. The comparison is made between the actual 
pixel and its neighbors, along the direction of the gradient.
51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
PO''
135^
dx<0
.:. ■ Ô
dxX)
dv>0
yT dx>0 dyX)
I  ;dK>Al 
4y<0
^  |dxi>;d}i
.1?: ■ 0 &\ -■'■ '
225^
|dy|>|c
dx<0
dy<0
dx>0 /
dy::C J /
----— y /  0^
270"
Figure 4.11 Gradient Orientation
or example, if the ^proximate direction of the gradient is between o" and 45°, the 
magnitude of the gradient at is compared with the magnitude of the gradient at 
at^acent points as shown in Figure 4.12. where  ̂ =| I + 14^%, I -
The values of the Gradient at the point P , and Pb are defined as follows.
(4.1)
4  l+l4kyfi I (4 2)
52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
0
Figure 4.12 Pixel Interpolation
The center pixel is considered as an edge , if and p  ̂  ̂> P& - If
conditions are not satisfied center pixel is eliminated.
Since the gradient calculation as explained depends on the direct neighboring pixels, 
a 3x3 window operator is used. The output of the previous stage is used as input in this 
stage. The output produced in the previous stage is a 18 bit number, first nine bits are 
horizontal gradient and other nine bits are vertical gradient. In order to access all the 
pixels in the 3x3 window at the same time two eighteen bit FIFO buffers of width of the 
image minus three array size are employed. To calculate the phase and magnitude at 
every pixel the horizontal and vertical gradient values derived from the eighteen bit 
number are used. The output produced in this stage is given as input to the threshold 
stage.
4.4.4 Threshold
The output image of non-maximum suppression stage may consist of broken edge 
contours, single edge points which contribute to noise. This can be eliminated by 
thresholding. To get thin edges two thresholds (high threshold (Tn) and low threshold(TL) 
) are used. If the gradient of the edge pixel is above the Tn, it is considered as an edge
53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
pixel, lets call it as edge. If the gradient of the edge pixel is below ÏL then it is
unconditionally set to zero. If the gradient is between these two, then it may be an edge 
pixel, lets call it edge pixel. It is set to zero unless there is a path from this pixel to 
a pixel with a gradient above T» ; the path must be entirely through pixels with gradients 
of at least TL.
To get the connected path from the edge pixel and the edge pixel, a
3x3 window operator is used. If the center pixel is an dg/mzfg edgg pixel and any of the 
neighbors is a edge pixel, then maybe pixel is considered as a definite edge pixel.
And also, if the center pixel is a pixel and any of the neighbors is definite pixel,
then maybe pixel is considered as a definite edge pixel. The resultant image is an image 
with thin sharp edges.
Since the design is pipelined each output pixel produced is written to RAM on every 
clock cycle.
The 256 x 256 gray scale Lena image was considered as a bench mark to implement 
the basic image processing algorithms like Median filter, basic morphological operators, 
5x5
Gaussian convolution and canny edge detection processed on FPGA. In Figure 4.13 
(b) shows the image obtained by processing (a) salt and pepper noise Lena image on 
FPGA. Rgure 4.13 (c) and (d) shows the erosion and dilation of the original Lena image 
processed on FPGA respectively. Figure 4.13 (e) shows the 5x5 Gaussian convolution of 
image and (d) shows the edge detection of lena image implemented on FPGA.
54
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(a) Lena (salt & pepper noise) (b) Median filtered image
(c) Erosion of lena image (d) Dilation of Lena Image
(e) 5x5 Gaussian Convolution (f) Edge detection
Figure 4.13 Lena image processed on FPGA
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTERS
RESULTS AND FUTURE WORK 
The image processing algorithms were simulated and synthesized using Handel-C 
hardware description language using the Celoxica DK2 environment. The Handel-C 
produces an Electronic Design Interchange Format (EDIF) output when compiling the 
design for hardware target. The Xilinx placement and routing tools are used to translate 
the EDIF format into hardware layout (bit format Ale).
A host side program on the PC was written using VC-H- for FPGA configuration and 
communication with RCIOOO board. The RCIOOO device driver routines provided by the 
board vendor are employed to accomplish this task. Once the FPGA is configured, the 
host program requests the ownership of the SRAM memory bank on the RCIOOO board. 
The image to be processed is loaded into SRAM memory bank and signals the FPGA to 
read the image from memory bank and start processing as shown in Figure 5.1.
WIËB control
Host 
Program 
on PC
FPGA
Read staiusl
RCIOOO Board
Figure 5.1 Host FPGA Communication
56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
When the processing of the image is completed by the FPGA, it signals the host program 
that the processed image is in the memory bank. The host PC reads the processed image 
from the memory bank into its main memory and displays.
5.1 Results
Timing is an important metric when comparing the hardware and software 
implementation. The timing results of image processing algorithms implemented on 
Xilinx Vertex-E FPGA and on Pentium m  are tabulated below:
Table 5.1 Timing result of Median Filter on a 256 x 256 grayscale image
System Freq[MHz] Timefms]
Xilinx Vertex-E FPGA 25.9 2.56
Pentium m 1300 51
It may be observed from the Table 5.1 that the time taken to implement the median 
filter on a 256 x 256 grayscale image on a PC with PentiumHI 1300 MHz is 51 ms. The 
pipelined implementation of median Glter of a 256 x 256 gray scale Lena image on a 
Xilinx Vertex-E is 2.56 ms at a clock frequency of about 34 MHz. A close comparison 
of the results presented reveals that implementation of median filter on Xilinx Vertex-E 
FPGA is 26 times faster than that on PC.
57
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 5.2 Timing Result of Median Filter on a 512 x 512 gray scale image
System Freq[MHz] Time[ms]
Xilinx Vertex-E FPGA 23.06 11.3
Pentium m 1300 235
The pipelined implementation of median filter on a 512 x 512 gray scale Lena image 
on a Xilinx Vertex-E is 11.3 ms at a clock frequency of about 23.06 MHz. A close 
comparison of the results presented reveals that time taken for pipelined median Alter on 
Xilinx Vertex-E faster than that on PC.
Table 5.3 Timing result of 5x5 Convolution on a 256 x 256 gray scale image
System Freq[MHz] Time[ms]
Xilinx Vertex- 
EFPGA
Direct division 
by 115 25.9 2.62
Division using 
right shift( »  
7)
42 1.57
Pentium m 1300 31
Time taken for 5x5 Gaussian convolution on 256 x 256 gray scale image by 
employing division by 115 on Xilinx Vertex-E FPGA is 2.62 ms, where as time taken for 
5x5 Gaussian convolution by employing shift operation ( »  7) division is 1.57 ms. 
Time taken for 5x5 Gaussian smoothing convolution of a 256 x 256 gray scale image on 
PC with Pendumin 1300 MHz is 31 ms.
58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 5.4 Timing result o f 5x5 convolution o n a 5 1 2 x 5 1 2  gray scale image
System Freq[MHz] Time[ms]
Xilinx
Vertex-E
FPGA
Direct
division by 
115
24.8 10.16
Division 
using right 
shift( »  7)
40.43 7.03
Pentium m 1300 125
Time taken for 5x5 Gaussian convolution on 512 x 512 gray scale image by 
employing division by 115 on Xilinx Vertex-E FPGA is 2.62 ms, where as time taken for 
5x5 Gaussian convolution by employing shift operation ( »  7) division is 7.03 ms. On 
a 1.3 GHz Pentiumin PC the time taken for 5x5 Gaussian smoothing convolution of a 
512 X 512 gray scale image on 125 ms.
Table 5.5 Timing Results of 3x3 Convolution on a 256 x256 gray scale image
System Freq[MHz] Time[ms]
Xilinx
Vertex-E
FPGA
Direct
Multiplication 42.03 1.58
LUT based 
Multiplication 50.99 1.31
Pentium m 1300 16
Table 5.5 shows the time taken for 3x3 Gaussian smoothing on 256 * 256 gray scale 
image by direct multiplication and Look Up Tables (LUTs) based multiplication on
59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Xilinx Vertex FPGA. It was observed that the convolution performed by using LUT 
based multiplication is faster than convolution performed by direct multiplication.
Table 5.6 Timing Results of 3x3 Convolution on a 512 x 512 gray scale image
System Freq[MHz] Time[ms]
Xilinx
Vertex-E
FPGA
Direct
Multiplication 42.03 6.49
LUT based 
Multiplication 50.99 5.32
Pentium m 1300 50
Table 5.6 shows the time taken for 3x3 Gaussian smoothing on 512 * 512 gray scale 
image by direct multiplication and Look Up Tables (LUTs) based multiplication on 
Xilinx Vertex FPGA.
Table 5.7 Timing Result of edge detection algorithm on 256 x 256 gray scale image
System Freq[MHz] Time[ms]
Xilinx Vertex-E FPGA 16 4.2
Pentium m 1300 47
Table 5.8 Timing Result of edge detection algorithm on 512 x 512 gray scale image
System Freq[MHz] Time[ms]
Xilinx Vertex-E FPGA 15.8 16.7
Pentium IH 1300 171
60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
From Table 5.7, 5.8, it can be seen that the time taken to implement the edge 
detection algorithm on hardware is ten times faster than time taken on PC.
The device usage is normally reported in terms of number of look up tables, block 
RAMs, CLB slices. Gate Count. The device usages for the algorithms implemented in 
this thesis are shown in Table 5.9.
From the tables, it was observed that implementation of the algorithms on FPGA by 
exploiting parallelism and pipelining is faster than on a general purpose computer.
Table 5.9 Implementation Cost on FPGA
Algorithms
LookUp 
Tables 
( LUTs)
Flip-Flops BlockRAMs
Gate
Count(%)
CLB
Slices
Median, 
Erosion & 
Dilation
856 298 2 1.9 652
5x5 
Convolution 
div by 115
947 926 4 2.08 1041
5x5 
Convolution 
div by shift ( 
» 7 )
597 802 4 3.3 791
3x3
Convolution
direct
multiplication
535 479 2 4.25 479
3x3 
Convolution 
LUT based 
Multiplication
384 428 2 4.04 479
Edge
Detection 945 807 4 4.85 1820
61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.2 Future Work
The performance of the image processing algorithms in this work is achieved by 
implementing the algorithm on Field-Programmable Gate Arrays (FPGAs). Hardware 
implementation accelerates the designs by performing the operations concurrently. On 
the other hand, leprogrammability of the FPGAs allows for faster and cheaper design 
cycle of the system compared to Application SpeciAc Integrated Circuit (ASIC) design.
Handel-C gives the Aexibility of using the components of one design to be used in 
different designs. Because of this, the image processing algonthms implemented in this 
thesis can be used in many different applicaAons
This work can be further extended to real Ame implementaAon of object detecAon. 
The implementaAon of the whole design on FPGA is tedious and Ame consuming. With 
the advances in the software tools and growing funcAonality and capabiliAes of the 
FPGAs, hardware software soluAons can be employed, that drasAcally reduces the design 
Ame for high speed applicaAons.
One of the current shortcomings of the designs presented in this thesis is the resource 
uAlizaAon of FPGAs. This is mainly due to the FIFO imits being used in the design. 
FPGA resource uAlizaAon can be greaüy reduced by creaAng FIFO buffers on external 
RAM. A large part of the improvement possible in this design lies in the algonthms 
themselves. If the kernel for the convoluAon design were to be changed, the convoluAon 
algorithms would have increased funcAonality for changing the convoluAon kernels on 
the Ay.
62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BmUOGRAHGY
1. John C. Ross. "Image Processing Hand book", CRC Press. 1994.
2. Stephen D.Brown, R.J. Francis, JRose Z.G.Vranesic "Hied Programmable Gate 
Arrays", 1992.
3. Moore,M A DSP-based real time image processing sytem, "Proceedings of the 6^ 
IntemaAonal conference on signal processing applicaAons and technology, Boston 
MA", August 1995.
4. Rolf F. Molzl, Paulo M. Engel 1, Fernando G. Moraes2, Lionel Torres3, Michel 
Roberts. Design of a ClassiAcaAon System for Rectangular Shapes Using a Co- 
Design Environment, "13th Symposium on Integrated Circuits and Systems Design ,
01/01/2000, pp. 281-286".
5. Handel-C Tutonal Celoxica Ltd.
6. Xilinx Vertex™-E Field Prgrammable Gate Arrays (V2.4) "July 17, 2002 ProducAon 
Product speciAcaAon".
7. Digital Video & Image processing Xilinx soluAons for the Broadcast Chain. "Xilinx 
Ltd 2002".
8. Richard G.Shoup. Parameterized ConvoluAon Filtering in a Field rogrammable Gate 
Array Interval," Reasearch Palo Alto, California .1993".
9. 3x3 Convolver with Run-Time ReconAgurable Vector MuIAplier in Atmel AT6000 
FPGAs. "AT6000 FPGAs AppIicaAon Note 1997".
10. V.Gemignani, M. Demi, M Patemi , M Giannoni and A Benassi. DSP 
implementaAon of real Ame edge detectors. "Proceedings of speech and image 
processing pp 1721-1725,2001".
11. J.Canny . A computaAonal approach to the edge detecAon. "IEEE Trans Pattern and 
Machine Intelligent. Vol PAMI-8 1986 pp 679-698".
12. F.G.Lorca, L Kessal and D.Demigny EfAcent ASIC and FPGA implementaAon of 
IIR Alters for Real Ame edge detecAon. "IntemaAonal Conference on image 
processing(ICIP-97) volume 2. oct 1997".
63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
13. Fahad Alzahrani, Tom Chen Real-time high perfonnance Edge detector for computer 
vision applicaAons. "Proceedings of ASP-DAC ,1997. pp 671-672".
14. Peter Me Curry, Fearghal Morgan, Liam KilmarAn. Xilinx FPGA implementaAon of 
a pixel processor for object detecAon applicaAons. "Proc Irish Signals and Systems 
Conference 2000".
15. David.E, Fergal.S and Andy.N. CorrecAon of Geometric Image DistorAon using 
FPGAs "OpAcal Metrology, Imaging, and Machine Vision Conference, SF7E-/nt'Z 
SocietyOpAcoZ Engineerings, Goiwcy, ZRL, 2002".
16. Peter MarAn. A pipelined hardware implementaAon of geneAc programming using 
FPGAs and Handel-C. "5th European Conference, EuroGP 2002, Kinsale, Ireland, 
April 3-5,2002"
17. An IntroducAon to Morphological Image Processing. "SPIE Bellington WA.1992".
18. Lee Fergusion. Image Processing using Reconfigurable FPGAs. DSP and MulAmedia 
Technology. "May/June 1996 Golden Gate Enterprises, Inc."
19. Chou C. MohanKrishnan, S Evans.J. FPGA implementaAon of Digital Filters. 
"Proceedings of the intemaAonal Conference Signal for AppIicaAon & 
T echnology, 1993/'
20. Man Ng: High Level Design For High Speed FPGA DevicesMaster's Thesis "Dept 
of compuAng. Imperial College. June 13. 2002".
21. Arrigo BenedetA, Andrea PraA, and Nello Scarabottolo. Image ConvoluAon on 
FPGAs: the ImplementaAon of a MulA-FPGA FIFO Structure. "Proceedings of the 
24^ Euromicro Conference (1998). Pp.123-130".
22. Ernest Jamro, Kazimierz Wiatr. RPGA implementaAon of AddiAon as a part of 
convoluAon. "IEEE Euromicro Conference, Poland.
23. Advanced RAM access from Handel-C. Celoxica Ltd.
64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
APPENDIX
In this appendix the Handel-C codes that were developed to describe the hardware 
for the image processing algorithms on FPGA are dealt with.
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *  
Function: ReadlmageFromRam:
Reads data from the RAM - bank 0,
when the Host program signals the FPGA, it gets the ownerships of the RAM bank 0. 
The process reads a pixel for every clock cycle and outputs it down the channel OutData,
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /  
void ReadImageFromRam(chan unsigned 8 *OutData)
{
unsigned 8 Pixel; // 8 bit register to store the pixel value
unsigned 17 addr; // 17 bit address to access the RAM.
static signal unsigned 1 NoDataSent = 0; 
macro expr PipeLatency = 4;
unsigned 2 DelayCounter; //counter to delay the output 
addr = 1 ;
par // execuAon in parallel
{
while(l)
{ //if data has been sent on this clock cycle 
if (INoDataSent)
{
par // execuAon in parallel.
{
//Read one pixel from the Memory Bank 
PP1000ReadBankO(Pixel,0@addr); 
addr+4-;
}
1
else //otherwise do nothing 
delay;
65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
if( addr =  0)
{
//releases the owneship o the RAMI 
PP 1 OOOReleaseMemoryB ank(Ox 1 ) ;
}
}
{
//delay unAl the pipeline is primed 
do
{
DelayCounter++;
} while(DelayCounter!=PipeLatency-1 );
// send the data to the next process in the pipeline 
while(l)
{
priait
{
case * OutData ! Pixel: break; 
default: NoDataSent = 1; break;
}
}
}
}// end of ReadRAM
****************
Function: WiitePixelT oRam:
Writes data to the RAM - bank 1, when the procesed image is written to the RAM, it 
releases the ownership of the RAM BANK 1 and signals the Host program The process 
reads a pixel from channel and write one pixel per clock
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /  
void WiitePixelToRam(chan unsigned 8 *PixelInChan)
{
unsigned 8 Pixel; // Register to store th 8 bit pixel value, 
unsigned 17 addr; // 17 bit address for memory location 
static signal unsigned 1 NoDataSent = 0; 
macro expr PipeLatency = 4;
//counter to delay the output of data 
unsigned 2 DelayCounter;
66
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
addr = 1; // initialize the address location
*PixelInChan ? Pixel; // read pixel from the channel
par{ // parallel execution 
wWle(l)
{
//if data has been sent on this clock cycle 
if (INoDataSent)
{
par
{ // to write the pixel at a memory location
PP1 OOOWiiteB ankl (0 @ addrfixel) ; 
addr-H-;
1
}
else //otherwise do nothing 
delay;
if( addr =  0)
//releases the ownership of Bank 0 and Bankl 
PP 1 OOOReleaseMemoryB ank(0x2) ;
// writes the control status to the host Program 
PPlOOOWriteStatus(O);
// reads the control word from the host program 
PP1000ReadControl(Reg);
//requsest the memory bank,
PP 1 OOORequestMemoryB ank(0x3) ;
readsignal = 1; 
addr= 1;
}
//delay until the pipeline is primed 
do
I
DelayCounter-H-;
} while(TDelayCounter !=PipeLatency-1 );
// read the data from the next process when the datais available 
while(l)
67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
{
priait
I
case *PixelInChan ? Pixel : break; 
default: NoDataSent = 1; break;
}
}
}
y*********************************************************************** 
Declaration of 8-bit First In First Out (FIFO) buffers.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * y
Struct _FIFO_PDŒL_8_
{
//multiport ram for the Gfo data 
mpram
{
rom PDŒL_8 Read[FlFO_WIDTH] ; // read only memory
wom PDŒL_8 Wiite[FIFO_WIDTH]; // write only memory
} ramBuffer with {block = 1};
//pointer to the front of the frfo 
unsigned INDEX_WIDTH ramBufferFront;
};
The FIFO_PIXEL_8_ is an 8 bit dual-ported RAM. One port is to read only and 
other to write only. The dual-port RAM allows the read and write in same clock cycle.
In FIFO from one end the data is read and other end the data is written in the same cycle.
Macro Proceedure:
ReadFIFO: To read a pixel from the head of the FIFO
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * y
macro expr ReadFIFOfFIFO) = FlFO.ramBuffer.Read{FiFO.ramBujTerFront 
=  FIF0_WIDTH-1 ? 0 : FIFO.ramBufferFront+I];
y * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Macro Procedure:
68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
WriteFiFO: To write a Pixel to the tail of the FIFO
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /  
macro proc WriteFlFO(FlFO, Pixel)
{
FIFO .ramBuffer .Write[FfFO.ramBufferFront] = Pixel;
}
y * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Macro procedure : IncrementFlFOPointer 
The pointer to point to the head of the FIFO
macro proc IncrementFlFOPointeif FlFO)
{
if (FlFO.ramBufferFront != FiFO_WIDTH-l)
{
FlFO.ramBufferFront-H-;
}
else
FIFO .ramBufferFront = 0;
}
**************
Function : ManageBuffers
This function generates a 5x5 window on every clock cycle. Since it is pipelined 
architecture, the output is produced once the FIFO buffers were filled. The first output 
occurs after 2*W+2 clock cycles, the channel output was delayed until the pipeline is 
primed. Once the FIFO are full 5x5 window pixels are produced on every clock cycle
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * y
void ManageBuffers5x5(chan PDŒL_8* InPixelChan, chan WINDOW_PIXEL_8* 
OutWindowChan)
{
//5 by 5 window to be processed 
static WINDOW_PIXEL_8 Window;
//4 FIFO buffers of pixels for the four rows of the image 
static FIFO_PIXEL_8 Buffers[4];
//input pixel 
static PIXEL_8 Pixel;
//signal indicating whether data has been sent on a clock cycle 
static signal unsigned I NoDataSent = 0;
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
//delay in clock cycles from the input to the output 
macro expr PipeLatency = (WIDTH «  I + 2);
//counter to delay the output of data while the pipeline is primed 
unsigned 10 DelayCounter;
//read in the first pixel (blocks until data is available)
*InPixelChan ? Pixel;
par // parallel execution of statements
{
while(l)
{
//if data has been sent this clock cycle 
if (INoDataSent)
{
par
{
//signal for data read from the buffers 
signal unsigned sReadBuffer[5];
//read in a new pixel from the input channel 
*InPixelChan ? Pixel;
//read the input data from SRAM 
sReadBuffer[4] = Pixel;
//read the data from the head of the buffer into the signal 
par (y=0; y<4; y++)
{
sReadBuffer[y] = ReadFlFO(Bu%rs[y<-2]); 
IncrementHFOPointerfBuffers [y<-2]);
}
/*
The four elements of the window are moved to the FIFO buffers above 
from left bottom
*/
par(y=l; y<5; y-H-)
{
WiiteFIFO(Buffers[(y-l)<-2],ReadDataWindow(Window, y, 0));
//construct the window from the registers and the buffers 
par(y=0; y<5; y++)
{
WiiteDataWindow(Window, y, OJReadDataWindow(Window, y, 1)); 
WriteDataWindow(Window, y, l,ReadDataWindow(Window, y, 2)); 
WriteDataWindow(Window, y, 2,ReadDataWindow(Window, y, 3));
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
WriteDataWindow(Window, y, 3,ReadDataWindow(Window, y,4)); 
WriteDataWindow(Window, y, 4,sReadBuffer[y]);
}
}
}
//otherwise pause the pipeline for one clock cycle 
else 
delay;
}
I
//delay the output until the pipeline is primed 
do
{
DelayCounter++;
}
while(DelayCounter !=PipeLatency-1 ) ;
//Send the data to the next process in the pipeline. 
while(l)
{
priait
{
case *OutWindowChan ! Window: break; 
default: NoDataSent = 1; break;
}
}
}// end of Manage Buffers
y * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Function : ManageBuffers3x3
This function generates a 3x3 window on every clock cycle. Since it is pipelined 
architecture, the output is produced once the FIFO buffers were filled. The Rrst output 
occurs after W+2 clock cycles, the channel output was delayed until the pipeline is 
primed. Once the FIFO are full 3x3 window pixels are produced on every clock cycle
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * y
void ManageBuffers3x3(chan PDŒL_8* InPixelChan, chan WINDOW_PIXEL_8* 
OutWindowChan)
I
//3 by 3 window to be processed
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
stade WINDOW_PIXEL_8 Window;
//2 FIFO buffers of pixels for the two rows of the image 
stadc FIF0_PIXEL_8 Buffers [2];
//input pixel 
stadc PDŒL_8 Pixel;
//signal indicating whether data has been sent on a clock cycle 
stadc signal unsigned 1 NoDataSent = 0;
//delay in clock cycles from the input to the output 
macro expr PipeLatency = WIDTH;
//counter to delay the output of data while the pipeline is primed 
unsigned 9 DelayCounter;
//read in the first pixel (blocks undl data is available) 
*InPixelChan ? Pixel;
par
{
while(l)
{
//if data has been sent this clock cycle 
if (INoDataSent)
{
par
{
//signal for data read from the buffers 
signal unsigned sReadBuffer[3];
//read in a new pixel from the input channel 
*InPixelChan ? Pixel;
//read the input data 
sReadBuffer[BOTTOM] = Pixel;
//read the data from the head of the buffer into the signal 
par (y=0; y<2; y++)
I
sReadBuffer[y] =ReadFIFO(Buffers[y<-l]); 
IncrementFIFOPointer(Buffers[y<-1 ]);
}
I*
72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The four elements of the window are moved to the FIFO buffers above 
from left bottom
*/
/*
move the three elements of the kernel lines into the 
buffer for the line above
*/
par(y=l; y<3; y++)
{
WriteFiFO(Buffers [(y-1 )<-1 ] JReadDataWindow(Window, y, LEFT));
1
//construct the kernel from the registers and the buffers 
par(y=0; y<3; y++)
{
WriteDataWindow(Window, y, LEFT ,ReadDataWindow(Window,
y, CENTRE));
WriteDataWindow(Window, y, CENTRE,ReadDataWindow(Window, y,
RIGHT));
WriteDataWindow(W indow, y, RIGHT,sReadBuffer[y]);
1
}
}
//otherwise pause the pipeline for one clock cycle 
else 
delay;
1
{
//delay the output until the pipeline is primed 
do
I
DelayCounter++;
1
while(DelayCounter !=PipeLatency-1 );
/*
Send the data to the next process in the pipeline.
*/
while(l)
I
priait
{
case * OutWindowChan ! Window: break; 
default: NoDataSent = 1; break;
73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
y * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Function : ProcessWindowMedian
Performs a Median Filter algorithm on the 3x3 window inputs. Since median filter is 
pipelined. The first output appears after a latency of 14 clock cycle.
void ProcessWindowMedian(WINDOW_PIXEL_8 *W, unsigned 8 * Pixel)
{
//registers to store all the outputs
unsigned 8 outl,out2,out3,out4,out5,out6,out7,out8,out9;
//signals for intermediate values
signal unsigned int 8 C11_L,C11_H,C12_L,C12_H,C13_L,C13_H,C14_L,C14_H 
C21_L,C21_H,C22_L,C22_H,C23_L,C23_H,C24_L,C24_H,C31_L,C31_H,C32 
_L,C32_H,C33_L,C33_H,C34_L,C34_H,C41_L,C41_H,C42_JL,C42_H,C43_L,C 
43_H,C4A1_L,C4A1_H,C4A2_L,C4A2_H,C4B0_L,C4B0_H,C4B 1 JL,C4B 1_H, 
C4B2_L,C4B2_H,C51_L,C51_H,C61_L,C61_H,C71_L,C71_H,C81_L,C81_H,C 
91_L.C91_H,C101_L,C101_H, Cl 11_L,C111_H;
signal unsigned int 8 R ll J121,R31 Jt41,R42JR43Jl4Al JR4A2,R4A5JR4A4,R4A3;
signal unsigned int 8 R4B 1,R4B4,R4B5, R51,R52,R53,R54JR55,R56,R57,
R61 ,R62T163,R64,R65,R66 J167,R71 J172,R73,R74, 
R75Jl76dl77,R81 ,R82 J183,R84,R85,R86T187T191 ,R92, 
R93J^94,R95J196,R97,R101J1102,R103,R104JR105. 
R106Jll07,Rlll,R112jtll3,R114JR115,R116Jlll7;
// comparison of the signals for median filtering 
par{// parallel execution 
par{
if((*W).YC[0][0] <(*W).YC[0][1])
{
par{Cl IJL = (*W).YC[0][0];
C11JH = (*W).YC[0][1];
}}else
{ par{
C11_L = (*W).YC[1][0];
C11_H = (*W).YC[0][0];
}
1
74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
}//par
if((*W).YC[0][2] <(*W).YC[1][0]) 
{ par{
C12_L=(*W).YC[0][2];
C12_H = (*W).YC[1][0];
}
}else
{par{
C12_L=(*W).YC[1][0];
C12_H = (*W).YC[0][2];
}}
if((*W).YC[l][l] < (*W).YC[1][2]) 
{ par{
C13_L = (*W).YC[1][1];
C13_H = (*W).YC[1][2];
}}
else{
par{
C13_L = (*W).YC[1][2];
C13_H = (*W).YC[1][1];
}}
if((*W).YC[2][0] < (*W).YC[2][1])
{
par{
C14_L = (*W).YC[2][0];
C14_H = (*W).YC[2][1];
}
else
I
par{
C14_L = (*W).YC[2][1];
C14_H = (*W).YC[2][0];
}
}
R ll = (*W).YC[2][2];
//level 2 
par{
if (C11_L<C12JL)
I
par{
75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C21_L = C11_L;
C21_H = C12_L;
}
}
else
{
par{
C21_L = C12_L; 
C21 H = C11 L;
if(Cll_H<C12_H)
{
par{
C22_L = C11_H; 
C22_H = C12_H;
}
}
else
{
par{
C22_L = C12_H; 
C22_H = C11_H;
}
}
if(C13_L<C14_L)
I
par{
C23_L = C13_L; 
C23JH = C14_L;
}
}
else
I
par{
C23_L = C14_L; 
C23_H = C13_L;
}
}
if(C13_H<C14_H)
{
par{
C24 L = C13 H;
76
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C24 H = C14
else
par{
C24_L = C14_H; 
C24_H = C13 H;
R21 =R11;
)//par 
// level 3 
par{
par{
if (C21_L < C23_L )
{
par{
C31_L = C21_L; 
C31_H = C23_L;
}
}
else
{
C31_L = C23_L; 
C31_H = C21_L;
}
1
if(C21_H<C23_H)
{
par{
C32_L = C21_H; 
C32_H = C23 H;
else
C 32J. = C23_H; 
C32_H = C21_H;
}
}
77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
if(C22_L<C24_L)
{
par{
C33_L = C22_L; 
C33_H = C24_L;
}
}else
I
par{
C33_L = C24_L; 
C33_H = C22_L;
}
if(C22_H<C24_H)
{
par{
C34_L = C22_H; 
C34_H = C24_H;
}
}
else
{
par{
C34_L = C24_H; 
C34_H = C22_H;
1
}
R31 =R21;
}//par 
//— level 4 
par{
R41 = C31_L; 
if(C31_H<C32_L)
{
par{
C41_L = C31_H; 
C41_H = C32 _E;
else
par{
78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C41_L = C32_L;
C41 H = C31 H:
if(C32_H<C33_L)
{
par{
C42_L = C32_H; 
C42_H = C33_L;
else
{
par{
C42_L = C33_L; 
C42_H = C32_H;
}
}
if(C33_H<C34_L)
I
par{
C43_L = C33_H; 
C43_H = C34JL;
}
}
else
{
par{
C43_L = C34_L; 
C43_H = C33_H;
}
}
R42 = C34_H;
R43 = R31;
)//par
//— level 4a 
par{
R4A1 =R41; 
if(C41_L<C42_H)
{
par{
C4A1_L = C41_L; 
C4A1 H = C42 H;
79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
else
I
else
{
par{
C4A1_L = C42_H; 
C4A1_H = C41_L;
1
if (C41_H<C42_L) 
par{
C4A2_L = C41_H; 
C4A2_H = C42_L;
}
{
par{
C4A2__L = C42_L; 
C4A2_H = C41_H;
}
1
par{
R4A2 = C43_L;
R4A3 = C43_H; 
R4A4=R42;
R4A5 = R43;
}
}//par 
//— level 4b 
par{
R4B1 = R4A1; 
if(C4Al_L<C4A2_L)
(
par{
C4B0J. = C4A1_L; 
C4B0_H = C4A2_L;
}
}
else
{
par{
C4B0_L = C4A2_L; 
C4B0_H = C4A1_L;
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
if( C4A2_H < R4A2 )
C4B1_L = C4A2_H; 
C4B1 H = R4A2;
else
{
par{
C4B1_L = R4A2; 
C4B1_H = C4A2_H;
}
}
if( C4A1_H < R4A3 )
{
par{
C4B2_L = C4A1_H; 
C4B2_H = R4A3;
}
}
else
{
par{
C4B2_L = R4A3; 
C4B2_H = C4A1_H;
}
}
par{
R4B4 = R4A4;
R4B5 = R4A5;
}
}//par 
//— level 5 
par{
if( R4B1 < R4B5 )
{
par{
C51_L = R4B1;
C51 H = R4B5;
81
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
else
C51_L = R4B5; 
C51_H = R4B1:
R51 = C4B0_L; 
R52 = C4B0_H; 
R53 = C4B1_L; 
R54 = C4B1_H; 
R55 = C4B2_L; 
R56 = C4B2_H; 
R57 = R4B4;
)
}//par 
//— level 6
par{
if(R 5K C 51_H )
{
par{
C61_L = R51; 
C61_H = C51_H;
}
1
else
{
par{
C61_L = C51_H; 
C61_H = R51;
}
1
par{
R61 =C51_L;
R62 = R52;
R63 = R53;
R64 = R54;
R65 = R55;
R66 = R56;
R67 = R57;
}
}//par 
/ / -  level 7 
par{
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
R62<C61_H)
C71_L = R62; 
C71_H = C61_H;
}
}
else
{
par{
C71_L = C61_H; 
C71_H = R62;
}
1
par{
R71 = R 6 1 ;//-L  
R72 = C61_L;// -  2L 
R73 = R63;
R74 = R64 
R75=R65 
R76 = R66 
R77 = R67 
}
}//par 
//— level 8
par{
if(R73<C71_H)
{
par{
C81_L = R73;
C81_H = C7I_H;
}
}
else
I
par{
C81_L = C71_H; 
C81_H = R73;
}
}
par{
R81 =R71;
R82 = R72;
R83 = C71_L;
R84 = R74:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
R86 = R76; 
R87 = R77;
}
//— level 9 
par{
if(R84 < C81_H )
{
par{
C91_L = R84;
C91 _H = C81_H;
}
}
else
{
par{
C91_L = C81_JH; 
C91 _H = R84;
I
}
par{
R91 =R81;
R92 = R82; 
R93=R83;
R94 = C81_L;
R95 = R85;
R96 = R86;
R97 = R87;
}
}//par
//-- level 10 
par{
if( R95 < C91_H )
{
par{
C101_L = R95; 
C101_H = C91_H;
1
}
else
{
par{
C101_L = C91_H; 
ClOl H = R95:
84
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
}
par{
R101=R91;
R102 = R92;
R103 = R93;
R104 = R94;
R105 = C91_L; 
R106 = R96;
R107 = R97;
}
}//par
//— level 11 
par{
if(R106<C101_H)
{
par{
C111_L = R106; 
C111_H = C101_H;
}
}
else
{
par{
C111_L = C101_H; 
C111_H = R106;
}
1
par{
R111=R101;
R112 = R102;
R113 = R103;
R114 = R104;
R115 = R105;
R116 = C101_L; 
R117 = R107;
1
}//par
//— level 12 
par{
if(R 117< C lll_H )
{
par{
outS = R117; 
out9 = C lll_H ;
}
}
85
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
else
{
par{
outS = Cl 11_H; 
out9 = R117;
}
1
par{
outl =R111; 
out2 = R112; 
out3 =R113; 
out4 = R114; 
out5 = R115; 
out6 = R116; 
out7 = Cl 11_L;
*Pixel = R111;
}
}//end par*/
}//par
}//end of Median filtering
/***********************************************************************
Fnnciton: processWindowConnvolveKCM_3x3
.2131 21 
 ̂ 314831
213121
256
Convolution is done by 8 bit Look Up Tabled based multiplication. The multiplication 
tables of above convolution mask are stored in ROMs. To access the same location in 
ROM in parallel. For each mask value a multiplication table is stored in two different 
ROMs. A 3x3 window pixels are accessed same time. First output is produced after a 
latency of 7 cycles.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
//ROM Declaration 
rom unsigned int 16
MUL0[16]={0,21,42,63,84,105,126,147,168,189,210,231,252,273,294,315 }; 
rom unsigned int 16
MUL1[16]={0,31,62,93,124,155,186,217,248,279,310,341,372,403,434,465};
86
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
rom unsigned int 16
MUL2[16]={0,21,42,63,84,105,126,147,168,189,210,231,252,273,294,315 }; 
rom unsigned int 16
MUL3[16]={0,31,62,93,124,155,186,217,248,279,310,341,372,403,434,465}; 
rom unsigned int 16
MUL4[16]={0,48,96,144,192,240,288,336,384,432,480,528,576,624,672,720 }; 
rom unsigned int 16
MUL5[16]={0,31,62,93,124,155,186,217,248,279,310,341,372,403,434,465}; 
rom unsigned int 16
MUL6[16]={0,21,42,63,84,105,126,147,168,189,210,231,252,273,294,315 }; 
rom unsigned int 16
MUL7[16]={0,31,62,93,124,155,186,217,248,279,310,341,372,403,434,465}; 
rom unsigned int 16
MUL8[16]={0,21,42,63,84,105,126,147,168,189,210,231,252,273,294,315 }; 
rom unsigned int 16
MUL10I16]={0,21,42,63,84,105,126,147,168,189,210,231,252,273,294,315 }; 
rom unsigned int 16
MUL11[16]={0,31,62,93,124,155,186,217,248,279,310,341,372,403,434,465}; 
rom unsigned int 16
MUL12[16]={0,21,42,63,84,105,126,147,168,189,210,231,252,273,294,315 }; 
rom unsigned int 16
MUL13[16]={0,31,62,93,124,155,186,217,248,279,310,341,372,403,434,465); 
rom unsigned int 16
MUL14[16]={0,48,96,144,192,240,288,336,384,432,480,528,576,624,672,720 }; 
rom unsigned int 16
MUL15[16]={0,31,62,93,124,155,186,217,248,279,310,341,372,403,434,465}; 
rom unsigned int 16
MUL16[16]={0,21,42,63,84,105,126,147,168,189,210,231,252,273,294,315 }; 
rom unsigned int 16
MUL17[16]={0,31,62,93,124,155,186,217,248,279,310,341,372,403,434,465}; 
rom unsigned int 16
MUL18[16]={0,21,42,63,84,105,126,147,168,189,210,231,252,273,294,315 };
//Function process Window
void ProcessWindow(WINDOW_FDŒL_8 *W, unsigned 8 *Pixel)
//registes to store the values 
unsigned 16 i01,rll,r21j31 j41 j51,r61,r71,r81; 
unsigned 16 r02jl2,i22f32,r42j52,r62j72j82; 
unsigned 16 c01,cll,c21,c31,c41,c51,c61,c71,c81; 
unsigned 16 c02,cl2,c22,c32,c42,c52,c62,c72,c82; 
unsigned 16 tO,tl,t2,t3,t4,t5,t6,t7,t8; 
unsigned 16 sl,s2,s3; 
unsigned 16 ml,t;
87
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
//1st cycle 
par{
// high order four bits of input to access the ROM 
iCl = MUL0[(*W).YC[0][0]\W]; 
r l l  =MUL1[(*W).YC[0][1]\\4]; 
r21 = MUL2[(*W).YC[0][2]\\4]; 
r31 =MUL3[(*W).YC[1][0]\\4]; 
r41 =MUL4[(*W).YC[1][1]\\4]; 
r51 =MUL5[(*W).YC[1][2]\\4]; 
r61 =MUL6[(*W).YC[2][0]\\4]; 
r71 = MUL7[(*W).YC[2][1]\\4]; 
rSl =MUL8[(*W).YC[2][2]\\4];
//low order four bits of input to access the ROM
r02 = MUL10[(*W).YC[0][0]<^];
rl2 = MULll[(*W).YC[0][l]<-4];
r22 = MUL12[(*W).YC[0][2]<-4];
r32 = MUL13[(*W).YC[l][0]<-4];
r42 = MUL14[(*W).YC[l][l]<-4];
r52 = MUL15[(*W).YC[1][2]<^];
r62 = MUL16[(*W).YC[2][0]<^];
r72 = MUL17[(*W).YC[2][l]<-4];
r82 = MUL18[(*W).YC[2][2]<^];
//2°^ cycle 
// get first 12 
cOl = i01 [ll 
c l l  = r l l [ l l  
c21 = r21[ll 
c31=r31[ll 
c41 =r41[ll 
c51=r51[ll 
c61=r61[ll 
c71 =r71[ll 
c81 = r81[ll
bits
:0] @ (unsigned 4)0 
:0] @ (unsigned 4)0 
:0]@ (unsigned 4)0 
:0] ©(unsigned 4)0 
:0] ©(unsigned 4)0 
:0] ©(unsigned 4)0 
:0]© (unsigned 4)0 
:0]© (unsigned 4)0 
:0] ©(unsigned 4)0
// get Erst 12 bits 
c02 = 0©i02[ll:0] 
cl2 = 0© rl2[ll:0] 
c22 = 0©r22[ll:0] 
c32 = 0©r32[ll:0] 
c42 = 0©r42[ll:0] 
c52 = 0@r52[ll:0] 
c62 = 0@i62[ll:0] 
c72 = 0©r72[ll:0] 
c82 = 0©r82[ll:0]
88
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
y/3"' cycle
//combine the 12 bits to produce 16 bit number
to = cOl + c02;
tl —c ll 4- cl2;
t2 = c21 + c22;
t3 c31 + c32;
t4 —c41 4- c42;
t5 z= c51 + c52;
t6 = c61 + c62;
t7 = c71 + c72;
t8 = c81 4- c82;
//4th cycle
si = t04 tl 4-t2
s2 = t34 t4 4-15
s3 zz t64 t7 4 18
//5 th cycle
ml = si + s2 + s3; 
//6"  ̂cycle 
t = (m l)» 8 ; 
*Pixel = t<-8;
}//par
//run the sobel edge detection on each of the three components 
//Convolution((*Window).YC, *Pixel);
89
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
y * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
F u n c t i o n  ' C zisà»  SmooUi H w  0= 1.4
ProcessWindowConvoultion5x5
1
ÏÏ5
2 4 5 4 2
4 9 12 8 4
5 12 15 12 5
4 9 12 9 4
2 4 5 4 2
This function process the 5x5 gaussian convolution. Since the design is pipelined 5x5 
window pixel are accessed at the same time. A single output is produced on every clock 
cycle. The Erst output is produced after a latency of 7 clock cycles.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * y
void ProcessWindowConvoluEon5x5(WINDOW_FIXEL_8 *W, unsigned 8 *Pixel)
I
//16 bit registers
unsigned 16 tl,t2,t,mO,ml,m2,m3,m4,nO,nl,pO,pl,p2,p3,q2,q3;
unsigned 16 m,n,p,q,temp;
macro expr ext(n) = ((unsigned 8)0) @ n;
unsigned 16 m ll,m l2 ,p ll,p l2 ,q ll;
par{
// ist clock cycle
mO = ext((*W).YC[0][l]) + ext((*W).YC[2][0]);
m l = ext((*W).YC[2][4])+ ext((*W).YC[4][2]) ;
m2 = ext((*W).YC[l][l]) + ext((*W).YC[l][3]) ;
m3 = ext((*W).YC[3][l]) + ext((*W).YC[3][3]) ;
m4 = ext((*W).YC[2][2]);
nO = ext((*W).YC[0][0]) + ext((*W).YC[0][4]);
n l = ext((*W).YC[4][0]) + ext((*W).YC[4][4]) + ext((*W).YC[2][2]);
pO = ext((*W).YC[0][l]) + ext((*W).YC[0][3]);
pi = ext((*W).YC[l][0]) + ext((*W).YC[l][4]);
p2 = ext((*W).YC[3][0]) + ext((*W).YC[3][4]);
p3 = ext((*W).YC[4][l]) + ext((*W).YC[4][3]);
q2 = ext((*W).YC[2][2]) + ext((*W).YC[2][l]);
q3 = ext((*W).YC[2][3]) + ext((*W).YC[3][2]) + ext((*W).YC[2][2]);
//2nd clock cycle 
m il = mO + ml; 
ml2 = m2 + m3 ; 
p l l  =pO+pl; 
pl2 = p2+p3; 
qll=q2+q3;
90
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
//3rd clock cycle 
m = (m il + ml2 +m4); 
n = (nO + n l ) « l ;  
p = (p ll + m l l  + p l2  + q l l ) « 2 ;  
q = (m l2+ qll)«3 ;
//4th clock cycle
tl = (m+n);
I2 = (p4q); 
t =(t l  +t2yil5;
//5th clock cycle 
temp = (t > 255)?255:t;
//6th clock cycle
*Pixel = (unsigned)((temp)<-8);
}
}
Function : ProcessWindowGradient
Calculates the gradient using the following conolutions
■ -f
[-10 1] for horizontal gradient for vertical gradient
3x3 pixels are given as input to this function and a 9 bit horizontal and 9 bit vertical 
gradient are combined to produce a 18 bit pixel on every clock cycle.
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
void ProcessWindowGradient(WINDOW_SM *W, int 18 *Pixel_dx_dy,unsigned 8 
*count_row,unsigned 8 *count_col)
1
unsigned 8 YC;
macro expr ext(n) = ((unsigned 1)0) @ n; 
int dy,dx;
if( (*count_jow < 4 II *count_row =  255)|| (*count_col< 2 ||*count_col =  255))
{
par{
dy = 0; 
dx = 0;
91
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
}
}else
I
}
} //par
par
{
dy = (int)ext((*W).YC[0][l])-(int)ext((*W).YC[2][l]); 
dx = (int)ext((*W).YC[l][0])-(int)ext((*W).YC[l][2]);
}
*Pixel_dx_dy = 0@dx@dy;
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Function : ProcessWindowNonMax
Calculate the directional non maximum gradient based on the sign and magnitude of the 
gradient. 18 bit 3x3 window pixels are given as input and produces a 8 bit output after an 
intial latency of 4 cycles
#define HIGFLTHRESHOLD 120 
#deRne LOW_THRESHOLD 40
void ProcessWindowNonMax(WINDOW_J)X_DY *Wdxdy, unsigned 8 *Pixel)
{
int dx,dy,mag;
//absoulte value
macro expr abs(a) = (a<0 ? -a : a);
//absoulte
macro expr absl(b) = (b[width(b)-l]? -b: b);
//macro expr ext(n) = ((unsigned 8)0) @ n; 
int P l f2 f3 f4 ,P a fb ;
//l^ cycle 
par{
//drop first nine bits
dx = (int)(*Wdxdy).YC[l][l]\\9)
//take first nine bits
92
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
dy = (int)(*Wdxdy).YC[l][l]<-9)
mag = (int)(absl((*Wdxdy).YC[l][l]\\9) + absl((*Wdxdy).YC[l][l]<- 
9));
if( abs(dx) > abs(dy) )
{
// between 0 & 45 degrees 
if( dy =  0)
{
par{
PI = (int)(abs 1 ((* W dxdy). Y C[ 1 ] [0]\\9) + abs 1 ((* W dxdy). YC[ 1 ] [0]<-9); 
P2 = (int)(absl((*Wdxdy).YC[l][0]\\9) + absl((*Wdxdy).YC[l][0]<-9); 
P3 =(int)(absl((*Wdxdy).YC[l][2]\\9) + absl((*Wdxdy).YC[l][2]<-9);
P4 =(int)(absl((*Wdxdy).YC[l][2]\\9) + abs 1 ((*Wdxdy). YC[ 1 ] [2]<-9);
}
}else
if( (dx > 0 && dy > 0) || (dx < 0 && dy < 0))
{
par{
PI = (int)(abs 1 ((* W dxdy). Y C [0] [2]\\9) + absl((*Wdxdy).YC[0][2]<-9); 
P2 = (int)(absl((*Wdxdy).YC[l][2]\\9) + abs 1 ((* Wdxdy). YC[ 1 ] [2]<-9); 
P3 = (int) (int)(absl((*Wdxdy).YC[2][0]\\9) + absl((*Wdxdy).YC[2][0]<- 
9);
P4 = (int)(absl((*Wdxdy).YC[l][0]\\9) + absl((*Wdxdy).YC[l][0]<-9);
}
}else// between 0 & -45 deg
if( (dx > 0 && dy < 0)||(dx < 0 && dy > 0))
{ par(
PI = (int)(absl((*Wdxdy).YC[2][2]\\9) + absl((*Wdxdy).YC[2][2]<-9); 
P2 = (int)(absl((*Wdxdy).YC[l][2]\\9) + absl((*Wdxdy).YC[l][2]<-9); 
P3 = (int)(absl((*Wdxdy).YC[0][0]\\9) + absl((*Wdxdy).YC[0][0]<-9); 
P4 = (intXabsl((*Wdxdy).YC[l][0]\\9) + absl((*Wdxdy).YC[l][0]<-9);
}
}
}
else
if(abs(dy) > abs(dx))
{ // vertical direction
if( dx =  0)
{
par{
PI = (int)(absl((*Wdxdy).YC[0][l]\\9) + absl((*Wdxdy).YC[0][l]<-9); 
P2 = (int)(absl((*Wdxdy).YC[0][l]\\9) + absl((*Wdxdy).YC[0][l]<-9);
93
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
P4 = (int)(absl((*Wdxdy).YC[2][l]\\9) + absl((*Wdxdy).YC[2][l]<-9); 
P3 = (int)(absl((*Wdxdy).YC[0][l]\\9) + absl((*Wdxdy).YC[0][l]<-9);
}
jelse// between 45 & 90 or 225 & 270
if( (dx > 0 && dy > 0) II ( dx < 0 && dy < 0))
{
par{
PI = (int)(abs 1 ((* Wdxdy). Y C [0] [ 1 ]\\9) + absl((*Wdxdy).YC[0][l]<-9); 
P2 = (int)(absl((*Wdxdy).YC[0][2]\\9) + absl((*Wdxdy).YC[0][2]<-9); 
P3 = (int)(absl((*Wdxdy).YC[2][l]\\9) + absi((*Wdxdy).YC[2][l]<-9); 
P4 = (int)(absl((*Wdxdy).YC[2][0]\\9) + absl((*Wdxdy).YC[2][0]<-9);
}
}else//between 90 & 135 & 270 & 315
if( ( dx < 0 && dy > 0) || (dx > 0 && dy < 0))
{
par{
PI = (int)(absl((*Wdxdy).YC[0][0]\\9) + absl((*Wdxdy).YC[0][0]<-9); 
P2 = (int)(absl((*Wdxdy).YC[0][l]\\9) + absl((*Wdxdy).YC[0][l]<-9); 
P3 = (int)(absl((*Wdxdy).YC[2][2]\\9) + absl((*Wdxdy).YC[2][2]<-9); 
P4 = (int)(absl((*Wdxdy).YC[2][l]\\9) + absl((*Wdxdy).YC[2][l]<-9);
}
}
}
else// 45 & 135
if( abs(dx) =  abs(dy) )
{
if( (dx > 0 && dy > 0) || (dx < 0 && dy < 0))
{
par{
PI = (int)(absl((*Wdxdy).YC[0][2]\\9) + abs 1 ((* W dxdy). Y C [0] [2] <-9); 
P2 = (intXabsl((*Wdxdy).YC[0][2]\\9) + absl((*Wdxdy).YC[0][2]<-9); 
P3 = (int)(absl((*Wdxdy).YC[2][0]\\9) + absl((*Wdxdy).YC[2][0]<-9); 
P4 = (int)(absl((*Wdxdy).YC[2][0]\\9) + absl((*Wdxdy).YC[2][0]<-9);
)
}else
if((dx < 0 && dy > 0) || (dx > 0 && dy < 0))
{
par{
PI = (int)(absl((*Wdxdy).YC[0][0]\\9) + abs 1 ((* Wdxdy). YC[0] [0]<-9); 
P2 = (int)(absl((*Wdxdy).YC[0][0]\\9) + absl((*Wdxdy).YC[0][0]<-9); 
P3 = (int)(absl((*Wdxdy).YC[2][2]\\9) + absl((*Wdxdy).YC[2][2]<-9); 
P4 = (int)(absl((*Wdxdy).YC[2][2]\\9) + absl((*Wdxdy).YC[2][2]<-9);
}
94
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
}
else
if( dy =  0)
{
par{
PI = (int)(abs 1 ((* Wdxdy). Y C [ 1 ] [0]\\9) + absl((*Wdxdy).YC[l][0]<-9) 
P2 = (int)(absl((*Wdxdy).YC[l][0]\\9) + absl((*Wdxdy).YC[l][0]<-9) 
P3 = (int)(absl((*Wdxdy).YC[l][2]\\9) + absl((*Wdxdy).YC[l][2]<-9) 
P4 = (int)(absl((*Wdxdy).YC[l][2]\\9) + absl((*Wdxdy).YC[l][2]<-9)
}
}
}
//2™̂ cycle
Pa = (PI + P 2 )» l ;
Pb = (P3 + P 4 )» l;
//3"  ̂cycle
if( mag > Pa && mag > Pb )
{
if(mag > HIGH_TEiRESHOLD)
*Pixel = EDGE;//definite edge 
else
if( mag < LOW_THRESLIOD)
*Pixel = NOEDGE; // no egde
else
*Pixel = MAYBE_EDGE;// may be an edge
}
else
*Pixel = 0
}//end par
}//end macro Nonmax
}
95
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
* Function : ProcessWindowhystersis 
Description
Reads input 3 by 3 windows from *WindowIn, performs the hystersis by two 
threshold. Th and Tl( Th > Tl). if center pixel is maybe edge and if any of the 
neighboring pixels is a definite edge, then that pixel is considered an edge. The first 
output down the channel *OutChan will not occur until the pipeline has been primed with 
valid data
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
#define EDGE 255 
#deEne MAYBE_EDGE 200 
#deGne EDGE 0
void ProcessWindowHystersis(WINDOW_PIXEL_8 *W, unsigned 8 *Pixel)
{
// check if the center pixel MAYBE_EDGE ( i.e, pixel value between high threshold 
and low threshold). 
if( (*W).YC[1][1] =  MAYBE_EDGE)
I
if((*W).Y[0][0] =  EDGE || (*W).Y[0][1] =  EDGE || (*W).Y[0][2] =  EDGE || 
(*W).Y[1][0] =  EDGE II (*W).Y[1][2] =  EDGE || (*W).Y[2][0] == EDGE 
(*W).Y[2][1] =  EDGE II (*W).Y[2][2] =  EDGE )
{
*Pixel = EDGE; //center pixel is an edge.
}else delay
}else
delay;
96
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
VTTA
Graduate College 
University of Nevada, Las Vegas
Venkateshwar Rao Daggu
Local Address:
1165, Maryland Circle #1 
Las Vegas, NV 89119.
Degree:
Bachelor of Technology
Computer Science and Engineering, 2000.
Kakatiya University, India
Thesis Title:
Efficient Design and Implementation of Image Processing Algorithms on 
Reconfigurable Hardware using Handel-C
Thesis Examination Committee:
Chairperson, Dr.Venkatesan Muthukumar, Ph. D.
Committee Member, Dr. Henry Selvar^, Ph. D.
Committee Member, Dr. Emma Regcntova, Ph. D.
Graduate faculty Representative, Dr. Evangelos Yfands, Ph. D.
97
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
