A special approach to the Hough transform by Parrella, Vincent
Lehigh University
Lehigh Preserve
Theses and Dissertations
2004
A special approach to the Hough transform
Vincent Parrella
Lehigh University
Follow this and additional works at: http://preserve.lehigh.edu/etd
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Parrella, Vincent, "A special approach to the Hough transform" (2004). Theses and Dissertations. Paper 854.
-Parrella, Vincent
A Special
Approach to the
Hough Transform
May 2004
A SPECIAL APPROACH TO THE HOUGH
TRANSFORM
by
Vincent Parrella
A Thesis
Presented to the Graduate and Research Committee
of Lehigh University
in Candidacy for the Degree of
~laster of Science
1Il
Electrical Engineering
Lehigh University
April 2004

Acknowledgements
Jason Schlessman: For having unending patience with my questions and for co-
advising me throughout this research. Without his help and inspiration, this design
and paper would not have been possible.
Meghanad Wagh: For providing me with the knowledge to see this research from
start to finish and giving me the techniques to be a good engineer.
Prakash Krishnamoorthy: For teaching me about the Verilog HDL and Ivlodelsim.
Ann Zawartkay: For having confidence in my ability even when I did not.
Betty Ann and Vincent Parrella: For their love and support of my academic pursuits
since kindergarten.
iii
Table of Contents
Acknowledgements iii
Table of Contents iv
List of TaiJI~ t vii
List of Figures viii
Abstract 1
1· Introduction
1.1 Technological trends
1.2 Importance of pattern recognition
1.3 i\lethods of pattern recognition
1.4 Organization of thesis. . . . . .
2 Previous Research
2.1 Cnrtesinn representntion
lY
2
2
3
<1
6
8
8
2.2 Complexity of the Hough Transform 10
2.3 Parallelism. . . . . . . • . . 11
2.4 Non-standard architectures. 12
2.5 FPGA approach. 14
2.6 Design tradeoffs . 14
3 Proposed approach 16
3.1 Polar representation ......... 16
3.2 Pitfalls of a Hough Transform system 18
3.3 Pipelining .... 19
3.4 Hough Transform 20
3.5 Proving the parameterization 21
3.6 Manipulating the expression 23
3.7 Scanning the image . 25
3.8 Lookup tables . . . 26
3.9 Calculation circuit 28
3.10 i\lemory processing and storage 32
3.11 Interaction of the Computational and i\lemory Update Units 38
4 Performance Evaluation 40
4.1 Analysis of Results 40
4.2 Error Analysis. 42
5 Conclusions and Future Research
5.1 Conclusions ...
5.2 Future Research .
Bibliography
44
44
45
48
/List of Tables
4.1 ASIC and FPGA Area and Delay Estimates ~ . . .. 42
Y11
List of Figures
2.1 Example of a straight line in Cartesian coordinates. Fundamental
equation shown. . . . . . . . . . . . . . . . . 9
3.1 Example of a straight line in polar notation. 17
3.2 Method for proving parametrization of a line. 22
3.3 Lookup table and complementary circuitry . . 27
3.4 Calculation circuit that computes the radius value. 30
3.5 Block diagram of the adder circuit. 31
3.6 Storage and processing circuit ... 34
3.7 Control for storage and processing circuit. 35
3.8 Timing diagram for storage and processing circuit 36
3.9 Flowchart of the Hough system. .......... 39
YIIl
Abstract
The Hough transform is an exhaustive method for detecting patterns within images.
Its benefits include resistance to noise and adaptability to finding different patterns.
The major drawback to this method is its computational complexity is O(N3) for an
N x N image. This makes the transform unsuitable for Von Neumann architecture.
f;::.
This thesis presents a special purpose architecture dedicated to performing the
Hough transform. Our method eliminates multipliers and uses parallelism and
pipelining to enhance time performance. Further, the calculated value is avail-
able at the same time as the image point is read. Thus it is unlike earlier methods
which did the operations of reading the image, computation and storage in serial
fashion. The hardware implementation of the circuit is gi\"Cn along with area and
delay estimates for its implementations on both ASIC and FPGA technologies.
1
Chapter 1
Introduction
1.1 Technological trends
The trend in current technological development is to make devices faster and smaller.
New approaches to old problems are constantly being investigated and applied in
order to achieve this goal. The focus tends to be on consumer products because
consumer demand for products that make their life more convenient is high. How-
ever, it is also necessary to imprm'e technology for non-consumer applications. For
example industrial automation is used everyday to produce items such as cars, cel-
lular phones and printed circuit boards. If these products were assembled manually,
then production capability would be substantially diminished. In the case of printed
circuit boards. components are too small to assemble them manually. The inefficien-
cirs and inconsistrncirs of humans.have brrn oycrcome by technological innovations.
Surgeries are more safe because of the precision of minimally invasive procedures
that make use of laser scalpels and digital imagery. This is not to say that humans
should be replaced by machines, rather human capabilities can be enhanced through
the creation and use of technology.
1.2 Importance of pattern recognition
Pattern recognition is a vital part of many industrial applications. Industrial au-
tomation relies heavily on robotics. Robotic devices need to know where to find a
part and then where to place it in the case of assembly. Further down an assem-
bly line another robotic device may fasten the part in to place. This once again
requires knowledge of its local environment. Robots know where to place items and
how to assemble parts through pattern recognition. The pattern sought out may
be a circular screw hole and once is it found, a screw can be placed in it. Another
important pattern in the application may be the Phillips markings of the screw and
an automatic screw driver may be aligned with it to tighten it.
Character recognition also relies heavily on finding patterns. Portable devices
such as PDAs and tablet PCs have areas that will let the user write with an optical
pen. This handwriting is then converted to a document which is stored on the
portable device and possibly copied to a desktop. An image of what the user writes
can be translated into typewritten text. The portable device docs this by comparing
3
what the user wrote with a database of various ways to create different letters and
then choosing a match to the character.
The printing of books, newspapers, or circuit boards is also improved greatly
by the use of pattern recognition. In the case of books and newspapers, pattern
recognition is used to accurately align the printed text with the edges of the paper.
In the case of circuit boards, which are" printed" by etching designs into copper and
removing the excess, pattern recognition once again ensures proper alignment and
positioning.
A straight line is an example of a simple pattern that can be used to represent
more complex entities. Four straight lines connected properly can form a square.
Straight lines can also be placed tangentially to a circle to circumscribe it. If the
straight lines are numerous, short and oriented properly, they could form a circle.
All of the forementioned examples would benefit highly from straight line detection.
Since straight lines exist so frequently, they can be thought of as building blocks for
more complex patterns.
1.3 Methods of pattern recognition
There are three main types of pattern recognition. The first is a form of artificial
intelligence that looks for lines by scanning ~~ neighborhoods" of points. The algo-
rithm examines one point in an image and tries to extend it to see what line it may
lie on. It does this by looking at the surrounding points and seeing which ones are
defined. When it finds one, it begins looking at that point and continues until it
finds a point that is not filled in. The drawback to this approach is its sensitivity
to the inherent noise in an image. It may possibly miss some data or draw wrong
conclusions. Further, it is unable to deal with quantization of the image space if
the line is not at a multiple of 45 degrees. In this case a line may not be exactly
straight and the algorithm reports incorrect values.
A second method for detecting patterns in images is by bit matching. This
method actually compares two quantized images and looks for similarities in the
way they are represented in memory. This method is useful if two images are very
much alike and the noise in the image is low. However, it is not an all purpose
method for detecting patterns as size of patterns and different quantizations will be
a major hindrance in its performance.
The third method for general pattern recognition is the Hough transform. It
allows an image to be transformed to a domain with axes indicating distance from
the origin to the location and length of straight lines. The hough transform can be
applied to detect any parameterized cun'e, not just straight lines. The mathematics
of the Hough transform deals with a parameterization of the image from a normal
Cartesian representation to a domain that tells the location of all edges in the image.
Each point in the transform domain is actually a parameterization which indicates
how many times an associated pattern has been detected. The resulting data can
5
then be scanned and a higher number at a particular location in the transform
domain indicates the presence of a pattern with parameters corresponding to that
location. This method is a general form of pattern detection because it can be
applied to any image regardless of its size. It also is resistant to noise in the samples
and noise created by quantization of the image. This robustness comes from the
use of statistical measures that may be used to determine the maximum in the
transform domain. The one major drawback of the Hough transform is that it is
)
very computationally intensive. If both image and transform spaces are N x N, then
the Hough transform typically needs O(N3) computations. This thesis is focused on
developing a hardware implementation of the Hough transform that would improve
the time complexity.
1.4 Organization of thesis
This research attempts to show a fast and low power method for performing the
Hough transform. Since the trend of electronics is decreasing size and increasing
performance, this paper will show a design that meets these criteria. The design uses
less hardware because it exploits some mathematical properties that exist naturally.
It also hinges on simple but repetiti\"e operations which digital systems are so good
at performing. This system also benefits from a paralleled and highly pipelined
design which will be explained in detail in the forthcoming chapters. The benefits
6
of designing a system with these added features creates the possibility foS upgrade-
ability in the future without having to perform a redesign.
This thesis focuses on a hardware implementation of a system to perform the
Hough transform. This examination concludes with a suggestion for further research.
Chapter 2 presents the theory behind the Hough transform with an explanation of
straight line parameterization, pipelining and parallel approaches. Advantages and
disadvantages to the techniques presented are discusses along with previous research
on the subject. Chapter 3 focuses on the proposed Hough transform implementation
and describes in detail the hardware and software developed to accomplish this task.
Chapter 4 presents the timing estimates for a synthesized version of the hardware
described in Chapter 3. Chapter 5 concludes with a discussion of the advantages
and disadvantages of the proposed research. It also presents future research ideas
for the associated system developed in this thesis.
-,
Chapter 2
Previous Research
2.1 Cartesian representation
The Hough transform is a method for extracting curves in an image. It is particularly
useful in noisy images because it deals with one pixel at a time and determines what
curves the point has the possibility of lying on. A straight line is an example of a
simple curve.
The focus of this paper is the extraction of straight lines from images. Straight
lines are characterized by the equation
y=mx+b (2.1 )
The Hough transform can be thought of as a parameterization in which the coordi-
nates of the image are switched from Cartesian representation to a slope intercept
representation. The hough transform is a two dimensional representation containing
s
yaxis
y intercept (b)
y =mx + b
dy
dx
m =dy Idx
x axis
Figure 2.1: Example of a straight line in Cartesian coordinates. Fundamental equa-
tion shown.
9
image information in an alternate - more useful - form. In particular, it plots the
slope of straight lines in the image. Each point in the resulting plot represents the
locations and length of the corresponding straight line in the image and contains the
number of lines passing through that point. Since a single image point may lie on
more than one line, it contributes to multiple points corresponding to different slope
and intercept combinations. Once the calculation process is complete, the resulted
transform is examined for maximums. The coordinates of those maximums--are the
parameters of longest lines in the image.
This research deals with black and white images only. If grayscale images are
needed to be processed, each image may be discretized with a properly selected
threshold before it is applied to the circuit. An example could be any pixel below a
value 127 could be considered white while any pixel above 127 could be considered
black. Once the image is pre-processed in this manner, the system could proceed as
usual.
2.2 Complexity of the Hough Transform
Even though the approach seems easy enough the one major flaw is the time com-
plexity of the system. The amount of time any system would take to perform the
Hough transform increases dramatically as the image size increases. For every pixel
in the image, one needs to find all the Hough transform domain coordinates that
'.
10
correspond to lines through that pixel. Assume an angular resolution of (180jM)
degrees. This implies O(M) computations per pixel. Thus for an N x N pixel
image, a (180jM) degree angular resolution Hough transform has a computational
complexity of 0(N2 *M).
In a very simple example, a 64 x 64 image would have a total of 4096 pixels.
Each of these pixels would need to be examined to see if it fits on one of many
"-.....
possible lines. If a 10 angular resolution is chosen then one needs to do as many as
737280 computations. Note that these many operations are required just to process
a relatively small image. The last step of line detection requires one to scan the
parameterized Hough space for the maximum, and has complexity 0(M2).
2.3 Parallelism
Since each image is processed independently by a Hough system, one method of
speeding it up is to have more hardware working on the image at the same time.
An image can be split in half with two dedicated circuits working on each half
concurrently. This method effectively cuts the time necessary for the entire process
in half, but increases the cost. Similarly, an image may be divided into fourths,
effectively quartering the computational time necessary. This approach has been
attempted on much larger scale to decrease the time necessary to process an image.
\
Parallel Ho'lJgh transform algorithms when implemented efficientl~.. can offer many
11
advantages over Von Neuman architectures. Examples of parallel architectures and
algorithms can be found in [1], [2], [3], [4], and [5]. In particular, the architecture
described in [5] can accomplish memory writes and searches completely in parallel
which requires much less time. [5] also uses a flag that denotes whether a point has
single or multiple hits where a hit represents how many lines it lies on. This system
however also requires 4096 processor elements for a 64 x 64 image size.
2.4 Non-standard architectures
Another approach to the Hough transform is to develop a special purpose architec-
ture. This hardware is specifically designed to carry out the Hough transform in
an efficient method and can also be optimized for this task. In [6] the divide and
conquer approach is used with a hyper pyramid multiprocessor system. \Vhile this
approach is well suited for this application it suffers from the intercommunication'
time of the processors and from the amount of hardware required. The other draw-
back in this approach is that the processors need to be connected in a linked list
fashion. [7] also suffers from long interprocessor communication and from needing
32,000 physical processors. This system howeyer does shut off inactiye processors
to saye power. Instead of shutting off processors when not actiye, they can actually
be made to handle the remaining work-load, as demonstrated in [8]. One processor,
12
acts as the director and assigns work to the other processors in the system. Unfor-
tunately, not all the processors do the same amount of computations and processor
communication time is a bottleneck. In [9] the previous drawbacks of long inter-
processor communication and unevenly distributed loads have been overcome. This
system makes use of the PVM language which enables a group of computers to be
interconnected in a way that will allow them to work cooperatively to accomplish
parallel computations. [10]
[11] takes a different approach to executing the Hough transform. In this im-
plementation a 200MHz operating frequency is achieved without pipelining. The
method lies in using a large amount of delay elements (D flip flops) in which differ-
ent points are filtered through. Points that have heavier weights experience longer
delays. This system deals with 512 delay elements. This architecture unfortunately
cannot exploit any of the advantages brought from pipelining or parallelism. Real
time systems have also been created to perform the Hough transform. [12] has the
ability to process an image in less than one second from input to straight line param-
eter reproduction. It contains a slightly parallelized version of the normal algorithm
because the generation of the resulting curve and the histogram plot arc done at the
,~
same time. Eycn though this system exploits pipelining techniques it is not highly
parallel in that two copies of the system could not work side by side on the same
image. Other examples of non-standard architectures can be found in [13], [14], and
[15].
13
2.5 FPGA approach
It seems that an FPGA (Field Programmable Gate Array) is an ideal choice for im-
plementing a special purpose digital system to compute the Hough transform. It has
the advantages of programmability, speed, ability to modify the design quickly, and
portability. [16] uses a Xilinx 4010 series board to implement the Hough transform.
This method works for both black and white and gray level images. The authors
have developed both pipelined and non-pipelined versions of the system.
2.6 Design tradeoffs
The Hough transform is certainly a problem that can be solved more quickly by
adding more hardware working concurrently. Unfortunately most designs available
to date simply trade the hardware complexity with the time complexity. This is a
trivial tradeoff because it does not affect the product of hardware and time com-
plexities. It only exploits the inherant parallel nature of the Hough transform.
Consequently, to cause any real impact on time complexity, one needs to go through
an exorbitant increase in expense, size and power. Designing a system in a non-
standard architecture or with a large amount of hardware also makes it tough to
port and optimize. A designer has to weigh the tradeoffs and decide what is neces-
sary for the project at 113nd. This research focuses on a low power, pipclined and
parallel architecture to perform the Hough Transform. The next chapter will discuss
14
.'
the design approach that is used in this thesis.
15
Chapter 3
Proposed approach
3.1 Polar representation
The major drawback to using the slope intercept representation is dealing with
vertical lines. Since slope is defined as
m = b.yjb.x (3.1)
a ,·ertical line would result in an infinite slope. Infinity exceeds the bounds of this
system because it cannot be represented in finite hardware. This thesis therefore uses
a polar representation. The equation of a line in polar notation using parameters r
and 0 is
r = .r >I: cos{O) + Y >I: sin{O)
16
(3.2)
yaxis
r = x*cos(theta) + y*sin(theta)
Radius
(r)
Figure 3.1: Example of a straight line in polar notation.
x axis
In fig 3.1, r is the orthogonal distance of the line from the origin and () is the angle
made by the radius with the horizontal a.xis. This form of the line equation allows
for \"ertical lines because any \"ertical line would ha\"c a orthogonal distance from
the origin and a 0 of zero degrees.
Ii
3.2 Pitfalls of a Hough Transform system
Even though the approach to computing the Hough transform seems easy enough,
its major flaw is the amount of cycles a system would take to complete the operation.
The amount of time any system would take to perform the hough transform increases
as the image size increases. In a very simple example, a 64 x 64 image would have
a total of 4096 pixels. Each of these pixels would need to be examined to see if it
fit on one of many possible lines. The length of a possible line increases with image
size. In an N x N image, the longest possible length of a line is N J2. Since the
Hough transform involves calculations based on the number of pixels on a line, the
actual size in bits of the system needs to be large enough to accurately represent the
largest possible line. It is for this reason that system size increases with image size.
The next conisderation is how many line segments one wishes to detect. In looking
back at the polar representation of figure 3.1 the next thing to decide is how many
lines to look for inside the image. One has the possibility of looking for an infinite
amount of lines but unfortunately the system size once again comes into play. Being
that the angles that the radius (r) can make with the horizontal axis are between
zero and 180, i.e., 0 ~ 0 ~ 180, the designer ha,s to decide what angular resolution
he/she needs for the giyen application. The angular resolution will be one of the
most important factors in determining how long the system takes to transform an
image. For a standard, non-pipelined and non-parallel approach, the system would
need at least (nllmbcrof]Ji.rcls) x (numbcroflincs) cycles to perform the transform.
18
The system may actually need more cycles because of the reading of the image and
updating (read, modify, store) of the transform. In the case of angular resolution of
1°, a point has the possibility of lying on 180 different lines. If the resolution were
to decrease, size of the sy~tem memory could also become an issue along with the
number of cycles.
3.3 Pipelining
After the proper assessments are made for memory of the system, the next issue
to consider is the timing. Systems react quickly but the delay time between stages
is always an issue that designers struggle with. Since the process requires reading
from an image, performing a calculation and then storing the result, the amount of
time each operation takes is critical. Each part must be precisely calculated for the
system to work properly. The reading of the image pixels is generally fast since the
pixels are only 1 bit wide. The arithmetic circuits also have high speed. However,
updating memory presents a real bottleneck. The timing of the memory stage is
crucial because if it is not designed properly, values will be coming in at a faster
rate then their modification and storage rate. This will result in an incomplete data
set which is not a desirable result. One way to handle a slower system element is
through implementing a pipeline. Pipelining allows for partial steps to be completed
during one clock cycle. Pipelining increases a system's throughput by performing
multiple phases of the task on different data points concurrently. In a non-pipelined
system, not all of the hardware is working at the same time. The system normally
has states that it goes through to complete a process. For example, define a system
Alpha with three stages: A, B, and C that reads in two numbers, multiplies them
and produces the result. Stage A would handle the input, stage B perform the
multiplication, and stage C would produce the result to either a display or store
it to memory. The last consideration to place on the system is that it takes 10
nanoseconds or one clock period for each stage. At this rate, a result would be
produced 30 nanoseconds after it is inputted. The drawback is that once a stage
has passed, it is not used again until the next full cycle. Once numbers are inputted,
stage A waits 20 nanoseconds to work again. The same is true with the other stages.
For a high number of inputs this amounts to a considerable amount of time wasted.
A pipelined implementation would produce one result every clock cycle. Stage A
would take input every clock cycle, stage B would always be multiplying and stage
C would always be producing a result. Even though it still takes 30 nanoseconds
for the full operation, there would be no lull between inputs.
3.4 Hough Transform
As stated earlier, the Hough transform is a method for extracting cun'es from an
image. This met hod though highly effcctiye is incredibly repetitiye and has increased
20
needs based on image size. Since the proposed research focuses on straight lines,
figure 2.1 shows a simple example of a straight line in cartesian coordinates. The
problem in computing the slope of lines occurs when a straight line is completely
vertical thus giving a slope of infinity. To get around this problem, a parametrization
from cartesian coordinates to polar coordinates is used. This approach lends itself to
a realistic implementation because it works for all cases. Equation 3.2 characterizes
lines by their radius and theta which cannot go outside the bounds of the image due
to the nature of polar coordinate representation.
3.5 Proving the parameterization
Figure 3.2 shows how the proof must be approached with some new variables defined.
X is the horizontal coordinate of the point where the perpendicular is drawn to the
line in question. Similarly, Y is the vertical distance to the same point. If a straight
line is drawn down to the horizontal axis, the triangle that existed originally is
bisected. 0 measured from the horizontal axis remains the same but the other
angles of are now modified.
We now define new variables r1 and r2 which are each half the length of the
original radius (r). Now through simple trignometric equations the parametrization
can be established. x becomes the hypotenuse of the smaller triangle. Using right
21
yaxis
r = x*cos(theta) + y*sin(theta)
(r)
90-theta
y
x axis
Figure 3.2: Method for proving parametrization of a line.
triangle relationships we can write
rl = x *cos(B) (3.3)
(
The next step is to defi~n equation for y. Looking at the remaining angle of
the large triangle, it can be seen that its value is (90-B). Making use of the same "(
relationships we can write
r2 y *cos(90 - B)
ysin(B) (3.4)
Now that we have developed two equations that each comprise half of the radius,
we can sum them to reach the final form.
r rl + r2
x *cos(O) + Y* sin(O)
3.6 Manipulating the expression
(3.5)
After arriving at equation 3.2 we can see that there are more modifications required
than just switching coordinate systems. Besides needing values for radius and theta,
the cartesian variables x and yare still used. The new hurtle hO\\'ever comes from
the added mathematical operation. In equation 2.1 there is one multiply and one
23
addition. In equation 3.2 there is an extra multiply. This can cut performance
drastically because of the amount of times this calculation needs to be repeated.
The proposed solution in part examines the mathematical operations involved in
this expression. It is easy enough to compute the product of 5 x 3. It requires
only knowing a basic multiplication table to arrive at the answer. Multiplication on
the other hand is a much more complicated process for a digital circuit. Further,
a multiplication of two n-bit operands results in a 2n bit product. This added bit
width needs to be accounted for all subsequent computations. Thus multiplication
requires a large amount of hardware and a large amount of time to perform it.
A multiplication is necessary to solve the equation but a multiplier is not a
necessity. A multiplication merely involves a specified number of additions. The
statement 5 x 3 merely implies adding the number 5 three times. Similarly, the first
term of equation 3.2 is x*cos((}) which can be performed through addition. If theta
were held constant, then it would only be a matter of adding cos(0) for a specified
amount of times to reach the answer. If they expression happened to be 16*cos(48),
cos(48) would only need to be added 16 times to reach same result. Additions can
be accomplished at a much faster rate than multiplications and yield the exact same
results.
24
3.7 Scanning the image .
With the previous discussion in mind, a method for scanning the image can be
constructed so the above mathematical properties hold true. In thinking of an
image as a square, begin examining pixels at the top left corner. Each pixel in the
image can lie on infinitely many lines. Lines are represented by a distance from the
origin and an angle with the horizontal axis. The resulting histogram can be created
by scanning the image through one time for each theta in consideration. With theta
held constant, equation 3.2 can be solved quite easily for r because all other variables
in the expression are known. The image in the pixel has an associated x and y value
and the cosine and sine values are read from a ROM holding tables of these values.
Inside this memory are the quantized values of all the cosines and sines necessary.
The generation of the lookup table will be discussed later.
Since the image is being scanned linearly, substituting addition for multiplica-
tion is valid. If the system were not scanning the pixels in a linear fashion then
this method would be invalid. The radius can now be computed for the associated
theta and be stored in a memory element. Since one radius and theta combination
can occur multiple times, the two parameters will represent an address in memory.
Contents of that address, denote the number of times that radius and theta com-
bination was detected. This number also represents the number of pixels on the
line. After the processing of the image is finished the RA~1 will hold the Hough
transform data.
25
3.8 Lookup tables
Now that the method for transforming the imageQas been discnssed, the valnes to
use in the equations need to be determined. The cosine and sine values are stored in
blocks of read-only memory (ROM) referred to as lookup tables (LUT). The LUT
can be thought of as a large matrix composed of cells that are organjzed by row and
column. The system contains one lookup table with an address size of 4 bits and a
data size of 16 bits.
The cosine and sine values that are stored have a binary representation of eight
bits. All of the cosine and sine values lie between zero and one. Eight bits of
representation were chosen to keep computation size low and system size minimal.
The system performs arithmetic based upon a fixed point number system with 7 bits
of fractional point. A fixed point number system is analogous to how human beings
do arithmetic because the decimal place does not move. It also reduces system
complexity because the need for more hardware due to floating point arithmetic is
not necessary.
In order to increase performance, not eyery integer yalue of theta has been ac-
counted for. Incrementing theta by 10 at a time would giye more accurate results
but the delay tradeoff would be quite large. 0 instead has been chosen to be incre-
mented by 5.625 0 which is equimlent to 180/32. This number was chosen to keep
the lookup tablrs small while still producing desireable results. Address i of the
LOT contains cosine and sine yalues of angle i. This quantization allows us to usc
26
OV
I
+5V
I
1 I lheta=90?
I I
address
I theta <I- clock
I
I
~ount direction address
D Q~
Qf-
(~
~I
clock
COS
data
I
I
L..-__""", /
To adder circuit
Figure 3.3: Lookup table and complementary circuitry
SIN
data
only 5 bits to describe any angle between 0 and 180. The system stores only the
cosine and sine values from 0 - 90. For larger angles, a small amount of hardware
was added instead of having a larger LUT. Cosine and sine are periodic functions
wi th symmetry and satisfy
cos(180 - 0)
sin(180 - 0)
-cos(o)
Si71(O) (3.6)
To access the sine and cosine values of consecutive angles (in steps of 5.625°), we
use an address counter that begins at 0 and increments to provide new sine/cosine
values. As a consequence of equation 3.6 this counter is designed to decrement back
to 0 when it reaches 31. The sign of the cosine value is corrected by using extra
hardware. This ensures that when the counter counts down from 31 - 0 the system
receives the proper values for sine and cosine as if the values are being brought
straight from memory. Figure 3.3 shows the two LUTs with the complementary
circuitry surrounding them. A LUT address counter labeled "theta" always holds
the current angle. It is incremented each time the entire image is processed, Le.,
when all 256 x 256 pixels have been read. When the counter reaches a value of 31, a
multiplexor senses it and passes a logic "high" for the duration the counter is at 31.
This level is logically"anded" with the clock signal which is the resulting clock for
the D-flip flop. A Delay (D) flip-flop is the simplest memory element which holds
one bit of data for one clock period. The flip flop then puts out a logic "high"
from its Q output. This signal controls both the inverter and the counter direction.
\Vith a logic "high" the inverter complements the x values and the counter begins
to count down. \\Then the counter reaches zero the system has finished processing
the image and the final results may be read from RA!\t
3.9 Calculation circuit
The calculation circuit is the section where the computation of equation 3.2 takes
place. Figure 3.4 shows the architecture of the calculation circuit. The sine and
cosine yalues from the LUT are brought in and latched into the XCOS and YSIN
28
registers respectively. These registers are comprised of multiple delay flip-flops with
their clocks tied together. The two registers are updated on the rising edge of each
clock cycle. The bus lines from the LUT to the XCOS and YSIN registers are eight
bits wide. Directly after the XCOS and YSIN registers additional lines all with logic
"low" are added onto the bus for a width of 18 lines. The bus needs to be extended
to accommodate the largest possible radius of the image. Since the image is 256x256
pixels, the largest radius possible would be a diagonal of length 256 x J2 or 363
pixels.
The values coming in from the LUT have 8 bits of fractional point. The "decimal
point" would actually be between the 8th and 9th bit as counted from the left. We
then need an extra 10 bits to the left of that to account for the highest possible
integer sum. After the bus line extension, the lines are fed into of the the two inputs
of a ripple-carry adder. The ripple carry adder computes the sum and the carry
and combines them in the end to form the answer. This architecture is also used
with the YSIN register and RCA2. The YSIN_UPDATE register is clocked only
when XCNT reaches a value of 255 and the negative edge of the clock occurs. This
signal which is the bitwise AND of all the counter outputs is called x255. \Vhen
255 is reached the line becomes a logic "1". This signifies that one horizontal line of
the image has been processed for the current theta. The multiplexer that switches
between YSIN_UPDATE and the output of RCA1 depends on which pixel is being
processed. If the pixel being processed is the at the beginning of a horizontal line
29
YCNT
yzcro
ncwthCla
clock
clock
f-----1I---f---+--1i--f------J y255
_!r~~]~.!'~~ ~~l:~ _
XCNT
{-----1 Q Di--------+-------l
clock
lO'hOOO
19'h()()()()o--~~=t-,
Figure 3.4: Calculation circuit that computes the radius yalue.
30
hold cos value
hold sin value
!
add
\ no yes/
\ x==O? /
1
add
!
ysin(theta)
I
I
\
:0 ::/ I
,0:{) && rdJ} rem
~
hold
radius
I
,
Figure 3.5: Block diagram of the adder circuit
31
(XCNT=O) then the first half of equation 3.2 is zero. The signal xzero is the bitwise
NOR of each counter output and becomes a logic "I" when the value is zero. The
multiplexer then chooses the YSIN_UPDATE. The output of the YSIN_UPDATE
register is also fed back to RCA2. This is connected in such a way that the sum
continues to be calculated, thus effectively emulating a multiplication. The flow is
controlled though the selective clocking of the YSIN_UPDATE register. The output
of RADIUS is also fed back to RCA1 to emulate a multiplication. A multiplexer
exists before YSIN_UPDATE which allows the register to be filled with a value of
zero in the case of y=O or to latch in the value from RCA2. A second similar
multiplexer exists before the RADIUS register which will zero out the register in
the case of a new theta occuring. A signal newtheta is the logical AND of the xzero
and yzero signals. The signal yzero is identically configured to the xzero signal and
becomes a logic" I" when the YCNT value is zero. The upper 10 bits of the RADIUS
register are sent to the memory circuit to store the values. A block diagram of the
adder is shown in figure 3.5.
3.10 Memory processing and storage
A separate circuit handles the storing and updating of the data from the adder
circuit. \Vhen a radius is calculated for an associated theta, the system sees this
pair as another possible point on the line that the radius and theta represent. Inside
32
the RAM, an integer value exists at each address. When an address is called upon,
the current value i~~d, incremented and written back. The radius and theta
combination is the RAM address. After the entire image has been scanned and the
data written, a location of the ram contains the length of the line whose parameters
correspond to the location. Thus scanning the RAM at the end of the transform
one can determine the longest lines in the image.
The memory milnipulation circuit was designed to accept a radius and theta on
every clock cycle. The problem however is that the process of reading the memory
value, incrementing it and writing it back cannot be done in one cycle. It is for this
reason that the memory was split into two blocks, RAMI and RAM2 and the circuit
is pipelined. When one value comes in, one half of the circuit begins to work with
that address. 'When the next value comes in, the other half of the circuit handles
that address. During this time, the previous operation on the left side of the circuit
is finished and it can accept the next value.
The upper seven lines from the RADIUS register are fed in from the adder circuit.
These lines hold the integer value of the computed radius for the given angle. The
lines arc joined here with 5 lines from the THETA register to form the complete
value of radius and theta. These 12 lines will be the memory address inside each of
the two blocks of RAi\11abeled RAMI and RAi\12. RAi\U holds odd address values
while RAi\12 holds even address values. A block labeled CONTROL handles the
hardware interaction and timing for the system. The control block can be thought
33
RADIUS
ADDRESS
RAMI
DATA
RD
WR
CONTROL ADDRESSL..-----IRD
L..-----i WR RAM2
DATA
RCA
CLOCK
WRI _--L ---J
Figure 3.6: Storage and processing circuit
34
ytItItIf ~l
wr2 wrl rd2 rdl ADD REG2 ADD REG
Figure 3.7: Control for storage and processing circuit
of as two registers, STATE_REG and CONTROL-REG. STATE_REG is 1 bit wide
and CONTROL_REG is 6 bits wide. STATE_REG is incremented every rising edge
of the clock and it's value is monitored. Each bit of CONTROL-REG corresponds
to a signal in the system. Figure 3.7 shows the layout of the control signals.
\\Then the value of STATE-REG is 0, CONTROL-REG is loaded with 100101.
When the value of STATE_REG is 1, CONTROL_REG is loaded with 011010.
\\Then any of the control signals receives a 1 it is activated. In the first clock cycle,
ADD-REG is clocked, the rdl signal is asserted and the wr2 signal is asserted. This
signifies that the new radius value coming in is latched into the ADD-REG register
and the previous value is being written back to the RAM2 block. The new value
from RAMI is also read simultaneously and put out onto the bus. The value of the
rdl signal also controls the multiplexor to choose either the value from RAMI or
RA!\12. The register DATA_REG is clocked on every negative clock cycle so that
enough propagation time is allowed for a value to be read from ram and latched in.
The \'alue is then incremented by the adder RCAI and RESJlEG is clocked on the
next posith'e edge. Then by this time, the \'alue is on the bus waiting to be written
35
clock
add 0 add2 add 4 add 6
add 1 add3 add 5 add 7
rdl
wrl
rd2
wr2
Figure 3.8: Timing diagram for storage and processing circuit
36
back. The tristate buffers, which act as high impedance devices are controlled by the
wr1 signal. When that signal is asserted, DATAl allows the value to pass from the
output of RES-REG to the data lines of RAMI where it is written back. This whole
process happens in 2 clock cycles. The same procedures take place with RAM2 but
on alternating clock cycles from RAMI. Figure 3.8 shows the pipelining schedule
that the storage circuit follows. In the beginning the registers contain undetermined
values but the pipeline fills up quickly.
The wordsize of each ram block is 5 bits which can hold a maximum value of
31. As discussed earlier, the longest possible line in the image can have a value of
363. If the adder were to keep adding a new value once 31 was reached the memory
value would just wrap around to zero and the data would be invalid. The five bits of
the RAM are logically "nanded" together and this result is "anded" with the value
to be added which is read from the image. This configuration allows for the adder
to work normally unless the value in memory is already 31 in which case a zero is
added to the result. This means that any radius and theta combination having a
value of 31 is considered a line.
37
3.11 Interaction of the Computational and Mem-
ory Update Units
Figure 3.9 shows a functional block diagram of the Hough system. It has been
shown how the Lookup tables, the adder circuit and the memory and storage unit
perform the Hough Transform on an image. A black and white image is processed
pixel by pixel and equation 3.2 is solved many times over for every theta possible. A
new result is produced each clock cycle and it is stored in memory that is addressed
directly by the radius and theta combination. The architecture has been optimized
for throughput by using pipelining techniques.
38
X<-o Y<-o
theta<-o
read LUTs
calculate
radius
8_.!...ye,--s--<
Accepl
new
address
read
ram
Figure 3.9: Flowchart of the Hough system.
39
add I
10 value
read
image
+
black
pixel?
write
value
add 0
to value
no
Chapter 4
Performance Evaluation
4.1 Analysis of Results
The implementation discussed in chapter3 was coded in the Verilog hardware de-
sign language and was synthesized using the Leonardo software package on Sun
workstations. Table 4.1 shows the area and delay estimates given by Leonardo
Spectrum for the Hough system design. The target technology column shows the
input libraries that Leonardo Spectrum used to map the designs. The sc105u library
denotes an Application Specific Integrated Circuit (ASIC) while the xi4xl denotes a
Xilinx 4005XL Field Programmable Gate Array (FPGA). Leonardo accepts a target
technology library and a specific design. It then simulates the design and produces
the rclatiye performance data associated with constructing the circuit on the speci-
fied technology. Rclatiye parameters include size of the circuit (area) and speed the
40
circuit can run at (delay).
The results of table 4.1 show that the ASIC implementation using scl05u cuts
the delay to half as compared with the FPGA implementation. FPGA technology
is convenient because it takes little time to go from concept to product, but clearly
it is not an optimized implementation. ASICs, on the other hand, are specifically
tailored to a particular design. Even though they require more time to build and
are expensive, they run at much higher clock speeds than FPGAs with the same
design.
The gate count section of table 4.1 has been divided into RAM and Total to
demonstrate that the RAM section of the system accounts for more than half of the
gates used. The RAM was synthesized in this design in order to prove functionality
of the system. In an actual implementation a RAM chip would be used to cut cost
and usage of the FPGA. The synthesis results of Leonardo demonstrate that this
design can be implemented on a commercially available device such as the Xilinx
4005XL. The percent usage column denotes how much of the FPGA is used to
implement the Hough system design.
The FPGA statistics are broken up into gates and complex logic blocks (CLBs).
The CLBs that are used on Xilinx FPGAs consist of lookup tables and flip flops
to implement logic. The larger logic blocks are supposed to coincide with improwd
performance. The Xilinx 4005XL FPGA can accommodate 5000 gates and 400
CLBs.
41
Area and Delay Results
Target Technology Gate Count Delay CLBs Percent Usage
RAM Total Gates CLBs
sc105u 2713 4633 9.25ns n/a n/a n/a
xi4xl n/a 303 20.40ns 113 6.06 28.25
Table 4.1: ASIC and FPGA Area and Delay Estimates
The above data indicates this design can be realized on an FPGA platform. It
also suggests improved performance with an ASIC implementation and reduced area
with commercially available RAM.
4.2 Error Analysis
The valid data that is stored in the RAM has inherent error. This is due to the
quantized value of 0 used in calculations. Since 0 is quantized in multiples of 180/64,
this error in 0 propagates through into the calculations. Recall that the Hough
transform equation in polar coordinates links r to 0 as:
r = x *cos(O) + Y * sin(O) (4.1)
\Ve can therefore obtain the error ~r in r in terms of the quantization error ~O by
differentiating 4.1 with respect to 0 to get
~r = -.r * sin(O) *~O + Y* cos(O) * ~O
42
(4.2)
· Clearly, the error .6.r depends on the value of () as 4.2 shows. To find the maximum
error, equate the derivative or r with respect to () as follows.
d(.6.r)/d() = -(y * sin(()) + x *cos(O)) = .6.() = 0
or
tan(()) = -x/y
Substituting back into equation 4.2 we arrive at
(4.3)
(4.4)
.6.r (y * cos(()) + Y * sin2(())/cos(()))6:.()
y * sec(()) * 6:.() (4.5)
This equation shows that the error in 6:.r is linearly proportional to y and increases
with O. Thus the ma.ximum error in r in Hough transform would be for lines with
O=rr /2, i.e. vertical lines.
43
Chapter 5
Conclusions and Future Research
5.1 Conclusions
The previous chapters detail the design of a special purpose architecture to per-
form the Hough transform on black and white images. The discussion begins with
a description of the Hough transform and illustrates the problem using the Carte-
sian coordinate system. After a polar representation is developed, the architecture
is presented. The system has two main portions: the calculation circuit and the
memory storage and update circuit.
The calculation circuit is the portion of the system that actually performs the
mathematics of the Hough transform. It computes the possible radius and 0 asso-
ciated with the current .r and y values by solving equation 3.2 once per clock cycle.
The calculation circuit uses onl~' adders to perform the mathematics. This approach
44
was used to reduce the critical path in the calculation portion of the circuit. The
quantization of () added error to the calculations but ultimately reduced the address
size of the RAM.
The memory storage and update circuit accepts the calculated radius from the
calculation circuit. This circuit is pipelined because new values are calculated every
clock cycle but the procedure to update the memory takes two clock cycles. Together
the radius and the current theta value make up the address which is read from
RAM. The value of the current image point determines what is added to the current
memory address. A saturation adder is included to bound the line lengths to a
maximum value of 31 pixels. Quantizing the value of the lines saves on the word-
size of the RAM, but it does not accurately provide the lengths of long lines in the
image.
The design shows a proof of concept for a 1m\' power approach to the Hough
transform. The delay and area of the resultant architecture implemented in ASIC
technology and the Xilinx 4005XL FPGA were verified obtained. They verify the
possibility of implementing the circuit on two commercially available platforms.
5.2 Future Research
A topic for future study of the hardware implementation of the Hough Transform
would be to reduce the error in the system. In chapter 4, an approximation of the
inherent error in the system was given. The error of the system results from the
quantization of (). In order to reduce error, the () step could be decreased, thereby
increasing the amount of angles covered in the sweep. While this approach would
cut down on error, both the hardware and delay increase linearly as the number
of distinct ()'s increase. Another method of reducing error would be to remove the
saturation adder that converts larger lines to 31 pixels in length. Even though this
would increase precision by denoting exactly how long a line in the image is, it would
increase the RAM size necessary to store the values. This will ultimately add cost
to the system because a RAM with a larger word-size would be necessary.
The design makes use of parallelism in order to perform the Hough transform.
The speed of the system would vary linearly as another copy of the circuit is added.
Presently, one adder and one memory storage and update unit are implemented. If
a second copy of the circuit were added it could begin calculations with () beginning
at 90 and proceeding through 180. This would cut the time required to process the
image in half. Another possibility would be to have the calculations begin with a
y value that is half of the vertical distance of the image. This method would also
decrease the time required to perform the calculations by 50%. However, due to the
nature of the memory storage and update unit it would be more direct to implement
the former suggestion. It would not require two reads from the image at once. If it
is determined that the latter method is more desirable, the image could be stored
in two RO:'ls so that simultaneous reads could take place.
46
Another approach to improving the performance of the Hough system would be
to make use of the non-standard architecture already implemented. The system
presently performs all calculations without reading the image. The memory storage
and update unit is the only portion of the hardware that reads the image. Since the
calculation circuit does not read from the image, a more efficient method would be
to precompute the values so that they are ready to be sent to the pipelined memory
unit. Storing all the possible values in a ROM is less than desirable because as
the image size increases so would the necessary ROM size. There are methods for
cosine and sine approximation which would allow for quicker retrieval of the values.
Optimized multiplier architectures could also be used to reduce the critical path
in the calculation circuit. By using a multiplier, the number of iterations required
would be fewer because the radius would not need to be calculated incrementally.
A final method for improving performance would be a system with variable
precision. Portions of the image may have little or no desirable patterns in which
case the precision of the system for that portion of the image could be rela.xed. If a
large portion of the image had no desirable pattern then it could be processed less
exhaustively or skipped entirely, leading to an increase in performance. The tradeoff
to this method would be adding a new portion of "smart" circuitry to handle the
detecting of bare portions of the image and the control of the system.
47
Bibliography
[1] F. Ozbek and M. Wagh, "A parallel hough transform algorithm for nonuniform
images," Parallel Processing Letters, pp. 253-259, 1994.
[2] M. Nakanishi and T. Ogura, "Real-time line extraction using a highly parallel
Hough transform board," in Image Processing, 1997. Proceedings. International
Conference on, 26-29 Oct. 1997, vol. 2, pp. 582-585, 1997.
[3] L. Lin and V. Jain, "Parallel architectures for computing the Hough transform
and CT image reconstruction," in Application Specific Array Processors, 1994,
Proceedings, International Conference on, pp. 152-163, August 1994.
[4] 11. Meribout, M. Nakanishi, E. Hosova, and T. Ogura, "Hough tranform algo-
rithm for three-dimensional segment extraction and its parallel hardware imple-
mentation," Computer Fisio71 and Image Understanding, vol. 78, pp. 177-205,
~lay 2000.
[5] ~L ~lahmoud, ~L Nakanishi, and T. Ogura, "Hough transform implementa-
tion on a rcconfigurable highly parallel architecture:' in Proceedings Fourth
48
IEEE International Workshop on Computer Architecture for Machine Percep-
tion 1997, 1997.
[6] M. Akil, A. Dehili, E. DUjardin, K. Hamard, and S. Zahirzami, "Parallel Hough
transform on a hierarchial structure," in Proceedings of the 13th International
Conference on Pattern Recognition, 1996.
[7] R. Shankar and N. Asokan, "A parallel implementation of the hough transform
method to detect lines and curves in pictures," in Proceedings of the 32nd
Midwest Symposium on Circuits and Systems, pp. 321-324, August 1989.
[8] N. Guil and E. Zapata, "Fast Hough transform on multiprocessors: A branch
and bound approach," Journal of Parallel and Distributed Computing, vol. 45,
pp. 82-89, August 1997.
[9] L. Hopwood, W. Miller, and A. George, "Parallel implementations of the
Hough transform for the extraction of rectangular objects," in Southeastcon
'96. 'Bringing Together Education, Science and Technology' Proceedings of the
IEEE, pp. 261-264, April 1996.
[10] http://www.netlib.org/pvm3/book/node17.html, "The PVi\l System.".
[11] A. Epstein, G. Paul, B. VeUermann, C. Boulin, and K. F., "A parallel systolic
array ASIC for real-time execution of the hough transform," IEEE Transactions
on Nuclear Science, vol. ·19, pp. 339-346, April 2002.
49
[12] K. Hanahara, T. Maruyama, and T. Uchiyama, "A real-time processor for the
Hough transform," IEEE Transactions on Pattern Analysis, vol. 10, pp. 121-
125, January 1988.
[13] J. Vuillemin, "Fast linear hough transform," in International Conference on
Application Specific Array Processors, Proceedings., pp. 1-9, August 1994.
[14] R. Yip, D. Leung, and S. Harrold, "Line segment patterns Hough transform for
circles detection using a 2-dimensional array," in Industrial Electronics, Con-
trol and Instrumentation, 1993. Proceedings of the IECON '93. International
Conference on, vol. 3, pp. 1361-1365, November 1993.
[15] F. Rhodes, J. Dituri, G. Chapman, B. Emerson, A. Soares, and J. Raffel, "A
monolithic Hough transform processor based on restructurable VLSI," IEEE
'Iransactions on Pattern Analysis and Machine Intelligence, vol. 10, pp. 106-
110, January 1988.
[16] R. Cucchiara, G. Neri, and 11. Piccardi, "A real-time hardware implementation
of the Hough transform," Journal of Systems Architecture, vol. 45, pp. 31-45,
1998.
50
END OF
TITLE
