Abstract-The Hough Transform is a pattern recognition tool commonly used in many image processing algorithms for detecting straight lines. Hough's original formulation of this transform, based on Cartesian coordinates, could not detect vertical lines, and thus it has become common to use Duda and Hart's approach, based on the Radon Transform, which uses polar coordinates and trigonometric functions. For a hardware implementation, this typically requires the use of CORDIC algorithms or lookup tables, thus adding overhead and reducing precision. In this paper we show that Hough's original method, slightly modified, competes well with the Radon transform formulation in terms of speed and required resources in an FPGA implementation. The architecture of our design is straightforward. And because it is implemented in Verilog on an FPGA, the design can also easily be tuned to the desired accuracy.
I. INTRODUCTION
In 1959 Paul Hough described a transform for detecting straight lines in bubble chamber pictures [1] .
Hough's procedure for determining straight lines and complex patterns, which has many applications in the fields of computational mathematics and computer vision, was originally awarded a US patent in 1962 [2] . A general version of the Hough Transform, applicable to parametrically defined image curves, is explained in [3] . The Hough Transform has been used, for example, in applications related to medical devices, to detect harmful growths or shapes within the human body [4] , [5] , and to sharpen and correct noisy still images and videos [6] . Hough's original version of this transform, which we will refer to as the Classical Hough Transform, represents a line L in terms of its slope m and y-intercept b. Because they mistakenly believed that there was no way to prevent the range of m from being the entire real line (-∞, ∞), Duda and Hart in 1972 proposed a version of the Hough Transform which replaced the parabolic dual by the Radon Transform, which uses polar coordinates [7] . In their formulation, a line L is represented by the angle that the normal to L makes with the positive x-axis and the distance ρ of L from the origin This is the version which is commonly referred to today as the Hough Transform. Here we will refer to this as the Duda-Hart Hough Transform. In both the Classical Hough Transform and the Duda-Hart Hough Transform, a voting scheme is used to count points on a given line, with the output being the set of lines containing more than two points. But in the Duda-Hart formulation the accuracy of the computations is related to the choice of the resolution for the angle and to computations involving sin( ) and cos( ). Thus the Duda-Hart Hough Transform is less suitable for an FPGA implementation, where numerical precision and area resources are limited, than Hough's original method.
Several researchers have developed FPGA implementations of the Duda-Hart Hough Transform. One of the first was by Tagzout et al. [8] . Their implementation was based on an improved Fast Incremental Hough Transform (FIHT2) model and required the use of a lookup table and periodic adjustments to deal with errors that accumulate during the computation. Karabernou and Terranti [9] used a CORDIC algorithm [11] in their architecture. More recently, Zhou et al. [10] have designed a version of the Duda-Hart Hough Transform which uses DSP slices in an FPGA designed specifically for signal processing applications. Using a CORDIC algorithm adds computational overhead, while relying on lookup tables reduces precision. But, for detecting lines, this overhead or precision reduction is not necessary, as we explain below. Our method is an improved version of the methods proposed in [13] and [14] , and our hardware implementation is simpler than the systolic algorithm proposed in [14] .
II. METHODOLOGY
Our implementation of the Classical Hough Transform uses parabolic duality, as opposed to Duda and Hart's parametric duality. Parabolic duality was not commonly used previously due to its presumed difficulty in detecting lines with a steep slope.
A. Parabolic Duality
If L is a line y = mx + b, then the parabolic dual of L, D(L) is the point (m, -b). The dual of the point P = (P x , P y ) is the line L = D (P) whose equation is L: y = P x x -P y. [12] Parabolic duality has the following properties:
The crucial observation is that if two points P1 and P2 lie on a line L with equation y = mx + b, then D(P1) and D(P2) will both contain the point (m, -b) (Fig. 1) . So counting how many of the lines in the dual space contain this point will tell us how many points in the original space lie on the line L. The only caveat is that the line L not be vertical, or, in computational terms, that m not be "too large". The geometric interpretation of this transform is as follows. Consider the parabola y = ½ x 2 . If (a,b) is a point on the parabola, then its dual line is the line tangent to the parabola at the point (a,b). Given any point of the form (a,b+c), its dual line is a line parallel to the line tangent to the parabola at (a,b), and the value -c is the vertical distance between the two lines.
As we see in the algorithm steps below, our approach not only avoids the use of CORDIC algorithms or lookup tables, but can be implemented in such a way that it does not even use multiplication.
B. Algorithm for the Classical Hough Transform
Input: n data points (x k , y k ).
Preprocessing: Normalize the data so that the n points lie in the square Q = [-1, 1] 2 .
Output: All lines L intersecting Q that contain, in some approximate sense, more than two of the given n points. Choose the number of iterations N and the resolution δ so that the value Nδ is close to 1. These choices define a grid of points P = (jδ, kδ), 1< j < N, -N < k < N, in the dual space.
Loop: For i = 1 to n do Note: Clearly the multiplications in steps (b1) and (b2) can be replaced by additions, further simplifying any hardware implementation.
A similar method of rotating the axis had been proposed by Mejdani et al. [12] but that method was restricted to only two states. Costa and Sandler [13] have proposed a similar method with four states, but in their approach they limit only the value of slope to m [0,1] and other parameters are still greater than 0. This forces them to add extra rounding equations to keep the y intercept within limits. With the advances in image processing one can now easily map all the pixels in an image to a domain space where the range is limited to [-1,1] x [-1,1], which would give each pixel a coordinate value (x,y)
[-1,1]. By doing this, we are not only reducing the complexity of the algorithm, but also improving the overall accuracy of the algorithm.
C. Computation Time and Precision
It is easy to determine the precision for this algorithm. The grid is determined by the parameter N and the resolution is determined by the choice of δ. These also determine the (sequential) running time of the algorithm, since the running time for each case is O(n*N 2 ).
D. Hardware Implementation
We have implemented a version of the algorithm in Verilog. This version can be ported to a wide range of hardware platforms. In this project we target Altera FPGAs.
FPGAs have become a powerful tool for hardware design implementations. Their low cost, reconfigurable nature, and ability to work in parallel with other systems within a larger design are the main reasons for their wide use. Fixed point arithmetic is used throughout the design, with 24 fractional bits, since all the calculations lie between -1 and 1. Since the algorithm deals with four different cases, a synchronous state machine is built, where the state can be selected by the user for the orientation of lines he/she wishes to detect.
Based on the algorithm, the design can be split into different blocks as shown in Fig. 2 . The Hough Transform and Address mapping units are designed behaviorally as one module and the Memory unit is a separate module. The Hough Transform unit is responsible for calculating the parameters m and b by applying the dual transfer function to the input coordinates based on the input state. The address mapping unit is responsible for two tasks. First, it does rounding and conversion of the data to an integer number which represents the two dimensional addresses of the (m, b) pairs. Second, it converts the two addresses for m and b into one single address using row-major order. Lastly, the memory unit is a dual port RAM used as the accumulator to count the votes for various (m, b) pairs. One set of RAM ports is used specifically for reading and one set of ports is used as a dedicated writing port. This makes reading and writing to the same memory cell more efficient.
For our experiments, we implemented an [800] x [400] grid, with b ranging from -1 to 1 with a precision of 0.0025 and m ranging from 0 to 1 with a precision of 0.0025. Thus, we would need 320,000 cells to store the vote counts. We assign 3 bits to each cell in the voting grid, which can be increased if need be.
III. RESULTS AND ANALYSIS
Two versions of the algorithm were designed. Prior to the hardware implementation, the algorithm was programmed in C. This was primarily done to functionally verify the working of the algorithm and to determine the ideal parameters with which it could detect all the lines in our test cases to some level of accuracy. Then the algorithm was coded in Verilog, simulated in ModelSim TM and finally simulated in Altera Quartus II TM for an Altera DSP development kit containing a Cyclone II EP2C70F672C6 chip. The design is not ported onto the board but rather simulated to do a resource and timing analysis.
A. Simulation Results
The first test case was to detect 6 different lines, each from different states, with a couple of states having two lines each. This data is meant to mirror the behavior within a bubble chamber where the movement of a particle is represented by points. Obviously, the same concept could be extended to line detection with respect to any image. There are 32 input points, with 5-6 points for each line. Fig. 3 illustrates the plot of the detected lines, along with the input data points. Ideally, the line detector should work in situations where the input data is noisy. Thus, the input data points were perturbed by adding random errors to replicate real world data. A portion of the results from this experiment is shown in Fig.  4 . A few of the detected lines have changed by a cell or two due to the perturbations in the input data. 5 shows results for 1000 random points plus 20 perturbed input points for each line. The detector finds all lines and no spurious lines. In these experiments the threshold was manually varied to find an optimal value. The threshold can be a user parameter or could use a heuristic to calculate an optimal value for a given application.
B. Resource and Timing Analysis
The design was simulated in Altera Quartus II for an EP2C70F672C6 chip. It uses 1093 logic elements, 978 of which are combinational functions, or about 2% of the total available logic blocks. It also uses 960,081 memory bits, or 83% of the available memory. To implement the design with 1000 input points, as in Fig. 5 , we need approximately 196 kB bits. The maximum frequency is 62.82 MHz. It takes 20.06 µs to perform calculations for each input and about 8 ms for searching and resetting the memory unit. So for 32 points we need about 34.56 ms to detect all possible lines. The line detector spends most of its time reading and resetting the RAM. If we increase the number of input points, the time taken for reading and resetting the RAM will remain the same, so the time won't increase much. It can be seen that our implementation is slower but does well in terms of computational units and memory (although the Xilinx and Altera devices are not directly comparable).
Relative accuracy still remains to be determined.
IV. CONCLUSIONS AND FUTURE WORK
The paper presents an algorithm and an architecture, described in Verilog, to calculate the Classical Hough Transform, based on parabolic duality, to detect straight lines. This method reduces errors due to approximations by avoiding the use of trigonometric functions. Results show the detector to be highly accurate for the cases tested. Computation time could be reduced by running the 4 cases treated here in parallel.
For a specific problem, speed and memory requirements depend on the accuracy desired. 
