In this article, an asynchronous array architecture for straight line Hough Transform (HT) is proposed using a scaling free modified CORDIC (Co-Ordinate Rotation Digital Computer) unit as a basic Processing Element (PE). It exhibits four-fold angle parallelism by dividing the Hough space into four subspaces to reduce the computation burden to 25% of the conventional requirements. A distributed accumulator arrangement scheme is adopted to ensure conflict free voting operation. The architecture is then extended to compute circular and elliptic HT given their centers and orientations.
Introduction:
Hough Transform (HT) is a well-known technique for efficient shape recognition (1, 2) . High computational complexity and excessive memory requirement are the major obstacles for monolithic integration of HT (3) . Memory requirement problem may be simplified by current level of memory integration technique (4) .In this paper we restrict ourselves to speed up the computational time of transformation part of the HT i.
e., the computation of vote address in the parameter space.
Different architectures and algorithms have been proposed to speed up the computational time for HT (4, 5, 6, 7, 8, 9) . Most of the Hough -based methods encounter the evaluation problem of implicit trigonometric and transcendental functions. This makes the monolithic implementation of the entire algorithm rather difficult. To overcome this problem, CORDIC based architectures (3, 10) . Are used to generate the vote address in parameter space.
The motivation of this work is to construct the HT architectures suitable for VLSI implementation, which can exhibit high throughput rate at reduced computational complexity. For this purpose CORDIC based asynchronous array architectures have been proposed. The total PE and angle scan range requirements are reduced by adopting an angle parallelization scheme. To overcome the scaling problem inherent to the conventional CORDIC unit, a scaling free modified CORDIC unit (11) which can be implemented using crosscoupled bus connections and adders. A high throughput asynchronous array architecture for straight line HT is proposed. Then the proposed architecture has been extended and modified to compute circular and elliptic HT. While computing circular and elliptic HT, we focus only on the estimation of the radius (for circle), semi major and semi minor radii (for ellipse) as these parameter estimation requires exhaustive arithmetic operations like multiplication, square root evaluation, division, addition / subtraction and squaring (12) . To reduce the computation and hardware requirements for the estimation of these parameters, the problems are reformulated in terms of the CORDIC rotation.
The paper has been structured as follows, in Section 2, a brief description of the scaling free modified CORDIC unit is provided. The design of the CORDIC unit is carried out using Transmission Gate Logic (TGL), which shows 62 mW power consumption for 1.6 µm sea of gates technology, that has been described in this Section.
In Section 3, theoretical formulation of the straight line HT using an angle parallelization scheme and the corresponding architecture are described. Comparison of this architecture with some other existing architectures is done in Section 4. In Section 5, theoretical formulation for circular and elliptic HT and the corresponding architectures are described. Conclusions are drawn in Section 6.
2. The CORDIC unit:
2.1 Brief description of modified CORDIC unit:
The CORDIC algorithm, first proposed by Volder (13) and unified by Walther (14) , is an iterative procedure to compute magnitude and phase or the rotation of a vector in circular, linear and hyperbolic co-ordinate systems, described by the parameter m shown in Table 1 .
An initial vector [x y]
T undergoing a rotation through an angle ψ, will generate the final vector [x / y / ] T according to the following relation,
The total rotation ψ can be expressed in the steps of smaller angles α i s, such that
where M is an integer.
Equation ( , equation 3 may be written as
The largest term that we are neglecting in the process of such approximation is
If the machine in which the operations are supposed to be implemented has got an accuracy of b-bits, then multiplying any quantity with α i 3 /3! will have no effect if stage i. e., one section corresponding to α i, using this principle is shown in Figure 1 . The detailed description of this modified CORDIC is given in the reference (11) .
2.2 Design of the low power CORDIC processor:
A 16-bit CORDIC processor for ψ = 3.583° is designed using the TGL methodology on the sea of gates semicustom design environment. The sea of gates image (15) .
A performance comparison of the TGL design style with the conventional CMOS, NMOS pass transistor and Domino CMOS logic style is carried out using an XOR structure. The simulated results are shown in Table 2 , which reveals that the TGL style exhibits somewhat better power and delay performance than the CMOS style. The NMOS pass transistor style shows less power consumption than the TGL but they are not suitable for sea of gates design style as they leads to an wastage of prefabricated PMOS transistors. The critical sizing of the swing restoration buffer required for NMOS pass transistor logic is also difficult to carry out in the sea of gates environment. However, from the layout point of view, implementation of TGL on sea of gates minimizes the wastage of prefabricated PMOS transistors. Unlike NMOS logic the swing restoration buffer is not required in TGL and the body effect can be made symmetrical for long TGL chain (16) . Since the direct powerline access is not required in TGL style, the static power dissipation due to leakage current is expected to be low. Implementation of the logic circuits using TGL requires less number of transistors than the conventional CMOS design style and thus the area consumption in the former case is lower. Considering these features, the TGL style is selected for our purpose.
The performance of the circuit is analyzed by the Switch Level timing Simulator where P is the power consumption, n is the number of internal nodes, β i is the switching probability of the i th node, C Li is the i th load capacitance, f is the operation frequency and V DD is the supply voltage. The switching probability is considered as 1 in order to include the glitching effect which may exhibit the upper limit of worst case power consumption.
The design of the CORDIC processor is carried out by using two levels of metalization. For some critical routing portions the prefabricated polysilicon gates of the fishbone structure are used. The individual cell isolation is done by connecting the polysilicon gates to the power rails. All the designs of the datapath elements have been carefully optimized.
The simulated circuit extracted from the layout shows that the worst case delay of the CORDIC processor is 22.72 nsec. At 5 V supply with 44 MHz operation frequency, the dynamic power consumption, PDP and EDP of the CORDIC are 62 mW, 1.408 nJ and 3.2 × 10 −17 Jsec. respectively. With proper threshold voltage and device scaling, the supply voltage can be lowered further to achieve quadratic improvement in power performance (16) .
3. The straight line HT:
The mathematical formulation:
The Duda -Hart parameterization for detecting straight lines in an edge image is defined as (17) 
where ρ is the normal distance of the straight line from the origin of the co-ordinate system and θ is the angle between the normal and x-axis as shown in Figure 2 Equation (5) can be implemented using CORDIC which is evident from equation
(1). From equation (1), one gets,
Equation (6) and (7) show that the CORDIC provides two concurrent outputs with their arguments lying π/2 angle apart.
Now replacing (45° + θ) in place of θ in equations (6) and (7), we have another two equations as follows:
These equations imply that a scan range of θ ∈ [0, π] can be divided into four
. Thus, parallely computing equations (6), (7), (8) and (9) 
In equations (11) and (13) (11) and (13) in terms of ρ A and ρ C as follows,
Using CORDIC, equations (10) and (13) can be computed concurrently and from this, equations (14) and (15) can also be computed.
Array architecture for straight line HT:
The array architecture for straight line HT has been constructed by suitable mapping of equations (10), (12), (14) and ( N number of such PE (H S ) are cascaded to realize the transform. The distributed accumulator arrangement with each PE ensures conflict free voting operation. The data transfer between the adjacent PE is done asynchronously. This will suppress the data skewing and the computation becomes data driven. However, a suitable handshaking protocol has to be adopted. Since the PEs are pipelined, in the steady state, parallel HT computation at different θ (= jθ 0 , j ∈ {1, 2, …, N}) can be done for N feature points. The peak detection can be carried out by checking the accumulator counts parallely for all H S .
The total architecture is shown in Figure 4 . The whole operation is summarized in the following pseudocode, 3. Look for peaks in the accumulator array∀ p.
Performance of the architecture:
To evaluate the performance of the proposed architecture and to compare it with the other proposed methods we assume that in the proposed one θ space is quantized in step of θ 0 , where Nθ 0 = π/4 ± δ, n be the number of edge pixels to be processed and m be the number of accumulators per subspace for full set of ρ for each θ 0 .
Computational complexity:
The total number of operations required for ρ computation using the conventional method is 2nπ/θ 0 trigonometric multiplication + nπ/θ 0 additions whereas, in the proposed method, the total arithmetic operations required is 6nπ/4θ 0 (=1.5 nπ/θ 0 ) additions which is much less than the conventional method as the θ scan range is restricted between [0, π/4±δ]. The total accumulator cell requirement in the proposed method is equal mπ/θ 0 , which is same as the conventional one. The results are shown in Table 3 . All the referenced architectures except the architecture in the reference (3) requires larger θ scan range than the proposed architecture implying higher computational requirement than the proposed one. Though the effective scan range for the architecture in reference (3) is approximately same to that of our architecture, the total time requirement of the proposed one is less than that of the architecture of the reference (3) as is evident from the Table 3 . Thus, the proposed architecture enjoys superiority in speed and computational requirement than others. Quantitative measurements in Table 3 are done by considering θ 0 = 2 −4 = 0.625 radians = 3.579545°, N = 13 and δ = 1.534085° and T a = 7.1 nsec (in 1.6 µm sea of gates technology). Under these considerations, a full set of ρ value generation for one feature point takes 295.36 nsec, which seem to be considerably low.
Since this architecture utilizes CORDIC, unlike multiplier based designs, the precomputations of 'cos' and 'sin' values are not required which in its way eliminates the requirement of RAM. This makes the architecture more time effective compared to the multiplier based designs, as in the later case, the RAM access time become a deterministic constraint for ρ computation as is evident in the reference (4) .
In the proposed architecture, the CORDIC units require only adder-subtractor and the architecture can simultaneously compute ρ for N angles in the θ scan range of [0,
Being composed of the scaling free CORDIC (discussed in Section 2), the architecture is more hardware efficient compared to the other CORDIC based implementations and does not require the extra conversion unit like the architecture of reference (10) .
The distributed accumulator cell arrangement with each PE ensures conflict free voting operation. This facilitates a parallel approach for peak detection by simultaneously checking the count of the accumulators for all θ 0 , i. e. for all PE.
The proposed one is modular and shows better regularity than other architectures which is suitable for VLSI implementation. Being asynchronous and pipelined, it is advantageous from low power and fault tolerant application point of view. Since the computation is data driven, the PE synchronization problem (typical to the systolic arrays when the array size becomes large) does not occur. This, in turn, suppresses the data skewing and subsequent glitches which leads to power saving.
In light of the above results and discussion, it can be conjectured that this architecture can be considered as a potential candidate for low power high performance real time straight line HT using VLSI.
Circular and elliptic HT:
One common method applied for extraction of elliptic pattern from a given image data is the tristage (12) approach. In such an approach, the computation is carried out in three hierarchical stages namely, detection of the center, detection of orientation and the major and minor radii estimation. This method can be applied for detecting circular pattern as well where instead of three hierarchical stages only two hierarchical stages are required viz., the estimation of the center and the radius of the circle. In both the cases, the pattern detection procedure is computation intensive and one may require parallel processing array architectures corresponding to the different stages of the hierarchy where each array architecture can be considered as a subunit of the whole system.
Though in the hierarchical approach for detecting circle and ellipse all the stages are computation intensive, the maximum computation involves at the final stage of the hierarchy i.e., for estimating radius of the circle and the major and minor radii of the ellipse. These stages demand diversified mathematical operations like squaring, division, addition, square root evaluation and multiplication. From this point of view, in this section, we have concentrated on developing parallel processing array architectures corresponding to this stage of the hierarchy (which can be considered as a subunit of the entire system for circular or ellptic Hough transform respectively) only. Our principal aim is to reduce the computational requirements for detecting the radius of the circle and semi-major and semi-minor radii of the ellipse using their parametric representation.
Subsequently, CORDIC based array architectures are proposed for them. Analyses made here are based on two considerations that are,
• The origin of the curves is already known.
• The orientation angle of the ellipse is known.
Circular HT:
The equation of a circle can be stated as,
where, (x, y) is a point lying on the circle and 'r' is the radius. In parametric form the length of the radius is given by, (17) where θ is the angle made by the radius vector with the positive x-axis as shown in Figure   5 . Equation (17) is exactly similar to equation (5) 
Where, the suffix of r defines their values in appropriate subspaces and r b / and r d / are considered as modified parameters in the respective subspaces. It can be observed that only (18) and (19) are needed to be computed which can be readily done using CORDIC.
Equations (20) and (21) can be derived from (18) and the architecture for the circular HT are shown in Figure 6 (a) and (b) respectively.
Elliptic HT:
The parametric equation of a point (x, y) lying on an ellipse with semi-major and semi-minor radii 'a' and 'b' respectively, is given by
where θ is the angle made by the radius vector (from origin to the (x, y) point) with the positive x-axis. Now, defining 1/a = a / and 1/b = b / , equation (26) and (27) can be written as
The other four addresses can be computed by changing the sign of the addresses given by equations (30) and (33). Finally, the votes of the same indexed accumulator cells for different PE will determine the shape of the ellipse and the conversion from a / , b / to a, b can be carried out using a look-up table. However, the nature of equations (32) and (33) suggests that each PE requires two CORDIC units operating parallely. Each PE also requires eight 2-D accumulator arrays of which each one is dedicated for a particular subspace. The basic PE designated as H e and the architecture are shown in Figure 7 
Discussions on elliptic and circular HT architecture:
Compared to the conventional method, the proposed formulations require less number of arithmetic operations to detect the radius of the circle and semi-major and semi-minor radii of the ellipse. In evaluating these parameters conventional method requires multiplication, squaring, subtraction, division and square root evaluation (12) . In The basic CORDIC unit has been designed using TGL on 1.6 µm sea of gates semicustom environment which exhibits 62 mW power consumption at 5 V supply and 44 MHz operation frequency. With device scaling, this CORDIC unit is expected to operate at lower supply voltage, which implies that a quadratic advantage in power consumption can be achieved.
Considering all these points, it can be conjectured that the proposed architectures can be considered as good candidates for low power high performance real time HT computation. 
