The pipelined CORDIC with linear approximation to rotation has been proposed to achieve reductions in delay, power and area; however, the schemes for rotation (multiplication) 
Introduction
The CORDIC algorithm is an arithmetic method to perform 2D vector rotations. The rotations are performed as a sequence of elementary rotations with a decreasing angle in a convergent linear process. In fact, to use only adders and shifters, the elementary rotations are implemented as similarities. Therefore, the vectors are scaled by a constant during the rotation process. The algorithm has two operating modes: rotation and vectoring.
Current applications include digital signal processing, 3D graphics, reconfigurable computing, speech and music synthesizers, and communication devices (OFDM, CDMA, etc). CORDIC modules are offered by core vendors, specially for FPGA. It is also being used in an FPGA-based Supercomputer [10] .
To achieve high performance the algorithm is unfolded and pipelined. For small angle a linear approximation to the rotation can be used requiring multiplications and addition.This approach has a very significant effect on the latency of the conventional pipelined CORDIC since about * E. Antelo has been partially supported by Xunta de Galicia under project PGIDT03TIC10502PR and J. Villalba has been supported by the Ministry of Educatinal & Science of Spain under project TIC2003-006623 half of the stages (serially organized) with a delay of about one carry-propagate adder each, are changed by a fast treelike structure of carry-free counters with only one final carry-propagate addition. The drawback is that this method can only be applied to the rotation mode. The linear approximation in the vectoring mode leads to a division operation. Thus, a fully operational pipelined CORDIC (rotation and vectoring modes) cannot be efficiently implemented using this approach.
In this work we extend the final multiplication approach to the vectoring mode. Our approach is based on the concurrent computation of a reciprocal with the first half of the CORDIC stages. We also present an architecture that implements both modes of operation with final multiplication and with a concurrent compensation for the scale factor, so that further reductions of latency are obtained with respect to the conventional pipelined CORDIC.
CORDIC algorithm
In this section we present a brief description of the CORDIC algorithm. For more details and references see [5] . The algorithm consists in the following steps:
1.-Initialization:
For the rotation mode (x f , y f ) are the new coordinates of the rotated vector and for the vectoring mode x f gives the modulus of the vector and z f is the rotated angle.
It can be shown that using the sign of z[j] (rotation mode) or y[j] (vectoring mode) to obtain the direction of each elementary rotation (σ j ), the following conditions are verified:
which results in
where x[j] is bounded by the initial modulus of the input vector (M ) scaled by a factor K j , that is
In this work we concentrate on a unfolded (parallel) pipelined implementation of the algorithm with carry propagate adders. It consist of n hardware stages implementing Equations (1) and a final constant multiplication by K n . Since each iteration of the algorithm is implemented in a separate hardware, the shifters are actually hardwired. Registers are introduced to achieve the desired clock cycle resulting in a pipelined system.
The accuracy of the algorithm is determined by many parameters [7] [5] . To have a simpler presentation we assume the following: the approximation to the rotation angle is of the order of the last elementary rotation angle, which is tan −1 (2 −(n−1) ) for n iterations; the input operands have n − 1 fractional bits. The width of the data-path is roughly b = 3+(n−1)+m bits (m = log 2 (n)), including guard and overflow bits; the scale factor is rounded to n − 1 + m fractional bits. All the approximations that follow in this work have an error less than 2 −(n−1) and the resultant accuracy should be of O (2 −(n−1) ). From iteration j = n/3 + 1 = t, the elementary rotation angles can be approximated to within n bits of precision by σ j tan −1 (2 −j ) = σ j 2 −j . Therefore for j ≥ t, after a recoding, it is not necessary to implement the z recurrence for both rotation and vectoring.
Reducing latency through the linear approximation to rotation
Termination schemes perform linear approximations for the rotation when the remaining angle is small enough [1] [12] [13] . These linear approximations lead to a final multiplication (rotation) or division (vectoring) to complete the rotation. To have this approximation correct to within n bits of precision for both rotation and vectoring, it is necessary for the remaining rotation angle α to be bounded by α < 2 −n/2 . Therefore, it is possible to perform this simplified rotation after iteration j = n/2 [1] [12] [13] . The implications of this approximation are the following: • Rotation: after CORDIC iteration j = n/2 the x and y coordinates (
Since |z[q]| < 2 −n/2 , the multiplication by z is of about n/2 + m bits and can be performed with a tree of counters with logarithmic delay. The multiplication by K q is performed after the linear approximation to rotation. This scheme is illustrated in Figure 1 (a).
• Vectoring: in this mode the modulus and the angle between the vector and the x axis are computed. The modulus and the angle are obtained by solving the following equations xf
, the computation required after the CORDIC iterations are a division of about n/2 bits to obtain the angle, and the scale factor compensation to obtain the modulus. Both processes can be performed concurrently. This scheme is illustrated in Figure 1 
Since division is a sequential process, the linear approximation leads to different latency schemes for vectoring and rotation. This fact leads to inefficient unified implementations of rotation and vectoring in a pipelined CORDIC processor with a latency determined by the division process. In addition, the scale factor compensation in the rotation mode is performed after the rotation, while in vectoring the compensation of the scale factor to obtain the modulus can be performed concurrently with the linear approximation to the rotation.
In [2] a unified implementation with a linear approximation was proposed, using a radix-4 prescaled division algorithm to complete the vectoring operation. To have a unified implementation, the rotation is completed with an iterative radix-4 multiplication. As indicated, the division algorithm limits the use of fast parallel tree multipliers. We compare with this scheme in Section 6.
Multiplicative scheme for vectoring
In this section we show how to efficiently combine the architecture for vectoring and rotation when a linear approximation to the rotation is used. As shown in the previous section, the linear approximation to rotation leads to a multiplication scheme in the rotation mode and to a division scheme in the vectoring mode. We now show how to also use a multiplication scheme for vectoring.
For vectoring we need to compute zf
) which requires a division and add operation. This is transformed into a multiplication operation by first computing R = 1/x[q]. There seems to be no apparent advantage to doing this. However, two observations need to be taken into account: i) We know the bound
ii) The n/2 leading bits of R can be computed from x[j] with j < q. This is due to the fact that roughly two bits of the modulus M (stored as the x coordinate) are determined in each iteration. More specifically, from (2) we know that, at iteration j, the angle between the vector and the x axis is bounded by tan −1 (2 −(j−1) ). In addition, the modulus of the vector at this iteration is bounded by
Therefore, the following bound results:
Then from (3) and (4), a bound for
Thus, the maximum difference between x[j] and the scaled modulus M/K q is
This bound decreases at a rate of 2 −2j , which means that approximately two bits of the modulus are determined in each iteration.
We now fully develop both observations to determine the index j from which we can obtain an approximation R of R = 1/x[q]. The precision required forR is such that the error of the multiplication ofR and y[q] is less than
We assume that we obtain an approximation of R from x[j] with j < n/2, which represents the f + 1 leading bits of
We proceed by following three steps: 1) obtain a bound of
2) determine the error produced by the reciprocal computation of x[j] using a specific method of computation; and finally 3) combine both items to obtain a bound on j.
Bound of x[q] − x[j]
We use the identity
From (5) and (6) we have:
To simplify the derivation of the bound of
we multiply by a factor to normalize to the range [1, 2) . From (5) we obtain
Thus, the normalizing factor is 2K q /M and then the normalized value has at most one integer bit. We scale the difference 
Reciprocal computation
To simplify the presentation we assume a direct table method to compute the reciprocal. The extension to other methods (say bipartite tables, linear or quadratic interpolation, or high-order methods) is straightforward.
From [3] we know the following result: 1a 1 a 2 ....a p+g of the reciprocal of d that satisfies 
De-normalizing this result, we obtain the bound in the error of estimation of the reciprocal of x[q]
Since the error of estimation of d is bounded by 2 −p , the resultant allowed estimation error for x[q], considering the de-normalizing factor, results in
Bound on j
We now determine a bound for the index j so that we can obtain an estimation of R = 1/ −1) , we obtain the following condition for p,
To reduce table size we take p = n/2 − 1 and g = 0.
Therefore, from (10) the allowed error for the estimation
A bound for the value of x[q]− x[j] is given in (7)
. Therefore, from the bound of the allowed error given in (11), it is required that
Since f determines the number of bits of x[j] that input the reciprocal table, we select the minimum possible value of f = n/2. Then we obtain the condition on j:
which is verified for j ≥ n/4 + 1. Figure 2 ):
In summary, the computation of y[q]/x[q] is performed as follows: (see
• Obtain a reciprocal R from a table addressed by f = n/2 bits (it is not necessary to input the leading one into the table ) of x[j] with j ≥ n/4+1. The result is of about n/2 bits. The index j is selected so that enough time is provided for the delay of the table. The bound of about n/4 CORDIC iterations seems long enough so that the reciprocal computation is not in the critical path.
• After iteration j = n/2, perform a (n/2+m)×(n/2+ m) multiplication of R × y [q] . Since only the leading n/2 bits of the multiplication are necessary, we use a truncated multiplier. 
Figure 2. Proposed scheme for linear approximation to rotation (vectoring mode).
If n is large the table for reciprocal computation could be large. In this case alternative methods for reciprocal computation can be used. Depending on the precision required, we find the following among several alternatives [5] : bipartite tables, linear or quadratic interpolation, high-order polynomial methods and digit-by-digit methods.
Concurrent scale factor compensation
Constant scale factor compensation has been performed in the following ways for different authors [5] : Pre-or postmultiplying the x and y coordinates by the constant scale factor, decomposing the scale factor in a product of shift and add terms and performing the compensation using similar iterations such as elementary rotations, concurrent compensation [14] , etc.
For a unified rotation/vectoring pipelined architecture, the most suitable method might be pre-or post-constant multiplication. Using this scheme the compensation adds latency overhead, since in the rotation mode the compensation has to be performed after or before the elementary rotations but not concurrently (see Figure 1(a) ).
We now show how to perform the compensation concurrently with the linear approximation to the rotation, to reduce the latency overhead of the compensation.
After the n/2 elementary rotations we perform the following computation
are constant multipliers with a multiplicand of b bits (result truncated to b bits). Since |z[q]| < 2 −n/2 , it has n/2 + m significant bits (including the sign). Moreover K q < 1 and therefore P < 2 −n/2 . Then the multiplications P y[q] and P x[q] are of (n/2+m)×(n/2+m) (truncated to n−1+m fractional bits).
Note that P should be computed before x[q] and y[q] for the scheme to be effective. In Section 2 we mentioned that to reduce the complexity of the z iteration, from iteration j = n/3 + 1 = t, the σ j values are obtained from a direct recoding of z [t] . Therefore, we obtain z[q] from z [t] by performing a reverse recoding (from digit set {−1, 1} to {0, 1} two's complement) of the digits with weights lower than 2 −n/2 . Figure 3(a) shows the implementation of the proposed method. This method introduces a latency overhead of about one constant multiplication for both the linear approximation to rotation and scale factor compensation, while the conventional scheme requires two series multiplications, a (n/2 + m) × b multiplication and a constant multiplication. The effect on hardware complexity is analyzed in Section 6.
Combined architecture for rotation and vectoring
Figure 3(b) shows the unified architecture for vectoring and rotation incorporating the linear approximation to rotation for both operation modes and the concurrent compensation of the scale factor. The combination of both architectures is simple since the additional hardware required is only one multiplexer, one adder, and one row of AND gates.
Evaluation
In this section we compare our design and existing proposals in terms of hardware complexity and delay. There have been many proposals regarding variations on the CORDIC algorithm. However, we are only concerned about those schemes proposed for a CORDIC unit implementing both rotation and vectoring. Moreover, many techniques proposed to reduce CORDIC latency, such as redundant CORDIC and very-high radix CORDIC, are orthogonal to our proposal. Therefore, it is appropriate to only compare it to the conventional radix-2 CORDIC and to a combined radix-2/radix-4 CORDIC implementation [2] .
The radix-2/radix-4 proposal performs about the first half of the iterations in radix-2, and the rest of iterations in radix-4. The radix-4 iterations correspond to a multiplication operation for rotation and a division operation for vectoring with prescaling of operands in parallel to the scale factor compensation.
For the delay calculations we use a rough timing model based on logical effort [11] normalized to FO4 units (FO4 refers to the delay of a 1x inverter with a load of four 1x inverters). The effect of the interconnections in the delay was not considered.
Regarding hardware complexity, we determined the number of equivalent two-input nand gates for each design. The relative complexity of each gate with respect to a twoinput nand gate was determined in terms of the size of the total active area of the transistors of the gate. This simple area-delay model provides area and delay ratios that should indicate the potential advantage of our proposal when actual implementations are considered. Table 1 shows the data of the model corresponding to the simple gates and basic hardware elements used in the evaluation.
For the comparison we considered additions implemented with fast parallel prefix adders: for the table lookup we assumed an implementation using a multiplexer tree (of 4-to-1 multiplexers) for each output bit. This represents a fast but costly implementation for a look-up table.
Assuming that the complexity of a look-up table varies according to an area × time 2 law, a slower but more area efficient implementation results by doubling the delay, reducing the hardware complexity by a factor of four.
The computation of R corresponds to the computation of a reciprocal with about n/2 bits of precision. We considered three types of implementations: linear approximation [4] , quadratic approximation [9] [8], and a very-high radix digit-by-digit reciprocal [6] . Figure 4 shows the estimated delay and area for the three methods for the range 16 ≤ n ≤ 116. In addition, since the reciprocal should be computed in parallel to about n/4 CORDIC iterations, we show in Figure 4 (a) the bound of delay corresponding to n/4 CORDIC iterations. In our design we used the method with minimum area and a delay less than the delay of n/4 CORDIC iterations. Figure 4 (b) shows three regions A, B and C, each one corresponding to the method used for the corresponding range of n. For lower precisions (regions A and B) the linear or quadratic approximation are used. However, for higher precisions the very-high radix approach is more convenient due to better scalability 1 . In the delay and area estimations we only considered the combinational elements, since the contributions due to registers depend on the cycle time requirements. We performed the comparisons for a range of 16 ≤ n ≤ 116. Figure 5 shows the delay and hardware complexity for the compared designs. Figure 6 shows the corresponding delay and hardware complexity ratios taking as a reference the proposed design.
The conventional CORDIC presents 1.3 to 1.5 more area and 1.7 to 2.0 more delay compared to our proposal. The radix-2/radix-4 CORDIC presents about 1.2 more area and 1.5 more delay.
We conclude that our approach might lead to more efficient CORDIC modules when rotation and vectoring are implemented in the same unit. The reduction in delay can be "converted" into a reduction in dynamic power consumption through voltage scaling (increasing the delay). Specifically, if s is the speedup factor among two designs with the same voltage, through voltage scaling of the faster design to have the same delay in both designs, the factor of reduction in dynamic power consumption is roughly [(0.3s+0.7)/s] 2 , provided that both designs present similar activity factors and active capacitance. Since our design has less hardware complexity than the other designs we have compared it with, we can expect factors of dynamic power reduction of about 0.4-0.5 with respect to conventional CORDIC, and 0.6 with respect to radix-2/radix-4 CORDIC. 
