Abstract
Introduction
The CORDIC algorithm in circular coordinates permits the calculation of rotations, as well as the trigonometric functions sin, cos, and arctan(b/a), and the modulus of a vector. A large body of work has been reported on variations of the algorithm, on implementations, and on applications. We refer the reader to [15] [19] for an overview of the algorithm, of the previous work, and for additional references. The original algorithm is radix 2 with non-redundant adders. This has been extended to the use of redundant adders and to radix 4 [4] [18] . We consider here the extension to a much higher radix, such as radix 512.
As the radix increases, the number of iterations for a given precision is reduced accordingly, resulting in a potentially faster execution. However, two problems appear: E. Antelo and J.D. Bruguera were supported in part by the Ministry of Education and Science (CICYT) of Spain under contract TIC96-1125 the complexity of the selection function and the compensation of a variable scale factor. The former problem also appears for other digit-recurrence algorithms, such as division. An effective solution that has been applied there is to perform the selection by rounding [8] [11] . To allow this selection the recurrence has to satisfy certain conditions. The following two methods have been used for this:
1. Scaling the recurrence: done for division [11] , square root [13] , and sqrt(x/d) [2] .
2. Performing selection by table in the first iterations until the conditions are satisfied: used for exponential and CORDIC rotation [1] .
Very-high radix CORDIC for rotation mode has been considered in [5] [1]. Here we consider circular CORDIC in vectoring mode, which is suitable for computation of arctan(b/a) and the modulus of vector (a,b). In this case, the appropriate method for selection by rounding is scaling of the recurrence. In this sense the algorithm is similar to division. However, there is an additional difficulty due to the variation of the coordinate x in each iteration. This has the effect that several recurrence-scalings might be required. We show that two recurrence-scalings are sufficient.
The variable CORDIC scale factor, as well as the factors of the recurrence-scalings, require that the overall scaling factor be calculated and then used for the scale-factor compensation. To perform this, we calculate the logarithm of the scaling factor (by adding the logarithms of the component factors) and perform the exponential function for the compensation.
In [20] an algorithm for arctan(b/a) is presented which has some similar aspects to the one presented here. Apparently, the selection is done by truncation instead of rounding, so that, for convergence, an over-redundant digit set must be used, resulting in a more complex implementation. In addition, the final steps are done by the computation of a polynomial, requiring a full size multiplier. Moreover, the algorithm does not permit the calculation of the modulus of the vector as discussed in Section 7.
Since for the calculation of arctan(b/a) no scale-factor compensation is needed, we present this case in Sections 3 to 5. Then, in Sections 6 and 7, we extend the previous algorithm and implementation. We present pipelined and serial implementations, and have performed a rough estimation of the delay for 32-bit precision showing a substantial speedup with respect to radix-2 and radix-4 cases.
Due to space limitations we omit some details. An extended description can be found in [3] .
Very-high radix CORDIC vectoring
The algorithm is an extension of the radix-2 algorithm (we use the y recurrence scaled by r j as in [14] to facilitate the selection). That is, for radix r = 2 b , the algorithm is x j + 1 = x j + j r ,2j w j w j + 1 = rw j , j x j (1) z j + 1 = z j + tan ,1 j r ,j where w j = r j y j , j is selected from the set f,r , 1; :::; ,1; 0; 1; :::r , 1g so that the following bound for convergence is satisfied jw j j r j tan max j x j = A j x j where A j = r j tan max j , and
N is the index of the last microrotation, and U is the maximum angular error. To achieve n bits of precision, the number of iterations is N = dn=be. For a faster iteration, we utilize carry-save representation for x, w, and z.
Selection by rounding
The selection is performed by rounding the residual w j .
However, since the residual is in carry-save representation, as done in division, to reduce the iteration time the rounding is performed on an estimate. We consider the case in which this estimate is obtained from a truncation to t fractional bits of the carry-save representation. That is,
whereŵ = bw j c t is the carry-save representation of w j truncated to t fractional bits.
Conditions for selection by rounding
When j is selected by rounding as indicated above, following a similar reasoning as in [11] for division, the condition for convergence is given by 1 2 +2 ,t +r,1j1,x j j min1, 1 2r ; A j+1 x j+1 r ,1
In [3] it is shown that A j + 1 r , and since x j + 1 x j , condition (3) 
Pre-scaling
We use pre-scaling of x and w, so that x j is close to one and satisfies (4) . Since, in contrast to division, the value of x j changes because of the iteration, even if the initial x is pre-scaled into the required range, a subsequent x j might get out of the range, requiring additional pre-scalings. We now show that it is sufficient to have two pre-scalings, one before the first iteration and another before the second. Moreover, since the first pre-scaling is only useful for the first iteration and the delay of pre-scaling might depend on the radix of the iteration, we develop the algorithm al-z j + 1 = z j + tan ,1 j r ,J where r ,J = R ,1 r ,j+1 .
Interval for second pre-scaling and bound on R 
From this expression we obtain (b = log 2 r)
The tradeoffs for the selection of R are A small R leads to a simpler pre-scaling for j = 1 , which might result in a smaller pre-scaling overhead.
On the other hand, a small value of R results in a more complex pre-scaling in j = 2 . R and r should be selected to minimize the number of cycles with a reasonable hardware complexity.
Interval for the first pre-scaling 
Determination of the pre-scaling factors
The pre-scaling factors should assure conditions (7) and (8), that is a x coordinate within an interval close to one after each pre-scaling. Therefore, these factors correspond to an approximation of the reciprocal of x. Since the method used may be constrained by the need to compensate the prescaling factors to obtain the modulus, we consider the following two cases:
Algorithm used for angle calculation only. In this case, it is not necessary to compensate the modulus of the vector (neither due to the pre-scaling factors nor to the CORDIC scale factor). Therefore, the method used to compute the pre-scaling factors is not constrained and any of the existing methods to obtain an approximation of the reciprocal may be used. Details of the use of linear interpolation for very-high radix division can be found in [11] .
Algorithm used for modulus computation. In this case, it is necessary to compensate the pre-scaling factors, which introduces certain constrains in the method of calculation. This case is discussed in Section 7.1.
Range extension and -set for first iteration
In this section we show how to extend the range in the angle to 0; = 2 to conform to the case of radix 2. In relation to this, we determine the digit set for the first iteration.
Since The range covered depends on the value of D. To reach an angle of =2 a large value of D could be used, but this would lead to a complex implementation. To avoid this, we have considered the following method: (1) Detect whether the angle of the initial vector is larger than =4 (y in x in ); (2) if that is the case, exchange x and y and subtract the computed angle from =2 (this can be accomplished by making z 1 = =2).
For the detection, we need to compare x in and y in . To make a limited comparison we take advantage of the fact that the range of convergence in the angle is larger than =4, depending on the value of D. Consequently, to perform a limited comparison we allow that after the interchange
where F is the number of fractional bits compared. We determined D R + 3 2 + 2 ,F R
For instance, comparing F = log 2 R bits of x and y, results in D = R + 3 . Therefore, with values of D slightly larger than R it is possible to make a limited comparison.
Implementation for ArcTan(b/a)
We now consider the implementation of arctanb=a.
This uses directly the algorithm presented in previous sections, since no scaling-factor compensation is needed.
Architecture
CORDIC modules are implemented both in pipelined and sequential (word-serial) fashion. The choice between these two approaches depends on the throughput requirements and on the available area. The pipelined implementation consists on unfolding the algorithm and placing latches to define the pipeline stages. Figure 1 shows the unfolded algorithm. Details of these parts are given in Figure 2 . There is considerable flexibility on where to place these latches, the actual placement depending on the required throughput and the clock characteristics. To allow for these different characteristics, we do not decide on the placement of the latches.
On the other hand, the word-serial implementation reuses the same data-path for all parts. The diagram is shown in Figure 3 . The cycles are Cycle 1: Comparison and exchange. Calculation of M1 and pre-scaling. This can be done in one cycle because the radix R is small, so that the calculation of M1 is fast. 
Evaluation and comparison
We now make a rough estimation of the execution time and of the area and compare with other algorithms. To make this evaluation more specific, we consider the case of 32 bits precision and an implementation for r = 512 and R = 3 2 .
We use as unit of delay the delay of a full adder (t fa .
We compare with the following schemes: radix-2 with non-redundant (fast carry-propagate) addition, radix-2 with redundant addition [9] [4], and radix-4 with redundant addition [18] [12] .
Pipelined implementation
We first give a rough estimation of the critical path of the CORDIC iteration. We have not included the delay of the latches, since their number depends on the desired throughput. This path is composed of the following parts:
Selection by rounding and recoding: 1:5 t fa Rectangular MAC:
-buffer + mux : 1:5 t fa -adder tree (2x5+2 inputs): 4:0 t fa Consequently, the critical path is 7:0 t fa . Now we estimate the delay of the other parts as follows:
First pre-scaling: same as CORDIC iteration.
Second pre-scaling: two times CORDIC iteration.
Since there are four microrotations (one radix 32 and three radix 512) the total delay is 7:0 1 + 1 + 2 + 3 50 t fa . Table 1 shows the delays of the compared schemes, indicating the corresponding speedup obtained by the very-high radix implementation.
With respect to area, for redundant representation, the part of the CORDIC iterations has essentially the same area independent of the radix, except for the tables of the angles whose size is proportional to the radix. In addition, the very-high radix scheme requires the area devoted to the pre-scaling. On the other hand, the area devoted to the wired shifts (which is substantial for low-radix implementations) is significantly reduced in the very-high radix implementation. 
Word-serial implementation
We estimate the delay of one cycle as that of a CORDIC iteration and estimate that the rest of the cycles can be implemented in this cycle time. Note that to achieve this in the first cycle the table of M1 is duplicated. The delay is (1.5) recoding + (0.5) mux + (6) MAC + (1.0) reg = 9.0 t fa In this case the size of the MAC is determined by the second pre-scaling factor, so that the MAC has larger latency than that for the pipelined implementation. Since there are 7 cycles the execution time is 63 t fa .
We compare with the same schemes as for the pipelined implementation. The results are given in Table 2 .
In the word-serial implementation the very-high radix scheme occupies a larger area mainly because of the MACs, the tables for the angles, and the calculation of the prescaling factors. The tradeoff between speed up and area can be varied by choosing the radix.
Scale-factor compensation for the modulus
The algorithm described in the above sections introduces a scale factor in the modulus of the vector. This factor is the result of the pre-scaling factors, used to make selection by rounding, and the scale factor introduced by the CORDIC algorithm. Note that all of these factors depend on the input data and, therefore, are not constant.
In this section we present an algorithm to compensate all the scale factors (pre-scaling factors and CORDIC scale factor). This is needed only for modulus computation.
To perform the compensation, we use a scheme that consist in computing the logarithm of the total scale factor to be compensated and perform the exponential function. This scheme was first used in [5] and later used in [1] . In [4] a similar scheme was used for a radix-2 algorithm. In this paper we improve the computation of the exponential function. In the following sections we describe how to perform these two steps.
We do not consider other methods to compensate a variable scale factor (for instance square-root and division [9] , or look-up tables and multiplications [18] ) since we estimate that they increase the latency and/or the area [3] .
Calculation of logarithm of scale factor
The expression of the scale factor to compensate is
where r ,J = R ,1 r ,j+1 . Therefore, the computation of the logarithm of the scale factor results in lnT = lnM 1 + lnM 2 + 1 2
r ,2J (10) This involves the computation of the logarithm of each factor and addition of all the logarithms. We have determined that the range of T is within 1=2; 5=2. Therefore, the range of the logarithm is within , ln2; ln5 , ln2.
Since the argument of the exponential function is ln1=T , its range should be , ln5 + ln2; ln2.
Compensation by exponential
In this section we propose an algorithm for the computation of He ln1=T , where H is the x coordinate after the vectoring operation. This algorithm corresponds to improvements over the algorithms proposed in [5] and [1] . The basic iteration to compute the exponential is [5] v i + 1 = v i + e i r ,i v i c i + 1 = rc i , B i e i (11) where i 1, v 1 = H, c 1 = , lnT , ,r , 1 e i r , 1 and B i e i = r i ln1 + e i r ,i
The selection of e i is performed so that r ,i c i tends to 0.
The result is v S + 1 = He ln1=T = H T where S is the number of iterations of the exponential algorithm. Note that we use r for the radix, but this does not imply the use of the same radix as in the vectoring. As in [1] , we consider selection by rounding. However, in this case we perform the selection as follows e i = round(ĉ) (13) whereĉ = dc i e t is the carry-save representation of c i + 2 ,t truncated to t fractional bits. Note that, in contrast to the vectoring case, for this algorithm, to perform the rounding we add always one in position t. This selection leads to a reduction in the maximum residual for the next iteration.
In [3] we obtain the conditions for convergence using selection by rounding. Using similar conditions of convergence as in the vectoring case, we determined that For iterations with i 3 convergence is assured by using selection by rounding and ,r,1 e i r,1.
For iteration i = 2 , we achieve convergence with selection by rounding if ,r , 2 e 2 r , 1.
Iteration i = 1 does not converge with selection by rounding.
For the first iteration, we propose to use a table for selection. The following two aspects determine the parameters of the table:
As we have seen in Section 6.1, the domain of the exponential function has a positive and negative part. Since the first selection is by This assures a positive argument for the exponential function, by just performing a right shift by two positions. The range to be covered by the exponential is now 3 ln2 , ln5; 3 ln2.
Since the selection of the second iteration is performed by rounding with ,r , 2 e 2 r , 1, the residual c 2 should verify ,r , 2 , 1=2 + 2 ,t c 2 r , 1=2. This assures convergence in the remaining iterations.
We report in Section 7.1 the results we have obtained for this table.
Calculation of ArcTan(b/a) and Modulus
We now combine the implementation of Section 5 with the scale-factor compensation of Section 6 to obtain a module for the calculation of both arctan(b/a) and the modulus of the vector (a,b).
Architecture
We first discuss the size of the required tables for the compensation and then we consider a pipelined and a wordserial implementation.
Size of the required tables
In this section we discuss the size for the required tables, for the compensation of the total scale-factor. The tables correspond to the calculation of the logarithm of the scale factor, to the selection in the first iteration of the exponential, and to the constants for the exponential iterations. Tables for the computation of the logarithm. As discussed in Section 6.1, we include the compensation of the pre-scaling factors as a part of the CORDIC scale-factor compensation using the logarithm-exponential method. This requires the computation of the logarithm of the prescaling factors with a precision of about n bits. Since the pre-scaling factors have about b = log 2 r bits, which is much smaller than n, an effective method for the computation of the logarithms is a table-lookup. We address the table directly with the bits of x. Moreover, since the error in the computation of the pre-scaling factor is minimum when a table look-up is used, this method reduces the number of address bits of the tables.
In summary, in this case we use two tables for each prescaling factor, one for the pre-scaling factor and one for the logarithm. In [3] , we determined the effect of the algorithm parameters (such as R, r and t) on the number of bits involved. There are various solutions to the number of input bits of these tables, the choice depending on a tradeoff among many characteristics. In any case the number is about B + 1 bits for lnM 1 , and b + 2 bits for lnM2.
Moreover, the computation of the logarithm requires the terms ln1 + 2 j r ,2J , which are obtained from a table of b input bits for each j. 
Pipelined implementation
As shown in Figure 4(a) , the pipeline of Section 5.1 is extended to include the scale-factor compensation of x. This extension consists of the following parts:
The calculation of the logarithm of the scale factor. As described by expression (10) this calculation is performed by the sum of the component terms. As indicated previously, these component terms are obtained from tables. In Figure 4 (b) we show the implementation of one iteration of this calculation.
The recurrence to obtain the digits of the decomposition of the logarithm according to expression (11) . The selection is done by rounding, except for the first digit, which is obtained from a table (with b + 2 input bits).
The implementation of one of these iterations is shown in Figure 4 (c).
The recurrence using the digits above to perform the compensation. This is not shown since it is similar to the x CORDIC data-path.
Note that since the scale factor of the required precision depends only on about the first half of the j digits, the calculation of the logarithm is performed only in those iterations.
Moreover, in the last iterations x does not change, making it possible to overlap the compensation with the rest of the CORDIC iterations. Specifically, we have determined in [3] that the last iteration in x to obtain H (the value of the scaled modulus within the precision) is h = n , 2B 2b + 1
To avoid the delay of the table for selection of e 1 , this selection is performed using an estimate of ln1=T . Finally, the last portion of the compensation can be performed by a rectangular multiplier (termination by linear approximation of the exponential).
Word-serial implementation
As shown in Figure 4 (d), to the implementation for the calculation of arctan(b/a) it is necessary to add a data-path for the summation of the logarithms. This same data-path is then used to obtain the digits e i . The actual multiplication by 1 + e i r ,i is done using the x data-path. Therefore, the shifter should be extended to perform also shifts by r ,i . The same considerations as for the pipelined implementation reduce the amount of tables required and allow the overlapping of the CORDIC iterations and the scalefactor compensation. 
Evaluation and Comparison
We now extend the evaluation performed for the arctan case to include also the modulus. As there, we consider the case for 32 bits precision, r = 512, and R = 3 2 .
For the pipelined implementation, after the third CORDIC iteration it is possible to initiate the compensation. Moreover, after the second iteration of the exponential the compensation is terminated by a multiplier (linear approximation). We assume that the delay of an iteration of the compensation as well as the delay of the multiplier is the same as that of the CORDIC iteration. Consequently, the total delay of the calculation of the modulus is equivalent to the delay of 9 CORDIC iterations, which corresponds to 9 7:0 = 6 3 t fa . The comparison of delays is as follows:
For the radix-2 implementation with non-redundant adder, the compensation can be done totally in parallel with the angle calculation. Therefore, the delay remains in 180 t fa , with a speed up of 2.9.
For the radix-2 implementation with redundant adder, to simplify the compensation, it is convenient to use an implementation with a constant-scaling factor, such as that of [14] . For 32 bits, this requires 6 additional CORDIC iterations, for a total of 39 iterations and the compensation can be performed in parallel with these iterations. The total delay is 39 4:5 = 175 t fa , resulting in a speedup of 2:8.
For radix 4 (with redundant adder) [18] , the scaling factor is variable. The scale factor is computed by an initial table and linear approximations, and the compensation is performed by a multiplier. Both operations are performed in parallel with the angle calculation. Therefore the total delay is 17 7:0 ' 120 t fa , resulting in a speedup of 1:9.
The word-serial implementation has about the same critical path as the architecture for angle calculation of Section 5.1. However in this case we have 10 iterations (for the word-serial we do not terminate the exponential with a multiplier, resulting in one additional iteration, as compared with the pipeline architecture). Therefore, this implementation presents a total delay of about 10 9:0 = 9 0 t fa .
Making similar considerations for the radix-2 algorithms, results in an speedup of about 3:0 with respect to the radix-2 algorithm with non-redundant adder, and 2:4 with respect to the radix-2 algorithm with redundant adder. For the radix-4 algorithm, the compensation is carried out by a radix-4 division algorithm, that begins when the scaled module is obtained within the precision, so that certain overlap exists between the microrotations and the compensation.
The resulting total number of iterations is 33 (25 microrotations and 8 iterations of the division used for compensation, which do not overlap with the microrotations), and a total delay of 200. The speedup in this case is of about 2:2.
Conclusions
We have presented a very-high radix algorithm and implementation for the circular CORDIC in vectoring mode, which is used to compute arctan(b/a) and the modulus of vector a; b. In order to use the simple selection by rounding, two pre-scalings are performed, one before the first iteration and another after it. Moreover, to reduce the overhead of the pre-scaling and adapt to the required precision, the first iteration is done using a smaller radix than the rest. The main requirement for this algorithm, as compared with radix 2 and radix 4, corresponds to an increase in the size of the tables for storing the elementary angles, the need for tables for the pre-scaling factors as well as the need of rectangular multipliers instead of adders in the serial implementation (for the pipelined implementation we estimate that the total cost of the rectangular multipliers and of the adders is comparable). Moreover, when the modulus is required, the scale-factor compensation is performed by a logarithmexponential approach, so that additional tables are needed. The number of input bits to each of these tables is about the logarithm of the radix, so that we estimate that the implementation is reasonable up to a radix around 1024.
We have performed a rough evaluation for 32-bit precision of both a pipelined and a word-serial implementation and have shown a substantial speed up with respect to radix-2 and radix-4 implementations. This module complements the unit for CORDIC rotation presented in [1] .
