AbstractÐA very-high radix algorithm and implementation for circular CORDIC is presented. We first present in depth the algorithm for the vectoring mode in which the selection of the digits is performed by rounding of the control variable. To assure convergence with this kind of selection, the operands are prescaled. However, in the CORDIC algorithm, the coordinate x varies during the execution so several scalings might be needed; we show that two scalings are sufficient. Moreover, the compensation of the variable scale factor (including the CORDIC scale factor and the prescaling factors) is done by computing the logarithm of the scale factor and performing the compensation by an exponential. Then, we combine, in a unified unit, the proposed vectoring algorithm and the very-high radix rotation algorithm, which was previously proposed by the authors. We compare with low-radix implementations in terms of latency and hardware complexity. Estimations of the delay for 32-bit precision show a speedup of about two with respect to the radix-4 case with redundant addition. This speedup is obtained at the cost of an increase in the hardware complexity, which is moderate for the pipelined implementation. We also compare at the algorithmic level with other very-high radix proposals, demonstrating the advantages of our algorithms.
INTRODUCTION
A S is well-known, the CORDIC algorithm in circular coordinates permits the computation of rotations, as well as of the trigonometric functions sin and cos, using the rotation mode, and of ArcTan(y/x) and the modulus of a vector, using the vectoring mode. A large body of work has been reported on variations of the algorithm, on implementations, and on applications. We refer the reader to [12] , [15] for an overview of the algorithm, of the previous work, and for additional references.
In this paper, we consider first the algorithm for vectoring in the circular mode. Moreover, since the modulus is scaled by a variable scale factor, we present the algorithm for scale factor computation and compensation. This material corresponds to an extended version of [2] . We then combine this algorithm with a compatible algorithm for rotation, presented by the authors in [4] , in a unified architecture.
The original CORDIC algorithm is radix 2 with nonredundant adders and constant scaling factor. This has been extended to the use of redundant adders and to radix 4 [12] . We consider here the extension to a much higher radix, such as radix 512. We call an algorithm with such a radix ªvery-high radix,º as has been done for other digit recurrences, such as division [8] .
As the radix increases, the number of iterations for a given precision is reduced, resulting in a potentially faster execution. However, two problems appear: the complexity of the selection function and the compensation of a variable scale factor.
Several very-high radix CORDIC algorithms have been proposed. The different approaches followed for the selection are: selection by truncation [6] (rotation) [16] (vectoring), selection by table [11] (rotation and vectoring) [16] (rotation), and selection by rounding [1] , [4] (both for rotation). For a very-high radix, selection by table is complex (in time and hardware). Selection by truncation leads to a simpler selection than using a table, but still requires more hardware resources than selection by rounding (we compare further in Section 5).
For the scale factor calculation and compensation, different solutions have been proposed. In one approach, the scale factor is computed using tables and full multiplications and the compensation is performed by full multiplication [16] (rotation). Another approach consists of computing the logarithm of the scale factor (using tables and addition) and compensating by a very-high radix exponential algorithm [6] , [1] , [4] , [11] (all for rotation). In this case, the variations arise from the selection in the exponential algorithm in a similar way as discussed before for the very-high radix CORDIC algorithm.
The main body of this work is devoted to the description of a very-high radix CORDIC algorithm in the vectoring mode suitable for the computation of ArcTan(y/x) and the modulus of vector (x,y). The problem of the selection function using a very-high radix also appears for other digit-recurrence algorithms, such as division. An effective solution is to perform the selection by rounding [7] , [8] , [9] . To allow this selection, the recurrence has to satisfy certain conditions, which is achieved by scaling the recurrence, as was already done for division [8] , square root [9] , and sqrt(x/d). In the case of CORDIC vectoring, there is an additional difficulty due to the variation of the coordinate x in each iteration. This has the effect that several recurrencescalings might be required. We show that two recurrencescalings are sufficient. Moreover, to simplify the first prescaling, we perform the first iteration with a smaller radix.
The variable CORDIC scale factor, as well as the factors of the recurrence-scalings, require that the overall scaling factor be calculated and then used for the scale-factor compensation. To perform this, we calculate the logarithm of the scaling factor (by adding the logarithms of the component factors) and perform the exponential function for the compensation. This method is well-adapted to the very-high radix CORDIC algorithm.
As mentioned before, many applications require both modes of operation simultaneously. Consequently, we complete the work with the design of a unified unit to implement the proposed circular vectoring algorithm and the circular rotation algorithm proposed by the authors in [4] .
In Section 2, we discuss in general the very-high radix CORDIC algorithm with selection by rounding. In Section 3, we present the algorithm for vectoring, including the scale factor calculation and compensation. Section 4 is devoted to the description of a unified unit for vectoring and rotation. In Section 5, we compare with low-radix implementations and with other very-high radix CORDIC algorithms. Additional details can be found in [3] .
VERY-HIGH RADIX CIRCULAR CORDIC
The algorithm is an extension of the radix-2 algorithm. For radix r P , the iteration (microrotation) is
where we consider that ' j takes values in the digit-set fÀr À IY F F F Y ÀIY HY IY F F F r À Ig since, for a very-high radix, there is no significant advantage in using a smaller digit-set. The value of ' j determines the amount of rotation angle in each iteration and is determined as follows:
. In the vectoring mode, the initial values are xI x in , yI y in a n d zI H, and ' j ivyjY xj (SEL() is the selection function) so that y tends to 0 or, equivalently, that the following bound for convergence is satisfied
The final x accumulates the modulus of the initial vector (scaled by the CORDIC scale factor: u x P in y P in p ) and z accumulates the value tn ÀI y in ax in .
. In the rotation mode, the initial values are xI x in , yI y in , and zI z in , and ' j ivzj so that z tends to 0, satisfying the bound jzjj mx j. The final x and y accumulates the coordinates of the vector (scaled by the CORDIC scale factor) rotated by an angle z in . The expression of the CORDIC scale factor is
IaP
(where x is the index of the last microrotation). This factor depends on the rotation angle since ' j may take values different from AEI. Therefore, it is necessary to compute a variable scale factor and then perform the compensation. The iterations begin with the index j I and the final iteration must assure the required precision. For n bits of precision (this refers to a precision of P Àn in the angle) since I ixI tn ÀI r À Ir Ài `r Àx , we obtain that the number of iterations is x na d e. Observe that, for the radix-2 algorithm, the last iteration has index n.
For a faster iteration, we utilize carry-save representation for x, y, and z.
Selection by Rounding
We now consider the selection of ' j using rounding of the control variable, which produces a simple implementation for very-high radices. 1 To show a general description for both modes of operation, 2 we consider a generic recurrence g of the form
where gj r j yj for vectoring and gj r j zj for rotation. 3 In selection by rounding, the selection function is ' j round gY Q where g is obtained by truncating the (carry-save) representation of gj to t fractional bits and round is the function rounding-to-nearest integer.
Adding and subtracting ' j in (2) results in
Since the selection is done by rounding, we have for carrysave representation of gj
1. This has been used previously for other digit recurrences [7] , [8] , [9] . In this development, we follow [8] .
2. Actually, this kind of selection is also performed in the exponential algorithm used for the scale factor compensation.
3. As in [10] , we consider a recurrence scaled by r j to simplify the presentation. Furthermore, it simplifies the implementation in a word-serial architecture since the most-significant bits of y or z, which determine the value of ' j , are always in the same position.
Two conditions have to be satisfied, namely
where hj I is the convergence bound of gj I. 2. To have j' jI j r À I requires À r IaP` g`r À IaPY which results in
Considering both conditions, and combined with the bounds on gj I given by (6), we obtain
From these bounds, it is possible to obtain the necessary conditions to perform selection by rounding for each of the modes of operation.
VERY-HIGH RADIX CORDIC VECTORING
In this section, we present the vectoring algorithm for the computation of ArcTan(y/x) and the modulus of the vector.
For the selection, we use rounding combined with two prescalings (one before and one after the first microrotation). To obtain a faster implementation, we modify the basic algorithm to allow a different (lower) radix in the first iteration. Moreover, we extend the range in the angle of the basic algorithm without latency increase. Finally, for modulus computation, we present a scheme for scale factor calculation and compensation (including the CORDIC scale factor and the prescaling factors).
As we pointed out before, to simplify the presentation, as in [10] The initial condition is j I, xI x in , wI ry in , and zI H. We consider x in Y y in in the first quadrant.
Conditions for Selection by Rounding
The conditions for the vectoring case are obtained from (9) and (10) with hj I r jI tn mx j Ixj I (see (13) ) and f j ' j xj. Replacing these values in (9) and (10) and taking into account that the worst case is for expression (10) and that r jI tn mx j Ixj I b rxj, we obtain
This is similar to division [8] , resulting in
and requires t ! P.
Prescaling
To satisfy (14), we use prescaling of x and w. Since, in contrast to division, the value of xj changes each iteration, even if the initial x is prescaled into the required range, a subsequent xj might get out of the range, requiring additional prescalings. We now show that it is sufficient to have two prescalings, one before the first iteration and another before the second. Moreover, since the first prescaling is only useful for the first iteration and the delay of prescaling might depend on the radix of the iteration, we develop the algorithm allowing a radix P f ( r) in the first iteration; we later discuss the effects of the actual value of .
We proceed as follows:
1. Perform the first iteration in radix . Before this iteration, a prescaling is performed. The conditions for this prescaling are that the selection of ' I can be performed by rounding 4 and that the second iteration produces j' P j r À I. The remaining iterations are performed in radix r. 2. We determine a second prescaling range which is sufficient to allow selection by rounding in all remaining iterations. This prescaling interval has to accommodate the variation in xj produced by the remaining iterations. 3. From the range of the second prescaling, we determine a lower bound on . The first prescaling produces (w I is the first prescaling factor) dI w I x in and wI w I y in since this iteration is radix .
Then, ' I is produced by rounding, as indicated in (12) and the first iteration is
IS
Since d H P can be out of the prescaling range, we need to perform a second prescaling to produce (w P is the second prescaling factor) dP w P d H P and wP w P w H P. As we do not perform another prescaling, the iterations for j ! P are For this modified algorithm, to achieve an angle with a precision of n bits, it is necessary to perform dn À fae I microrotations [3] .
Interval for Second Prescaling and Bound on R
We now determine the range of dP so that no additional prescaling is required. For this, we first compute , a bound on the variation of dj, with respect to dP. This bound is calculated using the recurrence for dj with the restrictions j' j j r À I and jwjj`rI À IaPr. That is, I
which results in
Consequently, from (14) and (17), dj for j ! P is inside the required range if we prescale d H P into the range
From this expression, we obtain ( log P r)
for t P nd odd P P I for t P nd even or t b PX
& IW
In addition to satisfying this bound, the selection of is made taking into account the following considerations: 1) A small value of leads to a simpler prescaling for j I. This might result in a smaller prescaling overhead (smaller delay). 2) On the other hand, a small value of results in a more complex prescaling in j P (small prescaling interval in (18)). In summary, and r are chosen to minimize the execution time for a given precision with a reasonable hardware complexity.
Interval for the First Prescaling
To obtain the condition for the first prescaling, we use the following condition (obtained from (8) with gj wj)
Taking into account that wP w P w H P and that w P does not depend on w H P, we can take the upper bound of w P , given in (18), to obtain a bound of w H P from the above condition. The bound of w P is given by
Therefore, the condition is transformed into
On the other hand, ' I is obtained by rounding wI so that
Note that to obtain this expression we have taken into account that ' I ! H. From these expressions, we obtained a bound for dI, resulting in [3]
Determination of the Prescaling Factors
The prescaling factors should assure conditions (18) and (20), that is, an x coordinate within an interval close to one after each prescaling. Therefore, these factors correspond to an approximation of the reciprocal of x for which several methods have been reported. However, for modulus calculation, it is necessary to compensate the prescaling factors, which introduces certain constrains in the method of calculation (see Section 4.2).
Range Extension and '-Set for First Iteration
In this section, we show how to extend the range in the angle to HY %aP to conform to the case of radix 2. To achieve this extension, it is possible to allow larger values of ' for the first iteration. However, this would result in a complex implementation. To avoid this, we have considered the following method: 1) Detect whether the angle of the initial vector is larger than %aR (y in b x in ); 2) if that is the case, exchange x and y and subtract the computed angle from %aP (this can be accomplished by making zI %aP).
For the detection, we need to compare x in and y in . To make a limited comparison, we take advantage of the fact that the range of convergence in the angle is larger than %aR and that this depends on mx' I . Consequently, to perform a limited comparison, we allow that, after the interchange, yI`xI P Àp , where p is the number of fractional bits compared. In [3] , we obtain the following bound 6 :
For instance, comparing p log P bits of x in and y in results in mx' I Q. Therefore, with values of mx' I slightly larger than , it is possible to make a limited comparison.
Scale-Factor Compensation (for Modulus)
As is standard in the CORDIC algorithm, the modulus of the vector is scaled. Moreover, an additional scaling is due to the prescaling of the very-high radix algorithm. Since all of these factors depend on the input data, it is necessary to calculate the scaling factor and to compensate. We use a scheme that consists of computing the logarithm of the total scale factor and of performing the exponential function. More specifically: 1) Compute ln (where is the total scale factor) and 2) Compute r explnIa (where r is the value of x after vectoring). This scheme is convenient for our purposes because Step 2 can be performed in a way that is similar to the veryhigh radix CORDIC algorithm. It was first used in [6] , but the selection was by truncation instead of rounding. Selection by rounding was used in [1] (for compensation in very-high radix CORDIC rotation). Here, we improve this implementation, as discussed later. In [5] , a similar scheme was used for a radix-2 algorithm. Depending on the type of architecture and the precision, other methods to compensate a variable scale factor may be used (for instance, square-root and division or look-up tables and multiplications).
Calculation of Logarithm of Scale Factor
The computation of the logarithm of the scale factor is as follows 7 :
This involves the determination of the logarithm of each factor and addition of all the logarithms. Our implementation obtains the logarithms by table look-up and performs the additions concurrently with the CORDIC iterations. We have determined that the range of lnIa is within À lnS lnPY lnP.
Compensation by Exponential
In this section, we propose an algorithm for the computation of r expÀ ln . This algorithm corresponds to improvements over the algorithms proposed in [6] and [1] . 8 The basic iteration is as follows 9 :
vi I vi e i vir
where i ! I, vI r, I À ln , Àr À I e i r À I, and q i e i r i lnI e i r Ài . The selection of e i is performed so that r Ài i tends to 0. The result is v I r exp À ln ra , where is the number of iterations of the exponential algorithm ( dlae for a precision in the compensation of about mxr P Àl [3] ). As in [1] , we use selection by rounding. However, in this case, we perform the selection as e i round , where die t is the carry-save representation of i P Àt truncated to t fractional bits. That is, in contrast to the vectoring case (and [1] ), to perform the rounding we always add P Àt . For this case, this selection leads to a reduction in the maximum residual for the next iteration [3] .
Conditions for Selection by Rounding
From (9) and (10) and using similar reasoning as in the vectoring case, the conditions for convergence result in [3] I P e i À q i e i `I À I Pr PR and
Since e i À q i e i is always positive, the worst case for convergence (condition for convergence) is (24); whereas (25) imposes a lower limit for t, t ! P for r ! R. We evaluate (24) for e i b H and e i`H . For e i b H, from Taylor's series expansion of q i e i and e i r À I, convergence requires that r i b rr À I. Therefore, convergence is assured for i ! P. For e i`H , we have the following bound
Since the best result we can obtain is that convergence is achieved for i ! P, we determine the most negative value of e P to achieve convergence in that iteration. Taking into account the bound for e i À q i e i and since e P b Àr, we obtain [3] e P b Àr I À U Qr R IaP X
A bound for this expression is e P ! Àr À P. Following a similar reasoning, it can be shown that, from iteration i Q on, convergence is assured with e i ! Àr À I. In summary, we conclude that: 1) For iterations with i ! Q, convergence is assured with Àr À I e i r À I. 2) For iteration i P, we achieve convergence if Àr À P e P r À I. 3) Iteration i I does not converge with selection by rounding.
For the first iteration, we use a table for selection. The following two aspects determine the parameters of the table:
. As we have seen in Section 3.4.1, the range of the argument of the exponential function has a positive and negative part. Since the first selection is by table, negative arguments lead to a large selection table due to the form of the logarithm function for argument less than one. To always have a positive argument, we compute
where k dlog P mx e P. This assures a positive argument for the exponential function by just performing a right shift by two positions. The range to be covered by the exponential is now Q lnP À lnSY Q lnP.
7. w I w P x jI I ' j P r ÀPt IaP , where r Àt ÀI r ÀjI . 8. We use selection by rounding instead truncation as used in [6] . Furthermore, we have reduced by two iterations the exponential calculation with respect to the implementation of [1] .
9. Note that we use r for the radix, but this does not imply that the same radix as in the vectoring algorithm must be used.
. Since the selection in the second iteration is performed by rounding with Àr À P e P r À I, the residual P should verify
Consequently, this is the condition used to obtain the parameters of the table.
Sequence of Operations
Fig . 1 shows the sequence of operations for the vectoring operation to obtain the ArcTan and the (compensated) modulus. In the xayaz part, the operations are:
. Range extension, calculation of w I , and first prescaling. . First microrotation, with radix ( % r IaP ). . Calculation of w P and second prescaling. This step is more complex than the first prescaling since the interval for the second prescaling is narrower. . Remaining microrotations radix r, up to microrotation j dn À fae I. In parallel to the xayaz part, the logarithm of the (whole) scale factor is computed (vg iterations). Note that, since the scale factor of the required precision depends only on about the first half of the ' j digits, the calculation of the logarithm is performed only in those iterations.
The iterations for the compensation of the modulus (ig iterations) partially overlap the microrotations. This is possible because x does not change in the final iterations. Specifically, we have determined in [3] that the last iteration in x to obtain r (the value of the scaled modulus within the precision) is h dn À PfaPe I.
The first ig iteration obtains the digit e I from a table (in the remaining iterations, the selection is by rounding). To avoid the delay of the table for selection of e I , this selection is performed using an estimate of the logarithm. Finally, depending on the architecture, the last portion of the compensation can be performed by a rectangular multiplier (instead of about half of the last ig iterations) using a linear approximation of the exponential.
UNIFIED UNIT FOR VECTORING AND ROTATION
In this section, we combine in the same unit the proposed vectoring algorithm with the very-high radix CORDIC rotation algorithm developed by the authors in [4] . The resultant architecture may be useful for a wide variety of applications that require both modes of operation. We first briefly describe the rotation algorithm and then present a unified architecture.
Very-High Radix Circular CORDIC Rotation
This section is a summary of the main results obtained in [4] . We determined that selection by rounding can be used directly for iteration j ! P. Therefore, for the selection in j I, it is necessary to use a table. To avoid a large selection table in the first iteration, which might increase the latency, we use a lower radix for the first iteration. Therefore, introducing the scaling by r t in the z recurrence (uj r t zj), the algorithm for rotation is as follows: To have an efficient implementation, should be selected so that: 1) The selection in the first iteration is by table and therefore a small value of is desirable. 2) The remaining iterations are performed with a radix r and selection by rounding. That is, for j ! P ' j round u, where u is the carry-save representation of u truncated to t fractional bits. Therefore, it is necessary to obtain a condition on to perform selection by rounding from iteration j P on. This condition is obtained from (9) and (10) for t P nd r ! V P Since our implementation is for a very-high radix, we may use a radix for the first iteration of about r IaP , which is similar to the vectoring case. Therefore, for a unified unit, we use the same radices and r for rotation and vectoring, which simplifies the implementation. The actual value of selected should simultaneously verify conditions (27) for rotation and (19) for vectoring and depends on the values of r and t used.
The table for selection of ' I has dlog P ge f I input bits (f log P and g is the range in the angle of rotation). 
Range Extension
As in the vectoring algorithm, it is necessary to perform a range extension to cover the range À%aPY %aP in the rotation angle. The basic range is somewhat larger than %aR so that, to extend the range, we perform a rotation by %aP when necessary. Similarly to the vectoring case, we perform a limited comparison of q fractional bits, which is now between z in and %aR. For instance, comparing q f bits, we determined that j' I j Q. In this case, the table for selection of ' I has f P input bits and f P output bits. Since is of the order of r IaP , this table is small.
Scale Factor Compensation
The scale factor is computed and compensated using the logarithm-exponential scheme of Section 3.4. In this sense, this is useful for designing a unified unit for rotation and vectoring. The main differences with respect to the vectoring case are that, in the rotation case, it is only necessary to compensate the CORDIC scale factor since the rotation algorithm does not use prescaling factors as in vectoring. Therefore, in a unified unit, it is necessary to avoid the addition of the logarithms of the prescaling factors for the rotation mode. Moreover, the compensation is performed after the last microrotation, in contrast to vectoring, where certain overlap between the compensation of the modulus and the final microrotations exists.
Architecture
The unified algorithm can be executed in a pipelined or a serial unit, the choice depending on the throughput requirement and area constraint. Because of space limitations, we only describe the pipelined case and give some comments on the serial unit. More details are in [3] .
Pipelined Architecture
A diagram of the unified pipelined architecture and details of its component units are shown in Fig. 2 . It consists of the following steps 10 :
. The microrotations. These are the same for rotation and vectoring since both algorithms use the same values of and r. In the figure, we show the implementation of the first microrotation (radix ) and a generic microrotation radix r. For the pipelined architecture, we implement the recurrences without the scaling by r t since this simplifies the implementation when the processor operates in rotation and vectoring.
In the first microrotation, we initialize the values of x, y, and z in the rotation mode or load the values from the first prescaling in the vectoring mode. The selection is done by rounding for vectoring and by table look-up for rotation. To speed up the computation of ' I in the rotation mode, we have two tables for the cases jz in j %aR P Àq and jz in j b %aR that operate in parallel with the comparator of q P bits.
. Two prescaling steps, used only for the vectoring algorithm. In the first prescaling step, the limited comparison (p bits) and mux are used to extend the range. For the computation of the prescaling factors, we use tables since it is convenient to have w I and w P with a minimum number of bits to reduce the complexity in the computation of lnw I and lnw P (required for scale factor computation). Since these logarithms have to be computed with full wordlength, the only practical method is to use also tables. We show in Fig. 2 the approximate sizes of the tables. The exact values depend on the parameters r, , and t [3] . Note that, before the table of w P , a short assimilation is necessary. . Steps for the calculation of the logarithm of the scaling factor (LC iterations). These steps are performed concurrently with the prescalings and microrotations. The logarithms of the prescaling factors are added only in the vectoring case. As discussed before, the computation of lnw I and lnw P is by table look-up. . Steps for exponential calculation (compensation of scale factor). Although, in the vectoring mode, these steps can be placed after half of the microrotations, we place them after all microrotations since this is required for the rotation mode. We perform a linear approximation of the exponential instead of the (about) last half of the ig iterations.
The unified pipelined unit has a larger latency for both modes than dedicated units: in vectoring because of the compensation that is placed at the end of the microrotations and in rotation because of the prescaling steps. This increased latency is avoided in applications that require the evaluation of one mode of operation over a large data set by using a bypass in certain parts of the pipeline. In Fig. 2 , we show the bypass of data for the rotation and vectoring modes: The bypass for rotation avoids the prescaling steps, whereas the bypass for vectoring avoids the last microrotations for modulus calculation and the exponential stages for ArcTan computation.
Serial Architecture
The serial architecture reuses the same hardware to perform the steps indicated for the pipelined case. We now briefly list the features of this implementation.
11
. As indicated before, the control variable for the selection (yj for vectoring and zj for rotation) is scaled by r t to simplify the selection of ' j . In the unified architecture, this requires multiplexers to select the scaled or unscaled variables, depending on the mode. . Variable shifters are required for the shift by r Àt , r ÀPt , and r Ài . . Multiplexers are required to adapt the shared datapath to the microrotation, the prescalings, and exponential iteration. . All microrotations (the first and the rest) use the same multiplier/accumulator units. 10 . These steps do not necessarily correspond to the stages of the pipeline since these stages are determined by the desired throughput.
11. We refer the reader to [3] for more details. . The latency adapts to the mode by performing only the required steps. That is, the bypassing of steps is performed by skipping the corresponding cycles.
EVALUATION AND COMPARISON
In this section, we estimate the delay of the proposed veryhigh radix architecture and compare with standard lowradix implementations. From the large number of variations proposed, we have chosen those that seem fastest for each type (anyhow, we do not expect large variations in the conclusions if other instances are used). We also give a qualitative comparison of the area. Due to space limitations, we only compare pipelined implementations. For wordserial implementations, we obtained similar speedups as in the pipelined case [3] . With respect to other very-high radix proposals, we compare at the algorithmic and architectural level.
To make the evaluation more specific, we consider the case of n QP bits and an implementation for r SIP, QP and t Q. Therefore, ' I and ' i Y i b I have U and IH bits, respectively. For the computation of the prescaling factors w I and w P , 12 we use tables of sizes S Â T and IQ Â IP, respectively [3] .
The critical path delay depends on the delay of the different components in the path. To estimate the delay, we use the model proposed in [8] , where the delay of a full adder (t f is used as unit of delay. For the tables, we use an estimation obtained from implementations with standard cells. Since the delays are given in terms of the delay of a full adder, the conclusions should be rather independent of the technology and similar for any CMOS standard-cell libraries. The delay of the components is estimated as:
1. MAC. The addition of the partial products and the accumulation is carried out in a tree of 4-to-2 and 3-to-2 carry-save adders with delays IXS t f and IXH t f , respectively. The multiplier is recoded to a radix-4 representation. Moreover, we include the delay of the multiplier buffer (IXH t f ) and of partial products generation (HXS t f ). 2. The delay of the table for the computation of w I is IXH t f , whereas the delay of w P (including the short assimilation and the table) is RXH t f . 3. The delay of the table for the computation of ' I for rotation is IXH t f . 4. Small comparators for range extension: IXS t f . 5. For other components, we use the estimates given in [8] : recoding and rounding (IXS t f ), and muxes (HXS t f ). In the pipelined implementations, we do not consider the delay of the latches between stages since their number depends on the desired clock cycle.
First, we compare all the architectures for the case that there is a continuous issue of rotation and vectoring instructions and, therefore, the bypass scheme is not used.
We now estimate the latency of the very-high radix implementation. For n QP bits, we need one radixmicrorotation ( QP) and three radix-r microrotations (r SIP) to achieve a precision of P ÀQP in the angle. Moreover, for scale factor compensation, we need two radix-r iterations and a final linear approximation (rectangular multiplication).
There are five different iterations: CORDIC iterations, first prescaling, second prescaling, iterations for the computation of the logarithm of the scale factor (vg iterations), and exponential iterations (ig iterations). We assume that, for the CORDIC iterations, the critical path is determined by the xay datapath (this is reasonable since the z datapath is composed only by a table and a 3-to-2 carrysave adder). Following a similar reasoning, the vg iterations are not in the critical path.
The critical path of the first microrotation (Fig. 2) is composed of the following parts: determination of ' I 13 (IXS t f ), a multiplexer (HXS t f ), and rectangular MAC. The number of bits in the multiplier is 7 (four radix-4 digits). As the multiplicand and accumulation operands are in carrysave representation, the tree has 10 operands (eight partial products plus two accumulation operands). The adder tree is composed of two 4-2 and one 3-2 carry-save levels and the delay is P Â IXS IXH RXH t f . Then, the delay of the MAC is SXS t f . Consequently, the delay of this microrotation is UXS t f .
The critical path of a microrotation radix-r (Fig. 2) is: multiplexer (HXS t f ), round and recoding module (IXS t f ), and rectangular MAC. The number of bits in the multiplier is 10 (five radix-4 digits), resulting in a tree with 12 operands (10 partial products plus two accumulation operands). The adder tree is composed of two 4-2 and one 3-2 carry-save levels, with a delay of RXH t f , resulting in a delay of SXS t f for the MAC. Therefore, the delay of the critical path is UXS t f (in this particular case, it is the same as the delay of the radix-microrotation).
The critical path of the first prescaling (Fig. 2) is composed of limited comparison of x in and y in (IXS t f ), mux (HXS t f ), table of w I (IXH t f ), and small rectangular multiplier with 6-bits multiplier operand. The multiplier is stored in radix-4 representation with four digits. Therefore, as the multiplicand is in nonredundant representation, there are only four partial products and the adder tree is composed of one 4-2 carry-save adder. With these considerations, the delay of the small multiplier is QXH t f , resulting in a prescaling iteration delay of TXH t f .
The delay of the second prescaling (Fig. 2) is the delay of the calculation of w P (RXH t f ) plus the delay of the rectangular multiplier. The multiplier has 12 partial products and its delay is SXS t f . This results in a delay of WXS t f .
The delay of a radix-r iteration for exponential calculation (ig iteration) is UXS t f , the same as the delay of a radix-r CORDIC microrotation (see Fig. 2 ). The delay of the final linear approximation corresponds roughly to the delay of a rectangular multiplier (with a multiplier of about naP bits). The delay of this multiplier is composed of the delay of a multiplexer (HXS t f ), three levels of 4-2 carry save adders (RXS t f ), plus the delay of a carry-propagate adder (RXH t f ), resulting in a total delay of WXH t f .
Consequently, the latency of the very-high radix implementation is: TXH t f (first prescaling) + UXS t f (first microrotation) + WXS t f (second prescaling) + Q Â UXS t f (CORDIC microrotations) + P Â UXS t f (ig iterations for scale factor compensation) + WXH t f (linear approximation of exponential) % UH t f . Table 1 shows the delay of the low-radix alternatives and the estimated speedup of the very-high radix implementation. The assumptions for the low-radix implementations are the following:
. Radix-2 nonredundant architecture: This corresponds to the conventional implementation of the CORDIC algorithm with a (fast) carry-propagate adder and constant scale factor. The number of iterations is n I 14 (33 iterations for this evaluation) plus the iterations for scale factor compensation. We assumed that the scale factor is compensated with a parallel multiplier. Since the scale factor is constant, the tree may be adapted to the particular value of u. We consider that, after recoding, the number of partial products is naQ (as opposed to naP in a conventional multiplier). The delay of this multiplier (for n QP) is composed of three 4-2 carry-save adders (RXS t f ) and a carry-propagate adder (RXH t f ), resulting in VXS t f . . Radix-2&4 with redundant adders: We assumed an algorithm with constant-scale factor with correction iterations [10] , [13] for the first half of the microrotations. For this comparison, we estimated that, for 32 bits, four repetitions (necessary only in the first half of the iterations) results in a good trade-off between number of iterations and iteration delay. Moreover, for the second half of the microrotations (where the scale factor introduced for each microrotation is one within the precision), we assumed that a radix-4 algorithm [14] is implemented, resulting in a reduction of naR iterations with respect to a full radix-2 algorithm. Therefore, the number of radix-2 iterations is: IU (naP I for n QP) + R (repetitions) = 21, and the number of radix-4 iterations is eigth (naR for n QP). As before, for the (constant) scale factor compensation, we assumed an implementation with a constant multiplier, but with P Â naQ partial products since the input to the multiplier is in carry-save form. This increases the delay in one 4-2 carry-save level with respect to the nonredundant case, resulting in a delay of IHXH t f . . Radix-4 with redundant adders [14] : For this architecture, the number of microrotations is roughly halved with respect to a full radix-2 algorithm (naP I iterations). However, the scale factor is not constant. The scale factor is computed by an initial table and linear approximations, in parallel to the microrotations. The compensation is performed after the microrotations with a multiplier.
Since the inputs to the multiplier are in redundant representation and this is a general multiplier, the number of partial products is about P Â naP. For n QP, the multiplier requires a multiplexer (HXS t f ), five 4-2 carry-save levels (UXS t f ), and a carrypropagate adder (RXH t f ), resulting in a delay for compensation of IP t f . Therefore, as shown in Table 1 , the rough comparison performed reveals a significant speedup for the very-high radix implementation. In a similar way as other very-high radix recurrences (like division), this speedup should increase for higher precisions.
Until now we have evaluated the unified architectures without bypassing because of the possible interleaving of rotations, ArcTan computations, and modulus-ArcTan computations. We now consider the effect of bypassing for the case in which only one of these operations is performed over a large dataset (which may be the case in many signal processing and graphics applications). Moreover, this evaluation permits comparing the vectoring and rotation algorithms separately.
The effects of the bypassing in the very-high radix implementation are as follows:
1. ArcTan: bypass of ig iterations (scale factor compensation); 2. ArcTan&Modulus: overlap between last microrotations and scale factor compensation (for n QP, there is one iteration of overlap); 3. Rotation: bypass of first and second prescaling. For the low-radix implementations:
1. ArcTan: bypass of scale factor compensation; 2. ArcTan&Modulus: total overlap of last half of the microrotations and scale factor compensation; 3. Rotation: no bypassing is necessary.
In Table 2 , we show the delay and speedups for each of the operations considered. For the computation of the delay, 14 . An iteration with index j H is needed to achieve the range in the angle À%aPY %aP, plus n iterations to achieve a resolution of P Àn in the angle. This is similar for the other low-radix schemes. we have taken into account the stages bypassed in each case and the data for the delay of each stage given before.
Comparing the data of Tables 1 and 2 , we deduce that the speedup for the very-high radix implementation increases when only one type of operation is performed and the bypassing is used. Moreover, the best speedup is obtained for ArcTan computation, where no scale factor compensation is necessary.
With respect to area, for pipelined implementations using redundant representation, the CORDIC iterations have essentially the same area independent of the radix, except for the tables of the angles whose number of input bits is proportional to the radix. In addition, the veryhigh radix scheme requires the area devoted to the prescaling. On the other hand, the area devoted to the wired shifts (which is substantial for low-radix implementations) is significantly reduced in the very-high radix implementation.
We now comment on the hardware complexity due to the scale factor compensation. In the pipelined implementations, we estimate that the hardware for exponential calculation has similar complexity to the multipliers used in the low-radix implementations, except for the iteration, which require adders and tables of logarithms. The radix 4 and the very-high radix schemes require hardware for scale factor computation, which we estimate has similar complexity in both schemes. The low-radix implementations with constant scale factor do not need this module.
Comparison at the Algorithm Level
For comparison with other very-high radix schemes [6] , [11] , [16] , we prefer the algorithm level since, at this level, it is easier to establish the fundamental differences. The potential advantages of our algorithms with respect to the other schemes are:
. Selection function. We use selection by rounding (combined with selection by a small table in the first iteration of the rotation algorithm). This simplifies the implementation as compared to selection by truncation [6] , [16] or selection by direct table lookup [11] , [16] because, when selection by truncation is used, it is necessary to use an overredundant digitset [16] (with larger tables and rectangular multipliers) or perform a check of convergence in each iteration and introduce a correction iteration when necessary (variable latency algorithm) [6] . Moreover, the direct implementation of the selection function leads to large tables (even different tables for different iterations) that increase the hardware complexity and the latency and limit the maximum radix for a practical implementation. . Scale factor compensation. Two methods are used to compute and compensate the variable scale factor: 1) tables and full multiplications to compute the scale factor and full multiplication for compensation and 2) a logarithm-exponential scheme. We estimate that the logarithm-exponential scheme results in a faster and simpler implementation than the other method. Furthermore, among the logarithm-exponential schemes, our implementation uses selection by rounding (except in the first iteration, which is by table; however, due to the overlap with the CORDIC iterations, this table does not increase the latency), leading to a more efficient implementation as discussed before. An important point of our algorithm in the vectoring mode is that, for modulus calculation, the prescaling factors are compensated as a part of the CORDIC scale factor compensation. The vectoring algorithms proposed in [11] and [16] do not perform modulus calculation. . Use of lower radix for first iteration. This technique is only used in our algorithms and reduces the latency of the implementation. . Parametric algorithm description. We present our algorithms in terms of parameters, which permits us to select a good trade-off between hardware and latency through the selection of the radix (r and ). This is not possible in other descriptions, where the parameters of the algorithm are obtained after a complex process [11] or the algorithm is described only for a particular radix [16] . Furthermore, our algorithms are for general purpose and its characteristics are not dependent on the application (as in [11] ). . Range extensions. To cover the same range as in the conventional radix-2 CORDIC algorithms we incorporated range extension schemes that do not introduce latency penalty. This range extension is not considered in the other implementations, or introduce latency penalty. . Easy to combine rotation and vectoring in a unified unit.
Some interesting applications for CORDIC require both modes of operation in the same architecture. In terms of selection, scale-factor compensation, and architecture, our algorithms facilitate the unified implementation as compared with other proposals. Furthermore, there is no other proposal for a unified unit for very-high radix CORDIC vectoring and rotation. Moreover, in [4] , we proposed the extension of the rotation mode to hyperbolic coordinates and we estimate that the extension of the vectoring algorithm to hyperbolic coordinates is straightforward. 15 The unified implementation would just require the addition of tables of angles for hyperbolic coordinates.
CONCLUSIONS
We have presented a very-high radix algorithm and implementation for the circular CORDIC in vectoring mode, which is used to compute ArcTan(y/x) and the modulus of vector xY y. In order to use the simple selection by rounding, two prescalings are performed, one before the first iteration and another after it. Moreover, to reduce the overhead of the prescaling and adapt to the required precision, the first iteration is done using a smaller radix than the rest. The main requirement for this algorithm, as compared with radix 2 and radix 4, corresponds to an increase in the size of the tables for storing the elementary angles, the need for tables for the prescaling factors, as well as the need of rectangular multipliers instead of adders in the serial implementation (for the pipelined implementation, we estimate that the total cost of the rectangular multipliers and of the adders is comparable). Moreover, when the modulus is required, the scale-factor (CORDIC scale factor and prescaling factors) compensation is performed by a logarithm-exponential approach so that additional tables are needed.
We proposed a unified unit to implement the vectoring algorithm developed in this work and the rotation algorithm developed by the authors in [4] . The rotation algorithm also uses a smaller radix for the first iteration so that both algorithms (rotation and vectoring) use the same sequence of iterations, which simplifies the implementation of the unified unit. The most interesting applications of the CORDIC algorithm require the rotation and vectoring modes. Therefore, the proposed unified unit may be of great interest in improving the performance of this kind of applications. Furthermore, the extension of the unit to support hyperbolic coordinates is straightforward.
We have performed a rough evaluation for 32-bit precision of a pipelined implementation and have shown a substantial speedup with respect to radix-2, radix-2&4, and full radix-4 implementations. We compared at the algorithmic level with other proposals for very-high radix CORDIC and concluded that our approach presents better characteristics in terms of the selection function, scale-factor compensation, algorithm description in terms of parameters, and design of a unified implementation for vectoring and rotation. 
