In this article we propose a novel scheme based on virtually scaling-free COordinate Rotation DIgital Computer (CORDIC) algorithm to design a hardware efficient CORDIC rotator. For predicting rotation directions, less than 1/3 rd of the elementary rotational stages require classical CORDIC iteration. The rest of the iteration directions could be computed in parallel and the corresponding z-datapath could be eliminated. A 16-bit implementation of the processor requires 0.23 mm 2 silicon area and consumes 967.8 µW power when synthesized in 0.18 µm technology.
INTRODUCTION
The forward rotation mode of CORDIC algorithm in circular coordinate (CORDIC rotator) [1] computes sine and cosine of a given angle and carries out complex multiplication using iterative vector rotation in a to and fro (two-sided) manner through a set of elementary rotation angles. In terms of hardware, these vector rotations are nothing but a series of shift-and-add operations and the resulting structure is very hardware economical. Thus it has been incorporated in several DSP and communication systems and a large volume of research work has been dedicated to improve its speed and hardware requirement [2 -6, 9, 10] . Earlier we have proposed a low-power hardware-optimal architecture for CORDIC rotator [5] where we were able to eliminate all the arithmetic operations along the angle computation datapath (or z-datapath) considering unidirectional vector rotation. However, it is not possible to apply the same formulation to reduce hardware for the conventional two-sided vector rotation and for a wordlength larger than 18-bit the hardware requirement of the proposed one becomes more than the conventional CORDIC. In this work, we propose a novel scheme to make our previously proposed architecture more hardware optimal. In essence, the adopted philosophy is very much similar to that proposed in [6] where, the target angle is partitioned into two halves. For the upper half, conventional CORDIC iterations are executed whereas, for the lower half, the CORDIC rotator operation is approximated by linear CORDIC rotator. The directions of rotation for this lower half are predicted in parallel by mapping a logic '1' and a logic '0' to +ve and -ve rotation respectively. In this way, it is possible to eliminate about 2/3 rd of the required ziterations. In this work we will show that this simple mapping is not sufficient for maintaining accuracy. Consequently we propose a different mapping scheme with no additional hardware. The rest of the paper is structured as follows: in Section 2, we formulate a novel scheme for mapping a logic '1' and '0' in the representation of the target angle to +ve and -ve rotation respectively; Section 3 describes the proposed architecture; and its performance is analysed in Section 4; and conclusions are drawn in Section 5.
OPTIMIZATION OF Z-DATAPATH FOR TWO-SIDED ROTATION
In a scaling-free CORDIC [5] the i th elementary rotational angle is described as i = 2 i . In this formulation, considering the direction of each elementary rotation i ∈ {0, 1}, there exists a one-to-one correspondence between the position of a logic '1' in the binary representation of the target angle and the elementary CORDIC section undergoing a rotation operation. Thus the bit pattern of the target angle can be used directly as enable signals for different elementary rotational stages and there is no need for any real computation along the z-datapath. However, for two-sided rotation ( i ∈ { 1, 1}) there exists no such direct correspondence and thus the method adopted in [5] cannot be exploited directly to optimize hardware. Although it has been proposed earlier that if i = 2 i then a mapping of logic '1' and logic '0' in the expression of the target angle to +ve and -ve rotations respectively will enable the vector to get rotated by the target angle [6] , here we will show that this is not sufficient for preserving the accuracy. 
Therefore, in terms of , equation (2) Equation (4) can be written as
Adding 2 j + 2 k + 2 l + 2 m to both sides of equation (5) we get (6) can be written as
θ Corollary: For a -ve angle following the conditions of Theorem 2 the following condition holds:
Proof: The proof directly follows from Theorem 2.
The physical meaning of equation (3) is that, if the p th stage (more generally, the very first stage) of elementary rotation is always forced to do a positive rotation and a '0' or '1' in the binary representation of the input angle represents a ve or +ve rotation of the corresponding iteration stage respectively then effectively the input vector is rotated by an angle 2 with an error of O(b). Thus, if the binary representation of is shifted to its right by one bit position (i.e., yielding /2), and the elementary rotational stages (except p th stage) corresponding to the positional bit value '0' and '1' of this shifted bit pattern are treated as undergoing negative and positive rotation respectively, then the vector rotation through an angle is achieved.
ARCHITECTURE
In the scaling-free CORDIC rotator in [5] the lowest value of i has to be i min = p = (b − 2.585) / 3 . Thus to ensure the convergence over the range of [0, /8] (which was shown sufficient to cover the entire coordinate space [5] ), one needs to repeat i=p elementary rotational stage by N = ( /8)/2 p times. As b increases the value of p increases and accordingly N increases rapidly resulting in no more advantage compared to the classical CORDIC anymore. In order to overcome this problem we propose a scheme by integrating K = (p 2) classical elementary rotational stages (with i = tan 1 2 i ) with (b p) scaling-free stages. This means that the target angle is partitioned into two halves, one half with the definition of i = tan 1 (2 i ), and the other half with i = 2 i . It is to be noted that the numerical value of /8 is 0.392699. Thus the lowest value of i required to cover the convergence range will be i = 2 since tan 1 (2 2 ) = 0.255341 whereas tan 1 (2 1 ) = 0.5436 which is beyond the convergence range. Note that because of our deinition of i it is possible to apply the formulation presented in Section 2 from i = p stage. The problem of determination of rotation direction for the classical stages can be tackled easily by considering pipelined operation and noting that since we consider convergence range as [0, /8], all the input angles are +ve and thus i = 2 stage will always execute +ve rotation. Thus, determination of direction of rotation for i = 3 stage can be carried out concurrently with i = 2 rotation operation. Also note that for (b p) stages we do not need any real computation along the z-datapath and thus a significant saving of hardware is possible. To illustrate the whole design, here we describe the design of a 16-bit processor as an example. Without any loss of generality, the method described here can be extended for other wordlengths as well. The complete processor consists of three separate units viz., domain, basic pipeline and output unit. Domain: This unit is reponsible for carrying out the domain folding operation (described in [5] ) by which the vector lying in the range of ∈ ( /8, /2] is mapped to φ ∈ [0, /8]. It also generates two 2-bit signals namely domain and quad. While the quad signal indicates on which quadrant the original target angle lies, the domain signal indicates the corresponding domain in the first quadrant on which it has been folded back. The entire operation in this unit is performed in one clock cycle. Basic pipeline: The schematic diagram of the basic pipeline is shown in Figure 1 . Since we consider a wordlength of 16-bit, the value of p is 4. Thus the scheme of parallel prediction of direction of rotation developed in Section 2 can be applied for the elementary rotational stages from i = 4 down to 15. This means we need conventional CORDIC iteration for i = 2 and 3 stages. In Figure 1 the conventional stages are denoted as c_stage and the scaling-free stages are denoted as s_stage. The total length of the pipeline is 14 stages. x in , y in and z in are the input data generated after the domain folding operation and x out and y out are the data supplied to the output unit for final processing. Each of the conventional CORDIC section consists of two two-operand adder/subtractors and the corresponding z-datapath requires one adder/subtractor. Note that since i=2 stage always execute +ve rotation, the corresponding z-datapath component is actually a subtractor. After processing through the conventional CORDIC stages, data enter into the scaling-free stages. At this point the residual angle computed in the z-datapath could be either +ve or ve and accordingly, the 4 th bit position will be 0 or 1 respectively. If the residual angle is +ve then from the theory derived in Theorem 2 of Section 2, i=4 stage needs to execute a +ve rotation. Similarly, for ve residual angle, the direction of rotation will be ve. Thus, the logic value of 4 th bit is inverted before applying to i = 4 stage. Each of the bits from 4 th to 15 th position of the residual angle acts as the direction of rotation for the corresponding scaling-free elementary rotational stages. Each of the scaling-free stages require four two-operand adder/subtractor. However, as shown in [5] , the stages ranging from i = 7 to 15 requires two adder/subtractors each.
Output unit: This circuit is responsible for carrying out the scaling operation. In this particular case the scale factor to be multiplied is 0.9613526. This factor is coupled with the scaling of 2 adaptively when the quad and domain signals indicate that the initial poistion of the vector was in the range ( /8, 3 /8] . The complete circuit is realised using a shift-and-add technique with a couple of multiplexers for bypassing some of the stages. As the value of x and y approach 1 (i. e., the vector magnitude approaching 2), the error becomes more significant. Also the error floor is in conformation with the error theory of classical CORDIC developed in [7, 8] . Implementation results: The 16-bit CORDIC rotator is synthesized using 0.18µm CMOS library. The synthesized area of the processor is 0.23 mm 2 of which the domain, basic pipeline and the output unit requires 0.017 mm 2 , 0.135 mm 2 and 0.08 mm 2 area respectively. Power dissipation has been analyzed using Synopsys' Prime power. At 20 MHz the processor consumes 967.8 µW power of which the basic pipeline consume 765 µW, and the domain and output unit consumes about 49.1 µW and 153 µW respectively. Comparison: We compare the proposed architecture with the architectures reported in [5, 6, 9, 10] . Since different CORDIC architectures use different adder structures, to make a uniform comparison we consider the total number of required b-bit adders only (each of these b-bit adders could be implemented using different design approaches). The hardware comparison for different wordlengths is shown in Table 1 . In deriving the numbers we have assumed that one byte of ROM cell (applicable for [10] ) occupies half the area of a full adder and the area of a redundant adder is approximately equal to three times the area of a full adder.
PERFORMANCE EVALUATION
We have also excluded the scaling circuitry since it is common to all except for [5] . As the wordlength increases, the advantage of the proposed architecture becomes apparent compared to the other architectures except for [6] . However, as has been proved in Section 2, the proposed one shows better numerical accuracy compared to that of [6] .
CONCLUSIONS
In this paper we propose a novel CORDIC rotator algorithm and architecture where scaling-free CORDIC algorithm is used in conjunction with the classical CORDIC algorithm. In this formulation, less than 30% computations are required in the z-datapath compared to classical CORDIC which makes it an attractive one from the hardware cost and lowpower application point of view. The algorithm proposed here shows similar error characteristic to that of the classical CORDIC albeit at lower hardware cost. Synthesis results for a 16-bit processor following the proposed architectural methodology occupy a very small area and consume very low power.
