Abstract In order to approximate transandental functions, several algorithms were proposed. Historically, polynomial interpolation, infinite series, · · · and other +, ×, − and / based algorithms were studied for this purpose.
Introduction
In 1959, Volder [17] , introduced the CORDIC algorithm in order to compute approximations of trigonometric functions. This method is still used because of its adequacy to hardware design. It is a recursive method using only shiftand-add operations.
A decade later, J. S. Walther in [18] , generalized this method to other transendantal functions used in engineering fields.
The development of the CORDIC algorithm and architectures [8] has taken place for achieving the highest throughput rate and reduction of hardwarecomplexity as well as the computational latency of implementation. Some of the typical approaches for reducing complexity implementation are targeted on minimization of using the scaling-operation and complexity of barrel-shifters and adders in the CORDIC engine. However, one of the problems associated with the classical CORDIC formulation is that the scale factor depends of the angle, and is not constant. The complexity of the computation of the scale factor is in principle comparable to that of the basic CORDIC process itself. In a recent work, [6] , a new algorithm, CORDIC II, is proposed that substitutes the CORDIC micro-rotation by a new angle set.
Aiming to eliminate scale multiplication in conventional CORDIC, scale free CORDIC was used to eliminate the scale factor, see the piooneering papres [3, 1] and also [12, 10, 9] . The scale free CORDIC algorithm for cosine and sine functions is proved to be faster and efficient in terms of area and accuracy compared to conventional CORDIC.
We give in this paper a method in order to minimize the number of iterations ine the CORDIC method. This is given by computing the closest elementary angle to the residual one at each iteration. Our second contribution is the correction of the Taylor series used for the composed functions. We will prove that with our polynomial approximation, we will get faster computation for the same acuity. The CORDIC algorithm operates either in, rotation mode or vectoring mode, following linear, circular or hyperbolic coordinate trajectories. In this paper, we focus on rotation mode CORDIC using circular trajectories.
2 The CORDIC algorithm.
The idea behind conventional CORDIC algorithm is the rotation of a vector [x in y in ]
T in cartesian coordinate which can be expressed in (1), where
T is the output vector produced after rotation and θ is the angle of rotation.
x out y out = cos(θ) − sin(θ) sin(θ) cos(θ)
This can also be written as
We split the rotation angle in a sum of angles, and carries out the rotation by a series of the so called micro-rotation by these angles. The idea is to decompose any angle θ into a sum of some "elemntary" angles
where
If we use the fact that, if R θ denotes the matrix of the 2D rotation of angle θ:
We can translate the equation (3) into the matrix product :
The conventional CORDIC
The conventional CORDIC method performs a sequence of rotations by elementary angles. Any rotation θ on the plan can be decomposed into a composition (matrix product) of n elementary rotations. When taking θ k = arctan(2 −k ), the equation (2) becomes:
Using also the identity
we obtain, for ε k = ±1,
The idea is that the angles used are constant, so we have a constant scale
, which approximately equals, according to the litterature [13] , 0.60725. For this aim, we construct a sequence of vectors [
and
After the fixed number of iterations, we mutiply the resulting vector by the constant K, this means [
The essence of the CORDIC algorithm is that he is multiplication free (only shift-and-add operations). The scale multiplication, also called compensation, to get an output vector isometric to the input one, causes a problem.
The introduction of the scale free CORDIC is then legitimated.
The correct scale free CORDIC for sine and cosine
The scale free CORDIC for circular functions is based on the Taylor series
The approximation Taylor polynomial of the composed functions to order 5:
When we truncate the polynomials to the order 5, we obtain the right equations:
We can observe that sin(arctan(x)) = x cos(arctan(x)), this is simply due to the fact that sin(arctan(x)) cos(arctan(x)) = tan(arctan(x)) = x (8)
We the obtain, for the elementary angles θ k = arctan(2 −k ), and remarking that
The rotation matrix M θ k becomes:
As we know, all the works we have seen uses the Taylor series for sine and cosine functions and replace θ k = arctan(2 −k ) by 2 −k , see [3, 1] and also [12, 10, 9] . The error is that when using a Taylor polynomial of a composite function f • g, we have to use the same degree and truncate the resulting polynomial at the demanded degree, you can see [14] .
In order to give an empirical proof, we will compare the orders 3, 4 and 5 of ou method to the recent works [12, 1] .
Benchmark of scale free CORDIC for circular functions
In order to minimize the number of iterations of the CORDIC algorithm, we choose the microrotations to be the closest arctan(2 −k ) to the residual angle. This can be done by choosing the closest power of 2 to tan(θ), where θ represent the risidual angle.
Due of the continuity of the function arctan, if tan(θ) is close to 2 −k , then so is θ to θ k . This leads us to choose k such that, the closest θ k = arctan(2 −k ) to θ the following way:
we replace arctan(θ) by θ without any loss of acuity because θ is very close to arctan(θ) for θ in [0,
. As an example:
For a hardware design, the translation of our method for the binary represen-
-if for i ≥ 1 we have ∀j < i; ε j = 0 and ε i = 1 and ε i+1 = 0 then k = i -if for i ≥ 1 we have ∀j < i; ε j = 0 and
In this section, we will compare our approximation to the one given in [1] and [12] for order 3 Taylor approximations. The range of angles used is [0,
]. This range is enough, using the trigonometric identities, to can calculate any sine or cosine of any angle.
In [1] , cos(arctan(2 −k )) is approximated by 1−2 −2k−1 , and sin(arctan(2 −k )) is approximated by 2 −k − 2 −3k−3 . The authors in [12] , use the approximation cos(arctan(2
The proposed method use the approximation cos(arctan(2 −k )) ≈ 1 − 2 −2k−1 , and sin(arctan(2 −k )) ≈ 2 −k − 2 −3k−1 , which is the mathematically correct developpement.
The quadratic errors for cosine and sine function for the three method are summarized in the tables (1 and 2). In the table (3) below, we compare the quadratic errors of our method in different order Taylor approximation.
In figures (2 and 1) , a MatLab simulation of our method is given. In blue, the graph of our method, in red, the graph of the matlab function and in green the difference between them. 
Hardware implementation
Common Hardware implementations of CORDIC algorithms are either iterative or pipelined [16, 4] . The main computation CORDIC unit is iterated in both cases. It is unrolled in the first class and rolled in the former using pipelined registers to store intermediate computations [5, 15] . Table 3 Comparison of different orders of Taylor polynomials Fig. 3 Design of the CORDIC architecture.
A new design of the main computation unit is proposed in this paper and compared to the conventional CORDIC one. This is mainly a study to check if the theoretical results are feasible and simple to embed. Optimizations, complete CORDIC computation schemes, advanced CORDIC architectures and comparisons, which are based on the underlined computing unit, are future works.
The proposed scale-free CORDIC algorithm is based on Taylor polynomials. Three orders are evaluated for benchmarking the theoretical study (see Table 3 ). As the complexity of hardware architecture grows in function of the development order, only the order 3 is implemented in the hardware side. However, the impact of Taylor's order on hardware performances is also ongoing. It is not the focus of this paper.
Hence, the implemented hardware architecture is restricted to the main computation unit and the order 3 of Taylor series. It is composed of 4 blocks: dynamic index predictor, shifting processor, storing angles ROM and FSM controller. Figure 3 gives general description of the architecture. A detailed description is presented in the following sub-sections. The section ends with a summary of the main hardware results.
FSM controller
The controller is a finite state machine with three states. In the initial state, the signal load is set to initiate the initial values of the CORDIC core, namely X 0 = 1, Y 0 = 0 and the angle θ. The second state is processed 2 N cycles according to an N -bits counter which fixes the number of CORDIC iterations.
For a given iteration, new intermediate values X n and Y n are obtained by shifting previous X n−1 and Y n−1 according to the closest micro-rotation as given in equation 10. The third and final state sets the done signal. Cosine and sine of the angle θ are computed and stored in the output registers.
Dynamic Index predictor
The main theoretical result proved in section 2. (see theoretical result) is implemented in the Index Predictor. The computation of the next index is the main improvement of the proposed hardware architecture. It estimates the optimal index with which we address the ROM and read the closest CORDIC microrotation for a given iteration. The determined angle is compared with the ongoing error Z in order to compute the new direction of the micro-rotation. X n and Y n are then shifted index positions to right. The next listing describes the behavior of this block. The hardware implementation is based on a 32 bits comparator. When the first sequence of '11' bits are detected the block returns the corresponding K-index. Otherwise the first sequence of '10' is cheeked, but the (K − 1)-index is returned in this case. In this way, the most significant power of 2 micro-rotation is obtained. Listing below shows the addressing behavior the stored arctangent values.
IndexAccess: process(INDEX) begin case Index is when "00000" => MicroRotI <= X"3243f6a8"; _____ when "11000" => MicroRotI <= X"0000003f"; _____ when "11111" => MicroRotI <= X"00000000"; end Case; end process IndexAccess; hardware performances of the proposed architecture. By an in-depth Analysis of sub-blocks, we find that the Index predictor block consumes alone 20 mw and uses 240 slices. The shifting processor block on the other side consumes also 20 mw and uses 540 slices.
The index predictor is an extra block in our case which explains the extra values against the conventional architecture. However, we think that the main reason is our coding style of the VHDL design which was behavioral. The behavioral synthesis infers the use of LUTs rather than basic logics. Hence, more optimized implementation should lead to less logic slices.
Synthesis of the pipelined CORDIC design
A pipelined CORDIC architecture consists of rolling the main computation unit by storing intermediate computation into registers. In the case of the conventional architecture the main unit is rolled 32 times when processed data is 32-bit coded. The main improvement of our proposed architecture is rolling the same unit 3 or 4 times whatever the size of the processed data. More rolled units can be implemented if more precision is needed. With 3 units a precision of 10 −2 is reached and 10 −3 when 4 units are used. Figure 4 shows the proposed pipelined architecture. The index predictor which is resources consuming is instantiated only one time. The FSM controller enables the communication with only one pipeline stage. As shown in table 5 significant results compared to the iterative architecture are obtained. We save almost 50% of area and power with a speedup of 10 Mhz, against only 16-bit conventional architecture. 
Conclusion
CORDIC algorithm has several applications in several domains, for an overview the reader can read [11] . The popularity of this method is due to the simplicity of its hardware implementation, see [8] for example. In this paper two improvements were made. First, we have minimized the number of iterations for some fixed error by calculating the closest elementary angle to the residual one. Second, we gave the correct polynomial approximation for the scale free CORDIC. The comparison between our method and two other famous methods is given to confirm empirically our theoretical proof. In our simulation, we remark that the order of approximation of Taylor series used meets the accuracy requirements.
In section 3, we showed that these methods have a simple hardware implementation, in order to meet the objectives of the CORDIC's introduction. The iterative and pipelined architectures were implemented, and significant improvements of hardware performance were denoted in the pipelined case. The future works will focus on the improvement of the hardware architecture.
