In spite of several modified versions proposed, there is a huge demand for an efficient implementation of CORDIC in signal processing applications. Various methods proposed for CORDIC have their own drawbacks either in range of functionality or complexity. This paper presents speed optimized scale-free CORDIC in order to be mainly used in signal processing applications such as FFT and DCT etc.The proposed CORDIC provides enough flexibility to be implemented in signal processing applications to carry out vector rotation operation with required accuracy. The system uses less number of slices with increase in speed when compared with conventional version upon implementing on Xilinx virtex 5. The sine and cosine waveforms of the pipelined design is also shown for verification.
Introduction
The coordinate rotation digital computer [6] has found applications [5] in many areas such as RADAR signal processing, bomber jets, math coprocessors, hand held calculators and software defined radios for generating sine and cosine functions, calculation of several transforms such as discrete sine/cosine transforms, Fast Fourier transforms etc. Several modified versions of CORDIC has been proposed in order to reduce the complexity of conventional algorithm to improve the speed and area. The virtually scaling [3] , [4] free has been proposed with increased area considerably in order to acquire desired region of convergence. The enhanced scale free version [2] increases the region of convergence using hybrid method containing both conventional and scale free version which in turn degrades the performance. The area-time efficient CORDIC [1] overcomes the scaling problem completely but with reduced speed causing delay of seven clock cycles for output to be pushed out for each input given. This paper presents a speed optimized version of CORDIC that is flexible enough with operation in entire coordinate space to be applied to any signal processing application that requires computation of rotation. This article is organised into different sections. Section 2 displays the review of existing CORDIC algorithms. Section 3 describes the proposed architecture and section 4 presents the field programmable gate array implementation. Section 5 deals with comparison of results with existing architecture. Section 6 concludes the paper.
Concise review of existing CORDIC algorithms

Conventional algorithm
Conventional CORDIC has been designed in an attempt to reduce the complexity of conventional rotation operation by eliminating the use of multipliers in hardware. In this the rotation operation is performed by iterative 
Since the total number of iterations is equal to the word-length (b) of the inputs, the rotation theta is given by
where
depending on the direction of rotation. In conventional CORDIC even though the multipliers are eliminated, it still has a complexity equal to a multiplier. This because of the k value that is needed to be multiplied at the end of the last iteration in order to get the exact result. The value of k depends on the number of iterations. It is given by
Existing scaling free versions
Initially attempts were made to eliminate the scaling operation at the end of iterations. So first order sine and cosine series were used However this version suffered with very low convergence range because of the restriction that the basic shift should be greater than a certain value.
To overcome this problem modified virtually scaling free algorithm has been proposed but still it needs adaptive scale factor to be multiplied in order to make the convergence to entire coordinate space. In enhanced scaling free CORDIC radix 4 booth recoding is used along with conventional CORDIC which is likely less flexible to implement in other applications because of its complexity.
Taylor's series is used for approximating sine and cosine functions where the sine and cosine functions are given by
In the above version the value of basic shift varies depending on the order of Taylor's series and input word-length. It is given by In this version using an algorithm the value for the angle of next iteration along with the shift value for next iteration has been calculated. The total number of iterations is equal to the sum of number of iterations done with basic shift and the number of iterations done with shifts for each iteration value. In this version speed has been traded-off in order to reduce the area. So recursive architecture has been proposed, the practical implementation of which is somewhat complex in terms of synchronisation.
Proposed architecture:
In the proposed architecture critical path has been reduced considerably by a pipelined version which is also feasible for implementation in signal processing applications. It reduces the complexity in terms of synchronisation and also overcomes the problem of considerable delaying of output to be produced after every input is given which exists in the existing version. The proposed architecture eliminates the use of counter, multiplexer, input and output registers needed for synchronisation as in existing architecture. In addition it is capable of delivering the output continuously after every clock cycle for series of inputs given in spite of initial latency. In order to pipeline, delays are added between multiple copies of rotation block for the purpose of reducing the critical path that exists in the previous version of scale free CORDIC architecture.
In the proposed architecture third order Taylor's series approximation of CORDIC with input bit width of 16 has been implemented, which requires only seven iterations. So seven stage pipelined version of scale free CORDIC has been implemented. The sine and cosine functions are approximated as follows 
Here si is shift value calculated at every iteration.
Modified scale free CORDIC architecture
439
The algorithm for calculating micro rotation and shift value is shown in figure 3. 
Field programmable gate array implementation
The proposed architecture has been designed using Verilog HDL and synthesized in Xilinx 14.7 for implementation in virtex 5.The RTL schematic of the implemented architecture is shown in the figure 4. For the comparison of the no. of slices used in the proposed architecture with pipelined version of conventional CORDIC, both the architectures are implemented in virtex 5 in order to obtain uniformity. 
Comparison of results
The pipelined version of conventional CORDIC uses 914 slices whereas the proposed pipelined scale free architecture uses only 325 slices. 
Conclusion
The proposed architecture is capable of pushing outputs continuously at every next clock cycle for series of inputs given whereas the existing version causes the output to be delayed by seven clock cycles for every input given. The proposed design has only the initial latency of seven clock cycles for the first output to be pushed out. Therefore the proposed architecture is fast enough to be implemented in signal processing applications.
