1 Abstract-This article presents a composed architecture for the CORDIC algorithm. CORDIC is a widely used technique to calculate basic trigonometric functions using only additions and shifts. This composed architecture combines an initial coarse stage to approximate sine and cosine functions, and a second stage to finely tune those values while CORDIC operates on rotation mode. Both stages contribute to shorten the algorithmic steps required to fully execute the CORDIC algorithm. For comparison purposes, the Xilinx CORDIC logiCORE IP and previously reported research are used. The proposed architecture aims at reducing hardware resources usage as its key objective.
I. INTRODUCTION
Complex digital systems use trigonometric functions as a fundamental component. These functions are widely used in many areas including digital image and signal processing, cryptography and watermarking. Several approaches have been developed to calculate trigonometric functions, most of them based on polynomial or rational approximations which are not easily mapped into hardware architectures.
CORDIC (Coordinate Rotation Digital Computer) is an iterative algorithm designed to calculate trigonometric functions using basic addition and shift operations, characteristic that makes it suitable for hardware architectures design [1, 2] . The algorithm can be configured to operate in vectoring or rotation mode in several coordinate systems, providing the possibility to calculate hyperbolic functions. In rotation mode, a vector is iteratively rotated by an angle, to calculate a final vector corresponding to the sine and cosine functions of an input angle. A new vector is obtained every iteration, after a small rotation (micro-rotation). Each rotation has a specific direction which is calculated every step based on the sign of an angular variable. In vectoring mode, the algorithm follows similar steps but it also allows calculating divisions and logarithmic functions.
At an application level, CORDIC is widely used in the signal processing arena [3] [4] [5] [6] . In [3] , CORDIC compensation is implemented instead of using multiplications, helping to significantly reduce hardware complexity. In [4] , CORDIC algorithm is included in an efficient filtering design where power of two coefficients are calculated. CORDIC algorithm is applied directly in the implementation of a Givens rotation module used in [5] , improving computational time and decreasing complexity. In [6] an adaptive CORDIC rotator with constant scaling factor is proposed aiming to reduce resources usage.
This work was supported partially by the Mexican National Council for Science and Technology (CONACYT) through grant number 261243.
Researchers have focused on improving the CORDIC algorithmic core and its implementation. Algorithmic improvements commonly consist in reducing the number of rotations [7] [8] , modifying the scaling factor [9] [10] or changing the determination of rotations direction [8, 11, 12] . On the other hand, most popular improvements in the hardware arena are related to operational frequency, throughput and occupied area [13] .
In this paper, a composed architecture is proposed, it consists in combining a Lookup Table ( LUT) with a rotation prediction and a CORDIC pipeline modules. The algorithmic approach takes an input angle and obtains a coarse initial approach of its sine and cosine functions from a LUT. The remaining rotations are algorithmically predicted, and the final result is approximated with the CORDIC pipeline module. Results show a reduction in the number of rotations and in hardware resources usage per iteration.
The paper is organized as follows. In Section III the CORDIC algorithm is presented including specifics for the proposed architectural approach. Section IV introduces the proposed composed hardware architecture design followed by obtained results in Section V. Section VI presents conclusions of this research.
II. RELATED WORK
Related work is extensive and varied, among the approaches aiming to improve CORDIC's algorithmic performance, in [7] Lakshmi et al. proposed a pipelined architecture for radix-4 CORDIC rotations. This VLSI approach refines previously proposed radix-2 techniques in terms of latency and throughput by using redundant arithmetic and higher radix techniques. Originally, radix-4 rotations for a second stage of small rotations were used to accelerate radix-2 CORDIC algorithm [9] . Lee et al. approach was modified to use radix-4 for the entire set of rotations reducing the number of iterations and the hardware resources usage. In [8] , a modification of the angle recoding method that avoids an increase in cycle time and allows an arbitrary input angle is presented. This approach calculates in one step all angle constants by comparing the input angle to several adjacent rotation ranges. The lower the number of comparative ranges the lower the number of cycles during the iterative process. In [11] , authors follow a similar approach to improve CORDIC by reducing the number of required micro-rotations when large bit-width (64-bit) input angles are calculated. The improvement comes from recoding two bits of the input angle concurrently leading to a reduction of 21% in area/delay. In [10] an adaptive approach is proposed which executes necessary iterations with a 50% reduction while maintaining a constant scaling factor. A reduction in hardware resources usage is also reported.
Most popular improvements in the hardware arena are related to operational frequency, throughput and occupied area. In [14] an analysis of standard CORDIC implementations is carried out to reduce the interconnections delay. Several reconfigurable platforms are used for implementing three pre-computing sign methods: Para-CORDIC [12] , P-CORDIC [15] and Flat-CORDIC [16] . Results showed P-CORDIC performs better in newer devices and Flat and Para-CORDIC in older FPGAs devices.
III. CORDIC BASIS
CORDIC's algorithmic approach performs vector rotations by arbitrary angles using shifts and additions. The algorithm is based on a general rotation transformation with angles restricted to:
, reducing multiplications to shift operations. Thus, arbitrary angles are obtained while applying a series of micro-rotations. Basic CORDIC equations are shown in (1) and (2): Table, the input value for this stage is the input angle ( ). On a second stage, rotations directions are obtained by a module implementing the P-CORDIC algorithm [15] . The pre-calculated rotations are then applied to determine the final trigonometric values using a CORDIC pipeline module. 
B. Rotations Prediction
The second stage in the proposed architecture performs the P-CORDIC algorithm, which helps to speed up CORDIC computation by predicting the sequence of directions for all rotations to perform. Rotations directions are calculated in equation (4) by adding the input angle ( ), a constant (stored in the current stage) and an adjustment variable (  ) which is calculated previously and stored in the LUT module. Figure 2 (a) a detailed view of the P-CORDIC module is drawn. Removing datapath from the CORDIC algorithm in the proposed architecture is achieved by the P-CORDIC algorithm. Since the sequence of rotations directions to perform is known in advance, it is not necessary to verify sign in every iteration, thus eliminating datapath. 
C. CORDIC Pipeline
The final stage in the proposed architecture carries out a pipeline for the CORDIC algorithm; details are shown in Figure 2(b) . This implementation only uses x and datapaths, every rotation direction is obtained from the previous module and 
V. RESULTS DISCUSSION
FPGA devices are the chosen implementation platform due to its proven advantages, such as fast prototyping and advanced reconfigurability. The design was coded using VHDL and the tools used to implement the architecture were Mentor's ModelSim, Xilinx System Generator, Matlab/Simulink and Xilinx's ISE 13.2.
Previously proposed approaches and the CORDIC logiCORE IP by Xilinx are used as reference [13] , [17] [18] . Table 1 shows results in terms of hardware resources and maximum operational frequency. The proposed architecture and the CORDIC logiCORE IP are synthesized for Xilinx Spartan 3 (XC3S50-5) and 6 (XC 6SLX45-2) for direct comparison. However, the approach presented in [13] is synthesized for Spartan 2E not available on Xilinx's ISE 13.2. There is a significant reduction in the used logical resources achieved by the proposed architecture on a Xilinx Spartan 3 of 49% and 26% slices, and 75% and 65% FF for the logiCORE IP [18] and the approach reported in [17] respectively. A reasonable operational frequency, limited by the LUT access latency, is maintained. On the other hand, comparing both performance parameters on a Spartan 6, the logiCORE IP [18] increases in 40% and 70% the number of occupied slices and FF with respect to the proposed approach. In terms of frequency, the logiCORE IP achieves higher frequencies; however, its hardware resources usage is significantly greater. An example of an application domain suitable for the proposed CORDIC architecture would be a hard real-time problem such as the GPS attitude determination for vehicles navigation [19] . In [20] , a high performance hardware architecture is proposed to tackle this problem. The most expensive sub-module has as its core the CORDIC algorithm taking 116 clock cycles to calculate the attitude parameters. Throughput reported is approximately 48.2 s  .
The proposed architecture would take 72 clock cycles to calculate the attitude parameters, improving the overall throughput to 30.6 s  , considering the same implementation technology used in [20] . In the next section, conclusions of this research are presented. 
VI. CONCLUSIONS
An efficient and novel hybrid architecture for the CORDIC algorithm was described. The design strategy is the combination of a LUT to obtain a coarse initial approach of the basic trigonometric functions and the elimination of the datapath by predicting the sequence of directions for all rotations to perform. It provides several advantages such as a reduction in the number of necessary rotations as well as usage of hardware resources per iteration, while offering significant throughput and a reduction in the occupied area. Implementation results show that the architecture offers a good balance between high performance and low area complexity. The highly efficient resources usage achieved by the proposed architecture makes it suitable for low precision systems with limited resources such as mobile devices.
