Abstract: Despite being proposed since more than 50 years ago, COordinate Rotation DIgital Computer (CORDIC) is still one of the most effective algorithms for elementary function calculation so far. Original CORDIC, however, suffers high latency due to its nature of unvarying number of rotations. As a result, a low-latency hybrid adaptive (HA) CORDIC is proposed in this paper. Firstly, adaptive angle selection decreases total iterations up to 50% with respect to higher accuracy of results. Secondly, hybrid architecture including fixed-point input and floating-point output reduces the total hardware utilization and enhances the dynamic range of final results. Lastly, parallel and pipeline processing together with resource sharing technique allow the design to operate fully at 175.7 MHz with low resource consumption -1,139 LUTs and 489 registers.
Introduction
CORDIC, a simple and efficient iterative algorithm to compute elementary functions, was first introduced by Volder [1] in 1959 and later extended by Walther [2] in 1971. In fact, CORDIC only requires the additions, subtractions, and shift operations, thereby fitting with Very-Large-Scale Integration (VLSI) system design. For this reason, a vast amount of research in CORDIC algorithm and hardware solution still are in progress although CORDIC is more than 50 years old [3] .
CORDIC-based Fast Fourier Transform (FFT) [4, 5] plays an essential role in most multimedia and wireless communication applications, where the evaluation of trigonometric functions are imperative. Fixed-point number representation is widely utilized in those systems due to its simple calculation and sufficient precision. However, some advanced applications such as Synthetic Aperture Radar (SAR) data processing [6] , require not only highly precise results but also suitable format to manage wide dynamic range of numbers. In those systems, floating-point representation, instead of fixedpoint, are deployed to retain real numbers resolution and accuracy effectively. Many approaches to highly efficient floating-point CORDIC, therefore, have been proposed recently. D. M. Munoz et al. [7] described a two-operationmode floating-point CORDIC architecture that can compute the sine, cosine, or arctangent function. The design achieved an operating frequency of 86.1 MHz with elapsed time around 90 clock cycles at single-precision configuration. P. Surapong et al. [8] proposed an 8-and 16-stage pipelined floatingpoint CORDIC for phase and magnitude detector. As much as, 23% slice register and 38% slice lookup table (LUT) of Xilinx Virtex-5 are used for this 16-stage system, whereas the maximum frequency is only 133.8 MHz. Both Nikhil Dhume et al. [9] and Jie Zhou et al. [10] presented a hybrid approach that firstly convert floating-point input into fixed-point format. A fixed-point CORDIC, then, computes the trigonometric functions and those results, lastly, are transformed into IEEE 754 floating-point format.
In order to reduce error in the results of CORDIC system, more iterations must be performed. However, the number of iterations and clock cycles of each iteration significant affect the latency of CORDIC algorithm. Many approaches to enhance the precision without sacrificing the latency, therefore, have been increasingly attractive. In 1993, Y.H. Hu et al. [11] proposed a method called angle recording which could reduce 50 % the number of iterations. K. R. Terence et al. [12] proposed parallel angle recording method that converged to the final result in the least number of iterations. This method chooses all angle constants in one step but requires a large number of comparison logics. P. K. Meher et al. [13] used angle recording scheme for rotating a fixed angle that could be used in specific application areas such as robotics, graphics, games, and animation. R. Shukla et al. [14] proposed a new low-latency CORDIC algorithm that combining two existing algorithms. This new one can reduce the iterations to (3n/8 + 1) with n is the number of bit precision. Likewise, another reducing iteration algorithm was proposed by S. Aggarwal et al. [15] , which could be applied to waveform generator.
In this paper, a low-latency hybrid adaptive CORDIC with floatingpoint precision is proposed. Moreover, this proposed scheme can obtain low-resource consumption. The contributions of this research are described in detail as follows.
• Low-resource: Hybrid architecture includes 24-bit fixed-point input angle in degree and 32-bit IEEE 754 floating-point sine/cosine outputs. This architecture aims to balance the calculation accuracy with resource utilization. Furthermore, in floating-point arithmetic, resource sharing technique is implemented to reduce the logic utilization by around 70% in comparison with Altera library of the same function.
• Low-latency: Parallel processing is applied to fixed-and floating-point components, and pipeline processing is deployed in each of them to improve the throughput as well as latency. Moreover, adaptive technique, which reduces the number of iterations by obtaining the final results in the least number of constant angles, is an important part to reduce the latency.
• High-precision: Adaptive technique and hybrid architecture improve not only latency and resource consumption, respectively, but also the precision of the proposed system.
2 Proposed CORDIC Algorithm 2.1 Overview of conventional CORDIC algorithm CORDIC algorithm consists of two operation modes namely vectoring and rotation mode. In this paper, rotation mode is focused on to calculate sine and cosine of an input angle Φ. Initially, the initial vector of CORDIC algorithm is V 0 (x 0 , y 0 ). After each micro-rotation i, this vector is adjusted by an angle constant θ i as long as residual angle z i approaches to zero. The equation to calculate x i+1 , y i+1 which do not contain gain factor k i in each iteration of rotation mode is (1).
Gain factor, K is used to eliminate k i multiplication at each iteration i, where N is the total predetermined angles.
Trigonometric results, finally, are derived from x N −1 , y N −1 and K through (3), and sum of θ i almost reaches input angle Φ.
Proposed Hybrid Adaptive (HA) CORDIC Algorithm
The HA-CORDIC algorithm is proposed to reduce the number of iterations and thereby reduces the latency. In other words, only several angles in the set of N angle constants are utilized to form the closest result. This proposed optimized 4-step algorithm developed for sine/cosine calculation is shown in Fig. 1 . To begin with, the input angle is converted into predefined range. Secondly, an optimum set of micro-rotations whose sum approximates the input angle are selected. The coordinate (x, y), and K factor, then, are correspondingly updated. Finally, sine and cosine values are achieved by the product of latest x t , y t , and K, together with simple adjustments.
Angle Normalization
The CORDIC algorithm will only converge across a limited range of input values. In rotation mode, convergence is guaranteed for the angles below the sum of entire N angle constants, i. 
In case the input angle is 30 o , the residual angle of conventional CORDIC is 8.34e − 4. The proposed method, however, only requires five iterations to achieve better residual angle, 2.45e − 4.
The key point of this method is that in each iteration i, angle constant θ i is selected so that residual angle z i converges to zero. The iteration stops upon z i is smaller than a predefined threshold.
In order to choose the angle constant in each iteration, this method uses a set of parameters, C, which expresses the range of residual angles around one angle constant, as defined in (5) . The details of C is described in Table  I . On this basis, the pseudocode determines which angle constant is chosen summarized in Fig. 2 . 2.2.3 Pre-calculation: Coordinate, Residual, And Factor Adaption The adaption of coordinate (x, y), residual z, and factor K are described in (6) . Because only several angles are selected, factor 2 −i , θ i , and k i are replaced by 2 −j , θ j , and k j respectively. Both j and θ j are obtained in Fig.  2 , where k j is listed in Table I . If the residual is smaller than the defined threshold, the latest coordinate (x t , y t ), and gain factor K can be achieved. Finally, X, Y value can be obtained by (7) .
2.2.4 Post-calculation: Sine/cosine Recovery Before coordinate (X,Y ) is calculated, the input angle is normalized, thereby the final sine and cosine results of the input angle must be recovered from (X,Y ) values.
3 Proposed Hardware Architecture
Overview
The proposed design is composed of four main modules namely ANGLE SELE-CTION (ASEL), FIFO, PRE CALCULATION (PREC), and POST CALCU-LATION (POSC), which are illustrated in Fig. 3a . The input is 24-bit fixedpoint (FIX) angle that format is 1.8.15, i.e. 1-bit sign, 8-bit magnitude, and 15-LSBs, and the outputs are two 32-bit floating-point (FLP) trigonometric results. The processing of HA-CORDIC is separated into two parallel threads, which can be seen from Fig. 3b . Because of the difference in operating cycles, a FIFO is inserted between ASEL and PREC to ease the latency. In fact, ASEL and PREC/POSC cost one and two clock cycles for FIX and FLP operation, respectively. Besides, pipeline processing is applied to all modules to increase the throughput. Depending on the number of iterations is determined in ASEL, the execution latency of each angle is markedly different. Each module is described in more detail below.
Angle Selection (ASEL)
Module ASEL strives to obtain the precise result with the least number of iterations by reducing divergent pseudo-rotations. This module is composed of three main components, ANGLE NORMALIZER (ANOR), SET NEXT RO-TATION (SNR), and CHECK LAST ROTATION (CLR), which are depicted in Fig. 4 . Beforehand, ANOR converts input angle iData into normalization range of [0 o ,45 o ]. It can be seen in Fig. 5a , if a circle is split into eight pieces, any angle from the first to the seventh piece can be transformed into equivalent angle in zeroth piece. The recovery information (Rec. info) is transfered to the post-calculation to adjust the final result. By using the Correction equation, the final sine/cosine results can be achieved. After receiving the normalized angle, SNR determines the next angle in ROM THETA by utilizing a pair of ROM C and priority encoder, as illustrated in Fig. 5b . Module ROM THETA includes 16 angle constants θ while ROM C stores the range of residual angles C around one angle constant, which is defined in (5) . In order to eliminate the while loop in pseudocode from Fig. 2 , a set of comparators is deployed in parallel together with a priority encoder to search for next suitable angle θ at the speed of one cycle. The iteration completes as soon as residual angle register Z is smaller than threshold.
At each iteration i, CLR checks whether the current process is the last or not. If the current process is the final iteration, CLR signals ASEL to stop calculating and start the new input angle rotation in following cycle. The pseudocode of this circuit is described in Fig. 6 . If input angle Φ is approximately zero, no rotation is executed (line 5 and 6). If z is within the determined range, updated z will become smaller than the threshold in next rotation (line 7 to 10). Because the addition and subtraction affect the circuit frequency, two LUTs ROM s (θ i − threshold) and ROM a (θ i + threshold) are implemented instead. In this hardware system, module SNR and CLR A collection of 4-bit norm info, 2-bit last rotation, 1-bit sign, and 4-bit phase addr, lastly, is brought together and put into FIFO. At the same time, CONTROL LOGIC gets FIFO to accept data by asserting FF wrreq. If FIFO is not available or current angle is still in progress, oReady will go to low level and thereby CORDIC cannot accept new input angle.
Pre-calculation (PREC)
PREC contains four main components: Floating-point Adder Subtractor (FADD SUB), Floating-point Multiplier k i (FMUL ki), Floating-point Multiplier XY K (FMUL XYK), and CONTROL LOGIC, as shown in Fig. 7a .
FADD SUB calculates X and Y due to i and signZ in each iteration, as illustrated in Fig. 7b . The initial values of X and Y are set as one and zero, respectively, immediately after FADD SUB is reset. FADD SUB is active by asserting start within two clock cycles while holding both signZ and i. Simultaneously, phase becomes the control signal for multiplexing the data path during the operation. The 4-stage shifter performs a right shift operation with zeros fetched into empty MSB because of the fraction parts of x and y. The sign decision checks signZ, phase, and sign of previous X and Y data to decide the operation, addition or subtraction, in Carry Look Ahead (CLA) adder. The two's complement (2's complement), then, will correct the result in case it is a negative number. Finally, all of the information will produce the sign of the result to complete the process of the module.
FMUL ki produces the gain-factor K by multiplying each step-factor k i in each iteration i, as shown in Fig. 7c . Initial value of register RegK in module FMUL ki is set as one and is sequentially updated by previous K and ROM K that is depicted in Table I . FMUL ki also requires 2-clock-delay start for its pipeline computation.
FMUL XYK calculates the products of latest X/Y and K whose circuit is illustrated in Fig. 7d . The last signal is set within two clock cycles while remaining the X and Y to enable the FMUL XYK. The raw cosine and sine results are ready in third and fourth clocks, respectively. The results of sine, cosine pre-calculation are stored in registers regX, regY. This proposed hardware uses parallel and pipeline processing and resource sharing techniques. It can be seen that by using proposed scheme latency, throughput, and hardware utilization are improved significantly.
Post-calculation (POSC)
The raw sine and cosine values are combined with rec info to form the final trigonometric results. The recovery information rec inf o given in Fig. 5a is utilized to select the suitable adjusted pre sin or pre cos.
Experimental Results
The performance of the proposed CORDIC is evaluated in two aspects: algorithm and hardware design. The first assessment proves that proposed algorithm requires fewer iterations but achieves higher precision than the original CORDIC. The second assessment compares this work with the other floating-point systems in terms of latency and resource utilization.
Evaluation of Algorithm
In order to assess the algorithm, 9001 angles, from [−45 o , 45 o ] are generated. The total number of iterations are observed at three different number of angle constants -namely N . As can be seen in Fig. 8 , the proposed HA-CORDIC only requires maximum 4, 6, and 8 micro-rotations in case N = 8, 12, 16, respectively. In other words, in the worst case, the HA-CORDIC is still 2.7X, The precision of two algorithms is measured by mean square error (MSE) of the residual angle. In fact, the more the residual angle is close to zero, the more the precision of sine/cosine can be achieve. It can be seen in Fig  9, with the variation of N from 8 to 16, the precision of both methods increases. Nevertheless, the proposed CORDIC always delivers smaller MSE value, 2.51e-7, than that from conventional one, 1.02e-6.
Evaluation of Hardware Design
The proposed HA-CORDIC is synthesized by Altera Quartus 14.0 with Stratix IV FPGA target. The latency, resource utilization (LUT, Register, Memory) and operating frequency among HA-CORDIC and the other floating-point CORDIC systems are illustrated in Table II . Unlike original CORDIC methods, HA-CORDIC latency is varied due to the dynamic rotation but the latency results are much shorter than the other results. At N = 16, it costs 12, 20, and 26 clock cycles in best case (zero or one rotation), typical case (five rotations), and worst case (eight rotations). Besides, HA-CORDIC resource utilization is much better than the others regarding the operating frequency of 175.7 MHz.
Conclusion
In this paper, a low-latency hybrid adaptive CORDIC with floating-point precision is proposed. Because of adaptive angle selection, proposed HA-CORDIC needs fewer latency while achieves more precise trigonometric results rather than the original one. Furthermore, hybrid architecture, which includes a fixed-point input angle in degree and two floating-point sine/cosine outputs not only reduces the hardware resource in total but also enhances the results precision in general. The experiments show that the design is fully operational at 175.7 MHz and costs 12 and 26 latency cycles in best and worst case, respectively, in case of N = 16. Besides, HA-CORDIC is likely to integrate into other advanced systems easily due to given acknowledge signals and low resource consumption -1, 139 LUTs and 489 registers only.
