I. INTRODUCTION

C
OORDINATE Rotation DIgital Computer is abbreviated as CORDIC. The key concept of CORDIC arithmetic is based on the simple and ancient principles of two-dimensional geometry. But the iterative formulation of a computational algorithm for its implementation was first described in 1959 by Jack E. Volder [1] , [2] for the computation of trigonometric functions, multiplication and division. This year therefore marks the completion of 50 years of the CORDIC algorithm. Not only a wide variety of applications of CORDIC have emerged in the last 50 years, but also a lot of progress has been made in the area of algorithm design and development of architectures for high-performance and low-cost hardware solutions of those applications. CORDIC-based computing received increased attention in 1971, when John Walther [3] , [4] showed that, by varying a few simple parameters, it could be used as a single algorithm for unified implementation of a wide range of elementary transcendental functions involving logarithms, exponentials, and square roots along with those suggested by Volder [1] . During the same time, Cochran [5] benchmarked various algorithms, and showed that CORDIC technique is a better choice for scientific calculator applications.
The popularity of CORDIC was very much enhanced thereafter primarily due to its potential for efficient and low-cost implementation of a large class of applications which include: the generation of trigonometric, logarithmic and transcendental elementary functions; complex number multiplication, eigenvalue computation, matrix inversion, solution of linear systems and singular value decomposition (SVD) for signal processing, image processing, and general scientific computation. Some other popular and upcoming applications are: 1) direct frequency synthesis, digital modulation and coding for speech/music synthesis and communication; 2) direct and inverse kinematics computation for robot manipulation; 3) planar and three-dimensional vector rotation for graphics and animation. Although CORDIC may not be the fastest technique to perform these operations, it is attractive due to the simplicity of its hardware implementation, since the same iterative algorithm could be used for all these applications using the basic shift-add operations of the form . Keeping the requirements and constraints of different application environments in view, the development of CORDIC algorithm and architecture has taken place for achieving high throughput rate and reduction of hardware-complexity as well as the latency of implementation. Some of the typical approaches for reduced-complexity implementation are focussed on minimization of the complexity of scaling operation and the complexity of barrel-shifter in the CORDIC engine. Latency of implementation is an inherent drawback of the conventional CORDIC algorithm. Angle recoding schemes, mixed-grain rotation and higher radix CORDIC have been developed for reduced latency realization. Parallel and pipelined CORDIC have been suggested for high-throughput computation. The objective of this article is not to present a detailed survey of the developments of algorithms, architectures and applications of CORDIC, which would require a few doctoral and masters level dissertations. Rather we aim at providing the key developments in algorithms and architectures along with an overview of the major application areas and upcoming applications. We shall however discuss here the basic principles of CORDIC operations for the benefit of general readers.
The remainder of this paper is organized as follows. In Section II, we discuss the principles of CORDIC operation, covering the elementary ideas from coordinate transformation to rotation mode and vectoring mode operations followed by design of the basic CORDIC cell and multidimensional CORDIC. The key developments in CORDIC algorithms and architectures are discussed in Section III, which covers the algorithms and architectures pertaining to higher-radix CORDIC, angle recording, coarse-fine hybrid micro rotations, redundant number representation, differential CORDIC, and pipeline implementation. In Section IV, we discuss the scaling and accuracy aspects including the scaling techniques, scaling-free CORDIC, quantization and area-delay-accuracy trade-off. The applications of CORDIC to scientific computations, signal processing, communications, robotics and graphics are discussed briefly in Section V. The conclusion along with future research directions are discussed in Section VI.
II. BASIC CORDIC TECHNIQUES
In this Section, we discuss the basic principle underlying the CORDIC-based computation, and present its iterative algorithm for different operating modes and planar coordinate systems. At the end of this section, we discuss the extension of two-dimensional rotation to multidimensional formulation.
A. The CORDIC Algorithm
As shown in Fig. 1 , the rotation of a two-dimensional vector through an angle , to obtain a rotated vector could be performed by the matrix product , where is the rotation matrix: (1) By factoring out the cosine term in (1), the rotation matrix can be rewritten as (2) and can be interpreted as a product of a scale-factor with a pseudorotation matrix , given by (3) The pseudorotation operation rotates the vector by an angle and changes its magnitude by a factor , to produce a pseudo-rotated vector . To achieve simplicity of hardware realization of the rotation, the key ideas used in CORDIC arithmetic are to (i) decompose the rotations into a sequence of elementary rotations through predefined angles that could be implemented with minimum hardware cost; and (ii) to avoid scaling, that might involve arithmetic operation, such as square-root and division. The second idea is based on the fact the scale-factor contains only the magnitude information but no information about the angle of rotation. 
1) Iterative Decomposition of Angle of Rotation:
The CORDIC algorithm performs the rotation iteratively by breaking down the angle of rotation into a set of small pre-defined angles 1 , , so that could be implemented in hardware by shifting through bit locations. Instead of performing the rotation directly through an angle , CORDIC performs it by a certain number of microrotations through angle , where and (
that satisfies the CORDIC convergence theorem [3] : . But, the decomposition according to (4) could be used only for (called the "convergence range") since . Therefore, the angular decomposition of (4) is applicable for angles in the first and fourth quadrants. To obtain on-the-fly decomposition of angles into the discrete base , one may otherwise use the nonrestoring decomposition [6] and (5) with if and otherwise, where the rotation matrix for the th iteration corresponding to the selected angle is given by (6) being the scale-factor, and the pseudorotation matrix (7) Note that the pseudo-rotation matrix for the th iteration alters the magnitude of the rotated vector by a scale-factor during the th microrotation, which is independent of the value of (direction of microrotation) used in the angle decomposition. 
2) Avoidance of Scaling:
The other simplification performed by the Volder's algorithm [1] is to remove the scale-factor from (6) . The removal of scaling from the iterative microrotations leads to a pseudo-rotated vector instead of the desired rotated vector , where the scale-factor is given by (8) Since the scale-factor of microrotations does not depend on the direction of microrotations and decreases monotonically, the final scale-factor converges to . Therefore, instead of scaling during each microrotation, the magnitude of final output could be scaled by . Therefore, the basic CORDIC iterations are obtained by applying the pseudo-rotation of the vector to have, , together with the nonrestoring decomposition of the selected angles , as follows: (9) CORDIC iterations of (9) could be used in two operating modes, namely the rotation mode (RM) and the vectoring mode (VM), which differ basically on how the directions of the microrotations are chosen. In the rotation mode, a vector is rotated by an angle to obtain a new vector . In this mode, the direction of each microrotation is determined by the sign of : if sign of is positive, then otherwise . In the vectoring mode, the vector is rotated towards the -axis so that the -component approaches zero. The sum of all angles of microrotations (output angle ) is equal to the angle of rotation of vector , while output corresponds to its magnitude. In this operating mode, the decision about the direction of the microrotation depends on the sign of : if it is positive then otherwise . CORDIC iterations are easily implemented in both software and hardware. Fig. 2 shows the basic hardware stage for a single CORDIC iteration. After each iteration the number of shifts is incremented by a pair of barrel-shifters. To have an -bit output precision, CORDIC iterations are needed. Note that it could be implemented by a simple selection operation in serial architectures like the one proposed in the original work, or in fully parallel CORDIC architectures the shift operations could be hardwired, where no barrel-shifters are involved.
Finally, to overcome the problem of the limited convergence range and, then to extend the CORDIC rotations to the complete 
B. Generalization of the CORDIC Algorithm
In 1971, Walther found how CORDIC iterations could be modified to compute hyperbolic functions [3] and reformulated the CORDIC algorithm in to a generalized and unified form which is suitable to perform rotations in circular, hyperbolic and linear coordinate systems. The unified formulation includes a new variable , which is assigned different values for different coordinate systems. The generalized CORDIC is formulated as follows: (11) where for rotation mode for vectoring mode For or , and or , the algorithm given by (11) works in circular, linear or hyperbolic coordinate systems, respectively. Table I summarizes the operations that can be performed in rotation and vectoring modes 2 in each of these coordinate systems. The convergence range of linear and hyperbolic CORDIC are obtained, as in the case of circular coordinate, by the sum of all given by . The hyperbolic CORDIC requires to execute iterations for twice to ensure convergence. Consequently, these repetitions must be considered while computing the scale-factor , which converges to 0.8281.
C. Multidimensional CORDIC
The CORDIC algorithm was extended to higher dimensions using simple Householder reflection [7] . The Householder reflection matrix is defined as (12) where is an -dimensional vector and is the identity matrix. The product reflects the -dimensional vector with respect to the hyperplane with normal that passes through the origin. Basically, the Householder-based CORDIC performs the vectoring operation of an -dimensional vector to one of the axes.
For the sake of clarity, we consider here the case of 3-D vector projected on to the -axis in the Euclidean space. The rotation matrix for 3-D case, corresponding to the th iteration, , is given by the product of two simple Householder reflections as (13) where , and with , and and being the directions of microrotations.
One can write the th rotation matrix in terms of the pseudo-rotation matrix as , where is the scale-factor and is the pseudo-rotation matrix which could be expressed as function of the shifting and decision variables as (14) Therefore, the th iteration of 3-D Housholder CORDIC rotation results , and, the vector is projected to -axis, such that after iterations gives the length of the vector scaled by with bit precision [8] .
III. ADVANCED CORDIC ALGORITHMS AND ARCHITECTURES
CORDIC computation is inherently sequential due to two main bottlenecks: 1) the micro-rotation for any iteration is performed on the intermediate vector computed by the previous iteration and 2) the th iteration could be started only after the completion of the th iteration, since the value of which is required to start the th iteration could be known only after the completion of the th iteration. To alleviate the second bottleneck some attempts have been made for evaluation of values corresponding to small micro-rotation angles [9] , [10] . However, the CORDIC iterations could not still be performed in parallel due to the first bottleneck. A partial parallelization has been realized in [11] by combining a pair of conventional CORDIC iterations into a single merged iteration which provides better area-delay efficiency. But the accuracy is slightly affected by such merging and cannot be extended to a higher number of conventional CORDIC iterations since the induced error becomes unacceptable [11] . Parallel realization of CORDIC iterations to handle the first bottleneck by direct unfolding of micro-rotation is possible, but that would result in increase in computational complexity and the advantage of simplicity of CORDIC algorithm gets degraded [12] , [13] . Although no popular architectures are known to us for fully parallel implementation of CORDIC, different forms of pipelined implementation of CORDIC have however been proposed for improving the computational throughput [14] .
Since the CORDIC algorithm exhibits linear-rate convergence, it requires iterations to have -bit precision of the output. Overall latency of the computation thus amounts to product of the word-length and the CORDIC iteration period. The speed of CORDIC operations is therefore constrained either by the precision requirement (iteration count) or the duration of the clock period. The duration of clock period on the other hand mainly depends on the large carry propagation time for the addition/subtraction during each micro-rotation. It is a straight-forward choice to use fast adders for reducing the iteration period at the expense of large silicon area. Use of carry-save adder is a good option to reduce the iteration period and overall latency [15] . Timmermann and others have suggested a method of truncation of CORDIC algorithm after iterations (for -bit precision), where the last iteration performs a single rotation for implementing the remaining angle. It lowers the the latency time but involves one multiplication or division, respectively, in the rotation or vectoring mode [9] .
To handle latency bottlenecks, various techniques have been developed and reported in the literature. Most of the well known algorithms could be grouped under, high-radix CORDIC, the angle-recoding method, hybrid micro-rotation scheme, redundant CORDIC and differential CORDIC which we discuss briefly in the following subsections.
A. Higher Radix CORDIC Algorithm
The radix-4 CORDIC algorithm [16] is given by (15) where and the elementary angles . The scale-factor for the th iteration . In order to preserve the norm of the vector the output of micro-rotations is required to be scaled by a factor (16) To have -bit output precision, the radix-4 CORDIC algorithm requires micro-rotations, which is half that of radix-2 algorithm. However, it requires more computation time for each iteration and involves more hardware compared to the radix-2 CORDIC to select the value of out of five different possibilities. Moreover, the scale-factor, given by (16), also varies with the rotation angles since it depends on which could have any of the five different values. Some techniques have therefore been suggested for scale-factor compensation through iterative shift-add operations [16] , [17] . A high-radix CORDIC algorithm in vectoring mode is also suggested in [18] , which can be used for reduced latency operation at the cost of larger size tables for storing the elementary angles and pre-scaling factors than the radix-2 and radix-4 implementation.
B. Angle Recoding (AR) Methods
The purpose of angle recoding (AR) is to reduce the number of CORDIC iterations by encoding the angle of rotation as a linear combination of a set of selected elementary angles of micro-rotations. AR methods are well-suited for many signal processing and image processing applications where the rotation angle is known a priori, such as when performing the discrete orthogonal transforms like discrete Fourier transform (DFT), the discrete cosine transform (DCT), etc.
1) Elementary-Angle-Set Recoding:
In the conventional CORDIC, any given rotation angle is expressed as a linear combination of values of elementary angles that belong to the set in order to obtain an -bit value as . However, in AR methods, this constraint is relaxed by adding zeros to the linear combination to obtain the desired angle using relatively fewer terms of the form for . The elementary-angle-set (EAS) used by AR scheme is given by . One of the simplest form of the angle recoding method based on the greedy algorithm proposed by Hu and Naganathan [19] tries to represent the remaining angle using the closest elementary angle . The angle recoding algorithm of [19] is briefly stated in Table II . Using this recoding scheme the total number of iterations could be reduced by at least 50% keeping the same -bit accuracy unchanged. A similar method of angle recoding in vectoring mode called as the backward angle recoding is suggested in [20] .
2) Extended Elementary-Angle-Set Recoding: Wu et al. [21] have suggested an AR scheme based on an extended elementary-angle-set (EEAS), that provides a more flexible way of decomposing the target rotation angle. In the EEAS approach, the set of the elementary-angle set is extended further to and . EEAS has better recoding efficiency in terms of the number of iterations and can yield better error performance than the AR scheme based on EAS. The pseudo-rotation for th micro-rotations based on EEAS scheme is given by (17) The pseudo-rotated vector , obtained after (the required number of micro-rotations) iterations, according to (17) , needs to be scaled by a factor , where to produce the rotated vector. For reducing the scaling approximation and for a more flexible implementation of scaling, similar to the EEAS scheme for the micro-rotation phase, a method has also been suggested in [21] , as given below (18) where and . and . The iterations for micro-rotation phase as well as the scaling phase could be implemented in the same architecture to reduce the hardware cost, as shown in Fig. 3 .
3) Parallel Angle Recoding: The AR methods [19] , [21] could be used to reduce the number of iterations by more than 50%, when the angle of rotation is known in advance. However, for unknown rotation angles, their hardware implementation involves more cycle time than the conventional implementation, which results in a reduction in overall efficacy of the algorithm. To reduce the cycle time of CORDIC iterations in such cases, a parallel angle selection scheme is suggested in [22] , which can be used in conjunction with the AR method, to gain the advantages of the reduction in iteration count, without further increase in the cycle time. The parallel AR scheme in [22] is based on dynamic angle selection, where the elementary angles can be tested in parallel and the direction for the micro-rotations can be determined quickly to minimize the iteration period. During each iteration, the residual angle , is passed to a set of adder-subtractor units that compute for each elementary angle in parallel and the differences for are then fed to a binary-tree like structure to compare them against each other to find the smallest difference. The corresponding to the smallest difference is used as the angle of micro-rotation. The architecture for parallel angle recoding of [22] is shown in Fig. 4 .
The parallel AR reduces the overall latency at the cost of high hardware-complexity of add/subtract-compare unit. For actual implementation, it is required to find a space-time trade-off and look at the relative performance in comparison with other approaches as well. The AR schemes based on EAS and EEAS however are useful for those cases where the angle of rotation is known in advance.
C. Hybrid or Coarse-Fine Rotation CORDIC
Based on the radix-2 decomposition, any rotation angle with -bit precision could be expressed as a linear combination of angles from the set , given by , where , explicitly specifies whether there is need of a micro-rotation or not. But, radix-2 decomposition is not used in the conventional CORDIC because that would not lead to simplicity of hardware realization. Instead, arctangents of the corresponding values of radix-2 based set are used as the elementary-angle-set with a view to implement the CORDIC operations only by shift-add operations. The key idea underlying the coarse-fine angular decomposition is that for the fine values of , (i.e., when ), could be replaced by in the radix-set for expansion of , since when is sufficiently large. 1) Coarse-Fine Angular Decomposition: In the coarse-fine angular decomposition, the elementary-angle-set contains the arctangents of power-of-two for more-significant part while the less significant part contains the power-of-two values, such that the radix-set is given by , where and , and is assumed to be sufficiently large such that [10] . For the hybrid decomposition scheme, the rotation angle could be partitioned into two terms expressed as (19) where and are said to be the coarse and fine subangles, respectively, given by (20a) (20b) A combination of coarse and fine micro-rotations are used in hybrid CORDIC operations in two cascaded stages. Coarse rotations are performed in stage-1 to have an intermediate vector (21) and fine rotations are performed on the output of stage-1 to obtain the rotated output (22) 2) Implementation of Hybrid CORDIC: To derive the efficiency of hybrid CORDIC, the coarse and fine rotations are performed by separate circuits as shown in Fig. 5 . The coarse rotation phase is performed by the CORDIC processor-I and the fine rotation phase is performed by CORDIC processor-II.
To have fast implementation, processor-I performs a pair of ROM look-up operations followed by addition to realize the rotation through angle . Since could be expressed as a linear combination of angels of small enough magnitude , where , the computation of fine rotation phase can be realized by a sequence of shift-and-add operations. For implementation of the fine rotation phase, no computations are involved to decide the direction of micro-rotation, since the need of a micro-rotation is explicit in the radix-2 representation of . The radix-2 representation could also be recoded to express where as shown in [9] . Since the direction of micro-rotations are explicit in such a representation of , it would be possible to implement the fine rotation phase in parallel for low-latency realization.
The hybrid decomposition could be used for reducing the latency by ROM-based realization of coarse operation. This can also be used for reducing the hardware complexity of fine rotation phase since there is no need to find the direction of microrotation. Several options are however possible for the implementation of these two stages. A form of hybrid CORDIC is suggested in [23] for very-high precision CORDIC rotation where the ROM size is reduced to nearly bits. The coarse rotations could be implemented as conventional CORDIC through shift-add operations of micro-rotations if the latency is tolerable.
3) Shift-Add Implementation of Coarse Rotation: Using the symmetry properties of the sine and cosine functions in different quadrants, the rotation through any arbitrary angle could be mapped from the full range to the first half the first quadrant . The coarse-fine partition could be applied thereafter for reducing the number of micro-rotations necessary for fine rotations. To implement the course rotations through shift-add operations the coarse subangle is represented in [24] and [25] in terms of elementary rotations of the form as (23) where is a correction term. Using (23) on (19) , one can find , where (24) It is shown [25] that, based on the above decompositions using radix-2 representation, both coarse and fine rotations could be implemented by a sequence of shift-and-add operations in CORDIC iterations without ROM lookup table or the real multiplication operation. One such implementation is shown in Fig. 6 . Processor-I performs CORDIC operations like that of conventional CORDIC for nearly the first one-third of the iterations and the residual angle as well as the intermediate rotated vector is passed to the processor-II. Processor-II can perform the fine rotation in one of the possible ways as in case of the circuit of Fig. 5 .
The coarse-fine rotation approach in some modified forms has been applied for reduced-latency implementation of sine and cosine generation [24] - [28] , high-speed and high-precision rotation [24] , [26] , and conversion of rectangular to polar coordinates and vice versa [29] , [30] .
4) Parallel CORDIC Based on Coarse-Fine Decomposition:
In [31] , the authors have proposed two angle recoding techniques for parallel detection of direction of micro-rotations, namely the binary to bipolar recoding (BBR) and micro-rotation angle recoding (MAR) to be used for the coarse part of the input angle . BBR is used to obtain the polarity of each bit in the radix-2 representation of to determine the rotation direction. MAR is used to decompose each positional binary weight into a linear combination of arctangent terms. It is further shown in [32] that, the rotation direction can be decided once the input angle is known to enable parallel computation of the micro-rotations. Although the CORDIC rotation can be executed in parallel according to [32] , the method for decomposition of each positional binary weight produces many extra stages of micro-rotation, especially when the bit-width of input angle increases. A more efficient recoding scheme has been proposed in [33] for the reduction of number of micro-rotations to be employed in parallel CORDIC rotations.
D. Redundant-Number-Based CORDIC Implementation
Addition/subtraction operations are faster in the redundant number system, since unlike the binary system, it does not involve carry propagation. The use of redundant number system is therefore another way to speed up the CORDIC iterations. A CORDIC implementation based on the redundant number system called as redundant CORDIC was proposed by Ercegovac and Lang and applied to matrix triangularization and singular value decomposition [34] . Rotation mode redundant CORDIC has been found to result in fast implementation of sinusoidal function generation, unitary matrix transformation, angle calculation and rotation [34] - [38] . Although redundant CORDIC can achieve a fast carry-free computation, the direction of the micro-rotation (the sign factor ) cannot be determined directly unlike the case of the conventional CORDIC, since the redundant number system allows a choice along with the conventional choices 1 and 1 such that . Therefore, it requires a different formulation for selection of , which is different for binary signed-digit representation and carry-save implementation. In radix-2 signed-digit representation, assuming-, it is shown that [6] if if if (25) where is the value of truncated after the first fractional digit. Similarly for carry-save implementation, it is if if if (26) It can be noted from (25) and (26) , that in some of the iterations no rotations are performed, so that the scale-factor becomes a variable which depends on the angle of rotation. Since the redundant CORDIC of [34] uses non-constant scale-factor, Takagi et al. [35] have proposed the double-rotation method and correcting-rotation method to keep the value of scale-factor constant. In double rotation method, in each iteration two micro-rotations are performed, such that when , one positive and one negative micro-rotations are performed, and when or , respectively, two positive or two negative micro-rotations are performed. The scale-factor is retained constant in this case since the number of micro-rotations is fixed for any rotation angle but it doubles the iteration count. The correcting-rotation method examines the sign of constituted by some most significant digits of , and if then is taken to be and is taken to be otherwise. It is shown that the error occurring in this algorithm could be corrected by repetition of the iterations for , etc., where is the size of . The branching CORDIC was proposed in [36] for fast on-line implementation for redundant CORDIC with a constant scale factor. The main drawback of this method, however, is its necessity of performing two conventional CORDIC iterations in parallel, which consumes more silicon area than classical methods [39] . The work proposed in [34] has also been extended to the vectoring mode [37] , and correcting operations are included further to keep the scaling factor constant so as to eliminate the hardware for scaling.
E. Pipelined CORDIC Architecture
Since the CORDIC iterations are identical, it is very much convenient to map them into pipelined architectures. The main emphasis in efficient pipelined implementation lies with the minimization of the critical path. The earliest pipelined architecture that we find was suggested by Deprettere, Dewilde and Udo in 1984 [14] . Pipelined CORDIC circuits have been used thereafter for high-throughput implementation of sinusoidal wave generation, fixed and adaptive filters, discrete orthogonal transforms and other signal processing applications [40] - [44] . A generic architecture of pipelined CORDIC circuit is shown in Fig. 7 . It consists of stages of CORDIC units where each of the pipelined stages consists of a basic CORDIC engine of the kind shown in Fig. 2 . Since the number of shifts to be performed by the shifters at different stages is fixed (shift-operation through -bit positions is performed at the th stage) in case of pipelined CORDIC the shift operations could be hardwired with adders; and therefore shifters are eliminated in the pipelined implementation. The critical-path of pipelined CORDIC thus amounts to the time required by the add/subtract operations in each of the stages. When three adders are used in each stage as shown in Fig. 7 , the critical-path amounts to , where and are the time required for addition, 2:1 multiplexing and 2's complement operation, respectively. For known and constant angle rotations the sign of micro-rotations could be predetermined, and the need of multiplexing could be avoided for reducing the critical-path. The latency of computation thus depends primarily on the time required for an addition. Since there is very little room for reducing the critical path in the pipelined implementation of conventional CORDIC, digit-on-line pipelined CORDIC circuits based on the differential CORDIC (D-CORDIC) algorithm have been suggested to achieve higher throughput and lower pipeline latency.
F. Differential CORDIC Algorithm
D-CORDIC algorithm is equivalent to the usual CORDIC in terms of accuracy as well as convergence, but it provides faster and more efficient redundant number-based implementation of both rotation mode and vectoring mode CORDIC. It introduces some temporary variables corresponding to the CORDIC variables and , that generically defined as (27) which implies that and . The signs of are, therefore, considered as being differentially encoded signs of in the differential CORDIC algorithm [45] . The rotation and vectoring mode D-CORDIC algorithms are outlined in Table III. D-CORDIC algorithm is suitable for efficient pipelined implementation which is utilized by Ercegovac and Lang [34] using on-line arithmetic based on redundant number system. Since the output data in the redundant on-line arithmetic can be available in the most-significant-digit-first (MSD-first) fashion, the successive iterations could be implemented by a set of cascaded stages, where processing time between the successive stages is overlapped with a single-digit time-skew, that results in a significant reduction in overall latency of computation. Moreover, in some redundant number representations, the absolute values and sign of the output are easily determined, e.g., in binary signed-digit (BSD) representation, the sign of a number corresponds to the sign of the first nonzero MSD, and negation of the number can be performed just by flipping signs of nonzero digits. A two-dimensional systolic D-CORDIC architecture is derived in [46] where phase accumulation is performed for direct digital frequency synthesis in the digit-level pipelining framework.
IV. SCALING, QUANTIZATION AND ACCURACY ISSUES
As discussed in Section II-A, scaling is a necessary operation associated with the implementation of CORDIC algorithm. Scaling in CORDIC could be of two types: 1) constant factor scaling and 2) variable factor scaling. In case of variable factor scaling the scale-factor changes with the rotation angle. It arises mainly because some of the iterations of conventional CORDIC are ignored (and that varies with the angle of rotation), as in the case of higher-radix CORDIC and most of the optimized CORDIC algorithms. The techniques for scaling compensation for each such algorithms have been studied extensively for minimizing the scaling overhead. In case of conventional CORDIC, as given by (8) , after sufficiently large number of iterations, the scale-factor converges to 1.6467605, which leads to constant factor scaling since the scale factor remains the same for all the angle of rotations. Constant factor scaling could be efficiently implemented in a dedicated scaling unit designed by canonical signed digit (CSD)-based technique [47] and common sub-expression elimination (CSE) approach [48] , [49] . When the sum of the output of more than one independent CORDIC operations are to be evaluated, one can perform only one scaling of the output sum [50] in the case of constant factor scaling. In the following subsections, we briefly discuss some interesting developments on implementation of on-line scaling and realization of scaling-free CORDIC. Besides, we outline here the sources of error that may arise in a CORDIC design and their impact on implementation.
A. Implementation of Mixed-Scaling-Rotation
Dewilde et al. [51] have suggested the on-line scaling where shift-add operations for scaling and micro-rotations are interleaved in the same circuit. This approach has been used in [52] and improved further in [53] . In the mixed-scaling-rotation (MSR) approach, pioneered by Wu et al. [54] - [56] , the micro-rotation and scaling phases are merged into a unified vector rotational model to minimize the overhead of the scaling operation [54] - [56] . The MSR-CORDIC can be applied to DSP applications, in which the rotation angles are usually known a priori, e.g., the twiddle factor in fast Fourier transform (FFT) and kernel components in other sinusoidal transforms. It is shown in [55] that the MSR technique can significantly reduce the total iteration count so as to improve the speed performance and enhance the signal-to-quantization-noise ratio (SQNR) performance by controlling the internal dynamic range. The MSR-CORDIC scheme has been applied to a variable-length FFT processor design [29] , and found to result in significant hardware reduction in the implementation of twiddle-factor multiplications. Although, the interleaved scaling and MSR-CORDIC provide hardware reduction, they also lead to the reduction of throughput. For high-throughput implementation, one should implement the micro-rotations and scaling in two separate pipelined stages.
B. Low-Complexity Scaling
When the elementary angles pertaining to a rotation are "sufficiently small", defined by , and the rotations are only in one direction, the CORDIC rotation is given by the representation [57] (28) and , (considering clockwise micro-rotations only), where and are the components of the vector after the th micro-rotation, is the input wordlength and . The formulation of (28) performs the "actual" rotation where the norm of the vector is preserved at every micro-rotation.
However, the problem with this formulation is that the overall range of angles for which it can be used is very small, because, for 16-bit wordlength, the largest such angle is , which obviously is quite small compared to the entire coordinate space. To overcome this problem, argument reduction is performed through "domain folding" [58] by mapping the target rotation-angles into the range . Besides, the elementary rotations are carried out in an adaptive manner to enhance the rate of convergence so as to force the approximation error of final angle below a specified limit [59] . But, the domain-folding in some cases, involves a rotation through which demands a scaling by a factor of . Besides, the target range is still much larger than the range of convergence of the scaling-free realization. The formulation of (28), therefore, could be effectively used when a rotation through is not required and angles of rotations could be folded to the range . Generalized algorithms, and their corresponding architectures to perform the scale-factor compensation in parallel with the CORDIC iterations, for both rotation and vectoring modes are proposed in [60] , where the compensation overhead is reduced to a couple of iterations. It is shown in [61] that since the scale-factor is known in advance, one can perform the minimal recoding of the bits of scaling-factor, and implement the multiplication thereafter by a Wallace tree. It is a good solution of low-latency scaling particularly for pipelined CORDIC architectures.
C. Quantization and Numerical Accuracy
Errors in CORDIC are mainly of two types: 1) the angle approximation error which originates from quantization of rotation angle represented by a linear combination of finite numbers of elementary angles and 2) the finite wordlength of the datapath resulting in the rounding/truncation of output that increases cumulatively through the successive iterations of micro-rotations. A third source of error that also comes into the picture results from the scaling of pseudo-rotated outputs. The scaling error is, however, also due to the use of finite wordlength in the scaling circuitry and is predominantly a rounding/truncation error. A detailed discussion on rounding error due to fixed and floating point implementations is available in [62] . In his earlier work, Walther [3] concluded that the errors in the CORDIC output are bounded, and extra bits are required in the datapaths to take care of the errors. Hu [62] has provided more precise error bounds due to the angle approximation error for different CORDIC modes for fixed point as well as floating-point implementations. The error bound resulting for fixed point representation of arctangents is further analyzed by Kota and Cavallaro [63] and its impact on practical implementation has been discussed.
D. Area-Delay-Accuracy Trade-Off
Area, accuracy and latency of CORDIC algorithm depend mainly on the iteration count and its implementation. To achieve -bit accuracy, if fixed-point arithmetic is applied, the wordlength of and data-path is and for the computation of the angle , it is [45] , [63] . The hardware requirement therefore increases accordingly with the desired accuracy. Floating-point implementation naturally gives higher accuracy than its fixed-point counterpart, but at the cost of more complex hardware. To minimize the angle approximation error, the smallest elementary angle needs to be as small as possible [62] . This consequently demands more number of right-shifts and more hardware for the barrel-shifters and adders. Besides, to have better angle approximation, more number of iterations are required which increases the latency. The additional accuracy resulting from floating-point implementation or better angle approximation may not, however, be necessary in many applications. Thus, there is a need for trade-off between hardware-cost, latency and numerical accuracy subject to a particular application. Therefore, the designer has to check how much numerical accuracy is needed along with area and speed constraints for the particular application; and can accordingly decide on fixed or floating-point implementation and should set the wordlength and optimal number of iterations.
V. APPLICATIONS OF CORDIC
CORDIC technique is basically applied for rotation of a vector in circular, hyperbolic or linear coordinate systems, which in turn could also be used for generation of sinusoidal waveform, multiplication and division operations, and evaluation of angle of rotation, trigonometric functions, logarithms, exponentials and squareroot [6] , [64] , [65] . Table IV shows some elementary functions and operations that can be directly implemented by CORDIC. The table also indicates whether the coordinate system is circular (CC), linear (LC), or hyperbolic (HC), and whether the CORDIC operates in rotation mode (RM) or vectoring mode (VM), the initialization of the CORDIC and the necessary pre-or postprocessing step to perform the operation. The scale factors are, however, obviated in Table IV for simplicity of presentation. In this Section, we discuss how CORDIC is used for some basic matrix problems like QR decomposition and singular-value decomposition. Moreover, we make a brief presentation on the applications of CORDIC to signal and image processing, digital communication, robotics and 3-D graphics.
A. Matrix Computation 1) QR Decomposition:
QR decomposition of a matrix can be performed through Givens rotation [66] that selectively introduces zeros into the matrix. Givens rotation is an orthogonal transformation of the form (29) where and . The QR decomposition requires two types of iterative operations to obtain an upper-triangular matrix using orthogonal transformations. Those are: (i) to calculate the Givens rotation angle, and (ii) to apply the calculated angle of rotation to the rest of the rows. Circular coordinate CORDIC is a good choice to implement both these Givens rotations, where the first operation is performed by a VM CORDIC and the second one is performed by an RM CORDIC. The CORDIC-based QR decomposition can be implemented in VLSI with suitable area-time trade-off using a systolic triangular array, a linear array or a single CORDIC processor that is reconfigurable for rotation and vectoring modes of operations. A detail explanation of these architectures are available in [64] , [67] .
2) Singular Value Decomposition and Eigenvalue Estimation: Singular value decomposition of a matrix is given by where and are orthogonal matrices and is a diagonal matrix of singular values. For CORDIC-based implementation of SVD, it is decomposed into 2 2 SVD problems, and solved iteratively. To solve each 2 2 SVD problem, two-sided Givens rotation is applied to each of the 2 2 matrices to nullify the off-diagonal elements, as described in the following: (30) where is a 2 2 input matrix to be decomposed; and and are, respectively, the left and right rotation angles, calculated from the elements of using the following two relations:
for (31) CORDIC-based architectures for SVD using this method were developed by Cavallaro and Luk [68] . A simplified design of array processor for the particular case ( i.e., ) was developed further by Delosme [69] for the symmetric Eigenvalue problem. In a relatively recent paper [70] , Liu et al. have proposed an application-specific instruction set processor (ASIP) for the real-time implementation of QR decomposition and SVD where circular coordinate CORDIC is used for efficient implementation of both these functions.
B. Signal Processing and Image Processing Applications
CORDIC techniques have a wide range of DSP applications including fixed/adaptive filtering [8] , and the computation of discrete sinusoidal transforms such as the DFT [50] , [52] , [71] , [72] , discrete Hartley transform (DHT) [53] , [73] , [74] , discrete cosine transform (DCT) [75] - [78] , discrete sine transform (DST) [76] - [78] and chirp -transform (CZT) [79] . The DFT, DHT, and DCT [80] of an -point input sequence for , in general, are given by for (32) where the transform kernel matrix is defined as for for DHT for DCT
The input sequence for the DFT is, in general, complex and the computation of (32) can be partitioned into blocks of form: , which is in the same form as the output of RM-CORDIC, for . In case of DHT similarly the computation can also be transformed into a computations of the form to be implemented efficiently by RM-CORDIC units. These features of DFT and DHT are used to design parallel and pipelined architectures for the computation of these two transforms [50] , [52] , [53] , [71] - [74] . It is shown that [76] , [77] by simple input-output modification, one can transform the DCT and DST kernels into the DHT form to compute then by rotation mode CORDIC. Similarly in [79] , CZT is represented by a DFT-like kernel by simple pre-processing and post-processing operations, and implemented through CORDIC rotations. The CORDIC technique has also been used in many image processing operations like spatial domain image enhancement for contrast stretching, logarithmic transformation and power-law transformation, image rotation, and Hough transform for line detection [81] , [82] . CORDIC implementation of some of these applications are discussed in [83] , [84] . Several other signal processing applications are discussed in detail in [64] , which we do not intend to repeat here. 
C. Applications to Communication
CORDIC algorithm can be used for efficient implementation of various functional modules in a digital communication system [85] . Most applications of CORDIC in communications use the circular coordinate system in one or both CORDIC operating modes. The RM-CORDIC is mainly used to generate mixed signals, while the VM-CORDIC is mainly used to estimate phase and frequency parameters. We briefly outline here some of the important communication applications.
1) Direct Digital Synthesis: Direct digital synthesis is the process of generating sinusoidal waveforms directly in the digital domain. A direct digital synthesizer (DDS) (as shown in Fig. 8 ) consists of a phase accumulator and a phase-to-waveform converter [86] , [87] . The phase-generation circuit increments the phase according to , where is the normalized carrier frequency in every cycle and feeds the phase information to the phase-to-waveform converter. The phase-to-waveform converter could be realized by an RM-CORDIC [88] , [89] , as shown in Fig. 8 . The cosine and sine waveforms are obtained, respectively, by the CORDIC outputs and .
2) Analog and Digital Modulation:
A generic scheme to use CORDIC in RM for digital modulation is shown in Fig. 9 , where the phase-generation unit of Fig. 8 is changed to generate the phase according to , for and being the normalized carrier and the modulating frequencies, respectively, and is the phase of modulating component. By suitable selection of the parameters and and the CORDIC inputs and , the generic scheme of Fig. 9 it could be used for digital realization of analog amplitude modulation (AM), phase modulation (PM), and frequency modulation (FM), as well as the digital modulations, e.g., amplitude shift keying (ASK), phase-shift keying (PSK), and frequency-shift keying (FSK) modulators. It could also be used for the up/down converters for quadrature-amplitude modulators (QAM) and full mixers for complex signals or phase and frequency corrector circuits for synchronization [85] .
3) Other Communication Applications: By operating the CORDIC in vectoring mode, one can compute the magnitude and the angle of an input vector. The magnitude computation can be used for envelope-detection in an AM receiver or to detect FSK signal if it is placed after mark or space filters [90] . The angle computation in VM CORDIC, on the other hand, can be used to detect FM and FSK signals and to estimate phase and frequency parameters [91] . A single VM-CORDIC can be used to perform these computations for the implementation of a slicer for a high-order constellation like the 32-APSK used in DVB-S2.
CORDIC circuits operating in both modes are also required in digital receivers for the synchronization stage to perform a phase or frequency estimation followed by a correction stage. This can be done by using two different CORDIC units, to meet the high speed requirement in Costas loop for phase recovery in a QAM modulation [92] , [93] . On the other hand the burstbased communication system that needs a preamble for synchronization purposes, e.g., in case of IEEE 802.11a WLAN-OFDM receivers, can use a single CORDIC unit configurable for both operating modes since the estimation and correction are not performed simultaneously [94] , [95] . Apart from these, the CORDIC-based QR decomposition has been used in multiinput-multi-output (MIMO) systems to implement V-BLAST detectors [96] - [98] , and to implement a recursive-least-square (RLS) adaptive antenna beamformer [67] , [99] , [100] .
D. Applications of CORDIC to Robotics and Graphics
Two of the key problems where CORDIC provides area and power-efficient solutions are: 1) direct kinematics and 2) inverse kinematics of serial robot manipulators. How CORDIC is applied in these applications is discussed below.
1) Direct Kinematics Solution (DKS) for Serial Robot Manipulators:
A robot manipulator consists of a sequence of links, connected typically by either revolute or prismatic joints. For an -degrees-of-freedom manipulator, there are joint-link pairs with link 0 being the supporting base and the last link is attached with a tool. The joints and links are numbered outwardly from the base. The coordinates of the points on the th link represented by change successively for due to successive rotations and translations of the links. The translation operations are realized by simple additions of coordinate values while the new coordinates of any point due to rotation are computed by RM-CORDIC circuits.
2) Inverse Kinematics for Robot Manipulators: The inverse kinematics problem involves determination of joint variables for a desired position and orientation for the tool. The CORDIC approach is valuable to find the inverse kinematic solution when a closed form solution is possible (when, in particular, the desired tool tip position is within the robot's work envelope and when joint angle limits are not violated). The authors in [101] present a maximum pipelined CORDIC-based architecture for efficient computation of the inverse kinematics solution. It is also shown [101] , [102] that up to 25 CORDIC processors are required for the computation of the entire inverse kinematics solution for a six-link PUMA-type robotic arm. Apart from implementation of rotation operations, CORDIC is used in the evaluation of trigonometric functions and square root expressions involved in the inverse kinematics problems [103] .
3) CORDIC for Other Robotics Applications: CORDIC has also been applied to robot control [104] , [105] , where CORDIC circuits serve as the functional units of a programmable CPU co-processor. Another application of CORDIC is for kinematics of redundant manipulators [106] . It is shown in [106] that the case of inverse kinematics can be implemented efficiently in parallel by computing pseudo-inverse through singular value decomposition. Collision detection is another area where CORDIC has been applied to robotics [107] . A CORDIC-based highly parallel solution for collision detection between a robot manipulator and multiple obstacles in the workspace is suggested in [107] . The collision detection problem is formulated as one that involves a number of coordinate transformations. CORDIC-based processing elements are used to efficiently perform the coordinate transformations by shift-add operations.
4) CORDIC for 3-D Graphics:
The processing in graphics such as 3-D vector rotation, lighting and vector interpolation are computation-intensive and are geometric in nature. CORDIC architecture is therefore a natural candidate for cost-effective implementation of these geometric computations in graphics. A systematic formulation to represent 3D computer graphics operations in terms of CORDIC-type primitives is provided in [108] . An efficient stream processor based on CORDIC-type modules to implement the graphic operations is also suggested in [108] . 3-D vector interpolation is also an important function in graphics which is required for good-quality shading [109] for graphic rendering. It is shown that the variable-precision capability of CORDIC engine could be utilized to realize a poweraware implementation of the 3-D vector interpolator [110] .
VI. CONCLUSION
The beauty of CORDIC is its potential for unified solution for a large set of computational tasks involving the evaluation of trigonometric and transcendental functions, calculation of multiplication, division, square-root and logarithm, solution of linear systems, QR-decomposition, and SVD, etc. Moreover, CORDIC is implemented by a simple hardware through repeated shift-add operations. These features of CORDIC has made it an attractive choice for a wide variety of applications. In the last fifty years, several algorithms and architectures have been developed to speed up the CORDIC by reducing its iteration counts and through its pipelined implementation. Moreover, its applications in several diverse areas including signal processing, image processing, communication, robotics and graphics apart from general scientific and technical computations have been explored. Latency of computation, however, continues to be the major drawback of the CORDIC algorithm, since we do not have efficient algorithms for its parallel implementation. But, CORDIC on the other hand is inherently suitable for pipelined designs, due to its iterative behavior, and small cycle time compared with the conventional arithmetic. For high-throughput applications, efficient pipelined-architectures with multiple-CORDIC units could be developed to take the advantage of pipelineability of CORDIC, because the digital hardware is getting cheaper along with the progressive device-scaling. Research on fast implementation of shift-accumulation operation, exploration of new number systems for CORDIC, optimization of CORDIC for constant rotation have scope for further reduction of its latency. Another way to use CORDIC efficiently, is to transform the computational algorithm into independent segments, and to implement the individual segments by different CORDIC processors. With enhancement of its throughput and reduction of latency, it is expected that CORDIC would be useful for many high-speed and real-time applications. The area-delay-accuracy trade-off for different advanced algorithms may be investigated in detail and compared with in future work.
