50 Years of CORDIC: Algorithms, Architectures, and Applications by Meher, Pramod et al.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009 1893
50 Years of CORDIC: Algorithms, Architectures,
and Applications
Pramod K. Meher, Senior Member, IEEE, Javier Valls, Member, IEEE, Tso-Bing Juang, Member, IEEE,
K. Sridharan, Senior Member, IEEE, and Koushik Maharatna, Member, IEEE
Abstract—Year 2009 marks the completion of 50 years of the
invention of CORDIC (COordinate Rotation DIgital Computer)
by Jack E. Volder. The beauty of CORDIC lies in the fact that
by simple shift-add operations, it can perform several computing
tasks such as the calculation of trigonometric, hyperbolic and
logarithmic functions, real and complex multiplications, division,
square-root, solution of linear systems, eigenvalue estimation,
singular value decomposition, QR factorization and many others.
As a consequence, CORDIC has been utilized for applications in
diverse areas such as signal and image processing, communication
systems, robotics and 3-D graphics apart from general scientiﬁc
and technical computation. In this article, we present a brief
overview of the key developments in the CORDIC algorithms and
architectures along with their potential and upcoming applica-
tions.
Index Terms—Arithmetic circuits, CORDIC, CORDIC algo-
rithms, digital signal processing chip, VLSI.
I. INTRODUCTION
C
OORDINATE Rotation DIgital Computer is abbreviated
as CORDIC. The key concept of CORDIC arithmetic is
based on the simple and ancient principles of two-dimensional
geometry.Buttheiterativeformulationofacomputationalalgo-
rithm for its implementation was ﬁrst described in 1959 by Jack
E. Volder [1], [2] for the computation of trigonometric func-
tions, multiplication and division. This year therefore marks the
completion of 50 years of the CORDIC algorithm. Not only
a wide variety of applications of CORDIC have emerged in
the last 50 years, but also a lot of progress has been made in
the area of algorithm design and development of architectures
for high-performance and low-cost hardware solutions of those
Manuscript received August 22, 2008; revised November 26, 2008 and April
10, 2009. First published June 19, 2009; current version published September
02, 2009. This paper was recommended by Associate Editor V. Öwall.
P. K. Meher is with the Department of Communication Systems, Institute for
Infocomm Research, Singapore 138632 (e-mail: pkmeher@i2r.a-star.edu.sg).
J. Valls is with Instituto de Telecomunicaciones y Aplicaciones Multimedia,
Universidad Politécnica de Valencia, 46730 Grao de Gandia, Spain (e-mail:
jvalls@eln.upv.es).
T.-B. Juang is with the Department of Computer Science and Information
Engineering, National Pingtung Institute of Commerce, Pingtung City, Taiwan
900 (e-mail: tsobing@npic.edu.tw).
K. Sridharan is with the Department of Electrical Engineering, Indian Insti-
tute of Technology Madras, Chennai 600036, India (e-mail: sridhara@iitm.ac.
in).
K. Maharatna is with the School of Electronics and Computer Sci-
ence, University of Southampton, Southampton, SO17 1BJ, U.K. (e-mail:
km3@ecs.soton.ac.uk).
Digital Object Identiﬁer 10.1109/TCSI.2009.2025803
applications. CORDIC-based computing received increased at-
tention in 1971, when John Walther [3], [4] showed that, by
varying a few simple parameters, it could be used as a single
algorithm for uniﬁed implementation of a wide range of ele-
mentary transcendental functions involving logarithms, expo-
nentials, and square roots along with those suggested by Volder
[1].Duringthesametime,Cochran[5]benchmarkedvariousal-
gorithms,andshowedthatCORDICtechniqueisabetterchoice
for scientiﬁc calculator applications.
The popularity of CORDIC was very much enhanced there-
after primarily due to its potential for efﬁcient and low-cost
implementation of a large class of applications which include:
the generation of trigonometric, logarithmic and transcendental
elementary functions; complex number multiplication, eigen-
value computation, matrix inversion, solution of linear systems
and singular value decomposition (SVD) for signal processing,
image processing, and general scientiﬁc computation. Some
other popular and upcoming applications are:
1) direct frequency synthesis, digital modulation and coding
for speech/music synthesis and communication;
2) direct and inverse kinematics computation for robot ma-
nipulation;
3) planar and three-dimensional vector rotation for graphics
and animation.
Although CORDIC may not be the fastest technique to per-
form these operations, it is attractive due to the simplicity of
its hardware implementation, since the same iterative algorithm
could be used for all these applications using the basic shift-add
operations of the form .
Keeping the requirements and constraints of different ap-
plication environments in view, the development of CORDIC
algorithm and architecture has taken place for achieving high
throughput rate and reduction of hardware-complexity as well
as the latency of implementation. Some of the typical ap-
proaches for reduced-complexity implementation are focussed
on minimization of the complexity of scaling operation and the
complexity of barrel-shifter in the CORDIC engine. Latency
of implementation is an inherent drawback of the conventional
CORDIC algorithm. Angle recoding schemes, mixed-grain
rotation and higher radix CORDIC have been developed for
reduced latency realization. Parallel and pipelined CORDIC
have been suggested for high-throughput computation. The
objective of this article is not to present a detailed survey of
the developments of algorithms, architectures and applications
of CORDIC, which would require a few doctoral and masters
level dissertations. Rather we aim at providing the key develop-
ments in algorithms and architectures along with an overview
1549-8328/$26.00 © 2009 IEEE
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. 1894 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009
of the major application areas and upcoming applications. We
shall however discuss here the basic principles of CORDIC
operations for the beneﬁt of general readers.
The remainder of this paper is organized as follows. In
Section II, we discuss the principles of CORDIC operation,
covering the elementary ideas from coordinate transformation
to rotation mode and vectoring mode operations followed
by design of the basic CORDIC cell and multidimensional
CORDIC. The key developments in CORDIC algorithms and
architectures are discussed in Section III, which covers the al-
gorithms and architectures pertaining to higher-radix CORDIC,
angle recording, coarse-ﬁne hybrid micro rotations, redundant
number representation, differential CORDIC, and pipeline
implementation. In Section IV, we discuss the scaling and
accuracy aspects including the scaling techniques, scaling-free
CORDIC, quantization and area-delay-accuracy trade-off. The
applications of CORDIC to scientiﬁc computations, signal pro-
cessing, communications, robotics and graphics are discussed
brieﬂy in Section V. The conclusion along with future research
directions are discussed in Section VI.
II. BASIC CORDIC TECHNIQUES
In this Section, we discuss the basic principle underlying the
CORDIC-basedcomputation,andpresentitsiterativealgorithm
for different operating modes and planar coordinate systems. At
the end of this section, we discuss the extension of two-dimen-
sional rotation to multidimensional formulation.
A. The CORDIC Algorithm
As shown in Fig. 1, the rotation of a two-dimensional vector
throughanangle ,toobtainarotatedvector
could be performed by the matrix product ,
where is the rotation matrix:
(1)
By factoring out the cosine term in (1), the rotation matrix
can be rewritten as
(2)
and can be interpreted as a product of a scale-factor
with a pseudorotation matrix ,
given by
(3)
The pseudorotation operation rotates the vector by an angle
and changes its magnitude by a factor , to produce
a pseudo-rotated vector .
To achieve simplicity of hardware realization of the rotation,
the key ideas used in CORDIC arithmetic are to (i) decompose
the rotations into a sequence of elementary rotations through
predeﬁned angles that could be implemented with minimum
hardware cost;and (ii)toavoidscaling,thatmight involvearith-
metic operation, such as square-root and division. The second
ideaisbasedonthefactthescale-factorcontainsonlythemagni-
tude information but no information about the angle of rotation.
Fig. 1. Rotation of vector on a two-dimensional plane.
1) Iterative Decomposition of Angle of Rotation: The
CORDIC algorithm performs the rotation iteratively by
breaking down the angle of rotation into a set of small pre-de-
ﬁned angles1, , so that could
be implemented in hardware by shifting through bit locations.
Instead of performing the rotation directly through an angle ,
CORDIC performs it by a certain number of microrotations
through angle , where
and (4)
that satisﬁes the CORDIC convergence theorem [3]:
. But,
the decomposition according to (4) could be used only for
(called the “convergence range”)
since . Therefore, the angular decom-
position of (4) is applicable for angles in the ﬁrst and fourth
quadrants. To obtain on-the-ﬂy decomposition of angles into
the discrete base , one may otherwise use the nonrestoring
decomposition [6]
and (5)
with if and otherwise, where the
rotationmatrixforthe thiterationcorrespondingtotheselected
angle is given by
(6)
being the scale-factor, and the pseudoro-
tation matrix
(7)
Note that the pseudo-rotation matrix for the th itera-
tion alters the magnitude of the rotated vector by a scale-factor
during the th microrotation, which is in-
dependent of the value of (direction of microrotation) used in
the angle decomposition.
1All angles are measured in radian unless otherwise stated.
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1895
Fig. 2. Hardware implementation of a CORDIC iteration.
2) Avoidance ofScaling: Theother simpliﬁcation performed
by theVolder’s algorithm [1] is toremovethe scale-factor
from(6).Theremovalofscalingfromtheitera-
tivemicrorotationsleadstoapseudo-rotatedvector
instead of the desired rotated vector , where the
scale-factor is given by
(8)
Since the scale-factor of microrotations does not depend on
thedirectionofmicrorotationsanddecreasesmonotonically,the
ﬁnal scale-factor converges to . Therefore, in-
stead of scaling during each microrotation, the magnitude of
ﬁnaloutputcouldbescaledby .Therefore,thebasicCORDIC
iterations are obtained by applying the pseudo-rotation of the
vector to have, , together with the nonrestoring
decomposition of the selected angles , as follows:
(9)
CORDICiterationsof(9)couldbeusedintwooperatingmodes,
namely the rotation mode (RM) and the vectoring mode (VM),
which differ basically on how the directions of the microrota-
tions are chosen. In the rotation mode, a vector is rotated by
anangle toobtainanewvector .Inthismode,thedirection
of each microrotation is determined by the sign of : if sign
of is positive, then otherwise . In the vec-
toring mode, the vector is rotated towards the -axis so that
the -component approaches zero. The sum of all angles of mi-
crorotations(outputangle )isequaltotheangleofrotationof
vector , while output corresponds to its magnitude. In this
operating mode, the decision about the direction of the micro-
rotation depends on the sign of : if it is positive then
otherwise . CORDIC iterations are easily implemented
in both software and hardware. Fig. 2 shows the basic hardware
stage for a single CORDIC iteration. After each iteration the
number of shifts is incremented by a pair of barrel-shifters. To
have an -bit output precision, CORDIC iterations are
needed. Note that it could be implemented by a simple selec-
tion operation in serial architectures like the one proposed in
the original work, or in fully parallel CORDIC architectures the
shiftoperationscouldbehardwired,wherenobarrel-shiftersare
involved.
Finally, to overcome the problem of the limited convergence
range and,thentoextendtheCORDICrotationstothecomplete
TABLE I
GENERALIZED CORDIC ALGORITHM
range of , an extra iteration is required to be performed. This
new iteration is shown in (10) which is required as an initial
rotation through .
where (10)
B. Generalization of the CORDIC Algorithm
In 1971, Walther found how CORDIC iterations could be
modiﬁed to compute hyperbolic functions [3] and reformulated
the CORDIC algorithm in to a generalized and uniﬁed form
whichissuitabletoperformrotationsincircular,hyperbolicand
linear coordinate systems. The uniﬁed formulation includes a
new variable , which is assigned different values for different
coordinate systems. The generalized CORDIC is formulated as
follows:
(11)
where
for rotation mode
for vectoring mode
For or , and or
, the algorithm given by (11) works in circular,
linear or hyperbolic coordinate systems, respectively. Table I
summarizes the operations that can be performed in rotation
and vectoring modes2 in each of these coordinate systems.
The convergence range of linear and hyperbolic CORDIC are
obtained, as in the case of circular coordinate, by the sum of all
given by . The hyperbolic CORDIC requires
to execute iterations for twice to ensure con-
vergence. Consequently, these repetitions must be considered
while computing the scale-factor ,
which converges to 0.8281.
2In the rotation mode, the components of a vector resulting due to rotation of
avector through agiven angleare derived,while in the vectoringmodethe mag-
nitude as well as the phase angle of a vector are estimated from the component
values. The rotation and vectoring modes are also known as the vector rotation
mode and the angle accumulation mode, respectively.
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. 1896 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009
C. Multidimensional CORDIC
The CORDIC algorithm was extended to higher dimensions
using simple Householder reﬂection [7]. The Householder re-
ﬂection matrix is deﬁned as
(12)
where is an -dimensional vector and is the
identitymatrix.Theproduct reﬂectsthe -dimensional
vector with respect to the hyperplane with normal that
passes through the origin. Basically, the Householder-based
CORDIC performs the vectoring operation of an -dimen-
sional vector to one of the axes.
Forthesakeofclarity,weconsiderherethecaseof3-Dvector
projected on to the -axis in the Euclidean
space.Therotationmatrixfor3-Dcase,correspondingtothe th
iteration, , is given by the product of two simple House-
holder reﬂections as
(13)
where , and with
, and and
being the directions of microrotations.
One can write the th rotation matrix in terms of the
pseudo-rotation matrix as , where
is the scale-factor and
is the pseudo-rotation matrix which could be expressed as
function of the shifting and decision variables as
(14)
Therefore, the th iteration of 3-D Housholder CORDIC ro-
tation results , and, the vector is projected
to -axis, such that after iterations gives the length of the
vector scaled by with bit precision [8].
III. ADVANCED CORDIC ALGORITHMS AND ARCHITECTURES
CORDIC computation is inherently sequential due to two
main bottlenecks: 1) the micro-rotation for any iteration is per-
formed on the intermediate vector computed by the previous
iteration and 2) the th iteration could be started only
after the completion of the th iteration, since the value of
which is required to start the th iteration could be known
only after the completion of the th iteration. To alleviate the
second bottleneck some attempts have been made for evalua-
tion of values corresponding to small micro-rotation angles
[9], [10]. However, the CORDIC iterations could not still be
performed in parallel due to the ﬁrst bottleneck. A partial par-
allelization has been realized in [11] by combining a pair of
conventional CORDIC iterations into a single merged iteration
which provides better area-delay efﬁciency. But the accuracy
is slightly affected by such merging and cannot be extended to
a higher number of conventional CORDIC iterations since the
induced error becomes unacceptable [11]. Parallel realization
of CORDIC iterations to handle the ﬁrst bottleneck by direct
unfolding of micro-rotation is possible, but that would result
in increase in computational complexity and the advantage of
simplicity of CORDIC algorithm gets degraded [12], [13]. Al-
though no popular architectures are known to us for fully par-
allel implementation of CORDIC, different forms of pipelined
implementation of CORDIC have however been proposed for
improving the computational throughput [14].
Since the CORDIC algorithm exhibits linear-rate conver-
gence, it requires iterations to have -bit precision of
the output. Overall latency of the computation thus amounts to
product of the word-length and the CORDIC iteration period.
The speed of CORDIC operations is therefore constrained
either by the precision requirement (iteration count) or the
duration of the clock period. The duration of clock period on
the other hand mainly depends on the large carry propagation
time for the addition/subtraction during each micro-rotation.
It is a straight-forward choice to use fast adders for reducing
the iteration period at the expense of large silicon area. Use
of carry-save adder is a good option to reduce the iteration
period and overall latency [15]. Timmermann and others have
suggested a method of truncation of CORDIC algorithm after
iterations (for -bit precision), where the last itera-
tion performs a single rotation for implementing the remaining
angle. It lowers the the latency time but involves one multi-
plication or division, respectively, in the rotation or vectoring
mode [9].
To handle latency bottlenecks, various techniques have
been developed and reported in the literature. Most of the
well known algorithms could be grouped under, high-radix
CORDIC, the angle-recoding method, hybrid micro-rotation
scheme, redundant CORDIC and differential CORDIC which
we discuss brieﬂy in the following subsections.
A. Higher Radix CORDIC Algorithm
The radix-4 CORDIC algorithm [16] is given by
(15)
where and the elementary angles
. The scale-factor for the th iteration
. In order to preserve the norm of the
vector the output of micro-rotations is required to be scaled by
a factor
(16)
To have -bit output precision, the radix-4 CORDIC algorithm
requires micro-rotations, which is half that of radix-2 al-
gorithm. However, it requires more computation time for each
iteration and involves more hardware compared to the radix-2
CORDIC to select the value of out of ﬁve different possi-
bilities. Moreover, the scale-factor, given by (16), also varies
withthe rotation anglessince it dependson whichcouldhave
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1897
any of the ﬁve different values. Some techniques have there-
fore been suggested for scale-factor compensation through iter-
ative shift-add operations [16], [17]. A high-radix CORDIC al-
gorithm in vectoring mode is also suggested in [18], which can
be used for reduced latency operation at the cost of larger size
tables for storing the elementary angles and pre-scaling factors
than the radix-2 and radix-4 implementation.
B. Angle Recoding (AR) Methods
The purpose of angle recoding (AR) is to reduce the number
of CORDIC iterations by encoding the angle of rotation as a
linear combination of a set of selected elementary angles of
micro-rotations. AR methods are well-suited for many signal
processing and image processing applications where the ro-
tation angle is known a priori, such as when performing the
discrete orthogonal transforms like discrete Fourier transform
(DFT), the discrete cosine transform (DCT), etc.
1) Elementary-Angle-Set Recoding: In the conventional
CORDIC, any given rotation angle is expressed as a linear com-
bination of values of elementary angles that belong to the set
in order to obtain an -bit value as .
However, in AR methods, this constraint is relaxed by adding
zeros to the linear combination to obtain the desired angle
using relatively fewer terms of the form
for . The elementary-angle-set (EAS) used
by AR scheme is given by
. One of the simplest form
of the angle recoding method based on the greedy algorithm
proposed by Hu and Naganathan [19] tries to represent the re-
maining angle using theclosest elementary angle .
Theanglerecodingalgorithmof[19]isbrieﬂystatedinTableII.
Using this recoding scheme the total number of iterations could
be reduced by at least 50% keeping the same -bit accuracy
unchanged. A similar method of angle recoding in vectoring
mode called as the backward angle recoding is suggested in
[20].
2) Extended Elementary-Angle-Set Recoding: Wu et al. [21]
have suggested an AR scheme based on an extended elemen-
tary-angle-set (EEAS), that provides a more ﬂexible way of de-
composing the target rotation angle. In the EEAS approach,
the set of the elementary-angle set is extended further
to
and . EEAS has better
recoding efﬁciency in terms of the number of iterations and
can yield better error performance than the AR scheme based
on EAS. The pseudo-rotation for th micro-rotations based on
EEAS scheme is given by
(17)
The pseudo-rotated vector , obtained after
(the required number of micro-rotations) iterations, according
to (17), needs to be scaled by a factor , where
to produce
the rotated vector. For reducing the scaling approximation and
for a more ﬂexible implementation of scaling, similar to the
TABLE II
ANGLE RECODING ALGORITHM
Fig. 3. EEAS-based CORDIC architecture. BS represents the Barrel Shifter,
and C denotes the control signals for the micro-rotations.
EEAS scheme for the micro-rotation phase, a method has also
been suggested in [21], as given below
(18)
where and . and
.
The iterations for micro-rotation phase as well as the scaling
phase could be implemented in the same architecture to reduce
the hardware cost, as shown in Fig. 3.
3) Parallel Angle Recoding: The AR methods [19], [21]
could be used to reduce the number of iterations by more than
50%, when the angle of rotation is known in advance. However,
for unknown rotation angles, their hardware implementation in-
volves more cycle time than the conventional implementation,
which results in a reduction in overall efﬁcacy of the algorithm.
To reduce the cycle time of CORDIC iterations in such cases,
a parallel angle selection scheme is suggested in [22], which
can be used in conjunction with the AR method, to gain the
advantages of the reduction in iteration count, without further
increase in the cycle time. The parallel AR scheme in [22] is
based on dynamic angle selection, where the elementary angles
can be tested in parallel and the direction for the micro-ro-
tations can be determined quickly to minimize the iteration pe-
riod. During each iteration, the residual angle , is passed to a
set of adder-subtractor units that compute
for each elementary angle in parallel and the
differences for are then fed to a binary-tree like
structuretocomparethemagainsteachothertoﬁndthesmallest
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. 1898 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009
Fig. 4. Architecture for parallel angle recoding.
difference. The corresponding to the smallest difference
is used as the angle of micro-rotation. The architecture
for parallel angle recoding of [22] is shown in Fig. 4.
The parallelARreduces theoveralllatencyatthecost ofhigh
hardware-complexity of add/subtract-compare unit. For actual
implementation, it is required to ﬁnd a space-time trade-off and
look at the relative performance in comparison with other ap-
proaches as well. The AR schemes based on EAS and EEAS
however are useful for those cases where the angle of rotation
is known in advance.
C. Hybrid or Coarse-Fine Rotation CORDIC
Based on the radix-2 decomposition, any rotation angle
with -bit precision could be expressed as a linear combina-
tion of angles from the set ,g i v e n
by , where , explicitly speciﬁes whether
there is need of a micro-rotation or not. But, radix-2 decom-
position is not used in the conventional CORDIC because that
would not lead to simplicity of hardware realization. Instead,
arctangents of the corresponding values of radix-2 based set are
used as the elementary-angle-set with a view to implement the
CORDIC operations only by shift-add operations. The key idea
underlying the coarse-ﬁne angular decomposition is that for the
ﬁne values of , (i.e., when ),
could be replaced by in the radix-set for expan-
sion of , since when is sufﬁciently large.
1) Coarse–Fine Angular Decomposition: In the coarse-ﬁne
angular decomposition, the elementary-angle-set contains
the arctangents of power-of-two for more-signiﬁcant part
while the less signiﬁcant part contains the power-of-two
values, such that the radix-set is given by ,
where and
, and is assumed
to be sufﬁciently large such that [10]. For
the hybrid decomposition scheme, the rotation angle could be
partitioned into two terms expressed as
(19)
where and are said to be the coarse and ﬁne subangles,
respectively, given by
(20a)
(20b)
Fig. 5. Architecture for a Hybrid CORDIC algorithm [10].
A combination of coarse and ﬁne micro-rotations are used
in hybrid CORDIC operations in two cascaded stages. Coarse
rotationsareperformedinstage-1tohaveanintermediatevector
(21)
and ﬁne rotations are performed on the output of stage-1 to ob-
tain the rotated output
(22)
2) Implementation of Hybrid CORDIC: To derive the efﬁ-
ciency of hybrid CORDIC, the coarse and ﬁne rotations are per-
formed by separate circuits as shown in Fig. 5. The coarse ro-
tation phase is performed by the CORDIC processor-I and the
ﬁne rotation phase is performed by CORDIC processor-II.
To have fast implementation, processor-I performs a pair of
ROM look-up operations followed by addition to realize the ro-
tationthrough angle . Since could be expressedas a linear
combination of angels of small enough magnitude , where
, the computation of ﬁne rotation phase can
be realized by a sequence of shift-and-add operations. For im-
plementation of the ﬁne rotation phase, no computations are in-
volved to decide the direction of micro-rotation, since the need
of a micro-rotation is explicit in the radix-2 representation of
. The radix-2 representation could also be recoded to express
where as shown in [9]. Since the
direction of micro-rotations are explicit in such a representation
of , it would be possible to implement the ﬁne rotation phase
in parallel for low-latency realization.
The hybrid decomposition could be used for reducing the la-
tency by ROM-based realization of coarse operation. This can
also be used for reducing the hardware complexity of ﬁne rota-
tion phase since there is no need to ﬁnd the direction of micro-
rotation.Severaloptionsarehoweverpossiblefortheimplemen-
tation of these two stages. A form of hybrid CORDIC is sug-
gested in [23] for very-high precision CORDIC rotation where
the ROM size is reduced to nearly bits. The coarse rota-
tions could be implemented as conventional CORDIC through
shift-addoperationsofmicro-rotationsifthelatencyistolerable.
3) Shift-Add Implementation of Coarse Rotation: Using
the symmetry properties of the sine and cosine functions in
different quadrants, the rotation through any arbitrary angle
could be mapped from the full range to the ﬁrst half
the ﬁrst quadrant . The coarse-ﬁne partition could be
applied thereafter for reducing the number of micro-rotations
necessary for ﬁne rotations. To implement the course rotations
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1899
Fig. 6. Shift-add architecture for a Hybrid CORDIC algorithm.
through shift-add operations the coarse subangle is repre-
sented in [24] and [25] in terms of elementary rotations of the
form as
(23)
where is a correction term.
Using (23) on (19), one can ﬁnd ,
where
(24)
It is shown [25] that, based on the above decompositions
using radix-2 representation, both coarse and ﬁne rotations
could be implemented by a sequence of shift-and-add oper-
ations in CORDIC iterations without ROM lookup table or
the real multiplication operation. One such implementation is
shown in Fig. 6. Processor-I performs CORDIC operations like
that of conventional CORDIC for nearly the ﬁrst one-third of
the iterations and the residual angle as well as the intermediate
rotated vector is passed to the processor-II. Processor-II can
perform the ﬁne rotation in one of the possible ways as in case
of the circuit of Fig. 5.
Thecoarse-ﬁnerotationapproachinsomemodiﬁedformshas
beenappliedforreduced-latencyimplementationofsineandco-
sine generation [24]–[28], high-speed and high-precision rota-
tion [24], [26], and conversion of rectangular to polar coordi-
nates and vice versa [29], [30].
4) Parallel CORDIC Based on Coarse-Fine Decomposition:
In [31], the authors have proposed two angle recoding tech-
niques for parallel detection of direction of micro-rotations,
namely the binary to bipolar recoding (BBR) and micro-rota-
tion angle recoding (MAR) to be used for the coarse part of the
input angle . BBR is used to obtain the polarity of each bit
in the radix-2 representation of to determine the rotation
direction. MAR is used to decompose each positional binary
weight into a linear combination
of arctangent terms. It is further shown in [32] that, the rotation
direction can be decided once the input angle is known to
enable parallel computation of the micro-rotations. Although
the CORDIC rotation can be executed in parallel according to
[32], the method for decomposition of each positional binary
weightproducesmanyextrastagesofmicro-rotation, especially
when the bit-width of input angle increases. A more efﬁcient
recoding scheme has been proposed in [33] for the reduction of
number of micro-rotations to be employed in parallel CORDIC
rotations.
D. Redundant-Number-Based CORDIC Implementation
Addition/subtraction operations are faster in the redundant
number system, since unlike the binary system, it does not
involve carry propagation. The use of redundant number
system is therefore another way to speed up the CORDIC
iterations. A CORDIC implementation based on the redundant
number system called as redundant CORDIC was proposed
by Ercegovac and Lang and applied to matrix triangulariza-
tion and singular value decomposition [34]. Rotation mode
redundant CORDIC has been found to result in fast imple-
mentation of sinusoidal function generation, unitary matrix
transformation, angle calculation and rotation [34]–[38].
Although redundant CORDIC can achieve a fast carry-free
computation, the direction of the micro-rotation (the sign
factor ) cannot be determined directly unlike the case of the
conventional CORDIC, since the redundant number system
allows a choice along with the conventional choices
1 and 1 such that . Therefore, it requires a
different formulation for selection of , which is dif-
ferent for binary signed-digit representation and carry–save
implementation. In radix-2 signed-digit representation, as-
suming— ,i ti s
shown that [6]
if
if
if
(25)
where is the value of truncated after the ﬁrst fractional
digit. Similarly for carry-save implementation, it is
if
if
if
(26)
It can be noted from (25) and (26), that in some of the iter-
ations no rotations are performed, so that the scale-factor be-
comes a variable which depends on the angle of rotation. Since
the redundant CORDIC of [34] uses non-constant scale-factor,
Takagietal.[35]haveproposedthedouble-rotationmethodand
correcting-rotationmethodtokeepthevalueofscale-factorcon-
stant. In double rotation method, in each iteration two micro-ro-
tations are performed, such that when , one positive and
one negative micro-rotations are performed, and when
or , respectively, two positive or two negative micro-rota-
tions are performed. The scale-factor is retained constant in this
case since the number of micro-rotations is ﬁxed for any rota-
tionanglebutitdoublestheiterationcount.Thecorrecting-rota-
tion method examines the sign of constituted by some most
signiﬁcant digits of , and if then is taken to be
and is taken to be otherwise. It is shown that the
error occurring in this algorithm could be corrected by repeti-
tion of the iterations for , etc., where is the
size of . The branching CORDIC was proposed in [36] for
fast on-line implementation for redundant CORDIC with a con-
stant scale factor. The main drawback of this method, however,
is its necessity of performing two conventional CORDIC itera-
tions in parallel, which consumes more silicon area than clas-
sical methods [39]. The work proposed in [34] has also been
extended to the vectoring mode [37], and correcting operations
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. 1900 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009
Fig. 7. Pipelined architecture for conventional CORDIC.
are included further to keep the scaling factor constant so as to
eliminate the hardware for scaling.
E. Pipelined CORDIC Architecture
Since the CORDIC iterations are identical, it is very much
convenient to map them into pipelined architectures. The main
emphasis in efﬁcient pipelined implementation lies with the
minimization of the critical path. The earliest pipelined archi-
tecture that we ﬁnd was suggested by Deprettere, Dewilde and
Udo in 1984 [14]. Pipelined CORDIC circuits have been used
thereafter for high-throughput implementation of sinusoidal
wave generation, ﬁxed and adaptive ﬁlters, discrete orthogonal
transforms and other signal processing applications [40]–[44].
A generic architecture of pipelined CORDIC circuit is shown in
Fig.7.Itconsistsof stagesofCORDICunitswhereeachofthe
pipelined stages consists of a basic CORDIC engine of the kind
shown in Fig. 2. Since the number of shifts to be performed by
the shifters at different stages is ﬁxed (shift-operation through
-bit positions is performed at the th stage) in case of pipelined
CORDIC the shift operations could be hardwired with adders;
and therefore shifters are eliminated in the pipelined implemen-
tation. The critical-path of pipelined CORDIC thus amounts
to the time required by the add/subtract operations in each of
the stages. When three adders are used in each stage as shown
in Fig. 7, the critical-path amounts to ,
where and are the time required for addition,
2:1 multiplexing and 2’s complement operation, respectively.
For known and constant angle rotations the sign of micro-ro-
tations could be predetermined, and the need of multiplexing
could be avoided for reducing the critical-path. The latency of
computation thus depends primarily on the time required for
an addition. Since there is very little room for reducing the
critical path in the pipelined implementation of conventional
TABLE III
DIFFERENTIAL CORDIC ALGORITHM
CORDIC, digit-on-line pipelined CORDIC circuits based on
the differential CORDIC (D-CORDIC) algorithm have been
suggested to achieve higher throughput and lower pipeline
latency.
F. Differential CORDIC Algorithm
D-CORDIC algorithm is equivalent to the usual CORDIC in
terms of accuracy as well as convergence, but it provides faster
and more efﬁcient redundant number-based implementation of
both rotation mode and vectoring mode CORDIC. It introduces
some temporary variables corresponding to the CORDIC vari-
ables and , that generically deﬁned as
(27)
which implies that and
. The signs of are, therefore, considered as being
differentially encoded signs of in the differential CORDIC
algorithm [45]. The rotation and vectoring mode D-CORDIC
algorithms are outlined in Table III.
D-CORDIC algorithm is suitable for efﬁcient pipelined
implementation which is utilized by Ercegovac and Lang [34]
using on-line arithmetic based on redundant number system.
Since the output data in the redundant on-line arithmetic can be
available in the most-signiﬁcant-digit-ﬁrst (MSD-ﬁrst) fashion,
the successive iterations could be implemented by a set of
cascaded stages, where processing time between the successive
stages is overlapped with a single-digit time-skew, that results
in a signiﬁcant reduction in overall latency of computation.
Moreover, in some redundant number representations, the
absolute values and sign of the output are easily determined,
e.g., in binary signed-digit (BSD) representation, the sign of a
number corresponds to the sign of the ﬁrst nonzero MSD, and
negation of the number can be performed just by ﬂipping signs
of nonzero digits. A two-dimensional systolic D-CORDIC
architecture is derived in [46] where phase accumulation is per-
formed for direct digital frequency synthesis in the digit-level
pipelining framework.
IV. SCALING,Q UANTIZATION AND ACCURACY ISSUES
As discussed in Section II-A, scaling is a necessary opera-
tion associated with the implementation of CORDIC algorithm.
Scaling in CORDIC could be of two types: 1) constant factor
scaling and 2) variable factor scaling. In case of variable factor
scaling the scale-factor changes with the rotation angle. It arises
mainly because some of the iterations of conventional CORDIC
are ignored (and that varies with the angle of rotation), as in
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1901
the case of higher-radix CORDIC and most of the optimized
CORDIC algorithms. The techniques for scaling compensation
for each such algorithms have been studied extensively for min-
imizing the scaling overhead.In case of conventional CORDIC,
as given by (8), after sufﬁciently large number of iterations, the
scale-factor converges to 1.6467605, which leads to con-
stant factor scaling since the scale factor remains the same for
all the angle of rotations. Constant factor scaling could be ef-
ﬁciently implemented in a dedicated scaling unit designed by
canonicalsigneddigit(CSD)-basedtechnique[47]andcommon
sub-expression elimination (CSE) approach [48], [49]. When
the sum of the output of more than one independent CORDIC
operationsare tobeevaluated,onecanperformonlyonescaling
of the output sum [50] in the case of constant factor scaling. In
the following subsections, we brieﬂy discuss some interesting
developments on implementation of on-line scaling and real-
ization of scaling-free CORDIC. Besides, we outline here the
sources of error that may arise in a CORDIC design and their
impact on implementation.
A. Implementation of Mixed-Scaling-Rotation
Dewilde et al. [51] have suggested the on-line scaling where
shift-add operations for scaling and micro-rotations are in-
terleaved in the same circuit. This approach has been used in
[52] and improved further in [53]. In the mixed-scaling-rota-
tion (MSR) approach, pioneered by Wu et al. [54]–[56], the
micro-rotation and scaling phases are merged into a uniﬁed
vector rotational model to minimize the overhead of the scaling
operation [54]–[56]. The MSR-CORDIC can be applied to
DSP applications, in which the rotation angles are usually
known a priori, e.g., the twiddle factor in fast Fourier transform
(FFT) and kernel components in other sinusoidal transforms.
It is shown in [55] that the MSR technique can signiﬁcantly
reduce the total iteration count so as to improve the speed
performance and enhance the signal-to-quantization-noise
ratio (SQNR) performance by controlling the internal dy-
namic range. The MSR-CORDIC scheme has been applied
to a variable-length FFT processor design [29], and found to
result in signiﬁcant hardware reduction in the implementation
of twiddle-factor multiplications. Although, the interleaved
scaling and MSR-CORDIC provide hardware reduction, they
also lead to the reduction of throughput. For high-throughput
implementation, one should implement the micro-rotations and
scaling in two separate pipelined stages.
B. Low-Complexity Scaling
When the elementary angles pertaining to a rotation are “suf-
ﬁciently small”, deﬁned by , and the rota-
tions are only in one direction, the CORDIC rotation is given by
the representation [57]
(28)
and , (considering clockwise micro-rota-
tions only), where and are the components of the vector
after the th micro-rotation, is the input wordlength and
. The formulation of (28) performs the
“actual” rotation where the norm of the vector is preserved at
every micro-rotation.
However,theproblemwiththisformulationisthattheoverall
range of angles for which it can be used is very small, because,
for 16-bit wordlength, the largest such angle is
, which obviously is quite small compared to the entire
coordinate space. To overcome this problem, argument reduc-
tion is performed through “domain folding” [58] by mapping
the target rotation-angles into the range . Besides, the
elementary rotations are carried out in an adaptive manner to
enhance the rate of convergence so as to force the approxima-
tion error of ﬁnal angle below a speciﬁed limit [59]. But, the
domain-folding in some cases, involves a rotation through
whichdemandsascalingbyafactorof .Besides,thetarget
range is still much larger than the range of convergence
of the scaling-free realization. The formulation of (28), there-
fore, could be effectively used when a rotation through
is not required and angles of rotations could be folded to the
range . Generalized algorithms, and their corre-
sponding architectures to perform the scale-factor compensa-
tion in parallel with the CORDIC iterations, for both rotation
and vectoring modes are proposed in [60], where the compen-
sation overhead is reduced to a couple of iterations. It is shown
in [61] that since the scale-factor is known in advance, one can
perform the minimal recoding of the bits of scaling-factor, and
implement the multiplication thereafter by a Wallace tree. It is
a good solution of low-latency scaling particularly for pipelined
CORDIC architectures.
C. Quantization and Numerical Accuracy
Errors in CORDIC are mainly of two types: 1) the angle ap-
proximation error which originates from quantization of rota-
tion angle represented bya linear combination ofﬁnite numbers
ofelementaryanglesand2)theﬁnitewordlengthofthedatapath
resulting in the rounding/truncation of output that increases cu-
mulatively through the successive iterations of micro-rotations.
A third source of error that also comes into the picture results
from the scaling of pseudo-rotated outputs. The scaling error is,
however, also due to the use of ﬁnite wordlength in the scaling
circuitry and is predominantly a rounding/truncation error. A
detailed discussion on rounding error due to ﬁxed and ﬂoating
point implementations is available in [62]. In his earlier work,
Walther [3] concluded that the errors in the CORDIC output
are bounded, and extra bits are required in the datap-
aths to take care of the errors. Hu [62] has provided more pre-
cise error bounds due to the angle approximation error for dif-
ferent CORDIC modes for ﬁxed point as well as ﬂoating-point
implementations. The error bound resulting for ﬁxed point rep-
resentation of arctangents is further analyzed by Kota and Cav-
allaro [63] and its impact on practical implementation has been
discussed.
D. Area-Delay-Accuracy Trade-Off
Area, accuracy and latency of CORDIC algorithm depend
mainly on the iteration count and its implementation. To
achieve -bit accuracy, if ﬁxed-point arithmetic is applied, the
wordlength of and data-path is and
for the computation of the angle ,i ti s [45],
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. 1902 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009
TABLE IV
COMPUTATIONS USING CORDIC ALGORITHM IN DIFFERENT CONFIGURATIONS
[63]. The hardware requirement therefore increases accord-
ingly with the desired accuracy. Floating-point implementation
naturally gives higher accuracy than its ﬁxed-point counter-
part, but at the cost of more complex hardware. To minimize
the angle approximation error, the smallest elementary angle
needs to be as small as possible [62]. This consequently
demands more number of right-shifts and more hardware for
the barrel-shifters and adders. Besides, to have better angle
approximation, more number of iterations are required which
increases the latency. The additional accuracy resulting from
ﬂoating-point implementation or better angle approximation
may not, however, be necessary in many applications. Thus,
there is a need for trade-off between hardware-cost, latency
and numerical accuracy subject to a particular application.
Therefore, the designer has to check how much numerical
accuracy is needed along with area and speed constraints for
the particular application; and can accordingly decide on ﬁxed
or ﬂoating-point implementation and should set the wordlength
and optimal number of iterations.
V. APPLICATIONS OF CORDIC
CORDIC technique is basically applied for rotation of a
vector in circular, hyperbolic or linear coordinate systems,
which in turn could also be used for generation of sinusoidal
waveform, multiplication and division operations, and evalua-
tion of angle of rotation, trigonometric functions, logarithms,
exponentials and squareroot [6], [64], [65]. Table IV shows
some elementary functions and operations that can be directly
implemented by CORDIC. The table also indicates whether
the coordinate system is circular (CC), linear (LC), or hyper-
bolic (HC), and whether the CORDIC operates in rotation
mode (RM) or vectoring mode (VM), the initialization of the
CORDIC and the necessary pre- or postprocessing step to
perform the operation. The scale factors are, however, obviated
in Table IV for simplicity of presentation. In this Section, we
discuss how CORDIC is used for some basic matrix problems
like QR decomposition and singular-value decomposition.
Moreover, we make a brief presentation on the applications of
CORDIC to signal and image processing, digital communica-
tion, robotics and 3-D graphics.
A. Matrix Computation
1) QR Decomposition: QR decomposition of a matrix can
be performed through Givens rotation [66] that selectively in-
troduces zeros into the matrix. Givens rotation is an orthogonal
transformation of the form
(29)
where and
. The QR decomposition requires two types
of iterative operations to obtain an upper-triangular matrix
using orthogonal transformations. Those are: (i) to calculate
the Givens rotation angle, and (ii) to apply the calculated angle
of rotation to the rest of the rows. Circular coordinate CORDIC
is a good choice to implement both these Givens rotations,
where the ﬁrst operation is performed by a VM CORDIC
and the second one is performed by an RM CORDIC. The
CORDIC-based QR decomposition can be implemented in
VLSI with suitable area-time trade-off using a systolic trian-
gular array, a linear array or a single CORDIC processor that is
reconﬁgurable for rotation and vectoring modes of operations.
A detail explanation of these architectures are available in [64],
[67].
2) Singular Value Decomposition and Eigenvalue Estima-
tion: Singular value decomposition of a matrix is given by
where and are orthogonal matrices and
is a diagonal matrix of singular values. For CORDIC-based
implementationofSVD,itis decomposedinto2 2SVDprob-
lems, and solved iteratively. To solve each 2 2 SVD problem,
two-sided Givens rotation is applied to each of the 2 2 ma-
trices to nullify the off-diagonal elements, as described in the
following:
(30)
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1903
where is a 2 2 input matrix to be decomposed; and and
are, respectively, the left and right rotation angles, calculated
from the elements of using the following two relations:
for
(31)
CORDIC-based architectures for SVD using this method
were developed by Cavallaro and Luk [68]. A simpliﬁed design
of array processor for the particular case ( i.e., )
was developed further by Delosme [69] for the symmetric
Eigenvalue problem. In a relatively recent paper [70], Liu et al.
have proposed an application-speciﬁc instruction set processor
(ASIP) for the real-time implementation of QR decomposition
and SVD where circular coordinate CORDIC is used for efﬁ-
cient implementation of both these functions.
B. Signal Processing and Image Processing Applications
CORDIC techniques have a wide range of DSP applications
including ﬁxed/adaptive ﬁltering [8], and the computation of
discrete sinusoidal transforms such as the DFT [50], [52], [71],
[72], discrete Hartley transform (DHT) [53], [73], [74], dis-
crete cosine transform (DCT) [75]–[78], discrete sine transform
(DST) [76]–[78] and chirp -transform (CZT) [79]. The DFT,
DHT, and DCT [80] of an -point input sequence for
, in general, are given by
for (32)
where the transform kernel matrix is deﬁned as
for
for DHT
for DCT
The input sequence for the DFT is, in general, complex and
the computation of (32) can be partitioned into blocks of
form: ,
which is in the same form as the output of RM-CORDIC,
for . In case of DHT similarly the computation
can also be transformed into a computations of the form
to be im-
plemented efﬁciently by RM-CORDIC units. These features
of DFT and DHT are used to design parallel and pipelined
architectures for the computation of these two transforms [50],
[52], [53], [71]–[74]. It is shown that [76], [77] by simple
input-output modiﬁcation, one can transform the DCT and
DST kernels into the DHT form to compute then by rotation
mode CORDIC. Similarly in [79], CZT is represented by a
DFT-like kernel by simple pre-processing and post-processing
operations, and implemented through CORDIC rotations. The
CORDIC technique has also been used in many image pro-
cessing operations like spatial domain image enhancement for
contrast stretching, logarithmic transformation and power-law
transformation, image rotation, and Hough transform for line
detection [81], [82]. CORDIC implementation of some of these
applications are discussed in [83], [84]. Several other signal
processing applications are discussed in detail in [64], which
we do not intend to repeat here.
Fig. 8. CORDIC-based direct digital synthesizer. ￿ ￿ ￿ ￿ ￿ ￿ ￿.
Fig. 9. A generic scheme to use RM CORDIC for digital modulation. ￿ and
￿ are, respectively, the in-phase and quadrature signals to be modulated. ￿ ￿
￿￿￿￿￿￿￿￿￿￿￿￿￿￿￿ ￿ ￿￿￿￿￿￿￿￿￿￿￿￿￿ and￿ ￿￿ ￿ ￿￿ ￿￿ ￿￿￿￿ ￿￿.
C. Applications to Communication
CORDIC algorithm can be used for efﬁcient implementa-
tion of various functional modules in a digital communication
system [85]. Most applications of CORDIC in communications
use the circular coordinate system in one or both CORDIC op-
erating modes. The RM-CORDIC is mainly used to generate
mixed signals, while the VM-CORDIC is mainly used to esti-
mate phase and frequency parameters. We brieﬂy outline here
some of the important communication applications.
1) Direct Digital Synthesis: Direct digital synthesis is the
process of generating sinusoidal waveforms directly in the dig-
ital domain. A direct digital synthesizer (DDS) (as shown in
Fig. 8) consists of a phase accumulator and a phase-to-wave-
form converter [86], [87]. The phase-generation circuit incre-
ments the phase according to , where is the normal-
ized carrier frequency in every cycle and feeds the phase infor-
mationtothephase-to-waveformconverter.Thephase-to-wave-
formconvertercouldberealizedbyanRM-CORDIC[88],[89],
asshown inFig.8. Thecosineand sine waveformsare obtained,
respectively, by the CORDIC outputs and .
2) Analog and Digital Modulation: A generic scheme to
use CORDIC in RM for digital modulation is shown in Fig. 9,
where the phase-generation unit of Fig. 8 is changed to gen-
erate the phase according to , for
and being the normalized carrier and the modulating
frequencies, respectively, and is the phase of modulating
component. By suitable selection of the parameters and
and the CORDIC inputs and , the generic scheme of
Fig. 9 it could be used for digital realization of analog ampli-
tude modulation (AM), phase modulation (PM), and frequency
modulation (FM), as well as the digital modulations, e.g., am-
plitude shift keying (ASK), phase-shift keying (PSK), and fre-
quency-shift keying (FSK) modulators. It could also be used
for the up/down converters for quadrature-amplitude modula-
tors (QAM) and full mixers for complex signals or phase and
frequency corrector circuits for synchronization [85].
3) Other Communication Applications: By operating the
CORDIC in vectoring mode, one can compute the magnitude
and the angle of an input vector. The magnitude computation
can be used for envelope-detection in an AM receiver or to
detect FSK signal if it is placed after mark or space ﬁlters [90].
The angle computation in VM CORDIC, on the other hand, can
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. 1904 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009
be used to detect FM and FSK signals and to estimate phase
and frequency parameters [91]. A single VM-CORDIC can be
used to perform these computations for the implementation of
a slicer for a high-order constellation like the 32-APSK used in
DVB-S2.
CORDIC circuits operating in both modes are also required
in digital receivers for the synchronization stage to perform a
phase or frequency estimation followed by a correction stage.
This can be done by using two different CORDIC units, to meet
the high speed requirement in Costas loop for phase recovery
in a QAM modulation [92], [93]. On the other hand the burst-
based communication system that needs a preamble for syn-
chronization purposes, e.g., in case of IEEE 802.11a WLAN-
OFDM receivers, can use a single CORDIC unit conﬁgurable
for both operating modes since the estimation and correction
are not performed simultaneously [94], [95]. Apart from these,
the CORDIC-based QR decomposition has been used in multi-
input-multi-output (MIMO) systems to implement V-BLAST
detectors [96]–[98], and to implement a recursive-least-square
(RLS) adaptive antenna beamformer [67], [99], [100].
D. Applications of CORDIC to Robotics and Graphics
Two of the key problems where CORDIC provides area and
power-efﬁcientsolutionsare:1)directkinematicsand2)inverse
kinematics of serial robot manipulators. How CORDIC is ap-
plied in these applications is discussed below.
1) Direct Kinematics Solution (DKS) for Serial Robot Ma-
nipulators: Arobotmanipulatorconsistsofasequenceoflinks,
connectedtypically byeitherrevoluteor prismaticjoints. Foran
-degrees-of-freedom manipulator, there are joint-link pairs
withlink0beingthesupportingbaseandthelastlinkisattached
with a tool. The joints and links are numbered outwardly from
the base. The coordinates of the points on the th link repre-
sentedby changesuccessivelyfor
due to successive rotations and translations of the links. The
translation operations are realized by simple additions of coor-
dinate values while the new coordinates of any point due to ro-
tation are computed by RM-CORDIC circuits.
2) Inverse Kinematics for Robot Manipulators: The inverse
kinematicsprobleminvolvesdeterminationofjointvariablesfor
a desired position and orientation for the tool. The CORDIC ap-
proach is valuable to ﬁnd the inverse kinematic solution when
a closed form solution is possible (when, in particular, the de-
sired tool tip position is within the robot’s work envelope and
when joint angle limits are not violated). The authors in [101]
present a maximum pipelined CORDIC-based architecture for
efﬁcient computation of the inverse kinematics solution. It is
also shown [101], [102] that up to 25 CORDIC processors are
requiredforthecomputationoftheentireinversekinematicsso-
lution fora six-linkPUMA-typerobotic arm.Apartfrom imple-
mentation of rotation operations, CORDIC is used in the eval-
uation of trigonometric functions and square root expressions
involved in the inverse kinematics problems [103].
3) CORDIC for Other Robotics Applications: CORDIC has
also been applied to robot control [104], [105], where CORDIC
circuits serve as the functional units of a programmable CPU
co-processor.AnotherapplicationofCORDICisforkinematics
of redundant manipulators [106]. It is shown in [106] that the
case of inverse kinematics can be implemented efﬁciently in
parallel by computing pseudo-inverse through singular value
decomposition. Collision detection is another area where
CORDIC has been applied to robotics [107]. A CORDIC-based
highly parallel solution for collision detection between a robot
manipulator and multiple obstacles in the workspace is sug-
gested in [107]. The collision detection problem is formulated
as one that involves a number of coordinate transformations.
CORDIC-based processing elements are used to efﬁciently
perform the coordinate transformations by shift-add operations.
4) CORDIC for 3-D Graphics: The processing in graphics
suchas3-Dvectorrotation,lightingandvectorinterpolationare
computation-intensive and are geometric in nature. CORDIC
architecture is therefore a natural candidate for cost-effective
implementation of these geometric computations in graphics.
A systematic formulation to represent 3D computer graphics
operations in terms of CORDIC-type primitives is provided in
[108]. An efﬁcient stream processor based on CORDIC-type
modules to implement the graphic operations is also suggested
in [108]. 3-D vector interpolation is also an important function
in graphics which is required for good-quality shading [109] for
graphic rendering. It is shown that the variable-precision capa-
bility of CORDIC engine could be utilized to realize a power-
aware implementation of the 3-D vector interpolator [110].
VI. CONCLUSION
The beauty of CORDIC is its potential for uniﬁed solution
for a large set of computational tasks involving the evaluation
of trigonometric and transcendental functions, calculation of
multiplication, division, square-root and logarithm, solution of
linear systems, QR-decomposition, and SVD, etc. Moreover,
CORDIC is implemented by a simple hardware through re-
peated shift-add operations. These features of CORDIC has
made it an attractive choice for a wide variety of applications.
In the last ﬁfty years, several algorithms and architectures
have been developed to speed up the CORDIC by reducing
its iteration counts and through its pipelined implementation.
Moreover, its applications in several diverse areas including
signal processing, image processing, communication, robotics
and graphics apart from general scientiﬁc and technical compu-
tations have been explored. Latency of computation, however,
continues to be the major drawback of the CORDIC algorithm,
since we do not have efﬁcient algorithms for its parallel im-
plementation. But, CORDIC on the other hand is inherently
suitable for pipelined designs, due to its iterative behavior, and
small cycle time compared with the conventional arithmetic.
For high-throughput applications, efﬁcient pipelined-archi-
tectures with multiple-CORDIC units could be developed to
take the advantage of pipelineability of CORDIC, because the
digital hardware is getting cheaper along with the progressive
device-scaling. Research on fast implementation of shift-ac-
cumulation operation, exploration of new number systems
for CORDIC, optimization of CORDIC for constant rotation
have scope for further reduction of its latency. Another way
to use CORDIC efﬁciently, is to transform the computational
algorithm into independent segments, and to implement the
individual segments by different CORDIC processors. With
enhancement of its throughput and reduction of latency, it is
expected that CORDIC would be useful for many high-speed
and real-time applications. The area-delay-accuracy trade-off
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1905
for different advanced algorithms may be investigated in detail
and compared with in future work.
REFERENCES
[1] J. E. Volder, “The CORDIC trigonometric computing technique,” IRE
Trans. Electron. Computers, vol. EC-8, pp. 330–334, Sept. 1959.
[2] J. E. Volder, “The birth of CORDIC,” J. VLSI Signal Process., vol. 25,
pp. 101–105, 2000.
[3] J. S. Walther, “A uniﬁed algorithm for elementary functions,” in
Proc. 38th Spring Joint Computer Conf., Atlantic City, NJ, 1971, pp.
379–385.
[4] J. S. Walther, “The story of uniﬁed CORDIC,” J. VLSI Signal Process.,
vol. 25, no. 2, pp. 107–112, June 2000.
[5] D. S. Cochran, “Algorithms and accuracy in the HP-35,” Hewlett-
Packard J., pp. 1–11, Jun. 1972.
[6] J.-M. Muller, Elementary Functions: Algorithms and Implementa-
tion. Boston, MA: Birkhauser Boston, 2006.
[7] S.-F. Hsiao and J.-M. Delosme, “Householder CORDIC algorithms,”
IEEE Trans. Computers, vol. 44, no. 8, pp. 990–1001, Aug. 1995.
[8] E. Antelo, J. Villalba, and E. L. Zapata, “A low-latency pipelined 2D
and 3D CORDIC processors,” IEEE Trans. Computers, vol. 57, no. 3,
pp. 404–417, Mar. 2008.
[9] D. Timmermann, H. Hahn, and B. J. Hosticka, “Low latency time
CORDIC algorithms,” IEEE Trans. Computers, vol. 41, no. 8, pp.
1010–1015, Aug. 1992.
[10] S. Wang, V. Piuri, and J. E. E. Swartzlander, “Hybrid CORDIC algo-
rithms,” IEEE Trans. Computers, vol. 46, no. 11, pp. 1202–1207, Nov.
1997.
[11] S. Wang and E. E. Swartzlander, Jr., “Merged CORDIC algorithm,”
in IEEE Int. Symp. on Circuits Syst. (ISCAS’95), 1995, vol. 3, pp.
1988–1991.
[12] B. Gisuthan and T. Srikanthan, “Pipelining ﬂat CORDIC based
trigonometric function generators,” Microelectron. J., vol. 33, pp.
77–89, 2002.
[13] S. Suchitra, S. Sukthankar, T. Srikanthan, and C. T. Clarke, “Elimina-
tion of sign precomputation in ﬂat CORDIC,” in IEEE Int. Symp. on
Circuits Syst., ISCAS’05, May 2005, vol. 4, pp. 3319–3322.
[14] E. Deprettere, P. Dewilde, and R. Udo, “Pipelined CORDIC architec-
tures for fast VLSI ﬁltering and array processing,” in IEEE Int. Conf.
on Acoust., Speech, Signal Process., ICASSP’84, Mar. 1984, vol. 9, pp.
250–253.
[15] H. Kunemund, S. Soldner, S. Wohlleben, and T. Noll, “CORDIC pro-
cessor with carry save architecture,” in Proc. ESSCIRC 90, Sept. 1990,
pp. 193–196.
[16] E. Antelo, J. Villalba, J. D. Bruguera, and E. L. Zapatai, “High perfor-
mancerotationarchitecturesbasedontheradix-4CORDICalgorithm,”
IEEE Trans. Computers, vol. 46, no. 8, pp. 855–870, Aug. 1997.
[17] P. R. Rao and I. Chakrabarti, “High-performance compensation tech-
niquefortheradix-4CORDICalgorithm,”Proc.IEEComput.andDig-
ital Techn., vol. 149, no. 5, pp. 219–228, Sep. 2002.
[18] E. Antelo, T. Lang, and J. D. Bruguera, “Very-high radix circular
CORDIC: Vectoring and uniﬁed rotation/vectoring,” IEEE Trans.
Computers, vol. 49, no. 7, pp. 727–739, July 2000.
[19] Y. H. Hu and S. Naganathan, “An angle recoding method for CORDIC
algorithm implementation,” IEEE Trans. Comput., vol. 42, no. 1, pp.
99–102, Jan. 1993.
[20] Y.H.HuandH.H.M.Chern,“AnovelimplementationofCORDICal-
gorithmusingbackwardanglerecoding(BAR),”IEEETrans.Comput.,
vol. 45, no. 12, pp. 1370–1378, Dec. 1996.
[21] C.-S. Wu, A.-Y. Wu, and C.-H. Lin, “A high-performance/low-latency
vector rotational CORDIC architecture based on extended elementary
angle set and trellis-based searching schemes,” IEEE Trans. Circuits
Syst. II: Anal. Digital Signal Process., vol. 50, no. 9, pp. 589–601, Sep.
2003.
[22] T. K. Rodrigues and E. E. Swartzlander, “Adaptive CORDIC:
Using parallel angle recoding to accelerate CORDIC rotations,” in
40th Asilomar Conf. on Signals, Syst. and Computers, ACSSC’06,
Oct.–Nov. 2006, pp. 323–327.
[23] M. Kuhlmann and K. K. Parhi, “P-CORDIC: A precomputation based
rotation CORDIC algorithm,” EURASIP J. Appl. Signal Process., vol.
2002, no. 9, pp. 936–943, 2002.
[24] D. Fu and A. N. Willson, Jr., “A high-speed processor for digital sine/
cosine generation and angle rotation,” in Conf. Rec. 32nd Asilomar
Conf. on Signals, Syst. & Computers, Nov. 1998, vol. 1, pp. 177–181.
[25] C.-Y. Chen and W.-C. Liu, “Architecture for CORDIC algorithm real-
ization without ROM lookup tables,” in Proc. 2003 Int. Symp. on Cir-
cuits Syst., ISCAS’03, May 2003, vol. 4, pp. 544–547.
[26] D. Fu and A. N. Willson, Jr., “A two-stage angle-rotation architecture
and its error analysis for efﬁcient digital mixer implementation,” IEEE
Trans.Circuits Syst. I: Reg. Papers, vol. 53, no. 3, pp. 604–614, Mar.
2006.
[27] S. Ravichandran and V. Asari, “Implementation of unidirectional
CORDIC algorithm using precomputed rotation bits,” in 45th Midwest
Symp. on Circuits Syst., 2002. MWSCAS 2002, Aug. 2002, vol. 3, pp.
453–456.
[28] C.-Y. Chen and C.-Y. Lin, “High-resolution architecture for CORDIC
algorithm realization,” in Proc. Int. Conf. on Commun., Circuits Syst.,
ICCCS’06, June 2006, vol. 1, pp. 579–582.
[29] D. D. Hwang, D. Fu, and A. N. Willson, Jr., “A 400-MHz processor
for the conversion of rectangular to polar coordinates in 0.25-￿￿￿-m
CMOS,” IEEE J. Solid-State Circuits, vol. 38, no. 10, pp. 1771–1775,
Oct. 2003.
[30] S.-W. Lee, K.-S. Kwon, and I.-C. Park, “Pipelined cartesian-to-polar
coordinate conversion based on SRT division,” IEEE Trans. Circuits
Syst. II: Express Briefs, vol. 54, no. 8, pp. 680–684, Aug. 2007.
[31] S.-F. Hsiao, Y.-H. Hu, and T.-B. Juang, “A memory-efﬁcient and
high-speed sine/cosine generator based on parallel CORDIC rota-
tions,” IEEE Signal Process. Lett., vol. 11, no. 2, 2004.
[32] T.-B. Juang, S.-F. Hsiao, and M.-Y. Tsai, “Para-CORDIC: Parallel
CORDICrotationalgorithm,”IEEETrans.CircuitsSyst.I:RegularPa-
pers, vol. 51, no. 8, 2004.
[33] T.-B. Juang, “Area/delay efﬁcient recoding methods for parallel
CORDIC rotations,” in IEEE Asia Paciﬁc Conf. on Circuits Syst.,
APCCAS’06, Dec. 2006, pp. 1539–1542.
[34] M. D. Ercegovac and T. Lang, “Redundant and on-line CORDIC:
Application to matrix triangularization and SVD,” IEEE Trans. Com-
puters, vol. 39, no. 6, pp. 725–740, June 1990.
[35] N. Takagi, T. Asada, and S. Yajima, “Redundant CORDIC methods
with a constant scale factor for sine and cosine computation,” IEEE
Trans. Computers, vol. 40, no. 9, pp. 989–995, Sept. 1991.
[36] J. Duprat and J.-M. Muller, “The CORDIC algorithm: New results for
fast VLSI implementation,” IEEE Trans. Computers, vol. 42, no. 2, pp.
168–178, Feb. 1993.
[37] J.-A. Lee and T. Lang, “Constant-factor redundant CORDIC for angle
calculation and rotation,” IEEE Trans. Computers, vol. 41, no. 8, pp.
1016–1025, Aug. 1992.
[38] N. D. Hemkumar and J. R. Cavallaro, “Redundant and on-line
CORDIC for unitary transformations,” IEEE Trans. Computers, vol.
43, no. 8, pp. 941–954, Aug. 1994.
[39] J. Valls, M. Kuhlmann, and K. K. Parhi, “Evaluation of CORDIC algo-
rithms for FPGA design,” J. VLSI Signal Process. Syst., vol. 32, no. 3,
pp. 207–222, Nov. 2002.
[40] D. E. Metafas and C. E. Goutis, “A ﬂoating point pipeline CORDIC
processor with extended operation set,” in IEEE Int. Symp. on Circuits
Syst., ISCAS’91, June 1991, vol. 5, pp. 3066–3069.
[41] Z. Feng and P. Kornerup, “High speed DCT/IDCT using a pipelined
CORDIC algorithm,” in 12th Symp. on Computer Arithmetic, July
1995, pp. 180–187.
[42] M. Jun, K. K. Parhi, G. J. Hekstra, and E. F. Deprettere, “Efﬁcient
implementations of pipelined CORDIC based IIR digital ﬁlters using
fastorthonormal￿-rotations,”IEEETrans.SignalProcess.,vol.48,no.
9, 2000.
[43] M. Chakraborty, A. S. Dhar, and M. H. Lee, “A trigonometric formu-
lation of the LMS algorithm for realization on pipelined CORDIC,”
IEEE Trans. Circuits Syst. II, Express Briefs, vol. 52, no. 9, 2005.
[44] E. I. Garcia, R. Cumplido, and M. Arias, “Pipelined CORDIC design
on FPGA for a digital sine and cosine waves generator,” in Int. Conf.
on Electr. Electron. Eng., ICEEE’06, Sept. 2006, pp. 1–4.
[45] H.DawidandH.Meyr,“ThedifferentialCORDICalgorithm:Constant
scale factor redundant implementation without correcting iterations,”
IEEE Trans. Computers, vol. 45, no. 3, pp. 307–318, Mar. 1996.
[46] C. Y. Kang and E. E. Swartzlander, Jr., “Digit-pipelined direct digital
frequency synthesis based on differential CORDIC,” IEEE Trans. Cir-
cuits Syst. I, Reg. Papers, vol. 53, no. 5, pp. 1035–1044, May 2006.
[47] R. I. Hartley, “Subexpression sharing in ﬁlters using canonic signed
digit multipliers,” IEEE Trans. Circuits Syst. II: Analog Digital Signal
Process., vol. 43, no. 10, pp. 677–688, Oct. 1996.
[48] O. Gustafsson, A. G. Dempster, K. Johansson, M. D. Macleod, and L.
Wanhammar, “Simpliﬁeddesign of constant coefﬁcientmultipliers,” J.
Circuits, Syst., Signal Process., vol. 25, no. 2, pp. 225–251, Apr. 2006.
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. 1906 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009
[49] G. Gilbert, D. Al-Khalili, and C. Rozon, “Optimized distributed
processing of scaling factor in CORDIC,” in 3rd Int. IEEE-NEWCAS
Conf., June 2005, pp. 35–38.
[50] A. M. Despain, “Fourier transform computers using CORDIC itera-
tions,” IEEE Trans. Computers, vol. 23, no. C-10, pp. 993–1001, Oct.
1974.
[51] P. Dewilde, E. F. Deprettere, and R. Nouta, “Parallel and pipelined
VLSI implementation of signal processing algorithms,” in VLSI and
Modern Signal Processing, S. Y. Kung, H. J. Whitehouse, and T.
Kailath, Eds. Englewood Cliffs, NJ: Prentice-Hall, 1995.
[52] K. J. Jones, “High-throughput, reduced hardware systolic solution to
prime factor discrete Fourier transform algorithm,” Proc.IEE Com-
puters and Digital Techn., vol. 137, no. 3, pp. 191–196, May 1990.
[53] P. K. Meher, J. K. Satapathy, and G. Panda, “Efﬁcient systolic solution
foranewprimefactordiscreteHartleytransformalgorithm,”Proc.IEE
Circuits, Devices & Syst., vol. 140, no. 2, pp. 135–139, Apr. 1993.
[54] Z.-X. Lin and A.-Y. Wu, “Mixed-scaling-rotation CORDIC
(MSR-CORDIC) algorithm and architecture for scaling-free high-per-
formancerotational operations,”in IEEE Int.Conf. onAcoust., Speech,
Signal Process., ICASSP’03, Apr. 2003, vol. 2, pp. 653–656.
[55] C.-H. Lin and A.-Y. Wu, “Mixed-scaling-rotation CORDIC (MSR-
CORDIC) algorithm and architecture for high-performance vector ro-
tational DSP applications,” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 52, no. 11, pp. 2385–2396, Nov. 2005.
[56] C.-L. Yu, T.-H. Yu, and A.-Y. Wu, “On the ﬁxed-point properties of
mixed-scaling-rotation CORDIC algorithm,” in IEEE Workshop on
Signal Process. Syst., Oct. 2007, pp. 430–435.
[57] A.S.DharandS.Banerjee,“Anarrayarchitectureforfastcomputation
of discrete hartley transform,” IEEE Trans. Circuits Syst., vol. 38, no.
9, pp. 1095–1098, Sep. 1991.
[58] K. Maharatna, A. Troya, S. Banerjee, and E. Grass, “New virtually
scaling free adaptiveCORDIC rotator,” Proc.IEEComputers and Dig-
ital Techn., vol. 151, no. 6, pp. 448–456, Nov. 2004.
[59] K. Maharatna, S. Banerjee, E. Grass, M. Krstic, and A. Troya, “Modi-
ﬁed virtually scaling free adaptive CORDIC rotator algorithm and ar-
chitecture,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 11,
pp. 1463–1474, Nov. 2005.
[60] J. Villalba, T. Lang, and E. Zapata, “Parallel compensation of scale
factor for the CORDIC algorithm,” J. VLSI Signal Process., vol. 19,
no. 3, pp. 227–241, Aug. 1998.
[61] D. Timmermann, H. Hahn, B. J. Hosticka, and B. Rix, “A new addition
scheme and fast scaling factor compensation methods for CORDIC al-
gorithms,” Integration, the VLSI J., vol. 11, no. 1, pp. 85–100, Mar.
1991.
[62] Y. H. Hu, “The quantization effects of the CORDIC algorithm,” IEEE
Trans. Signal Process., vol. 40, no. 4, pp. 834–844, Apr. 1992.
[63] K. Kota and J. R. Cavallaro, “Numerical accuracy and hardware
tradeoffs for CORDIC arithmetic for special-purpose processors,”
IEEE Trans. Computers, vol. 42, no. 7, pp. 769–779, July 1993.
[64] Y. H. Hu, “CORDIC-based VLSI architectures for digital signal pro-
cessing,” IEEE Signal Process. Mag., vol. 9, no. 3, pp. 16–35, July
1992.
[65] F. Angarita, A. Perez-Pascual, T. Sansaloni, and J. Vails, “Efﬁcient
FPGA implementation of cordic algorithm for circular and linear co-
ordinates,” in Int. Conf. on Field Programmable Logic and Appl., Aug.
2005, pp. 535–538.
[66] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed. Bal-
timore, MD: Johns Hopkins Univ. Press, 1996.
[67] G. Lightbody, R. Woods, and R. Walke, “Design of a parameterizable
silicon intellectual property core for QR-based RLS ﬁltering,” IEEE
Trans.VeryLargeScaleIntegr.(VLSI)Syst.,vol.11,no.4,pp.659–678,
Aug. 2003.
[68] J. R. Cavallaro and F. T. Luk, “CORDIC arithmetic for a SVD pro-
cessor,” J. Parallel and Distributed Computing, vol. 5, pp. 271–290,
1988.
[69] J. M. Delosme, “A processor for two-dimensional symmetric eigen-
value and singular value arrays,” in IEEE 21th Asilomar Conf. on Cir-
cuits, Syst., and Computers, Nov. 1987, pp. 217–221.
[70] Z. Liu, K. Dickson, and J. V. McCanny, “Application-speciﬁc instruc-
tionsetprocessorforSoC implementationofmodernsignalprocessing
algorithms,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 4,
pp. 755–765, Apr. 2005.
[71] K. J. Jones, “2D systolic solution to discrete Fourier transform,” Proc.
IEE Computers and Digital Techn., vol. 136, no. 3, pp. 211–216, May
1989.
[72] T.-Y. Sung, “Memory-efﬁcient and high-speed split-radix FFT/IFFT
processor based on pipelined CORDIC rotations,” Proc. IEE Vision,
Image Signal Process., vol. 153, no. 4, pp. 405–410, Aug. 2006.
[73] L.-W. Chang and S.-W. Lee, “Systolic arrays for the discrete
Hartley transform,” IEEE Trans. Signal Process., vol. 39, no. 11, pp.
2411–2418, Nov. 1991.
[74] P.K.MeherandG.Panda,“Novelrecursivealgorithmandhighlycom-
pact semisystolic architecture for high throughput computation of 2-D
DHT,” Electron. Lett., vol. 29, no. 10, pp. 883–885, May 1993.
[75] W.-H. Chen, C. H. Smith, and S. C. Fralick, “A fast computational
algorithm for the discrete cosine transform,” IEEE Trans. Commun.,
vol. 25, no. 9, pp. 1004–1009, Sep. 1977.
[76] B. Das and S. Banerjee, “Uniﬁed CORDIC-based chip to realise DFT/
DHT/DCT/DST,” Proc. IEE Computers and Digital Techniques, vol.
149, no. 4, pp. 121–127, July 2002.
[77] J.-H. Hsiao, L.-G. Ghen, T.-D. Chiueh, and C.-T. Chen, “High
throughput CORDIC-based systolic array design for the discrete
cosine transform,” IEEE Trans. Circuits Syst. Video Technol., vol. 5,
no. 3, pp. 218–225, June 1995.
[78] D. C. Kar and V. V. B. Rao, “A CORDIC-based uniﬁed systolic archi-
tecture for sliding window applications of discrete transforms,” IEEE
Trans. Signal Process., vol. 44, no. 2, pp. 441–444, Feb. 1996.
[79] Y. H. Hu and S. Naganathan, “A novel implementation of chirp
Z-transform using a CORDIC processor,” IEEE Trans. Acoust.,
Speech, Signal Process., vol. 38, no. 2, pp. 352–354, Feb. 1990.
[80] J. G. Proakis and D. G. Manolakis, Digital Signal Processing: Prin-
ciples, Algorithms and Applications. Upper Saddle River, NJ: Pren-
tice-Hall, 1996.
[81] R. C. Gonzalez, Digital Image Processing, 3rd ed. Upper Saddle
River, N.J.: Prentice Hall, 2008.
[82] N. Guil, J. Villalba, and E. L. Zapata, “A fast Hough transform for
segment detection,” IEEE Trans. Image Process., vol. 4, no. 11, pp.
1541–1548, Nov. 1995.
[83] S. M. Bhandakar and H. Yu, “VLSI implementation of real-time
image rotation,” in Int. Conf. on Image Process., Sept. 1996, vol. 2,
pp. 1015–1018.
[84] S. Sathyanarayana, S. R. Kumar, and S. Thambipillai, “Uniﬁed
CORDIC based processor for image processing,” in 15th Int. Conf. on
Digital Signal Process., July 2007, pp. 343–346.
[85] J. Valls, T. Sansaloni, A. Perez-Pascual, V. Torres, and V. Almenar,
“The use of CORDIC in software deﬁned radios: A tutorial,” IEEE
Commun. Mag., vol. 44, no. 9, 2006.
[86] L. Cordesses, “Direct digital synthesis: A tool for periodic wave gen-
eration (part 1),” IEEE Signal Process. Mag., vol. 21, no. 4, 2004.
[87] J.Vankka,DigitalSynthesizersandTransmittersforSoftwareRadio.
Dordrecht, Netherlands: Springer, 2005.
[88] J. Vankka, “Methods of mapping from phase to sine amplitude in di-
rect digital synthesis,” in 50th IEEE International Frequency Control
Symposium, Jun. 1996, pp. 942–950.
[89] F. Cardells-Tormo and J. Valls-Coquillat, “Optimisation of direct dig-
italfrequency synthesisersbasedonCORDIC,”Electron.Lett.,vol.37,
no. 21, 2001.
[90] M. E. Frerking, Digital Signal Processing in Communication Sys-
tems. New York: Van Nostrand Reinhold, 1994.
[91] H. Meyr, M. Moeneclaey, and S. A. Fechtel, Digital Communication
Receivers: Synchronization, Channel Estimation, and Signal Pro-
cessing. New York: Wiley, 1998.
[92] F.Cardells,J.Valls,V.Almenar,andV.Torres,“EfﬁcientFPGA-based
QPSK demodulation loops: Application to the DVB standard,” Lec-
tures Notes on Computer Sci., vol. 2438, pp. 102–111, 2002.
[93] C.Dick, F. Harris, andM. Rice,“FPGA implementationof carriersyn-
chronization for QAM receivers,” J. VLSI Signal Process., vol. 36, pp.
57–71, 2004.
[94] M. J. Canet, F. Vicedo, V. Almenar, and J. Valls, “FPGA implementa-
tion of an IF transceiver for OFDM-based WLAN,” in IEEE Workshop
on Signal Process. Syst., SIPS’04, 2004, pp. 227–232.
[95] A. Troya, K. Maharatna, M. Krstic, E. Grass, U. Jagdhold, and R.
Kraemer, “Low-power VLSI implementation of the inner receiver for
OFDM-based WLAN systems,” IEEE Trans. Circuits Syst. I, Reg. Pa-
pers, vol. 55, no. 2, pp. 672–686, Mar. 2008.
[96] Z. Guo and P. Nilsson, “A VLSI architecture of the square root algo-
rithm for V-BLAST detection,” J. VLSI Signal Process. Syst., vol. 44,
no. 3, pp. 219–230, Sept. 2006.
[97] Z. Khan, T. Arslan, J. S. Thompson, and A. T. Erdogan, “Analysis and
implementation of multiple-input, multiple-output VBLAST receiver
from area and power efﬁciency perspective,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 14, no. 11, pp. 1281–1286, Nov. 2006.
[98] F. Sobhanmanesh and S. Nooshabadi, “Parametric minimum hardware
QR-factoriser architecture for V-BLAST detection,” Proc. IEE Cir-
cuits, Devices and Syst., vol. 153, no. 5, pp. 433–441, Oct. 2006.
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1907
[99] C. M. Rader, “VLSI systolic arrays for adaptive nulling [radar],” IEEE
Signal Process. Mag., vol. 13, no. 4, pp. 29–49, July 1996.
[100] R. Hamill, J. V. McCanny, and R. L. Walke, “Online CORDIC algo-
rithm and VLSI architecture for implementing QR-array processors,”
IEEE Trans. Signal Process., vol. 48, no. 2, pp. 592–598, Feb. 2000.
[101] C.LeeandP.Chang,“AmaximumpipelinedCORDICarchitecturefor
inverse kinematic position computation,” IEEE Trans. Robot. Autom.,
vol. RA-3, no. 5, pp. 445–458, 1987.
[102] R. Harber, J. Li, X. Hu, and S. Bass, “The application of bit-serial
CORDIC computational units to the design of inverse kinematics pro-
cessors,” in Proc. IEEE Int. Conf. on Robot. and Autom., 1988, pp.
1152–1157.
[103] C. Krieger and B. Hosticka, “Inverse kinematics computations with
modiﬁed CORDIC iterations,” Proc. IEE Computers and Digital
Techn., vol. 143, pp. 87–92, Jan. 1996.
[104] Y.WangandS.Butner,“Anewarchitectureforrobotcontrol,”inProc.
IEEE Int. Conf. on Robot. Autom., 1987, pp. 664–670.
[105] Y. Wang and S. Butner, “RIPS: A platform for experimental real-time
sensory-basedrobotcontrol,”IEEETrans.Syst.,Man,Cybern.,vol.19,
pp. 853–860, 1989.
[106] I. Walker and J. Cavallaro, “Parallel VLSI architectures for real-time
kinematics of redundant robots,” in Proc. IEEE Int. Conf. on Robot.
Autom., 1993, pp. 870–877.
[107] M. Kameyama, T. Amada, and T. Higuchi, “Highly parallel collision
detectionprocessorforintelligentrobots,”IEEEJ.Solid-StateCircuits,
vol. 27, pp. 300–306, 1992.
[108] T. Lang and E. Antelo, “High-throughput CORDIC-based geometry
operations for 3D computer graphics,” IEEE Trans. Computers, vol.
54, no. 3, pp. 347–361, Mar. 2005.
[109] B. Phong, “Illumination for computer generated pictures,” Commun.
ACM, pp. 311–317, June 1975.
[110] J. Euh, J. Chittamuru, and W. Burleson, “CORDIC vector interpolator
for power-aware 3D computer graphics,” in IEEE Workshop on Signal
Process. Syst., SIPS’02, Oct. 2002, pp. 240–245.
Pramod Kumar Meher (SM’03) received the M.Sc.
and M.Phil. degrees in physics, in 1978 and 1981, re-
spectively, and the Ph.D. degree in science, in 1996,
all from Sambalpur University, Sambalpur, India.
Currently, he isaSenior Scientistwiththe Institute
for Infocomm Research, Singapore. Prior to this as-
signment he was a visiting faculty with the School of
ComputerEngineering,NanyangTechnologicalUni-
versity, Singapore. From 1997 to 2002, he was a Pro-
fessor of Computer Application with Utkal Univer-
sity, Bhubaneswar, India, and from 1993 to 1997, he
was a Reader in electronics with Berhampur University, Berhampur, India. His
research interest mainly includes the design and optimization of dedicated and
reconﬁgurable architectures for computation-intensive algorithms, arithmetic
circuits, and algorithm-architecture co-design for digital signal processing and
communication.
Dr. Meher is a Fellow of IET (UK) and of IETE (India). He is currently
serving as Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND
SYSTEMS—II: EXPRESS BRIEFS, IEEE TRANSACTIONS ON VERY LARGE SCALE
INTEGRATION (VLSI) SYSTEMS, and The Journal of Circuits, Systems, and
Signal Processing. He was the recipient of Samanta Chandrasekhar Award for
excellence in research in engineering and technology for the year 1999.
Javier Valls (M’02) received the telecommunication
engineering degree from the Universidad Politec-
nica de Cataluna, Spain, and the Ph.D. degree in
telecommunication engineering from the Univer-
sidad Politecnica de Valencia, Spain, in 1993 and
1999, respectively.
He is with the Department of Electronics at Uni-
versidad Politecnica de Valencia, Valencia, Spain,
since 1993, where he currently is an Associate
Professor. His current research interests include the
design of FPGA-based systems, computer arith-
metic, VLSI signal processing, and digital communications.
Tso-Bing Juang (M’99) received the Ph.D. degree
in computer science and engineering from National
Sun Yat-sen University, Kaoshiung, Taiwan, in 2004.
Since2006,hehasbeenwiththeDepartmentofCom-
puter Science and Information Engineering, National
Pingtung Institute of Commerce, Taiwan, and is cur-
rently an Assistant Professor. His research interests
include computer arithmetic and VLSI systems.
Dr. Juang received the Best Thesis Award from
Taiwan Xerox’s foundation in 1995. Currently he
serves as the reviewers of IEEE TRANSACTIONS
ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, IEEE TRANSACTIONS
ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, IEEE TRANSACTIONS ON
COMPUTERS, and IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION
(VLSI) SYSTEMS.
K. Sridharan (S’84-M’96-SM’01) received the
Ph.D. degree from Rensselaer Polytechnic Institute,
Troy, NY, in 1995.
He was an Assistant Professor at the Indian
Institute of Technology (IIT) Guwahati, India, from
1996 to 2001. Since June 2001, he has been with
IIT Madras, Chennai, India, where he is presently an
Associate Professor. He was a visiting staff member
at NTU, Singapore, in 2000–2001 and 2006–2008.
He has supervised three Ph.D. candidates, and holds
one joint patent. He has published more than 60
papers in various international journals and conferences.
Dr. Sridharan received the Computer Engineering Division Prize for a paper
published in the Journal of I.E. (India) in 2002. He was also a joint recipient of
the IEEE Vincent Bendix Award.
Koushik Maharatna (M’02) received the M.Sc. de-
gree in electronic science from Calcutta University,
in 1995 and the Ph.D. degree from Jadavpur Univer-
sity, Calcutta, India, in 2002.
From 2000 to 2003, he was a Research Scientist
with IHP, Frankfurt (Oder), Germany, where, he
was mainly involved in the design of a single-chip
modem for the IEEE 802.11a standard. He is with
the Electronics Systems and Devices Group at the
School of Electronics and Computer Science of
the University of Southampton, U.K., as a Senior
Lecturer since 2006. His research interests include development of VLSI
architectures for DSP and communication applications, computer arithmetic,
low-power digital design, analog signal processing, and CNN.
Dr. Maharatna has served as session chair, reviewer and review committee
management member in several IEEE conferences and journals. He is currently
a member of the Engineering and Physical Research Council (EPSRC) college
in the U.K. He is also a member of VLSI System Application (VSA) Technical
Committee of IEEE Circuits and Systems Society.
Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore.  Restrictions apply. 