New virtually scaling free adaptive CORDIC rotator by Maharatna, Koushik et al.
Virtually scaling-free adaptive CORDIC rotator
K. Maharatna, A. Troya, S. Banerjee and E. Grass
Abstract: The authors propose a coordinate rotation digital computer (CORDIC) rotator algorithm
that eliminates the problems of scale factor compensation and limited range of convergence
associated with the classical CORDIC algorithm. In the proposed scheme, depending on the target
angle or the initial coordinate of the vector, a scaling by 1 or 1=
p
2 is needed that can be realised
with minimal hardware. The proposed CORDIC rotator adaptively selects the appropriate iteration
steps and converges to the ﬁnal result by executing on average only 50% of the number of iterations
required by the classical CORDIC. Unlike for the classical CORDIC, the value of the scale factor is
completely independent of the number of executed iterations. Based on the proposed algorithm, a
16-bit pipelined CORDIC rotator was implemented. The silicon area of the fabricated pipelined
CORDIC rotator core is 2.73mm
2. This is equivalent to 38000 inverter gates in the used 0.25mm
BiCMOS technology. The average dynamic power consumption of the fabricated CORDIC rotator
is 17mW at a 2.5V supply voltage and a 20Ms=s throughput. Currently, this CORDIC rotator is
used as a part of the baseband processor for a project that aims to design a single-chip wireless
modem compliant with the IEEE 802.11a standard.
1 Introduction
The Coordinate Rotation Digital Computer (CORDIC)
algorithm [1, 2] provides an elegant way of computing
various transcendental and trigonometric functions [2–4]
by merely using table look-up, shift and addition operations.
Since its introduction, the CORDIC algorithm has been used
in several arithmetic processors as well as to formulate and
implement different modern digital signal processing (DSP)
algorithms [5–11]. In principle, a CORDIC accepts three
input variables x, y and z and generates the output x0; y0 and
z0: It can operate in two different modes: (i) rotation; and
(ii) vectoring where, the z or y variable respectively, is
forced to zero through a series of iterations. Each of these
modes can be utilised in circular, linear and hyperbolic
coordinate systems to compute various functions as shown
in Table 1. The K  are the scale factors that are generated
during the CORDIC computation process.
The classical CORDIC approach [1, 2] suffers from three
principal drawbacks: (i) the requirement of a scale factor
compensation; (ii) the magnitude restriction of the input
variables; and (iii) its low speed of execution. Whereas the
ﬁrst drawback requires additional multiplication operations,
the second drawback incurs a signiﬁcant impact on the
accuracy of the computed function since it depends on how
closely the variables y (in vectoring mode) or z (in rotation
mode) can be driven to zero. They can be driven close to
zero if the initial inputs lie within a certain range called the
‘range of convergence’. Table 1 also lists the range of
convergence for different modes of CORDIC operation.
The third drawback comes from the iterative nature of the
CORDIC algorithm. These drawbacks have motivated
research on the development of high-performance special
purpose CORDIC processors and various implementations
of CORDIC processors have been suggested [12–17].
Circuit-level speed enhancements of the CORDIC
algorithm have been achieved by unfolding the iteration
stages and by realising it in a pipelined fashion. Redundant
arithmetic has also been used to speed-up the addition
operations that are at the heart of the CORDIC algorithm.
On the other hand, to achieve algorithm-level speed
enhancements, it is necessary for the CORDIC algorithm
to converge to the ﬁnal result by executing as small a
number of iterations as possible. However, the penalty
associated with this approach when applied to the classical
CORDIC formulation is that the scale factor becomes non-
constant and non-predictable and thus, extra post-processing
cycles and circuitry are needed for its compensation. The
complexity of the extra post-processing circuitry and cycles
is in principle comparable to that of the basic CORDIC
process itself.
Two different methods have been suggested to extend the
‘range of convergence’ of the CORDIC. The ﬁrst one is to
use mathematical identities to pre-process the CORDIC
input quantities [2, 18]. This process is broadly known as the
‘argument reduction technique’. Such mathematical iden-
tities do indeed help ease the present limitations but they are
cumbersome to use in hardware applications due to a
signiﬁcant processing time and large hardware overhead.
The second approach requires repetition of certain iteration
steps [19–21]. Depending on how these repeated iterations
are chosen, a large number of such iterations may be
necessary to obtain an accurate result which increases the
processing time signiﬁcantly [15]. Moreover, in this type
of approach, the scale factor does not remain constant and
once again requires extra processing hardware for its
compensation.
q IEE, 2004
IEE Proceedings online no. 20041107
doi: 10.1049/ip-cdt:20041107
K. Maharatna is with the Department of Electrical and Electronics
Engineering, University of Bristol, Bristol, UK
A. Troya and E. Grass are with Wireless Communication Systems, Institute
for High Performance Microelectronics (IHP), Im Technologie Park 25,
15236, Frankfurt (Oder), Germany
S. Banerjee is with the Department of Electronics and ECE, Indian Institute
of Technology, Kharagpur, India
Paper ﬁrst received 14th April and in revised form 25th August 2004
IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 6, November 2004 448The principal aim of the present work is to develop a
CORDIC processor that: (i) is power efﬁcient; (ii) free from
the scale factor compensation problem; and (iii) has a
convergence range over the entire coordinate space. More
speciﬁcally, we concentrate on the rotation and vectoring
mode of operation (i.e. z ! 0o ry ! 0 respectively) in the
Cartesian coordinate system. The current work is based on a
‘scaling-free’ rotational CORDIC algorithm [9] which has
been utilised to realise a number of DSP algorithms [9–11].
However, this CORDIC algorithm has a very small range of
convergence and can neither be applied for the rotation
operation with large angles nor for the vectoring mode of
operation. In this work we propose a ‘virtually scaling-free’
CORDIC rotator algorithm which requires a constant scale
factor and converges to the ﬁnal result executing a minimal
number of iterations. This potentially reduces the power
consumption since the number of arithmetic computations is
reduced. The main contributions of the present work are,
1. We develop a CORDIC algorithm where the scale factor
can assume only the values of either one or 1=
p
2 depending
on the input vector. This means that the scale factor is
a-priori predictable.
2. Unlike the classical CORDIC, the value of the scale
factor here is independent of the number of executed
iterations. Thus, algorithm-level speed enhancement
and power saving are possible in the proposed scheme
(by skipping some not actually needed iterations) without
worrying about the scale factor compensation.
3. We show that a basic convergence range of ½0 ; 22:5   is
absolutely sufﬁcient to compute the result of a CORDIC
rotation with the target angle (rotation mode) lying
anywhere in the coordinate space. This is also valid for
the vectoring operation. Utilising this property, the range of
convergence of the proposed CORDIC is extended over the
entire coordinate space.
4. In the proposed scheme on an average, a 50% reduction
in the number of iterations as compared to the classical
CORDIC is achieved.
5. A 16-bit pipelined CORDIC rotator is successfully
designed; fabricated and tested to validate the efﬁciency
of the proposed scheme.
2 The conventional CORDIC algorithm
The rotation of a vector ½xy  
T in the Cartesian coordinate
system can be described as:
x0
y0
  
¼ cosy siny
 siny cosy
  
x
y
  
ð1Þ
where, ½x0 y0 
T is the ﬁnal vector and y is the angle of
rotation ð 99:9    y   99:9 Þ: In the above equation it
was assumed that the rotation takes place in a clockwise
direction. For the conventional CORDIC method, the target
angle y is expressed as a summation of a number of
elementary rotational angles ai so that:
y ¼
X b 1
i¼0
siai ð2Þ
where, b is the bit precision of the machine in which the
operation is to be implemented and si 2f 1;  1g; and is
known as the direction of rotation. In the conventional
circular CORDIC operation ai is expressed as:
ai ¼ tan
 1ð2
 iÞð 3Þ
Substituting (2) into (1) and using (3) one may write:
x0
y0
  
¼
Y b 1
i¼0
cosai
1 si2 i
 si2 i 1
  
xi
yi
  
ð4Þ
and
si ¼ sign y  
X i 1
r¼0
ar
"#
ð5Þ
ziþ1 ¼ zi þ si2
 i ð6Þ
Equations (4)–(6) are the basic working equations of the
CORDIC unit operating in the circular mode. The physical
meaning of these equations is that the target angle inputted
as variable z, is driven to zero by to-and-fro motion of the
vector. At each iteration step the sign of the residual angle si
is calculated and accordingly, the vector is rotated in the
clockwise or counter-clockwise direction. For the vectoring
mode of operation, the value of y is forced to zero in the
same iterative manner and the angles are accumulated in the
z variable. In essence, the mathematical operations involved
here are a sequence of multiply-and-add operations. From
the hardware point of view, this function can be
implemented using shift-and-add operations only.
However, the result obtained in (4) requires scaling by a
factor
Qb 1
i¼0 cosai: The scale factor remains a machine
constant as long as i runs through all the steps from zero to
b   1: However, if i changes in a different manner, i.e. if
some of the allowed iterations (i.e. elementary angle
rotations) are dropped or repeated, the scale factor will
not remain constant or predictable. For its compensation one
may require complicated hardware structures or comparable
post-processing cycles.
3 Architecture of a scaling-free CORDIC rotator
Unlike the classical CORDIC method, in the scaling-free
CORDIC [9] the target angle is achieved by rotating the
vector in one direction only (either clockwise or counter-
clockwise) through sufﬁciently small elementary angles
Table 1: Function of the CORDIC in different modes of operation
Hyperbolic Linear Circular
y ! 0 ðvectoringÞ x
0 ¼ Kh
p
ðx
2   y
2Þ
z
0 ¼ z þ tanh
 1ðy=xÞ
jtanh
 1ðy=xÞj   1:1182
x
0 ¼ x
z
0 ¼ z þð y=xÞ
jy=xj 1
x
0 ¼ Kc
p
ðx
2 þ y
2Þ
z
0 ¼ z þ tan
 1ðy=xÞ
jtan
 1ðy=xÞj   1:7433ð99:9
 Þ
z ! 0 ðrotationÞ x
0 ¼ Kh½x coshðzÞþy sinhðzÞ 
y
0 ¼ Kh½x sinhðzÞþy coshðzÞ 
jzj 1:1182
x
0 ¼ x
y
0 ¼ y þ xz
jzj 1
x
0 ¼ Kc½x cosðzÞ y sinðzÞ 
y
0 ¼ Kc½x sinðzÞþy cosðzÞ 
jzj 1:7433ð99:9
 Þ
IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 6, November 2004 449ai so that the norm of the vector is always preserved at each
of the iteration steps. These elementary rotational angles can
be expressed by the following equation:
sinai ﬃ ai ¼ 2
 i ð7Þ
The ‘smallness’ of ai can be worked out by considering the
series expansion of sine and cosine of ai which are given by:
sinai ¼ ai   a
3
i =3! þ a
5
i =5!   ...: ð8Þ
cosai ¼ 1   a
2
i =2! þ a
4
i =4!   ...: ð9Þ
Now, using the approximation ai ¼ 2 i; (8) and (9) can be
written as follows:
sinai ¼ 2
 i   2
 3i=6 þ 2
 5i=120   ... ð10Þ
cosai ¼ 1   2
 2i=2 þ 2
 4i=24   ... ð11Þ
The largest term that is neglected during the approximation
of sinai ﬃ ai ¼ 2 i; is the second term in the sine series,
which is:
2
 3i=6 ¼ 2
 ð3iþlog2 6Þ ¼ 2
 ð3iþ2:585Þ ð12Þ
Now, let the machine in which the operation is supposed to
be implemented have a wordlength of b-bits. Then, if ð3i þ
2:585Þ equals or exceeds b, multiplication of any quantity
by 2 ð3iþ2:585Þ will essentially become machine zero since
this operation physically means that the multiplicand gets a
right-shift equal to or greater than b-bits. Therefore, the
condition for preserving accuracy during such approxi-
mation becomes:
3i þ 2:585   b or i  d ð b   2:585Þ=3e
¼ p ðsince i can adopt only integer valuesÞ
ð13aÞ
However, for practical purposes, the lower bound of i can be
relaxed slightly and can be expressed as:
bðb   2:585Þ=3c¼p ð13bÞ
The upper limit of i is b   1 since a right-shift of any b-bit
number by b-bits will result in machine zero.
Thus, under the above circumstances, the values of sinai
and cosai can be expressed as:
sinai ¼ 2
 i ð14Þ
cosai ¼ 1   2
 ð2iþ1Þ ð15Þ
Considering the above approximation, si ¼þ 1 (since the
sign of the residual angle is always the same for all
iterations) and clockwise rotation of the vector, therefore (1)
may be rewritten as:
x0
y0
  
¼
Y b 1
i¼p
1   2 ð2iþ1Þ 2 i
 2 i 1   2 ð2iþ1Þ
  
xi
yi
  
ð16Þ
It can be noted that like (4), expression (16) can be realised
in practice using only shift-and-add operations. On the other
hand, contrary to (4), no scaling term appears in (16), which
implies a ‘scaling-free’ CORDIC operation. The elimin-
ation of the scaling term is a signiﬁcant step, since it not
only rules out the requirement of extra post-processing
circuitry but also saves extra processing time. The
elementary datapath component for implementing the
above stated operation is shown in Fig. 1 which can be
viewed as an elementary rotational stage rendering a
rotation of ai to its input vector. It requires four shifters
and four adder=subtractors as the datapath components.
Thus, compared to the datapath of elementary rotational
stages of an unscaled CORDIC, it requires two additional
shifters and two more adder=subtractors. However, for a
pipelined implementation, the shifters are reduced to direct
hard-wired connections and the effective overhead becomes
two adder=subtractors per elementary rotational stage. On
the other hand, for the elementary rotational stages
corresponding to i   b=2; the extra shifters and the
adder=subtractors can be omitted as the right-shift of the
input quantity by ð2i þ 1Þ becomes machine zero and thus
no longer affects the accuracy. Thus, for the elementary
rotational stages corresponding to i   b=2; the hardware
cost is the same as that of the elementary rotational stages of
the conventional CORDIC.
4 Expansion of the angular convergence range in
the rotation mode
Since the CORDIC unit presented in the preceeding Section
is scaling free it saves post-processing hardware as well as
the processing time. However, the principal drawback of
this CORDIC algorithm is that only computations with very
small target angles can be carried out using this principle.
As an example, let us consider the implementation of a
CORDIC unit having a wordlength of 16-bits. Then,
according to the restriction imposed by (13b), the iteration
index i can assume the values 4; 5;...;15:
Since in our case the target angle y ¼
Pb 1
i¼p ai; and
according to (7), sinai ﬃ ai ¼ 2 i; the largest angle which
can be computed is only  7:16 : This is even less than the
range of convergence of the conventional CORDIC ð99:9 Þ
and is clearly insufﬁcient for general-purpose usage. Thus,
methods should be invented to expand the angle compu-
tation range while still preserving the ‘scaling-free property’
of the proposed CORDIC unit.
To expand the angular range of convergence for the
scaling-free CORDIC algorithm both, the ‘argument
reduction’ and the ‘repetition of certain iteration steps’
techniques are exploited with modiﬁcations. Firstly, an
argument reduction technique is used to reduce the total
angular range to be computed. Secondly, the elementary
rotational operations are carried out in an adaptive manner
to enhance the rate of convergence and to force the ﬁnal
angle approximation error below a certain prespeciﬁed
limit. In the following discussion this scheme is explained in
detail.
The main objective of the argument reduction technique
is to uniquely map the results of CORDIC rotations with
Fig. 1 Block diagram of the elementary CORDIC rotor stage
producing a rotation of by an angle ai
IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 6, November 2004 450large target angles y to the results of CORDIC rotations with
relatively small target angles f: To do this, we divide the
coordinate space into 16 equal domains (i.e. four domains
per quadrant) each having a uniform angular span of p=8:
Any target angle must lie in one of these 16 domains. We
ﬁrst examine the CORDIC rotation of an input vector with
target angle y lying in the ﬁrst quadrant. This essentially
means that y lies in one of the four domains Að½0; p=8ÞÞ;
Bð½p=8; p=4ÞÞ; Cð½p=4; 3p=8ÞÞ or Dð½3p=8; p=2 Þ.I n
each domain, y can be redeﬁned in terms of another angle
f as described in the following equations:
y ¼ f in domain A ð17Þ
y ¼ p=4   f in domain B ð18Þ
y ¼ p=4 þ f in domain C ð19Þ
y ¼ p=2   f in domain D ð20Þ
It is to be noted that the angle f is always bounded in the
interval ½0; p=8 : Substituting (17)–(20) into (1), the
CORDIC operation on an input vector ½xy  
T can be
expressed in different domains as:
xfA
yfA
  
¼
cosf sinf
 sinf cosf
  
x
y
  
in domain A ð21Þ
xfB
yfB
  
¼
1
ﬃﬃﬃ
2
p
ðcosf þ sinfÞð cosf   sinfÞ
 ðcosf   sinfÞð cosf þ sinfÞ
  
x
y
  
in domain B ð22Þ
xfC
yfC
  
¼
1
ﬃﬃﬃ
2
p
ðcosf   sinfÞð cosf þ sinfÞ
 ðcosf þ sinfÞð cosf   sinfÞ
  
x
y
  
in domain C ð23Þ
xfD
yfD
  
¼ sinf cosf
 cosf sinf
  
x
y
  
in domain D ð24Þ
where, xf  denotes the ﬁnal vector resulting from CORDIC
operations with target angles lying in different domains.
Now recapitulating that the CORDIC rotation for f and
 f can be expressed as:
x0
þ
y0
þ
  
¼ cosf  sinf
sinf cosf
  
x
y
  
ð25Þ
x0
 
y0
 
  
¼
cosf sinf
 sinf sinf
  
x
y
  
ð26Þ
where, the þ and   sign at the sufﬁx of x0 and y0 denote
positive and negative CORDIC rotations respectively. Thus,
(21)–(24) can be written as:
xfA ¼ x
0
  yfA ¼ y
0
  ðf ¼ yÞð 27Þ
xfB ¼
1
ﬃﬃﬃ
2
p x
0
þ þ y
0
þ
  
yfB ¼
1
ﬃﬃﬃ
2
p  x
0
þ þ y
0
þ
  
ðf ¼ p=4   yÞ ð28Þ
xfC ¼
1
ﬃﬃﬃ
2
p x
0
  þ y
0
 
  
yfC ¼
1
ﬃﬃﬃ
2
p  x
0
  þ y
0
 
  
ðf ¼ y   p=4Þ ð29Þ
xfD ¼ y
0
þ yfD ¼  x
0
þ ðf ¼ p=2   yÞð 30Þ
Equations (27)–(30) show that the CORDIC operation with
target angles lying in different domains in the ﬁrst quadrant
can be computed from the results of CORDIC rotations with
target angle f by simple addition and subtraction oper-
ations. This essentially means that the domains B, C and D
are effectively ‘folded back’ to domain A. Hence, we call
this technique ‘domain folding’. A consequence of the
domain folding operation is the generation of a constant
scale factor 1=
p
2 for the target angles lying either in
domain B or domain C. This constant scale factor can be
realised with minimal hardware using only shift-and-add
operations within a prespeciﬁed error margin.
Up to now, only the target angle that lies in the ﬁrst
quadrant has been considered. By exploiting the symmetry
of the coordinate axes, the domain folding technique can be
easily employed to carry out CORDIC operations with
target angles lying in other quadrants as well. It is easy to
show that depending on the quadrant in which the target
angle lies, only the sign of the ﬁnal output vectors becomes
changed. Thus, a range of convergence spanning the entire
coordinate space is achieved. In summary, for the
computation of vector rotations with an arbitrary target
angle, the ﬁrst step is to detect the domain in which it lies.
Once the domain is detected, the computation can be carried
out applying an appropriate equation from those derived in
equations (27)–(30).
Although, in using the domain folding technique, it is
sufﬁcient to consider a modiﬁed range of convergence of
½0 ; p=8  only, it is still beyond the range of convergence of
the scaling-free CORDIC unit presented in Section 3. To
eliminate this discrepancy, one approach is to repeat some
of the iteration steps more than once. The iteration index i at
each stage can be chosen adaptively, in accordance with the
residual angle still to be computed. The process of adaptive
selection of i is described in the ﬂowchart shown in Fig. 2.
In this process at the beginning of every iteration step i, the
residual angle zi is compared with the value of 2 i (the ith
elementary rotation angle). If zi is less than 2
2i, the ith
iteration step is skipped since the meaning of this is that
the residual angle to be computed is smaller than the ith
elementary rotational angle. In such a case the iteration
index i is updated to i þ 1 and once again the condition of
equality is checked. However, if zi is equal to or greater than
2 i; xiþ1 and yiþ1 are computed according to (16) and the
residual angle is updated accordingly. The process can be
repeated until a user-deﬁned accuracy Rref is achieved.
Since in our algorithm we consider only one-sided rotation,
this process of selection of iteration steps ensures that only
the actually needed iteration steps are performed. This
reduces the number of computations signiﬁcantly.
Lemma 1: Only the iteration steps corresponding to the
smallest allowable value of i, is repeated more than once
with the other values of i being non-repetitive.
Proof: Let at the start of jth iteration the residual angle be yr
and 2 q < yr < 2 ðq 1Þ < 2 j; where, p< j< q; p being
the smallest allowable value of i. Then, according to the
ﬂowchart shown in Fig. 2, the iteration steps (or the
elementary rotation operations) corresponding to i ¼ j to
ðq   1Þ will be dropped and an elementary rotation
IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 6, November 2004 451corresponding to i ¼ q will be performed. After this
elementary rotation operation, let the new residual angle
yrq still be greater than 2 q: This implies that another
elementary rotation operation corresponding to i ¼ q
(repetition) is required. Moreover, according to the
proposed scheme, application of rotation corresponding to
i ¼ q twice (considering at least one repetition is required to
bring down the value of the residual angle to <2 q)
essentially means that yr must be greater than or equal to
2 ðq 1Þ which is in contradiction to our initial assumption.
Thus, it can be concluded that no elementary rotation steps
corresponding to i> p can be repeated. However, if the
value of target angle y at the start of the ﬁrst iteration is such
that 2 ðp nÞ   y and y   2 ðp nÞ < 2 p; where n< p; the
elementary rotation operation corresponding to i ¼ p is to
be repeated n times to cover that range. A
To examine the required number of iterations for different
target angles y within the modiﬁed convergence range
½0 ; p=8  a pseudo-random sequence of angle y has been
generated and a Matlab simulation has been performed
using those angles. The result is shown in Fig. 3 considering
a 16-bit wordlength and an angle approximation error of
Oð2 16Þ: It is apparent from the ﬂowchart shown in Fig. 2
and the results of Fig. 3 that for any angle lying within this
range, the required number of iterations is always smaller
than the number of iterations required for the conventional
CORDIC when employed for the same operation.
On average, the proposed scheme needs 50% fewer
iterations compared to the conventional CORDIC method.
It can be easily shown analytically that according to the
proposed scheme, the number of required iterations in the
worst case is 15 which occurs when a CORDIC operation
corresponding to the angle 21:484  is to be carried out.
Thus, using the proposed scheme, a faster convergence rate
compared to the classical CORDIC approach can be ensured
while keeping the scale factor virtually constant.
4.1 Error analysis for rotation mode
The potential sources of error for the hardware realisation of
an algorithm are two-fold: (i) the error due to the
quantisation of the input word; and (ii) the ﬁnite wordlength
of the arithmetic units. In the case of the CORDIC, another
potential error source is the error due to the approximation
of the target angles. In our scheme, the errors from the ﬁrst
two sources are the same as for the conventional
implementation of the circular CORDIC. Thus, a detailed
analysis of the third error source is presented here with the
consideration that the implementation has been done with a
16-bit wordlength. However, the result can be generalised
for any arbitrary implementation.
Let yc be the computed angle whereas y is the ideal target
angle. Then, from (1):
x
0 ¼ xcosyc þ ysinyc ð31Þ
y
0 ¼  xsinyc þ ycosyc ð32Þ
where yc ¼ y   e and e ¼ Oð2 16Þ in the present case. After
simpliﬁcation, (31) and (32) yield:
x
0
error ¼ð xcosy þ ysinyÞð1   coseÞ
þð   xsiny þ ycosyÞsine ð33Þ
y
0
error ¼ð   xsiny þ ycosyÞð1   coseÞ
 ð xcosy þ ysinyÞsine ð34Þ
Since e ¼ Oð2 16Þ; ð1   coseÞ and sine will be Oð2 33Þ and
Oð2 16Þ respectively. Thus, the maximum error that can
occur due to the angle approximation according to the
proposed scheme is Oð2 16Þ which is similar to that of the
classical CORDIC [22, 23]. Equations (33) and (34) also
show that the ﬁnal error due to angle approximation is
dependent on the initial values of the vector components x
and y as well as on the ideal target angle y: It can also be
easily shown from the above mentioned equations that like
the case of the classical CORDIC [22, 23], in this proposed
method the ﬁnal datapath error also becomes unacceptably
high when non-normalised values of x and y are used at the
input.
5 Expansion of the domain of convergence in the
vectoring mode
The concept of domain folding can be utilised once again to
achieve a convergence range over the entire coordinate
space for the vectoring mode of operation of the proposed
Fig. 2 Operation principle of the adaptive CORDIC
Fig. 3 Number of required iterations with respect to the target
angle lying in the range ½0 ;22:5   using the proposed algorithm
IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 6, November 2004 452CORDIC. To explain the complete scheme, let us ﬁrst
concentrate on the case where the input vector lies in the
ﬁrst quadrant of the coordinate space i.e. the sign of x and y
are positive and they satisfy the condition jtan 1ðy=xÞj  
90 : The ﬁrst step in this scheme is to ﬁnd out the
appropriate domain of the input vector. Using tanð22:5 Þ¼
ð
p
2   1Þ and tanð67:5 Þ¼ð
p
2 þ 1Þ it can be said that the
ratios ðy=ð
p
2   1ÞxÞ and ðy=ð
p
2 þ 1ÞxÞ deﬁne the bound-
aries of the domains A-B and C-D respectively. On the other
hand the boundary of domain B-C is deﬁned by x ¼ y since
tanð45 Þ¼1: Thus, by checking the inequalities ð
p
2  
1Þx> y; ð
p
2 þ 1Þx> y and x> y; it is possible to determine
the domain in which the initial vector lies. Once the domain
is determined, the input vector can be preprocessed
accordingly. Since, the range of convergence of the basic
CORDIC algorithm proposed here is only 22:5 ; the
coordinate axes are ﬁrst prerotated by an angle of 45  in a
counter-clockwise direction, when the initial vectors are in
domain B or C. This prerotation, in essence, is simply an
addition and subtraction of x and y followed by a scaling of p
2: This ensures that the ﬁnal angle to be accumulated
always lies within ½0 ; 22:5  : For the vector lying in
domain D, the approach for preprocessing is to swap x and y
at the input. This swapping means that in effect the x
component has to be forced to zero instead of y.T h i s
preprocessing operation is shown in the ﬂow chart of Fig. 4
where ðxin; yinÞ denotes the output of the preprocessing
step.
After preprocessing, ðxin; yinÞ are considered as inputs
to the basic CORDIC having a convergence range of
½0 ; 22:5  ; which operates in the angle accumulation
(vectoring) mode (shown as vectoring CORDIC in Fig. 4).
The actual angle can be computed at the output of the basic
CORDIC by adding=subtracting the accumulated angle
to=from 45  for domain C and domain B respectively. For
the vectors lying in domain A the accumulated angle during
the vectoring process is the desired angle. The result of the
absolute magnitude of the input vector (the square rooted
result) will be available at the x0 output in all these three
cases. For the vector lying in domain D, the ﬁnal angle can
be computed by subtracting the accumulated angle from
90 : The square rooted result will be available at output y0 in
this case.
By exploiting the symmetry of the coordinate axes, this
procedure can be directly extended to calculate the
arctangent of any angle lying in the coordinate space and
thus ensuring a convergence range over the entire
coordinate space.
5.1 Error analysis for the vectoring mode
Let us consider that in our formulation the actually
computed angle is yc and the ideal angle that should be
computedis y: These two angles are related by the following
equation:
yc ¼ y   m ð35Þ
where m is the actual error that has to be determined. We
further consider that the ﬁnal value of y is approximated
within the precision e: Ideally, the ﬁnal value of y (i.e. y0)
should be zero. Thus, in an ideal situation, the following
condition holds:
 xsiny þ ycosy ¼ 0 ð36Þ
However, since y0 is approximated with accuracy e and the
angle computed is yc; (36) can be written as:
 xsinyc þ ycosyc ¼ e ð37Þ
Substituting the value of yc from (35) and simplifying one
gets:
cosm½ xsiny þ ycosy  sinm½xcosy þ ysiny ¼e ð38Þ
Using (36), the ﬁrst term of (38) becomes zero and the angle
approximation error can be expressed as:
m ¼ sin
 1 e
xcosy þ ysiny
ð39Þ
In an ideal case, xcosy þ ysiny ¼ r; where r is the
magnitude of the vector. Thus, (39) can be rewritten as:
m ¼ sin
 1 e
r
ð40Þ
Thus, the angle approximation error is a function of the
magnitude of the vector as well as the precision by which
the y component of the vector is approximated to zero.
The error in the computation of the magnitude can be
derived by considering the following equation of the x
datapath:
x
0 ¼ xcosðy   mÞþysinðy   mÞð 41Þ
Simplifying the above equation, one gets:
x
0 ¼ð xcosy þ ysinyÞcosm  ð ycosy   xsinyÞsinm
ð42Þ
Using (36) and simplifying (42) yields:
x
0 ¼ r cosm ¼ r
ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
1  ð e=rÞ
2
q
ð43Þ
where xcosy þ ysiny ¼ r: Thus, the error for the magni-
tude of the vector can be written as:
x
0
error ¼ r 1  
ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
1  ð e=rÞ
2
q   
ð44Þ
As in the case of the error in angle estimation, in this case
also, the ﬁnal error depends on the initial magnitude of the
Fig. 4 Flow diagram for the preprocessing steps for expansion of
range of convergence for the proposed CORDIC in its vectoring
mode of operation
IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 6, November 2004 453vector as well as on how closely the y input is driven to zero.
From (40) and (44) it can be shown that for very small
values of the input vector, the errors in angle approximation
and magnitude calculation are too large to be acceptable.
The same can be said if the value of the input vector is not
normalised. This behaviour is identical to that of
the classical CORDIC vectoring operation as investigated
in [23].
6 Design of the 16-bit CORDIC rotator
The effectiveness of the proposed algorithm is veriﬁed
through a design and implementation of a 16-bit pipelined
CORDIC rotator test chip. Conventional two’s complement
arithmetic has been used throughout the design. In this
Section, we provide the details of the design as well as the
fabrication results.
6.1 Architecture
The complete CORDIC rotator capable of operating in both
the rotation and vectoring modes over the entire coordinate
space has three parts: (i) sign and domain detection
circuitry; (ii) basic CORDIC section with a convergence
range of ½0 ; 22:5  ; and (iii) the output circuitry. The block
diagram of such a CORDIC rotator is shown in Fig. 5. It has
three principal inputs x (16-bit), y (16-bit) and z (18-bit).
An 18-bit wordlength is selected for z in order to cover the
entire coordinate space ð½0; 2p Þ: Similarly, it has three
principal outputs x0 (16-bit), y0 (16-bit) and z0 (18-bit). The
sign detection circuitry (designated as ‘sign detect’i n
Fig. 5) detects the sign of input variables x and y to ﬁnd out
the appropriate coordinate and domain in which the vector
lies. Accordingly, the input variables are processed as has
been described in Sections 4 and 5 and these processed
quantities are regarded as the modiﬁed inputs to the basic
CORDIC rotator. The detection of sign and domain,
preprocessing and appropriate inputting of data to the
basic CORDIC rotator require one clock cycle in total. The
basic CORDIC rotator (designated as ‘cord pipe’i n
Fig. 5) is a bit-parallel pipelined implementation with an
internal wordlength of 16-bits. The basic pipeline is 16
stages long where the elementary rotational stage corre-
sponding to 2 4 is repeated six times. Our aim is to
approximate z and y for the rotation and the vectoring mode
respectively with an accuracy of Oð2 16Þ: The quadrant and
domain of the initial vector detected in the sign detect
module are represented by two 2-bit signals namely quad
and domain respectively. An additional single bit signal
sign is also used to indicate the sign of the initial vector.
These signals are transferred between two consecutive
sections of the pipeline along with the data in a local register
transfer manner. Thus, the data in different sections of the
pipeline has a token attributed to it that carries its initial
quadrant and domain information. The output circuitry
(designated as cord op in Fig. 5), consists of a ﬁxed
scaling unit of 1=
p
2; a demultiplexer and a couple of
adder=subtractors. This circuit post-processes the output
vector emerging from the cord pipe section depending
on these attributes and generates the ﬁnal output. All the
operations at the output are completed in one clock cycle.
A single bit signal rot is also provided with the rotator
whose logic ‘high’ state makes the rotator operate in
rotation mode whereas; its logic ‘low’ state selects the
vectoring mode of operation. Apart from the rot signal, the
rotator is also equipped with a set of input and output data
valid signals that indicate the presence of valid data at the
input and output.
To reduce the input=output (I=O) pin count, we employed
pin multiplexing at the input and output of the core
CORDIC rotator. Two 8-bit datapaths are employed for
the input variables x and y respectively and a 9-bit datapath
is employed for input variable z. Thus, full length data
(16-bit for x and y respectively and 18-bit for variable z) are
available to the core CORDIC after every two clock cycles.
A similar arrangement is provided at the output. Here one
8-bit datapath is provided for the variable x0 and a 9-bit
datapath is provided for the variable y0 (for rotation)=z0 (for
vectoring). The I=O multiplexing circuitry uses a separate
clock, which is twice as fast as the core clock. In our design
we have a target core clock frequency of 20MHz. Thus, the
I=O clock frequency is of 40MHz. This multiplexing
reduces the number of required I=O pins from 100 to 56. Of
those 19 pins are output pins, four pins are power pins and
the remaining ones are input pins.
6.2 Hardware complexity and design ﬂow
In our implementation, the basic CORDIC module having a
convergence range of ½0 ; 22:5   requires 66 16-bit 2-input
adders altogether. In the sign detect and cord op
modules the terms such as ð
p
2   1Þx and ð
p
2 þ 1Þx and
scaling by 1=
p
2 are realised using shift-and-add operations.
Although these operations can be realised using multiple-
input adders, in our implementation we have used two-input
adders only. The total complexity of the sign detect
and cord op modules are 18 and 12 16-bit two-input
adders respectively. Thus, the complexity of the complete
processors is 96 16-bit two-input adders. The total hardware
complexity of the proposed processor in its present form is
slightly higher than the classical CORDIC processor that
requires 80 16-bit two-input adders considering the scale
factor compensation circuitry. Despite a slightly higher
hardware complexity, the proposed circuit beneﬁts from the
advantage of potentially lower power consumption com-
pared to the classical CORDIC. This is since on average, the
computational burden is reduced by 50%: However, our
more recent work reveals that it is possible to reduce the
total hardware cost associated with the proposed processor
Fig. 5 Block diagram of the pipelined CORDIC (without I=O pin
multiplexing)
IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 6, November 2004 454further and it can be made more economic than the classical
CORDIC [24, 25].
For the design of the rotator, the IHP in-house design kit
was used. The CORDIC rotator is ﬁrst modelled in VHDL
and simulated using Mentor Graphics’ Modelsim simulator.
After functional veriﬁcation, Synopsys’ Design Analyzer is
used tosynthesise the circuit forIHPin-house 2.5V0:25mm
BiCMOS technology with a target core clock frequency of
20MHz. The synthesised cell areas of different modules are
shown in Fig. 6. The Figure shows that about 15% at the cell
area of the complete rotator is consumed by the sign and
domain detection circuitry whereas 9:3% is consumed by
the output circuitry. The cell area required by the entire
rotator is equivalent to 38000 inverter gates (approxi-
mately) in this technology.
After synthesis, the layout of the processor was
performed using Cadence’s Silicon Ensemble. The standard
cell approach with a row utilisation of 85% is deployed. The
area of the processor core after layout is 2:73 mm2:
Including 56I=O pins, the complete chip area was measured
as 6:6mm2:
6.3 Test results
The fabricated CORDIC rotator was packaged in a QFP 100
package. Testing was carried out in two phases. First, a
pseudo-random sequence of angles and input vectors was
generated. For the rotation mode of operation, the value of z
was varied over the range ½0 ; 22:5   while keeping x and y
constant for a particular set of z. Alternatively, for testing
the vectoring mode of operation we kept z ¼ 0 and the
values of x and y were varied over a wide range. These
vectors were used in Matlab simulations to generate our
reference output results. The same set of vectors had been
applied to the processor and the outputs were cross-checked
with the results of Matlab. In all cases the processor
exhibited a correct behaviour. The latency of the complete
processor was 900ns and the throughput was 1 set of results
every 50ns at a 20MHz core clock frequency.
In the second phase, the current consumption of the
chip was measured with a current meter when the chip
operated in continuous mode with a long vector set. The
average dynamic power consumption of the 26 fabricated
and measured chips is 17mW at a supply voltage of
2.5V. The die photograph of the CORDIC processor is
shown in Fig. 7.
After veriﬁcation of the functional behaviour, the chips
were subjected to a maximum operating frequency test. We
found that all the chips showed correct functional behaviour
for an operating frequency of 50MHz (I=O clock) i.e. a
25MHz core clock which is the limit of this tester.
However, the synopsys’ design analyzer predicts that the
circuit will operate correctly up to a core clock frequency of
45MHz.
7 Conclusions
We have described a CORDIC rotator algorithm that is
virtually scaling free and has a convergence range over the
entire coordinate space.The algorithm converges to the ﬁnal
result by adaptively selecting only needed iteration steps
and hence, requires 50% fewer computations on average as
compared to known CORDIC implementations. An original
property of our algorithm is that the value of the scale factor
(1 or 1=
p
2) is independent of the adaptive selection of
iteration steps and thus, an algorithmic-level speed-up is
possible without affecting the ﬁnal value of the scale factor.
The computational precision of the proposed processor is
similar to that of the classical CORDIC processor.
The hardware complexity of the proposed processor is
slightly higher than the classical CORDIC processor.
However, our more recent work reveals that the hardware
cost of the proposed processor can be reduced signiﬁcantly.
Despite a slightly higher hardware cost, the proposed
processor consumes less power compared to the classical
CORDIC since the number of actually required arithmetic
operations is signiﬁcantly reduced.
Based on this algorithm, a 16-bit pipelined CORDIC
rotator was implemented using the IHP in-house 0:25mm
BiCMOS technology. The fabrication results conﬁrm
the predicted performance, power and area parameters.
Currently, this CORDIC rotator is used as a part of our
baseband processor in a project that aims to design a single-
chip wireless modem compliant with the IEEE 802.11a
standard [26].
Without any loss of generality, redundant arithmetic can
be applied in the proposed scheme for circuit-level speed
enhancements. However, our main aim was to improve, on
an algorithmic-level, performance and power consumption
of the CORDIC processor. The proposed scheme has been
shown to be efﬁcient for that purpose.
Fig. 6 Cell area of the different modules for the pipelined
CORDIC
Fig. 7 Die photograph of the pipelined CORDIC processor
IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 6, November 2004 4558 Acknowledgments
The authors sincerely thank A.S. Dhar and U. Jagdhold for
their valuable suggestions on the theoretical and architec-
tural development of the CORDIC processor and C. Wolf,
J. Lehman and M. Krstic for their help in test and
measurement of the fabricated chip. Finally, the authors
would like to thank the technology team of the IHP for
fabricating the chip.
9 References
1 Volder, J.E.: ‘The CORDIC trigonometric computing technique’, IRE
Trans. Electron. Comput., 1959, 8, (3), pp. 330–334
2 Walther, J.S.: ‘A uniﬁed algorithm for elementary functions’. Proc.
Joint Spring Computer. Conf., July 1971, vol. 38, pp. 379–385
3 Abruzzo, J.: ‘Applicability of CORDIC algorithm to arithmetic
processing’. Proc. 18th Asimolar Conf. on Circuits, Systems and
Computers, 1985, pp. 79–86
4 Andrews, M., and Eggerding, D.A.: ‘A pipelined computer architecture
for uniﬁed elementary function evaluation’, Comput. Electr. Eng.,
1978, 5, (2), pp. 189–202
5 Cochran, D.S.: ‘Algorithms and accuracy in the HP–35’, Hewlett-
Packard J., 1972, pp. 10–11
6 Despain, A.M.: ‘Very fast Fourier transform algorithms for implemen-
tation’, IEEE Trans. Comput., 1979, 28, (5), pp. 333–341
7 Lee, D.T., and Morf, M.: ‘Generalized CORDIC for digital signal
processing’.Proc. Int. Conf.onAcousticsSpeech andSignalProcessing
(ICASSP), May 1982, vol. 3, pp. 1748–1751
8 Mandal, M.C., Dhar, A.S., and Banerjee, S.: ‘Multiplierless array
architecture for computing discrete cosine transform’, Comput. Electr.
Eng., 1995, 21, (1), pp. 13–19
9 Dhar, A.S., and Banerjee, S.: ‘An array architecture for fast
computation of discrete Hartley transform’, IEEE Trans. Circuits
Syst., 1991, 38, (9), pp. 1095–1098
10 Maharatna, K., and Banerjee, S.: ‘A VLSI array architecture for Hough
transform’, Pattern Recognit., 2001, 34, pp. 1503–1512
11 Maharatna, K., Dhar, A.S., and Banerjee, S.: ‘A VLSI array architecture
for realization of DFT,DHT,DCTand DST’, SignalProcess.,2001, 81,
pp. 1813–1822
12 Deprettere, E., and Udo, R.: ‘The pipelined CORDIC’. Internal Report
Network Theory Section, Delft University of Technology, 1983
13 Ercegovac, M.D., and Lang, T.: ‘Redundant and on-line CORDIC:
applicationto matrixtriangularizationand SVD’,IEEE Trans.Comput.,
1990, 38, (6), pp. 725–740
14 Takagi, N., Asada, T., and Yajima, S.: ‘Redundant CORDIC methods
with a constant scale factor for sine and cosine computation’, IEEE
Trans. Comput., 1991, 40, (9), pp. 989–995
15 Hu, X., Harber, R.G., and Bass, S.C.: ‘Expanding the range of
convergence of the CORDIC algorithm’, IEEE Trans. Comput., 1991,
40, (1), pp. 13–21
16 Wang, S., Puiri, V., and Swartzlander, E.E., Jr.: ‘Hybrid CORDIC
algorithms’, IEEE Trans. Comput., 1997, 46, (11), pp. 1202–1207
17 Phatak, D.S.: ‘Double step branching CORDIC: A new algorithm for
fast sine and cosine generation’, IEEE Trans. Comput., 1998, 47, (5),
pp. 587–602
18 Hitotumatu, S.: ‘Complex arithmetic through CORDIC’, Kodai Math.
Seminar Rep., 1974, (26), pp. 176–186
19 Delosme, J.M.: ‘VLSI implementation of rotations in pseudo-Euclidean
spaces’. Proc. Int. Conf. on Acoustics Speech and Signal Processing
(ICASSP), 1983, vol. 2, pp. 927–930
20 Muller, J.M.: ‘Discrete basis and computation of elementary functions’,
IEEE Trans. Comput., 1985, 34, (9), pp. 857–862
21 Sung, T., Parng, T., Hu, Y., and Chou, P.: ‘Design and implementation
of a VLSI CORDIC processor’. Proc. IEEE Int. Symp. on Circuits and
Systems (ISCAS), 1986, vol. 3, pp. 934–935
22 Hu, Y.H.: ‘The quantization effects of the CORDIC algorithm’, IEEE
Trans. Signal Process., 1992, pp. 834–844
23 Kota, K., and Cavallaro, J.R.: ‘Numerical accuracy and hardware
tradeoff for CORDIC arithmetic for special-purpose processors’, IEEE
Trans. Comput., 1993, 42, (7), pp. 769–779
24 Maharatna, K., Troya, A., Krstic, M., Grass, E., and Jagdhold, U.: ‘A
CORDIC like processor for computation of arctangent and absolute
magnitude of a vector’. Proc. IEEE Int. Symp. on Circuits and Systems
(ISCAS), 2004, pp. II -713–II-716
25 Maharatna, K., Troya, A., Banerjee, S., Grass, E., and Krstic, M.:
‘A 16-bit CORDIC rotator for high-performance wireless LAN’. Proc.
IEEE Personal Indoor and Mobile Radio Communication (PIMRC),
Barcelona, Spain, 5–8 September 2004
26 Grass, E., Tittelbach, K., Jagdhold, U., Troya, A., Lippert, G.,
Krueger, O., et al.: ‘On the single chip implementation of a Hiperlan/2
and IEEE802.11a capable modem’, IEEE Pers. Commun., 2001, 8, (6),
pp. 48–57
IEE Proc.-Comput. Digit. Tech., Vol. 151, No. 6, November 2004 456