shown that the precision can vary several bits using the same number of adders and subtracters, and, hence, the correct choice of rotator architecture is crucial for a low-complexity realization.
I. INTRODUCTION
A rotation is a complex multiplication where the magnitude of the complex coefficient is equal to one, i.e., only the phase of the data is affected. Rotations happen, among other DSP algorithms, in many transforms, e. g., the Discrete Fourier Transform (DFT) [1] and the Discrete Cosine Transform (DCT) [2] . Also, many algorithms implementing the DFT, such as Fast Fourier Transform (FFT) [3] algorithms and the Goertzel algorithms [1] , and the DCT, such as the fast DCT [4] , will be based on rotations.
A general rotation of a rad is written as: I • X = X • cos a -y . sm a y '
= y . cos a + x . sin a where x and y are the real and imaginary data, respectively. The result is represented by imaginary components x ' and y ' , respectively. The rotation in (1) is shown in Fig. 1 . This rotation can be computed in several different ways, including a general complex multiplication [5] and the CORDIC algorithm [6] . When the rotation is known in advance it is possible to simplify the computation leading to an optimized shift-and-add 978-1-4244-8157-6/10/$26.00 ©201O IEEE 17 realization. This holds for both the multiplication approach an the CORDIC one. It is sometimes advantageous from an implementation point of view to introduce a scaling factor in (1). This means that the coefficient does not have unit gain. This is inherent in the CORDIC algorithm as each sub-rotation introduces a gain. For multiplication-based approaches it can also be advantageous, as it can lead to the case that one of the coefficients, or even both of them, becomes very simple. For DCT algorithms this scaling factor can be compensated in later stages of the image coding process [7] . For DFT algorithms it is often required that the scaling is the same for several different rotators. Hence, here we will primarily consider rotations for the DCT, even though the results, with additional constraints, can be applied to DFT as well.
In this work, we consider the realization of constant ro tations with a focus on low-complexity realizations based on shifts, adders, and subtracters. As the complexity of adders and subtracters are about the same, we will refer to both as adders. Also, as shifts can be hard-wired in bit-parallel arithmetic, we will focus on the number of adders as the cost to minimize.
The rest of the paper is arranged as follows. In the next section, different alternatives for rotators are presented. Then, in Section III the errors and complexity are presented for the different alternatives, and the obtained results are discussed. Finally, some conclusions are given in Section IV.
II. ROTATOR ALTERNATIVES
A. CORDIC CORDIC (COordinate Rotation DIgital Computer) [6] is one popular algorithm for the implementation of multiplier less rotations. It realizes rotation by means of a series of shifts and additions, which reduces the amount of hardware.
The CORDIC algorithm decomposes the angle that has to be rotated, e, into a sum of M predefined angles, a i , according to:
where E is the error of the approximation, lSi indicates the direction of the so called micro-rotation and:
These angles that define the micro-rotations have the property that they can be rotated by shifts and additions, which reduces ICECS 2010 significantly the hardware resource. These micro-rotations are carried out as follows:
The hardware circuit for calculating the case of 6 i = 1 is depicted in Fig. 2 . In Fig. 2 , the angle a i that the input datum is rotate is chosen by setting the number of bits that are shifted before the additions and subtractions are carried out.
Usually 6 E {-I, I}. This forces to calculate all the micro rotations either clockwise or counterclockwise and assures a constant gain of the CORDIC, which can be compensated by multiplying the outputs by:
This option is preferable when the circuit is used for rotating several different angles, and a constant gain for all of them is required, as happens in the rotators for the FFT [8] . However, in a constant rotator only a single angle e must be rotated. In this case it is better to consider 6 i E {-I, 0, I}. This approach is called redundant CORDIC [9] and allows to remove certain micro-rotations, reducing the number of adders.
B. Constant multiplication
For constant multiplication, it possible to replace the general multiplier by shifts, adders, and subtracters. Adders and sub stractors have same complexity so we refer to both as adders. When an input signal is multiplied by more than one constant, a simple method is to realize each multiplier individually, which can be done optimally for up to 19-bits coefficients [10] . However, it is also possible to utilize redundancies between the constants in order to reduce the complexity of the hardware. In terms of complexity, the shift operations are free, only reduce the number of adders in multiple constant multiplications (MCM) to implement the constant multiplications. In partic ular, for complex multiplications each input is multiplied by two constant coefficients. A dedicated algorithm for realizing MCM with two constants has been proposed in [11] and is used in this work.
In fact that, any rotation angle a can be realized with constant multiplications so it is possible to implement the complex rotations. General rotator transform is defined as I .
The constant multiplication algorithm for equation (6) is based on the implementation on constant value of the sine and cosine functions. The complexity is depending upon the precision requirement of rotations.
C. Scaled constant multiplication
Scaling is a method to readjust the internal parameters of the system without changing the transfer function [5] . This procedure can be applied to a complex rotation in order to make one of the coefficients be equal to 1. Thus, by extracting the term sin a from equation (6), the following equations are obtained:
I .
( cos a )
This scaling allows to reduce the internal constant multi plications to two, in contrast to the four ones required in equation (6) . Then, the outputs are scaled by a constant factor. In the case under study, this scaling can be incorporated to the corresponding constant of the quantizer, leading to savings in the number of adders. The resulting constant multiplications can be straightforwardly realized using the optimal approach in [10] .
The analogous case consist in taking the common factor cos a according to:
The scaling explained in previous section can be further generalized. Thus, a scaling factor R can be considered, which transform equation (6) into:
This general scaled constant multiplication allows to look for the value of R that leads to the lowest rotation error. However, finding the best value of R requires an exhaustive search on a very fine grid. Hence, here we only note that this possibility would lead to better or as good results as the two previous methods based on constant mUltiplication. However, due to the computational complexity involved in performing this search, no results are presented in the current work. Number of adders 
�----�-----L----�------L-----�----� 4
5 678
9
Number of adders 
10
For the constant multiplication cases It IS often possible to optimize the error with the same complexity compared with rounding by addition aware quantization [12] . In [12] , E additional fractional bits are used to realize that there are exactly 2E different representable coefficients for which E � 2-(N+ l ), including the one obtained by rounding to N fractional bits. These 2E combinations are searched for the best solution.
For each precision requirement, the solution with smallest maximum quantization error among those solutions with the smallest addition count is selected. Number of adders adders. This precision in terms of correct fractional bits is defined as:
III. RESULTS
where E is error of the rotation with respect to the ideal rotation. This error is caused by the finite arithmetic used in the hardware circuits. Specifically, in case of the constant multiplication and scaled constant multiplication it is due to the quantization of the sine and cosine coefficient of the angles. On the other hand, for the CORDIC algorithm the error is based on the fact that the angle cannot be exactly approximated with a finite sequence of micro-rotation.
The results for constant multiplication and scaled constant mUltiplication have been obtained by applying the addition aware quantization methodology [12] . For the case of the CORDIC, all the possible combinations of micro-rotations have been calculated considering Oi E {-I, 0, I}, and the sequence of micro-rotations that best approximate the angle has been chosen. From Fig. 3 it can be observed that the algorithm that best approximates the angle depends on the number of adders used for the rotation. In case of 4 or 6 adders are utilized, the best result is obtained by the CORDIC algorithm. However, if 8 or 10 adders are available, a better approximation can be carried out using the scaled constant multiplication method. This shows that none of the algorithms is better than the other ones in all circumstances.
Besides, Figs. 4 and 5 show the same analysis, for the angles 7r /8 and 37r /16 respectively. In the first case, the CORDIC algorithm obtains the best results for 4, 6 and 8 adders, whereas the precision is higher for the scaled constant multiplication in case 10 adders are used. On the other hand, in the case of the angle 37r /16 depicted in Fig. 5 it can be observed that the CORDIC provides the most accurate results independently of the number of adders.
By comparing all three graphs it can be noted that the optimum results are very dependent on the angle that must be rotated. Thus, for each single angle the algorithms should be evaluated in order to obtain the optimum case. Table I summarizes the best results for each angle and each number of adders according to the graphs. In the Table it is indicated the best algorithm for each case, and how to calculate the rotation according to it. For the CORDIC algorithm the sequence of values 8i is provided for i = o ... M, being a i = tan-1(2-i ). Note that I is used for indicating 8i = -1. For the cases of scaled constant multiplication (SCM) the coefficient indicates how the constant value cos a/ sin a is quantized. The hardware architecture that uses the indicated number of adders can be obtained from these values [10] .
Moreover, the table shows the error of the approximation that leads to the precision bits, as well as the scaling factor. This scaling factor is equal to sin a for the SCM, as can be observed in equation (7), and it is obtained according to equation (5) for the CORDIC algorithm, but only considering the scaling of the micro-rotations that are carried out.
Finally, it has been shown that it is possible to implement a DCT architecture by performing the rotations 7f /16, 7f /8 and 37f /16 at the last stage of the algorithm [7] . This allows to incorporate the scaling of these rotations to the corresponding constant of the quantizer after the DCT, leading to savings in hardware. These are the three rotations that have been studied in depth in this paper. Consequently, in order to get an optimized hardware architecture for the computation of the DCT, the study presented in this paper allows to chose arbitrarily the number of adders for these rotations depending on the available hardware resources. Then the most efficient rotator that minimizes the error can be simply selected from Table I .
IV. CONCLUSIONS
In this work we considered low-complexity realization of constant angle rotators based on shifts, adders, and subtracters.
20
The results show that redundant CORDIC and scaled constant multiplication are providing the best results, depending on which angle is considered. It is also shown that the precision can vary several bits using the same number of adders and sub tracters, and, hence, the correct choice of rotator architecture is crucial for a low-complexity realization.
