Abstract-Rotations by angles that are fractions of the unit circle find applications in e.g. fast Fourier transform (FFT) architectures. In this work we propose a new rotator that consists of a series of stages. Each stage calculates a micro-rotation by an angle corresponding to a power-of-three fractional parts. Using a continuous powers-of-three range, it is possible to carry out all rotations required. In addition, the proposed rotators are compared to previous approaches, based of shift-and-add algorithms, showing improvements in accuracy and number of adders.
I. INTRODUCTION
In fast Fourier transform (FFT) calculations [1] , there is a need to compute rotations of complex numbers, that are carried out by complex rotators. Each complex rotation can be defined as a multiplication by a twiddle factor
where N is the number of rotation angles in which the circumference is divided, and φ ∈ {0, 1, . . . , N − 1} is the index of the rotation angle. One approach to implement the rotator is to store the twiddle factors in a ROM, and use a complex multiplier to carry out the rotation [2] . The complex multiplier can be implemented with three or four real multipliers [3] .
In order to avoid those multipliers, many multiplierless shift-and-add methods have been developed. An early such algorithm was the CORDIC [4] - [6] . It is based on breaking down the rotation angle into a set of predefined angles. Accordingly, the rotation is calculated by a series of microrotation by those angles. The angles are chosen so that the micro-rotations can be easily implemented in hardware. Each micro-rotation stage only consists of an adder, a subtractor and some multiplexers. All micro-rotator stages introduce a scaling, which means that the magnitude is increased. In CORDIC, the total gain for all stages is R = i |1 + 2 −i j| ≈ 1.647 [5] . Other CORDIC-based methods that consist of a cascade of micro-rotations have also been proposed [7] , [8] . For instance, the redundant CORDIC algorithm [7] introduces a redundant phase representation, which removes long carry chains, leading to lower latency. This approach has the drawback that it gives a non-uniform scaling, or gain, depending on the required angle to rotate. This was solved in [7] by using another set of adders. However, this requires to increase the number of adders with respect to the conventional CORDIC.
Recently, a method based on scaling the coefficients in order to reduce the rotation error was presented in [2] . This approach uses complex multipliers, so it is not comparable with shiftand-add algorithms, but the described method is used in the proposed method to find efficient coefficients.
When the rotator must only compute rotations by a reduced set of angles, the design can be optimized for those angles. This is the case for the proposed method. Thus, by optimizing the rotations only for the needed angles, the total number of stages can be kept much lower than in the CORDIC. The proposed approach also uses a "base 3"-inspired method, which lowers the number of stages even more. The base-3 structure uses, like the redundant CORDIC, the possibilities of a rotation by 0
• and, hence, it rotates either +α, 0, or −α. However, contrary to the redundant CORDIC, by using the coefficient scaling method [2] , the scaling error can be considered in the selection of the coefficients. This assures a uniform scaling, i.e., the same gain for all angles.
Finally, in the proposed method, the number of stages depends only on number of points to rotate, N , and not on the precision. The precision is related to the complexity of the rotator, which can be arbitrarily chosen.
The proposed method has been compared with recent shiftand-add based methods. In [9] , [10] complex rotators using trigonometric identities were proposed. They implemented only a few constants, and then derive the rest of them to those, by using, e.g., sin(π/8) = 2 cos(π/16) sin(π/16). In [11] , a reduced Booth-like multipliers is presented. It is based on the observation that when the set of coefficients is known, it is enough to design the accumulation tree for the maximum number of actual partial products.
II. ERROR IN HARDWARE ROTATIONS
A hardware rotator takes an input x + jy and provides an output X + jY being
where C α + jS α is the coefficient for the rotation by the angle α. It approximates R cos α + jR sin α, which is the exact value of the rotation, scaled by the scaling factor R [2] . According to this, the normalized rotation error is According to (1) , the angles of the FFT are given by
where α is the angle in radians and φ is the angle in steps. A step is here defined as the angle corresponding to 1/N of the circumference, or 2π N radians. By substituting equations (1), (2) and (4) in (3), the rotation error can be obtained as
The error for a rotator is the maximum error for any angle the rotator can carry out, i.e.,
For the CORDIC, the maximum error has an upper limit given by ≤ tan
. The error can also be expressed in terms of effective word length, WL E , using the relation
III. PROPOSED BASE-3 ROTATOR
The proposed approach carries out the rotation in a series of stages, as depicted in Fig. 1 . Thus, the twiddle factors are calculated as
This approach is similar to the CORDIC algorithm, but uses a different set of rotations. Specifically, the rotation angle φ is broken down into a set of angles, φ i , where
The first angle, φ 0 ∈ {0,
4 }, is used for trivial rotations by the angles {0
• , 90
• ]. The rest M angles are used for micro-rotations by
where δ i ∈ {−1, 0, 1}. This set of angles is based on a base-3 number system, where δ i are considered as the digits of the numbers. Together they span all steps in the range
2 ]. They correspond to the required twiddle factors, which makes this approach very suitable for the FFT. 
, we get the requirement
. Therefore, the minimum number of micro-rotators is
A. Phase Decoding
The first stage of the phase decoding (Dec 0 ) determines the trivial rotations, whereas the rest of the phase decoders (Dec i , i ≥ 1), decode their incoming phases (z i ), that must be in the range
and selects which δ i to pick. This selection is done so the requirements (12) also holds for z i+1 , according to
The remaining angle, z i+1 , is calculated by subtracting the rotation angle of the i-th stage from its input angle, i.e.,
To implement this in hardware, first it is checked if z i ≥ 0, by using the most significant bit (MSB), and then a temporary value, z i = (1 − 3 M −i )/2 + |z i |, is calculated. If z i < 0 (again using the MSB), δ i = 0 is selected. Otherwise δ i = ±1, depending on the sign of z i . For this operation, one adder is required.
The operation z i+1 = z i − φ i requires another adder. Note that both adders operate on z i , which is typically a small number, so they only need a few bits. The last stage will only need an OR gate, and no adder at all.
In general, the decoder path uses 2M − 2 adders, with only a few bits per adder. Therefore, the complexity of the adders in the phase decoder is negligible compared to the decoders in conventional CORDIC processors. For instance, for W 32 , Dec 0 needs a 2-bit adder and Dec 1 needs a 2-bit and a 3-bit adder, and Dec 2 does not need any adders.
For example, consider the case N = 32, giving M = log 3 ( N 4 + 1) = 2, where x + jy should be rotated by φ steps. Figure 2 depicts the possible rotations, and highlights the rotations used when φ = 10. Here, the trivial rotator is depicted as the innermost arrows (one arrow per possible value for φ 0 ∈ {0, 8, 16, 24}). The next "layer" of arrows depicts the choices for φ 1 ∈ {−3, 0, 3}, and the outermost "layer" Table I. depicts the choices of φ 2 ∈ {−1, 0, 1}. The operations used in the different stages when φ = 10 are also listed in Table I , where z 0 = φ, x M +1 = X and y M +1 = Y . As can be seen, in the case when N = 32 the proposed method is well utilized, since most "branches" in Fig. 2 have all three sub branches (this because 3 M ≈ N ). In other cases, the utilization is not so good. For instance, if N = 128, then M = 4 and
, causing a significant overlap between the different quadrants.
B. Micro Rotators
The micro-rotators, Rot i , i ≥ 1, multiply x i + jy i by a complex value, P i , using shift-and-add methods according to
The complex value P i depends on the given rotation angle, and includes a scaling, R i ≈ |P i |. It is defined as
where C i , S i and K i are positive integer constants. Therefore, the micro-rotations are calculated as
The values C i , S i and K i are obtained by the coefficient scaling method [2] , optimized for the sets of angles {0, 2π N 3 M −i } radians, and modified to also calculate the number of adders for the kernels (where each kernel is a suggested set {K i , C i + jS i }). The calculated number of adders for a kernel is the greatest number used to implement any of the two angles of the kernel, and the other angle will then reuse the same adders by using multiplexers.
The cases when δ i = ±1 are implemented using multiple constant multiplication (MCM) [12] , where the same MCM structure is applied to both x i and y i . In addition to the adders needed by the MCM, the extra adder used by the additions in (17) is used. This gives 2 + 2 · MCM(C i , S i ) adders, where MCM(C i , S i ) is the number of adders needed for a multiplication by C i and S i simultaneously. In the case when δ i = 0, only a single constant multiplication (SCM) [13] is used, and applied on x i and y i , giving 2 · SCM(K i ) adders, where SCM(K i ) is the number of adders needed for the multiplication by K i .
For each kernel, the modified coefficient scaling method will output a list of the possible kernels, with their errors and required number of adders respectively.
C. Adders vs Errors
When the list of precision vs number of adders for each micro-rotator in the design is available, simply test all combinations, test the resulting precision for each combination, and finally keep the best result for each total number of adders. This brute force method may seem computationally demanding, but if the optimizer stores four different kernels for each micro-rotator (using 2, 4, 6 and 8 adders respectively), and there is M = 4 micro-rotators, then there are 4 M = 256 combinations to test, which is done quickly by a computer.
As an example, consider the generation of the N = 32 case. The optimization procedure gives results shown in Table II for each of the M = 2 micro-rotators. The procedure did not find any 2-adder alternative for implementing Rot 1 (W
{0,±3} 32
). The entire rotator is obtained by combining the alternatives for Rot 1 and Rot 2 . This leads to the 3 · 4 = 12 cases for the entire rotator listed in Table III . The best combination for each total number of adders is marked with ( ). This leads to a trade off between accuracy and number of adders.
IV. COMPARISON
The WL E of the all rotators has been calculated according to (5)- (7), for the best sets of δ. Since the adders in the phase decoder in the proposed method only need a few bits, and the CORDIC phase decoder can be implemented in different ways, only the adders in the data path are considered. Figure 3 compares some of the proposed designs with CORDIC, showing that the N = 16 and N = 32 cases are significantly better than the CORDIC. This is due to the use of ad-hoc rotations. The W 64 case is better than the CORDIC for high precision (WL E > 10). For N = 128 the results are similar to the CORDIC, with the advantage of the use of a shorter number of stages, and a much simpler phase decoder. For larger rotators the proposed method is not so suitable, due to the larger number of angles that must be considered.
It can be observed that the W 32 rotators are more efficient than the W 16 ones. Although W 32 and W 16 need two microrotator stages, so they should have roughly the same efficiency, the angles used in the W 16 case, π/8 and 3π/8 radians, are much harder to implement than the angles used in the W 32 case, π/16 and 3π/16.
Note that the W 16 rotator can be implemented using the coefficients from the W 32 rotator, if non-integer φ 1 and φ 2 can be accepted ({0, ±0.5} and {0, ±1.5} respectively). Ongoing work focuses on relaxing the requirements on φ i , to include, e.g., fractional values. Figure 4 compares different rotators for W 32 . The proposed architecture is shown to be more efficient than the Boothlike design [11] , the CORDIC and the trigonometric identity solution [10] . Only the errors in the W 32 angles are considered in this case.
V. CONCLUSION
In this paper, we have presented a method to perform complex rotation, based on cascaded micro-rotators. The main idea is the inspiration from a base-3 number system, with the digits −1, 0 and 1, that allows us to perform rotations in both directions. The focus in the rotations is to use few and accurate stages, that rotate integer number of steps of size 2π/N radians.
