Abstract-In this paper, we propose equivalent radix-2 2 algorithms and evaluate them based on twiddle factor switching activity for a single delay feedback pipelined FFT architecture. These equivalent pipeline FFT algorithms have the same number of complex multipliers with the same resolution as the radix-2 2 . It is shown that the twiddle factor switching activity of the equivalent algorithms is reduced with up to 40% for some of the equivalent algorithms derived for N = 256.
I. INTRODUCTION
Computation of the discrete Fourier transform (DFT) and inverse DFT is used in e.g. orthogonal frequency-division multiplexing (OFDM) communication systems and spectrometers. An N -point DFT can be expressed as
where
N is the twiddle factor, the N :th primitive root of unity with it's exponent being evaluated modulo N , n is the time index, and k is the frequency index. Various methods for efficiently computing (1) have been the subject of a large body of published literature. They are commonly referred to as fast Fourier transform (FFT) algorithms. Also, many different architectures to efficiently map the FFT algorithm to hardware have been proposed [1] .
A commonly used architecture for transforms of length N = b r is the pipelined FFT [2] . The pipeline architecture is characterized by continuous processing of input data. In addition, the pipeline architecture is highly regular, making it straightforward to automatically generate FFTs of various lengths. In a pipeline architecture the complex multiplier is one of the most power-consuming unit. Figure 1 outlines the architecture of a Radix-2 i singlepath delay feedback (SDF) decimation in frequency (DIF) pipeline FFT architecture if length N. This architecture is generic while the required ranges of each complex twiddle factor multiplier is outlined in Table I for varying numbers of i. For the twiddle factor multipliers with small ranges special methods have been proposed. Especially, one can note that for a W 4 multiplier the possible coefficients are {±1, ±j} and, hence, this can be simply solved by optionally interchanging real and imaginary parts and possibly negate (or replace the addition with a subtraction in the subsequent stage). In [5] , [8] twiddle factor multiplier for {W 8 , W 16 , and W 32 } using constant multiplication were proposed. However, a common way to solve the twiddle factor multiplication is to use a general complex mulitplier and precompute the twiddle factors and store in a memory. In integrated circuits, low power design is always desirable. In digital CMOS circuits, dynamic power is the dominating part of the total power consumption which can be approximated by [9] P
where V DD is the supply voltage, f C is the clock frequency, C L is the load capacitance and α is the switching activity switching activity. In this work we focus on the switching activity and how to reduce the switching activity between two successive coefficients fed to the complex multiplier.
In [11] - [14] , methods for reducing the size of the coefficient memory has been proposed. In [10] , [15] , methods for reducing the switching activity between successive twiddle factor coefficients have been proposed. However, these methods comes with a hardware overhead. In this work we focus on the algorithms derived from radix-2 i having same memory complexity as the standard radix-2 2 algorithm, i.e., the same resolution of the twiddle factors. However, as will be seen, the twiddle factor memory switching activity differs between the different algorithms.
The rest of the paper is organized as follows. In next section, the radix-2 2 algorithm and equivalent algorithms derived from radix-2 i are presented for N = 256. results are presented and, finally, some conclusions are given in Section IV.
II. RADIX 2 2 FFT AND ITS EQUIVALENT ALGORITHMS
The Cooley-Tukey FFT algorithm can be expressed as
In this algorithm, N, P and Q are considered to be powers of 2, i.e., N = 2 p+q , P = 2 p and Q = 2 q where p and q are positive integers. Here, the N -point DFT is decomposed into the Q P -point and P Q-point DFTs. Between these DFTs we have twiddle factor multiplications. Typically, the P and Qpoint DFTs are again divided into smaller DFTs. An efficeint representation of algorithms of this type is the binary tree representation [7] . An example of a binary tree is shown in Fig. 2 corresponding to (3) . The left branch corresponds to the P = 2 p -point DFT and the right branch to the Q = 2 qpoint DFT. The resolution of the interconnecting twiddle factor is N = 2 p+q , i.e., a W N multiplier is required. A radix-2 2 decimation in frequency algorithm is shown in Fig. 3 . In the remainder of this section we will present the radix-2 2 algorithm and other algorithms having the same intermediate node values as the radix-2 2 algorithm, but different binary trees. The naming of the resulting algorithms are shown in Table II .
A. Case I
The radix-2 2 algorithm have identical structure to radix-2 and are computationally identical to radix-4. In a pipeline FFT architecture, a structural advantage over the other algorithms that the non-trivial multiplication operations are after every other stage. Figure 3 represents the binary tree diagram, each node corresponding to the twiddle factor multiplication. Twiddle factors are indexed by the n and k, the linear index map equations and sequences of required n and k to determine the index. Twiddle factors with indices are tabulated in Table  III .
B. Case II
In this algorithm, the 256-point DFT is decomposed based on the radix-2 2 [3] for the first stages, then the modified radix-2 4 [5] in applied to the remaining stages. The radix-(2 2 &M.2 4 ) algorithm is characterized that it has same twiddle factor complex multiplier as the radix-2 2 for the W N multiplier. For instance consider the 256-point FFT, corresponding twiddle factors and indices with n and k sequences are shown in Tables III and IV, respectively. The binary tree representation of the algorithm is shown in Fig. 4(a) .
C. Case III
This algorithm is considered as a balanced binary-tree decomposition [7] shown in Fig. 4(b) . It can also be seen 
W n 5 (k 1 +2k 2 +4k 3 +8k 4 ) 64 n = 128n 1 + 64n 2 + n 3 n = 32n 1 + 16n 2 + 8n 3 + 4n 4 + n 5 n = 32n 1 + 16n 2 + 8n 3 + 4n 4 + n 5 as applying a modified radix-2 4 algorithm for the first stages, followed by a radix-2 2 algorithm for the remaining stages. It is worth noticing that this algorithm is not exactly equivalent to radix-2 2 as the W 64 -multiplier is replaced by a W 16 -multiplier. This should in general lead to lower memory requirements. The Twiddle factors with required index sequences are tabulated in Tables III and IV. 
D. Cases IV and V
In these algorithm, 256-point DFT is decomposed to get the modified radix-2 6 algorithm [6] . There are two decomposition which have same complexity as radix-2 2 algorithm. Figures  4(c) and 4(d) show the binary tree representation in which the difference is only the position of the W 64 and W 16 twiddle factors. Twiddle factors of all stages for both cases are shown in Table IV . The sequences for k and n are important to determine the index. As seen in Table III , the sequences and ranges with linear index map equations are tabulated. It should be noted that Case V corresponds to a radix-2 2 decimation in time (DIT) algorithm.
III. RESULTS
The switching activity between successive coefficient fed to the complex multiplier is defined in terms of Hamming distance for each coefficient transition. The Hamming distance is defined the number of 1's of the XOR operation between two binary coefficient. Twiddle factors can be precomputed and store in look-up tables intead of calculating in real time. In pipelined SDF architecture, in each cycle these stored coefficients are fed to the complex multiplier. The sequence of the stored coefficient affects the switching activity.
All the coefficient sequence of the complex multiplier having in both radix-2 2 and equivalent algorithms are encoded with different word lengths and two's complement representation. The reading sequence is then simulated to obtain the resulting switching activity. The results for the different algorithms are shown in Table V , where it can be seen that the different algorithms have significantly different twiddle factor switching activity. The algorithm with the lowest total switching activity is Case V.
Results using different wordlengths are shown in Fig. 5 . These results confirms that for N = 256 Case V provides the lowest twiddle factor switching activity.
In [5] , the W 16 is implemented through the use of a dedicated constant multiplier. Hence, it is for this case of interest to know how often the multiplier coefficient is changed. The results in Table VI shows that the number of coefficient changes can be significantly reduced using an equivalent algorithm. 
IV. CONCLUSIONS
In this work, we discuss the different equivalent algorithms of Radix-2 2 having the same implementation complexity but with possibly less switching activity between subsequent twiddle factor coefficients. It is shown that the twiddle factor switching activity of the equivalent algorithms can be reduced with more than 40% for some of the equivalent algorithms. Even though the corresponding effect on the power consumption should be evaluated, one would expect a significant reduction.
