Abstract-In this paper, we propose higher point FFT (fast Fourier transform) algorithms for a single delay feedback pipelined FFT architecture considering the 4096-point FFT. These algorithms are different from each other in terms of twiddle factor multiplication. Twiddle factor multiplication complexity comparison is presented when implemented on FieldProgrammable Gate Arrays(FPGAs) for all proposed algorithms. We also discuss the design criteria of the twiddle factor multiplication. Finally it is shown that there is a trade-off between twiddle factor memory complexity and switching activity in the introduced algorithms.
I. INTRODUCTION
Computation of the discrete Fourier transform (DFT) and inverse DFT is used in for e.g. orthogonal frequency-division multiplexing (OFDM) communication systems, Digital Video Broadcasting (DVB) and spectrometers. Few of these systems require large point FFT, usually more than 1K point.
An N -point DFT can be expressed as
where
N is the twiddle factor, the N :th primitive root of unity with its exponent being evaluated modulo N , n is the time index, and k is the frequency index. Various methods for efficiently computing (1) have been the subject of a large body of published literature. They are commonly referred to as fast Fourier transform (FFT) algorithms. Also, many different architectures to efficiently map the FFT algorithm to hardware have been proposed [1] .
A commonly used architecture for transforms of length N = b r is the pipelined FFT [2] . The pipeline architecture is characterized by continuous processing of input data. In addition, the pipeline architecture is highly regular, making it straightforward to automatically generate FFTs of various lengths. Especially for the large point FFT, reduces the computational complexity as well as hardware complexity. Figure 1 outlines the architecture of a Radix-2 i single-path delay feedback (SDF) decimation in frequency (DIF) pipeline FFT architecture of length N = 32. This architecture is generic while the required ranges of each complex twiddle factor multiplier is outlined in Table I for varying values of i. For the twiddle factor multipliers with small ranges special methods have been proposed. Especially, one can note that for a W 4 multiplier the possible coefficients are {±1, ±j} and, In digital CMOS circuits, dynamic power is the dominating part of the total power consumption which can be approximated by [9] P
where V DD is the supply voltage, f C is the clock frequency, C L is the load capacitance and α is the switching activity. Low complexity and low power architecture designs are always desirable. Low power can be achieved by either reducing the switching activity or resource utilization. In [10] - [13] , methods for reducing the size of the coefficient memory has been proposed. In [7] , the authors proposed balanced binary tree decomposition and claim optimal twiddle factor memory requirement.
In this work we propose algorithms to implement the 4096-point FFT. Butterfly structure of these proposed architectures are same but twiddle factor multiplications are different. Also discussed are the design criteria for the proposed algorithms on the basis of implementation of twiddle factor multiplication.
The rest of the paper is organized as follows. Next section describes the binary tree representation of Cooley-Tukey algorithm. In Section III we discuss the design criteria of the algorithms. In Section IV we introduce the proposed architectures derived from radix-2 i then in Section V, some results are presented. Finally, some conclusions are presented.
II. BINAY TREE REPRESENTATION OF COOLEY-TUKEY ALGORITHM
The Cooley-Tukey FFT algorithm can be expressed as
Where, N, P and Q are considered to be powers of 2, i.e., N = 2 p+q , P = 2 p and Q = 2 q where p and q are positive integers. Here, the N -point DFT is decomposed into the Q P -point and P Q-point DFTs. These are named as inner DFTs and outer DFTs repectively. Between these DFTs we have twiddle factor multiplications. Typically, the P and Qpoint DFTs are again divided into smaller DFTs. An efficient representation of algorithms of this type is the binary tree representation [7] . An example of a binary tree is shown in Fig. 2 corresponding to (3). The left branch corresponds to the P = 2 p -point DFT and the right branch to the Q = 2 q -point DFT. The resolution of the interconnecting twiddle factor is N = 2 p+q , i.e., a W N multiplier is required. p+q p q FFT algorithm is categorized by the way Cooley-Tukey recursive decomposition is applied. These decompositions finally reach butterfly operations which greatly influences the FFT architecture. A small radix is more desirable because it has a simple butterfly operation but higher radix has less number of twiddle factor multiplications. The radix-2 i has simple radix-2 butterfly operations and twiddle factor multiplications depend upon the value of i. The generalized radix-2(N = 32)
W3,25
x (16) x (17) x (18) x (19) x (20) x (21) x (22) x (23) x (24) x (25) x (26) x (27) x (28) x (29) x (30) x (31) x (0) x (2) x (3) x (4) x (5) x (6) x (7) x (8) x (9) x (10) x (11) x (12) x (13) x (14) x (15) x (1) W0,25 W0,27
x (1) x (17) x (9) x (5) x (13) x (29) x (3) x (19) x (11) x (27) x (7) x (23) x (15) x (31) x (0) x (8) x (4) x (28) x (2) x (10) x (26) x (6) x (22) x (14) x (30) x (20) x (12) x (16) x (24) x (18) x (25) x (21 signal flow graph is shown in Fig. 3 . Multiplication after each butterfly operation is shown with row and column. The radix-2 i algorithm can be achieved by applying the balanced decomposition for small point FFT.
III. CRITERIA FOR ALGORITHM SELECTION
Algorithm selection criteria is the most important step to design low power FFT algorithm. Twiddle factor multiplication is one of the major power contributors of the single delay feedback pipelined FFT architecture. Twiddle factor multiplication requires both memory and complex multiplier which consumes more power and more area.
A. Complexity of W N Multiplier
The simplest approach, is to just use a large look-up table to store the twiddle factors. For a W N multiplier, N words need to be stored. Twiddle factor multiplication is implemented with one complex multiplier and LUTs to store the precomputed coefficient. It should also be noted that this scheme possibly stores the same twiddle factor in several positions as the mapping is from row to twiddle factor and for radix-2 i algorithms some twiddle factors appears more than once for i ≥ 2. The complexity of the LUTs is depending upon the size of the FFT and resolution of the twiddle factor. It also to uses the well known octave symmetry to only store twiddle factors for 0 ≤ α ≤ π/4 with an additional cost of address mapping circuit [13] .
The lower resolution N ≤ 16, complex multiplier can be implemented with dedicated constant multiplier [5] , [8] . . The constant multiplier can be realized using a minimum number of adders using the method in [14] .
2) W 16 Multiplier: A W 16 -multiplier is a low resolution multiplier. This twiddle factor multiplication can be implemented with the dedicated constant multiplier of sin . In [15] authors proposed the low complexity in terms of adder with minimum error based on aware quantization method. In the proposed architectures we implement dedicated constant multiplier for W 16 twiddle factor multiplication.
B. Switching activity
Switching activity between two successive coefficients fed to the complex multiplier affects the power consumption. The coefficient reordering technique was proposed [16] to design low power architecture. Algorithmic level changes also affect the switching activity, depending upon how the FFT decomposition is recursively applied to form a small point FFT. In [17] the equivalent radix-2 2 algorithm with low switching activity was proposed. In the proposed architecture, we discuss switching activity of W 64 multiplication. The different decompositions of the 64-point FFT block is shown in Fig. 4 and the switching activity is tabulated in Table II . The position of the twiddle factor is affecting the switching activity. In case II and IV, we have same twiddle factor complexity but case II has less switching activity. Switching activity also depends upon whether any particular twiddle factor is located on left or right branch of the tree. It is shown that there is a trade off between complex multiplier and switching activity, both having affect on power consumption. formulated with eq. 3. Here we formulated the first decomposition of Fig. 5(a) expressed as
where W 4096 is the twiddle factor multiplication which connects the two decomposed DFTs. Similarly, we can apply the decomposition equation on each node of the binary tree representation of FFT. The generalized index mapping is presented for all stages of any radix-2 i algorithm [18] . Twiddle factors of each algorithm with resolution are tabulated in Table III .
V. RESULTS
We have analyzed the complexity and switching activity of twiddle factor multiplications. Both these factors influence low power designs. The architectures of the twiddle factor multiplication have been coded in VHDL. In higher resolution twiddle factor multiplication, we considered the LUTs to store the precomputed twiddle factors with complex multiplier and for others dedicated constant multiplier is considered for multiplication. The twiddle factor memory and complex multipliers were synthesized, targeting Virtex-4 FPGA. The twiddle factors are represented using 12 bits each for real and imaginary parts, using two's complement representation. The resulting complexity for each stage is illustrated in Table V. The switching activity between successive coefficient fed to the complex multiplier is defined in terms of Hamming distance for each coefficient transition. The Hamming distance is defined as the number of 1's of the XOR operation between two successive binary coefficient. Twiddle factors can be precomputed and stored in look-up tables instead of calculating in real time. In pipelined SDF architecture, in each cycle these stored coefficients are fed to the complex multiplier. The sequence of the stored coefficients affect the switching activity. The reading sequence is then simulated to obtain the resulting switching activity. The results for the different algorithms are shown in Table IV . The analysis of these results show that, we have more options to implement 4096-point FFT. The first proposed architecture requires 2 complex multiplier while other architectures need 3 complex multipliers. The hardware complexity of dedicated multiplier and the twiddle factor memory is higher than others with less switching activity. In the proposed architectures the complexity of the dedicated constant multipliers and twiddle factor memory is decreasing while switching activity is increasing from first to third proposed architecture.
Low power design is trade off between these parameters. In the proposed architectures we have better options to select low power design than balanced binary tree algorithms. 
VI. CONCLUSIONS
In this work, we proposed the different algorithms for single delay feedback architecture for higher radix, considering the 4096-point FFT. The twiddle factor multiplications at each stage is different for each proposed algorithms. Low power designs of each algorithm depends upon few twiddle factor multiplication design parameters. Design criteria of twiddle factor multiplication is trade off between these parameters.
It is shown that in the proposed algorithms we have better choices to select the low power architecture for 4096-point FFT.
