Abstract-The FFT, Fast Fourier Transform is the most ubiquitous algorithm used for signal analysis in present-day communication systems (Example: OFDM). The general principle of FFT Algorithms is to use a Divide and Conquer approach that effectively reduces operation count. The FFT is generally computed either using a single radix or by mixed radix algorithms. Several optimization techniques have been proposed to increase efficiency while computing the Fourier Transform. This paper develops a reconfigurable FFT processor that can take any number of points as input, and is not limited to sequence lengths that are powers of the radix. The paper compares the factorized mixed radix approach with the single radix approach for computation of the FFT of a truly N-point sequence and shows that the mixed-radix approach yields better performance results. The test designs were developed in VHDL, verified in Matlab and Modelsim, and implemented using Xilinx Virtex5 LX110T FPGA.
I. INTRODUCTION
Frequency domain analysis of signals has certain advantages over the time domain approach. The Fourier transform is a tool used to obtain the frequency domain counterparts of time domain signals. The number of computations involved in the direct calculation of the Discrete Fourier Transform increases significantly with an increase in the length of the input sequence. FFT Algorithms can compute the coefficients of the DFT at a much faster rate due to the reduced number of computations since redundant calculations are eliminated in the process [1] - [3] , thus increasing the processing speed dramatically [4] , [5] . The FFT carries out the computations iteratively resulting in a considerable savings in computation time using any one of several efficient algorithms employed to compute the DFT coefficients.
The method used in FFT algorithms involves a linear change of index variables to map the one dimensional problem into a multi-dimensional problem [6] . Most FFT processors are either designed for fixed lengths or are programmable, but can compute the FFT only for sequences that are of lengths that are powers of two. The design of a 3780 point FFT processor for applications like TDS-OFDM, using a mixed radix approach is discussed in [7] . Ref. [8] describes the implementation of radix-2 2 single-path delay feedback pipelined FFT/IFFT processor. The merits of R2 2 SDF FFT and the limited resource utilization are discussed in [9] . The area of variable length FFT Processors has been largely unexplored. A variable-length FFT processor that integrates two radix-2 stages and three radix-2 3 stages for FFT sizes 512, 1024 and 2048 was proposed in [10] . Ref. [11] describes the design of a reconfigurable VLSI architecture for an FFT processor, but is restricted to powers of 2. In these processors, the technique of zero padding is typically employed to first extend the sequence length to a power of 2, if it is not already so, and then compute the FFT. The disadvantage of this technique is that the number of zeros padded to the sequence increases tremendously as N increases resulting in shifts in the frequency spectrum. A sequence of length 514 would have to be zero-padded and converted to a 1024 point sequence before computing its FFT.
This paper presents a methodology to find the FFT of an 'N' point sequence where N can be any integer and not necessarily a power of two or the specific radix used. Highly optimized FFT blocks for smaller lengths (2 to 8) are designed. 'N' is decomposed into various factors and the FFTs of these smaller length factors are then computed using the basic building blocks. Since a mixed radix approach is used, zero-padding would not be required in a majority of the cases. Even in those cases where the FFT has to be computed for sequences that are of prime number lengths, much fewer zeros would need to be appended to the sequence. The next section gives an overview of the Fast Fourier Transform and Radix-R Butterfly based architecture. Section III discusses our proposed design in detail. Some synthesis results are presented in section IV.
II. BACKGROUND

A. Fast Fourier Transform (FFT)
The Discrete Fourier Transform for a sequence of length 'N' is defined as As N increases, the number of computations increases tremendously affecting the system efficiency. One of the techniques used to improve the system performance is index mapping where a one dimensional FFT is converted to a higher-dimensional FFT. The Cooley-Tukey algorithm [12] which uses a divide and conquer approach recursively partitions a sequence of length k=Mp+q where 0 ≤ p ≤ (L-1) and 0 ≤ q ≤ (M-1), The FFT can now be represented as in (2) (2) to N(M+L+1) and the number of additions is reduced to
properties of the twiddle factors can be used for more efficient computation of the DFT.
Mixed Radix algorithms factorize a large data vector into multiple sequences of shorter lengths. These smaller blocks can in turn be multiplied together to compute the DFT of larger length sequences. Hardware complexity can be minimized by reducing the number of multiplication involved in the computation of the Fourier Coefficients. Also, the use of a higher radix is regarded to be more efficient due to the reduced number of multiplications [13] . While the Prime-factor algorithm has the advantage that there are no intermediate twiddle factor multiplications, it can be used only in those cases when the factors of "N" are relatively prime. Also, the Prime factor algorithm involves a more complex reordering of the sequence at the output based on an index-mapping that makes use of the Euler Totient function.
B. Radix-r Butterfly Based Architecture
Radix-2 Algorithms can be used only for lengths that are powers of two. The basic module involves addition and subtraction. In general, for a radix-R algorithm (N/R) log R N multiplications and Nlog R N additions are required. There are N/R butterflies per stage, and each butterfly requires (R-1) complex multiplications. The precision of the values can be improved if the number of bits allotted is increased on the word length thereby improving the efficiency of the system. As the word length increases the complexity of the system increases thereby increasing the size of the memory and computational units and this in turn results in increased consumption of the power and area. The size of word length selection creates a delicate trade-off between precision and complexity. The word length for fixed and floating point implementations can be selected based on mixed-signal bit-true simulation of the whole system [14] .
C. Multi-Dimensional Index Mapping
Multidimensional index mapping is used to uncouple the calculations of the discrete Fourier Transform. It is often easy to translate an algorithm using index mapping into an efficient program. Index maps are further classified, depending on whether the factors of "N" are mutually prime or not. Separate index maps for the "n" and "k" terms in the basic DFT expression, each of which follow specific criteria, may be defined.
III. PROPOSED DESIGN
The architecture proposed for computing the FFT of any N point sequence involves the mixed radix approach. The design mainly involves three stages-1) Factorization of the number 'N' 2) The decision to choose between the Cooley-Tukey and Prime Factor Algorithm for FFT Computation depending on the factors of N. 3) choice of the sequential order of the factors with regard to computation of the N-point FFT. The basic building blocks used in the design are optimized for high speed and minimal area and are designed for Radices 2 through 8. The implementation of the radix 5 block is diagrammatically represented in Fig.1 .
Any N point sequence is factorized as a combination of these basic radices and decomposed into various factors N 1 , The order in which the basic building blocks are used depends on the order of the factors. The initial N point sequence is given as input to the first block (radix N 1 ). The output from the same is multiplied with appropriate twiddle factors (if the Cooley-Tukey approach is used) or may directly be given (if PFA) as input to the successive stage (radix N 2 ). The FFT coefficients of the N point sequence will be obtained once the sequence passes through all the building blocks ie; radices N 1 , N 2 , N 3 , etc. A reordering of the input may be required in some intermediate stages depending on the radix used in the previous stage.
Since mixed radices are implemented, the probability of having to use any zero padding is considerably reduced and even for those sequence lengths that will require zero-padding prior to FFT computation, the number of zeros that are required to be padded in the design will be much lesser as compared to an approach involving a single radix. Also, unlike the single radix approach, the number of computations involved in a mixed radix approach is reduced and this further increases the efficiency of the proposed design.
The design can also choose between using 2 or 3 factors to decompose the larger composite number and compute its FFT. While the Prime-Factor algorithm requires no intermediate twiddle factor multiplications, the Cooley-Tukey algorithm approach when N is factorised as N=N1 × N2 requires one intermediate twiddle factor multiplication and the Cooley-Tukey algorithm approach when N is factorized as N=N1 × N2 × N3 requires three intermediate twiddle-factor multiplications. Fig. 2 shows the block diagram representing all the different stages involved in the computation of the FFT of an N-point sequence when N is factorized as N=N1 × N2 × N3.
IV. SYNTHESIS RESULTS
THe basic building blocks were implemented in VHDL and synthesised in Xilinx Virtex 5 LX110T family of FPGA.
The Xilinx results are shown in Table II . Table I shows the synthesis results of an 8point FFT using radix 2 and radix 4 blocks. Since the factors are not relatively prime (gcd (2,4) ≠1 ), the Cooley-Tukey algorithm was chosen in this case. This involves one intermediate twiddle factor multiplication. ordering of the factors or the different stages in the FFT computation does not have a significant impact on the performance. The total number of available slices being 69120 in the given device, only 17280 slices were occupied which just above 20% of the entire capacity. V. CONCLUSION An FFT processor design methodology yielding optimal speed-area trade-off is proposed and explored by examining feasible factorization techniques and the use of mixed radix FFT computation techniques. The first advantage of proposed processor over several existing reconfigurable processors is that it can compute the FFT of any N-length sequence, with no restriction of N being a power of 2 or the specific radix used. The proposed design uses the Cooley-Tukey FFT algorithm for FFT computation only in those cases where the factors of the number 'N' are not relatively prime. The Prime-Factor Algorithm is used to compute the FFT in all other cases.
As a demonstration, a comparison between the two approaches was made. The mixed radix approach used reduced number of multiplications and optimized performance as compared to conventional single-radix processors. The need for zero-padding of sequences so that the sequence length becomes a power of the radix used is also significantly reduced as a mixed-radix approach is used. As an example, synthesis result for the FFT computation of 40-point FFT is shown. In turn, there is no shift in the frequency spectrum when the proposed design is used. The use of the Prime Factor algorithm in those cases where the factors of N are relatively prime eliminates the need for any intermediate twiddle factor multiplications, thus in turn, resulting in a better design. Table III shows the synthesis results for computation of the FFT of a 40-point sequence, with 40 being factorized into some possible factor combinations, with the factors ranging between 2 and 8. It can be observed from Table III that use of higher radices yield better performance results as compared to the use of lower radices. It can also be seen that the
