Abstract-This paper shows a design and implementation of a radix-4 FFT in FPGA using a Xilinx Spartan-6. The decimation in time equations are reviewed and in sequence several FPGA modules are presented according to algorithm architecture looking for optimization in execution time and occupied device area. Several tests were performed in order to validate the algorithm performance, FFT functionality, and time performance analysis. The proposed architecture is of low cost and very efficient for FFT computation.
I. INTRODUCTION
There are several methodologies and techniques that already offer hardware and software solutions for computing fast Fourier transform (FFT), which have advantages for specific applications. These solutions are developed for running in several platforms, such as GPU, DSP, FPGA and ASIC and they are usually described in C/C++ language or HDL. Implementation intended for reconfigurable logic is usually described in HDL, such as VHDL or Verilog.
Different FFT algorithms have been proposed to exploit certain signal properties to improve the trade-off between computation time and hardware requirements. Radix-4 based algorithms improve computation time by a factor of two, compared with radix-2 based algorithms, increasing hardware requirements by the same factor.
Considering that low-cost, high-density reconfigurable devices are already available, an optimized price/performance FPGA implementation of the 1024-point radix-4 FFT is feasible.
II. RADIX-4 FFT
The N-point discrete Fourier transform (DFT) is defined by equation [1] where The DFT calculation demands a complex implementation (requires N 2 complex multiplications and N(N-1) complex additions), so we have to find a more efficient way to perform this calculation. The Fast Fourier Transformation (FFT) was proven to be a faster and more efficient algorithm to compute Fourier transform. We use the decimation in Frequency (DIF) radix-4, which is the most used to calculate the FFT because of its reduce computational complexity.
The radix-4 DIF FFT divides an N-point discrete Fourier transform (DFT) into four N/4-point DFTs, then into 16 N/16-point DFTs, and so on. In the radix-2 DIF FFT, the DFT equation is expressed as the sum of two calculations. One calculation sum for the first half and one calculation sum for the second half of the input sequence [1] . Similarly, the radix-4 DIF fast Fourier transform (FFT) expresses the DFT equation as four summations, and then divides it into four equations, each of which computes every fourth output sample. The three twiddle factor coefficients can be expressed as follows:
To arrive at a four-point DFT decomposition, let:
Then the Equation (6) The points of the FFT Radix-2 algorithm are calculated using the following formula:
Compared with the Radix-2 algorithm, we will get a more complex algorithm but with less computational cost.
Taking a 1024-point sequence, radix-2 would require 40960 additions and 20480 multiplications. Radix-4 requires 30720 additions and only 5120 multiplications. The comparison between Radix-4 and Radix-2 implementation [4] is shown in Table I . III. ARCHITECTURE
The 1024-point FFT processes 1024 complex samples, with 64-bit length (32-bit word for real and imaginary part). Those samples are store in the memory RAM1, each one with a direction. The dragonfly takes 4 samples and operates, this process is repeated 256 times and it's stored in the memory RAM2. Then it replaces the RAM1 with the RAM2 information and process the dragonfly again. This process is done 5 times in order to finish the calculations of the Radix-4 FFT. We can see the flowchart in the Figure 3 . The dragonfly accepts 64-bit words consisting of a two IEEE-754 single-precision format with 32-bits (1-bit sign, 8-bit exponent, and 23-bit fraction). The first one is for the real part and the second one is for the imaginary part. Figure 4 . Fig. 4 .
FFT Bit Length
The 1024 complex input signals are stored into the RAM1. This register mixes the real part and an imaginary part, the input (complex) gets a new sample every clock cycle. When the store register gets full, it generates a valid signal that trigger and starts the FFT process.
The program starts by defining a main unit call "FFTfinal", this unit run two sub-stages: "cont_gen" and "unidadbasica", each one with a different task. As shown in Figure 5 . The first one is the controller; in this stage we can see the states of the operation. They are divide in fifteen states and each one have the key parameterization for the next stage. It's deploying with a switch-case command.
The second stage operates the states for the controller. It has counters to get the synchronization and direction of the data. The directions of in and out are charge in two separate ROMs and we have two units to access to the RAM, who FFTfinal Flow Chart contains the original function, one unit to read and another to overwrite. The rest of parts are for execute the dragonfly. The program starts loading the data into memory "RAM1", then copied into the registers"LoadA", "LoadB", "LoadC" and "LoadD". Twiddles coefficients are loaded from the ROM "LoadSinCos" and runs the dragonfly.
The dragonfly produces four records "LoadAF", "LoadBF", "LoadCF" and "LoadDF", they are stored in memory "RAM2" and once the memory is full proceeds to overwrite memory "RAM1". The counter"DataCount" expected to run 5 times the same process and terminates execution. The overall algorithm computation is supervised by a sequencer with the work flow as shown in Figure 6 .
Processes that require synchronization are counters and register writes, so these will need a clock signal "clk".
The main block for the FFT computation is the dragonfly processor which contains complex float-point single-precision multipliers and adders. These operations are implemented with an IP Core Multiply Adder v2.0 [5] , which one is already available in the Spartan 6.
The dragonfly is divided in two parts; those are shown in Fig. 7 . The first one "ArmaReI" makes the multiplications and adds with the twiddles coefficients, cosine and sine complex form. The second part "sumasfinales" makes the adders and subtractions, and gives the output orderly. During the implementation we optimize the use of the clock in order to optimize the time of response.
IV. RESULTS
The time-computation performance of the FFT is estimated by the software for the FPGA Spartan 6 and given us the following result:
 Clock period: 7.127ns (frequency: 140.308MHz).
We can see the optimum device utilization in the following estimated table: The implementation reaches a very near result in compare with the Xilinx FFT core [6] and better result than other FFT core [7] , as shown in Table IV . For validation purposes, we have captured the output of the simulation and compared with the FFT in MatLab [8] , which is shown at Figure 8 This work has described a radix-4 algorithm pipeline FFT in float point and was compared with a float-point model in MATLAB, reaching a result very accurate. We have also compared with the core 1024-Point Radix-4 FFT Computation [7] , and we have observed similar results.
As the Radix-4 FFT algorithm utilizes less complex multipliers than the Radix-2 FFT algorithm, the Radix-4 algorithm is preferable for hardware implementation. A parallel programming approach seems to be a good choice when a real time system with high sampling rate is desired. For reaching an acceptable level of phase error, it is desirable to use 32 bits precision on the input signal and the twiddle factor.
