Abstract-This brief presents a novel 4096-point radix-4 memorybased fast Fourier transform (FFT). The proposed architecture follows a conflict-free strategy that only requires a total memory of size N and a few additional multiplexers. The control is also simple, as it is generated directly from the bits of a counter. Apart from the low complexity, the FFT has been implemented on a Virtex-5 field programmable gate array (FPGA) using DSP slices. The goal has been to reduce the use of distributed logic, which is scarce in the target FPGA. With this purpose, most of the hardware has been implemented in DSP48E. As a result, the proposed FPGA is efficient in terms of hardware resources, as is shown by the experimental results.
I. INTRODUCTION
The fast Fourier transform (FFT) is one of the most important algorithms in the field of digital signal processing, used to calculate the discrete Fourier transform efficiently. The FFT is part of numerous systems in a large variety of applications. Sometimes the system demands the computation of the FFT at a very high rate. For this purpose, pipelined FFTs are mainly used [1] , [2] . In other systems, the demands in terms of performance are not so strict. Instead, there are demands in terms of area or hardware resources occupied by the architecture. Under these circumstances, the designers usually resort to memory-based FFTs [3] - [11] , also called in-place or iterative FFTs.
Memory-based FFTs consists of a memory or bank of memories that store the data. These data are read from memory, processed by butterflies and rotators, and stored again in memory. This process repeats iteratively until all the stages of the FFT algorithm are calculated. The advantage of memory-based FFTs is the reduction in the number of butterflies and rotators, as they are reused for different stages of the FFT.
There exist numerous memory-based FFT architectures in the literature. They mainly differ in the size of the processing element (PE) (butterflies and rotators). The most typical approach is to use a radix-2 butterfly [3] - [5] , but there are cases of radix-4 [6] , [7] and other radices [8] - [11] . Memory-based FFTs also differ in the access strategy to the memory, which in most cases provides conflict-free access. The amount of memory used in an N-point memory-based FFT is generally N or 2N. Apart from the memory, the access strategy may demand extra multiplexers [7] , buffers, or cache memories. This brief presents a novel radix-4 memory-based FFT. The proposed design has several advantages. With respect to the previous radix-4 approaches, it uses the minimum memory of N samples and a few additional multiplexers. Furthermore, the proposed approach has been implemented using DSP48E slices on a field programmable gate array (FPGA). The implementation allows to integrate the components of the architecture in the DSP48E, which reduces the hardware, especially the amount of distributed logic. As a result, the proposed approach is a compact solution for the FPGA that takes an advantage of the use of DSP48E slices, leaving room in the FPGA for other complex and area demanding elements.
This brief is organized as follows. Section II describes the proposed memory-based FFT. Section III explains the implementation using DSP slices. Section IV compares the proposed FFT to the previous memory-based FFTs. Section V presents an application where the proposed FFT has been used. Section VI shows the experimental results on the FPGA. Finally, Section VII summarizes the main conclusions of this brief.
II. PROPOSED MEMORY-BASED FFT

A. Basic Architecture
The basic architecture of the proposed 4096-point FFT is shown in Fig. 1 . The architecture uses radix-4 and computes the FFT algorithm iteratively in six iterations, which comes from the fact that in a radix-r memory-based FFT, the number of iterations is
The proposed design includes four memories of N/4 samples in parallel instead of a single memory of N samples. This allows to read and write data simultaneously in all the memories, which reduces the latency and increases the throughput of the circuit. Thus, at every clock cycle, the PE receives and provides four samples in parallel, one from and to each memory.
B. Conflict-Free Access
As all four memories are accessed simultaneously, it must be assured that the four samples processed in the PE every clock cycle 1063-8210 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. come from different memories. This demands a conflict-free memory access strategy as shown next. The notation in this brief is the same one used in previous works [12] , [13] . Initially, samples are stored in natural order in the memories. 
Bits b n−1 and b n−2 indicate in which of the four memories the sample is stored, whereas bits b n−3 , . . . , b 0 are the address. In a radix-2 FFT, the butterfly at stage s operates on samples whose indexes differ in the bit b n−s [2] . For radix-4, the butterfly at stage s operates on samples that differ in bits b n−2s−1 b n−2s . At each iteration of the FFT, these samples must arrive in parallel to the PE. For the first stage/iteration, bits b n−1 b n−2 are already in different memories according to (2) . For the second iteration, samples that differ in bits b n−3 b n−4 need to arrive in parallel to the PE. This is achieved by calculating the permutation
on the data in position P 1 , which leads to the position
For the rest of iterations, the same permutation is carried out to enable the correct samples into the PE. The permutation in (3) is carried out in three steps
The first permutation
is the permutation of the multiplexers before the memories, where (⊕) represents the logic XOR function. The circuit that calculates this permutation is shown in Fig. 2(a) . This permutation determines the memory in which data are going to be stored. The permutation σ 1 is controlled by the bits c n−3 and c n−4 . These are the two MSBs of the control counter with bits c n−3 , . . . , c 0 . Thus, the control is simple, as it is taken directly from the bits of the counter. The second permutation is carried out by the memories. It only affects the content of the memories. The memory address is obtained as
where m 1 m 0 values are the bits that indicate the memory, c i values are the bits of the control counter, W i+1 is the writing address at iteration i + 1, and R i is the reading address at iteration i. Note that W i+1 = R i . This means that at each iteration, data are written in the addresses that are emptied in the previous iteration. This optimization allows to only use a total memory of N. For the first iteration, W 1 = R 0 is equal to the control counter. The generation of the memory address is shown in Fig. 3 for MEM2, for which m 1 m 0 = 10. The third permutation
is the permutation of the multiplexers after the memories, shown in Fig. 2(b) . As σ 1 , it only determines the memory in which data are stored. Finally, the first time that samples are read from the memory, the permutation σ 3 is disabled, i.e., the control signals are set to 00. This happens because samples in the memory are already in the order demanded by the PE. Fig. 4 shows the data management for a 16-point FFT. The top part of Fig. 4 shows the data orders at the different stages of the circuit below. First, data are stored in memory. Fig. 4 shows the content of the different addresses in M0-M3. These data are read according to R 0 = c 1 c 0 , i.e., data from the memories are read in order. The data bypass the multiplexer after the memory, which is disabled in this iteration, and inputs the butterfly. The data order does not change until after the multiplexer that calculates σ 1 . This multiplexer is controlled by c n−3 c n−4 = c 1 c 0 and only permutes parallel data that arrive in the same clock cycle according to σ 1 . The data at the output of the multiplexer are stored again in memory. The writing address is W 1 = c 1 c 0 , which is equal to R 0 to avoid memory access conflicts, as shown in (8) . The memories are read according to R 1 = c 1 ⊕ m 1 , c 0 ⊕ m 0 . Note that the writing and the reading in the memories lead to a permutation σ 2 on data that arrive in series to the memories. Finally, a second parallel permutation, σ 3 , provides the order required at the input of the butterfly for the second iteration. The output of the butterfly provides the output of the FFT.
C. Rotations
The rotations of the FFT are performed by three complex multipliers. Each of them is connected to a rotation memory, which is an ROM of N/4 addresses that store the sine and cosine components of the rotation angle φ = −m(2π/N)i, where m ∈ {1, 2, 3} is the memory and i is the memory address. The memory address of the rotation memories is generated in a simple way from the control counter, as shown in Fig. 5 . The address is the same for all three memories and is obtained by enabling the bits of the counter depending on the iteration.
III. IMPLEMENTATION USING DSP SLICES
The proposed architecture has been implemented on a Virtex-5 XC5VSX95T FPGA. The VSX family is characterized by including a large number of DSP48E and a small amount of distributed logic. Thus, we have pursued to maximize the use of DSP48E and minimize the use of distributed logic by implementing on DSP48E all the elements of the architecture except the memories. An advantage of using DSP slices is that they can be clocked at high clock frequencies. Furthermore, the implementation on DSP slices allows for large word lengths without reducing the clock frequency, compared with designs implemented in the distributed logic, where the clock frequency may be reduced when increasing the word length. Fig. 6 shows the architecture.
The memories MEM0 to MEM3 are implemented using block RAM (BRAM) memories. Each memory has 1024 addresses and each address stores a sample of 24 + 24 bits for the real and imaginary parts, respectively.
The module BTF0 consists of two DSP48E in which the use of the multiplier has been disabled and allows to work with four inputs of 24 + 24 bits. The Arithmetic logic unit inside the DSP48E is configured in mode SIMD = TWO24. This way the real and imaginary components of the data are operated jointly.
Both sets of multiplexers in Fig. 2 would represent a significant cost if they were implemented in the distributed logic. In order to avoid this, the connections between the memory and the PEs are static, i.e., the outputs of the memories are always connected to the PE without multiplexing. This demands to modify the PE. The module BTF0 calculates the first crossed terms of the radix-4 butterfly and incorporates the multiplexers in Fig. 2(b) . Thus, the BTF0 changes the operations of the DSP48E depending on the two LSBs of the control counter, according to Table I. The module BTF1 is analogous to BTF0. It consists of two DSP48E and calculates the second part of the radix-4 butterfly. The operations that are executed depend on the bits c n−3 c n−4 of the control counter, as shown in Table II .
The module TWD in Fig. 6 calculates the multiplications by the twiddle factors. The twiddle factors are stored in the ROM memory, which is used in all the iterations of the FFT. While, in the first iteration, the coefficients are read one by one in order, in the rest of iterations, the LSBs of the control counter are canceled in order to determine the address [14] , as shown in Fig. 5 . As the outputs of BTF1 are shuffled, the twiddle factors are also provided in this shuffled order. During the last iteration, the module TWD does not need to calculate any rotation. Thus, the multipliers of this module are used to calculate the squared magnitude of the complex values, i.e., |C 2 + S 2 |, in order to determine the power at each output frequency. Table III compares various memory-based FFTs. In the memorybased FFTs, there is a tradeoff between the amount of resources of the architecture and the processing time, T PROC . This depends on the radix. The higher the radix, the larger the PE. A large PE increases the area, but reduces T PROC
IV. COMPARISON
where T MEM is the access time to the memory. Thus, radix-2 FFTs in Table III need less resources [3] - [5] , whereas the high-radix FFTs [8] - [10] achieve higher throughput. Radix-4 is in the middle between radix-2 and high radices. Therefore, it presents a tradeoff between resources and performance. Among radix-4 designs [6] , [7] , [10] , the proposed approach is characterized by the use of the least amount of resources, while keeping the same T PROC as other radix-4 designs. In particular, it only needs a total memory of N samples compared with 2N samples [6] or 2N(φ + 1) samples [10] . Note, however, that the approach in [6] is intended for continuous flow whereas our approach is not. Compared with [7] the proposed design reduces significantly the number of multiplexers to only 16 2-input multiplexers, compared with 16 16-input multiplexers and demultiplexers in [7] , which is equivalent to 240 2-input multiplexers.
V. APPLICATION CASE
The proposed FFT has been used for spectrum analysis. Fig. 7 shows the block diagram of the spectrum analyzer. The system includes four channels. Each channel consists of a finite-impulse response (FIR) filter, a decimation stage (DEC), a 4096-point iterative FFT, and a periodogram analysis.
The FIR filter is in charge of the bandwidth adaptation. The filtering is done with appropriate coefficients to avoid aliasing, and to reduce the band interferences and noise.
After the filter, data are decimated by a factor L = 8. The DEC block provides one sample every 20 ns to the FFT.
Once all the samples are loaded into the FFT module, the calculation of the FFT starts. The FFT module applies a window to the input sequence, and then calculates the FFT, and finally, the squared magnitude of the FFT (periodogram) is calculated to obtain the power of the signals. Table IV summarizes the figures of merit of the proposed 4096-point memory-based FFT shown in Fig. 7 . It processes four The implementation has been done on an FPGA. This differs from most memory-based FFTs in the literature, which have been implemented on Application specific integrated circuits. A previous memory-based FFT implemented on FPGAs is shown in [5] . The work in [5] requires 2863 slice LUTs, 2992 slice Flip-flop (FF), 24 DSP48E, and 8 BRAM. Four of this memory-based FFT would use 11452 slice LUTs, 11968 slice FF, 96 DSP48E, and 32 BRAM. The number of DSP48E and BRAM are comparable to the 98 DSP48E and 31 BRAM of the proposed design. However, in the proposed approach, the use of distributed logic is only 1407 slice LUTs and 1163 slice FF, compared with the 11452 slice LUTs and 11968 slice FF in [5] . This is an important saving in terms of distributed logic, and agrees with the goal of reducing distributed logic in the proposed design. Furthermore, with this hardware, the proposed architecture calculates a complex FFT, whereas [5] is only valid for real-valued signals.
VI. EXPERIMENTAL RESULTS
VII. CONCLUSION
The proposed 4096-point radix-4 memory-based FFT architecture presents a novel conflict-free access strategy. The new strategy requires the minimum amount of memory and a few multiplexers. This reduces the amount of hardware with respect to the previous radix-4 memory-based FFTs. Furthermore, the proposed FFT has been implemented efficiently on an FPGA making use of the DSP slices. The proposed design requires less distributed logic than the previous results on FPGA, while keeping a comparable amount of DSP slices and BRAM.
