Abstract-This paper presents the fastest fast Fourier transform (FFT) hardware architectures so far. The architectures are based on a fully parallel implementation of the FFT algorithm. In order to obtain the highest throughput while keeping the resource utilization low, we base our design on making use of advanced shift-and-add techniques to implement the rotators and on selecting the most suitable FFT algorithms for these architectures. Apart from high throughput and resource efficiency, we also guarantee high accuracy in the proposed architectures. For the implementation, we have developed an automatic tool that generates the architectures as a function of the FFT size, input word length and accuracy of the rotations. We provide experimental results covering various FFT sizes, FFT algorithms, and field-programmable gate array boards. These results show that it is possible to break the barrier of 100 GS/s for FFT calculation. In this work, we aim for the highest possible throughput. In order to achieve it, we consider the use of fully parallel FFTs [4], [12]-[16]. These architectures calculate an N-point FFT in a continuous flow of P = N samples in parallel. Thus, they correspond to the direct implementation of the FFT flow graph, i.e., each addition/rotation in the flow graph is directly translated into an adder/rotator in hardware. This represents the maximum parallelization that an N-point FFT can have.
I. INTRODUCTION

W
ITH the increasing demand on throughput of current signal processing applications, pipelined parallel FFT architectures have become very popular in the last years [1] - [11] . These architectures are able to process a continuous data flow of several samples in parallel. The main types of parallel pipelined FFTs are multi-path delay commutator (MDC) [1] - [8] and multi-path delay feedback (MDF) [9] - [11] . Both of them allow for high throughput in the range from hundreds of mega samples per second (MS/s) to tens of giga samples per second (GS/s).
In this work, we aim for the highest possible throughput. In order to achieve it, we consider the use of fully parallel FFTs [4] , [12] - [16] . These architectures calculate an N-point FFT in a continuous flow of P = N samples in parallel. Thus, they correspond to the direct implementation of the FFT flow graph, i.e., each addition/rotation in the flow graph is directly translated into an adder/rotator in hardware. This represents the maximum parallelization that an N-point FFT can have.
Fully parallel FFTs are already used in applications such as compensation of chromatic dispersion inherent in optical fibers [12] - [14] and radar [17] . For compensation of chromatic dispersion, filters with samples rates of tens of GS/s are required. The filter length scales more or less linearly with the fiber length. Therefore, for long fibers, hundreds or even thousands of taps are required. An efficient way to implement these filters is to make use of fully parallel FFTs [12] - [14] . For radar applications, the high throughput of fully parallel FFTs allow for object detection over large bandwidths [17] . Another area of interest of fully parallel FFTs is for applications where iterative FFTs are implemented [18] - [20] . Note that in iterative FFTs [18] - [20] , the butterfly of the processing element (PE) usually consists of an r -point fully parallel FFT, where r is the FFT radix. For high radices [19] , a hardware-efficient fully parallel FFT can reduce significantly the hardware cost of the PE in the iterative FFT. Finally, given the high demands of 5G systems due to the use of multiple antennas [21] , fully parallel architectures could be potentially used in future communication systems. Although fully parallel FFTs have been known for a long time, no previous work in the literature addresses in detail the technical challenges of designing fully parallel FFT architectures.
A first technical challenge is the implementation of the rotators as shift-and-add operations. The high parallelization of fully parallel FFTs demands the use of a large number of rotators. Without a proper design of these rotators, the area of the FFT can increase considerably. The fact that all rotators in a fully parallel FFT rotate by a constant angle allows for using advanced shift-and-add constant multiplication techniques to minimize their resource complexity. In our approach, we exploit the use of existing methods for constant multiplications, including single constant multiplication (SCM) [22] , [23] , multiple constant multiplication (MCM) [24] - [28] and constant matrix multiplication (CMM) [29] - [31] . With the aim of designing the most efficient rotators, we also exploit different approaches to implement rotators in hardware. A number of them are discussed in [32] .
A second challenge related to the design of the rotators is the accuracy of the FFT calculations. Here, the coefficient selection [33] , [34] plays an important role in the design of rotators. A good coefficient selection results in coefficients that calculate accurate rotations using few adders [34] . This guarantees high accuracy for the entire FFT.
A third challenge related to rotations is the selection of the FFT algorithms. FFT algorithms based on the Cooley-Tukey approach only differ in the rotations at different stages [35] .
A good selection of the FFT algorithm will reduce the number of rotations and, therefore, the area of the FFT architecture.
A fourth challenge in the design of very high-throughput FFTs is pipelining. High throughput demands deep pipelining. However, with the high parallelization of fully parallel FFTs, pipelining increases the area of the FFT. In order to minimize the amount of pipelining, we have taken into account the depth of the rotators in the FFT. As a result, our rotators only require three clock cycles to calculate the rotations, achieving a critical path of only one adder. This not only increases the clock frequency, but also keeps the pipelining at reasonable levels.
A final challenge is the generation of the FFT architectures automatically. This challenge is a consequence of the implementation of the rotators as shift-and-add. Due to the large number, complexity and variety of rotators, it is unfeasible to implement them by hand or using the generate command in VHDL. As a result, it is needed to create a tool that generates the architectures automatically.
In this paper, we address the previous design challenges in order to achieve very high-throughput FFT architectures. This results in fully parallel architectures with throughput over 100 GS/s, which is the highest throughput for FFTs reported so far. Furthermore, the architectures have high accuracy and a relatively small area cost for the provided throughput.
The paper is organized as follows. In Section II, we review the FFT algorithm and discuss different FFT radices. In Section III, we review key concepts related to rotations in fixed-point arithmetic. In Section IV, we explain the proposed fully parallel FFT architectures. In Section V, we provide experimental results and compare them to previous works. Finally, in Section VI we provide the main conclusions of the paper.
II. THE FFT ALGORITHM
The N-point DFT of an input sequence x[n] is defined as
where
N nk . To calculate the DFT, the FFT based on the Cooley-Tukey algorithmm [36] is mostly used. This reduces the number of operations from O(N 2 ) for the DFT to O(N · log 2 N) for the FFT. Figure 1 shows the flow graph of a 16-point radix-2 2 FFT according to the Cooley-Tukey algorithm, decomposed using decimation in frequency (DIF) [35] . The FFT consists of n = log 2 N stages. At each stage s ∈ {1, . . . , n} of the graph, butterflies and rotations are calculated. The lower edges of the butterflies are always multiplied by −1. These −1 are not depicted in order to simplify the graphs.
The numbers at the input represent the index of the input sequence, whereas those at the output are the frequencies, k, of the output signal X [k] . Finally, each number φ in between the stages indicates a rotation by: As a consequence, samples for which φ ∈ {0,
3N
4 } must be rotated by 0 • , 270 • , 180 • or 90 • , which corresponds to complex multiplications by 1, − j , −1 and j , respectively. These rotations are considered trivial, because they can be carried out by interchanging the real and imaginary components and/or changing the sign of the data.
Different radices only differ in the rotations at the FFT stages [35] , whereas the butterflies are the same. The most common algorithms are radix-2, radix-2 2 , radix-2 3 and radix-2 4 [35] , [37] in their decimation in time (DIT) and DIF versions. For a fully parallel FFT we also consider the split radix algorithm [38] - [40] . The advantage of split radix is its smaller number of non-trivial rotations, whereas the advantage of radices of 2 k , k ≥ 2 is that some of their stages only include trivial rotations. Table I shows the number of non-trivial rotations for split radix and radix-2 k algorithms for different FFT sizes. These numbers are equal to the number of rotators in a fully parallel FFT.
III. ROTATIONS IN FIXED-POINT ARITHMETIC
A rotation in a digital system can be described as [33] where X + jY is the result of the rotation and C and S are obtained from the rotation angle as
where α is the rotation angle, ε c and ε s are the relative quantization errors of the cosine and sine components, respectively, and R is the scaling factor. The output X + jY is also scaled by R. The rotation error for a single constant rotation (SCR) is calculated as [33] 
The effective word length (W L E ) is a measure of the rotator accuracy that indicates the number of output bits that are guaranteed to be accurate. It is obtained from the rotation error as [34] W L E = − log 2 ε
IV. PROPOSED FULLY PARALLEL FFT ARCHITECTURES FFTs do not include circuits for data management. However, it involves other design challenges related to the design of the rotators, the selection of the FFT algorithms and the implementation in VHDL. Next section present these challenges, as well as the proposed solutions.
A. Design of the Rotators
In fully parallel FFTs, rotators take the largest part of the area. As they consist of constant multiplications, the best way to reduce the area of the FFT is to implement them as shift-and-add. Low-depth shift-and-add implementations also reduce the number of adders in series in the rotators. This reduces the amount of pipelining required for high-throughput. Furthermore, the accuracy of the rotators has effect in the accuracy of the entire FFT. In order to achieve accurate FFTs it is necessary to select coefficients with small rotator error. To achieve these goals, we take into account the coefficient selection, we explore different architectures for the rotators and we make use of advanced shift-and-add algorithms. The details are explained next.
1) Coefficient Selection:
The rotation coefficients C + j S are selected by taking into account that a fully parallel FFT has N parallel edges and, therefore, it calculates N rotations in parallel. Some of them are by 0 • or other trivial rotations, and other ones are by non-trivial rotations. As the number of trivial rotations, specially those by 0 • , is large in FFT algorithms, it is convenient that the rotators have unity scaling [34] , which corresponds to scaling by a power-of-two. This scaling can be compensated by a hard-wired bit shift. By doing this, rotations by 0 • do not require any hardware. Fig. 3 shows the coefficient selection for SCR and unity scaling used in the proposed FFTs. For each angle α = −2πφ/N, it searches for the coefficients C + j S that approximate the complex number P O PT (α) = 2 q · (cos α + j sin α) with an accuracy W L E or higher, for q = 1, . . . , W L C − 1, where W L C represents the maximum word length of the rotation coefficients. This corresponds to a circle of radius 2 q ε around P O PT (α). Note that the maximum error ε that defines the circle is calculated from W L E according to (6) . Additionally, we include the condition that the the magnitude of the coefficients must be smaller than or equal to 2 q . This condition guarantees that the result of the rotators will not overflow. As a consequence, the selected coefficients are the complex numbers C + j S with C, S ∈ Z that lie in the shaded region of Fig. 3 .
The reason to do this selection is to have a set of coefficients with an accuracy W L E or higher. It would be easier to obtain a single coefficient by rounding the real and imaginary part of P O PT . However, to have a set of coefficients allows us to explore which of them can be calculated by using the smallest number of adders. This enables further optimization while meeting the accuracy requirements.
2) Exploration of Rotator Implementations: Equation (3) defined a rotation as a complex constant multiplication, which can be calculated in different ways. The most straightforward approach is to do the calculations according to (3) , which requires 4 multipliers and 2 adders. Other alternatives with 3 adders and 3 multipliers [41] are also possible. The 3-multiplier cases are shown in Figs. 4(a) and 4(b) , whereas the 4-multiplier one is shown in Fig. 4(c) . Table II shows how to calculate the number of adders for the different constant rotators in Fig. 4 . The most general case is a rotation by P = C + j S. However, when C = S, C = 0 or S = 0, the rotators can be simplified, leading to the costs in columns three to five in the table.
3) Shift-and-Add Implementation: The constant multiplications in the rotator architectures in Fig. 4 are implemented as shift-and-add. For the real constant multiplications in Figs. 4(a) and 4(b) we make use of the SCM algorithms in [22] and [23] . For the constant multiplications in Fig. 4(c) we evaluate different MCM techniques [24] - [28] . Finally, we address the calculation in Fig. 4(d) as a CMM problem [31] . The exact algorithms that we consider for the implementation of the rotators are shown in Table III .
4) Selection of the Best Rotators:
For each rotation angle in the FFT, the previous steps provide a number of coefficients Fig. 4 , which are evaluated according to a number of shift-and-add algorithms. Among all these cases, Table IV shows the best coefficients that we obtain for each rotation angle. The coefficients in the table are selected as those that require the smallest number of adders for an accuracy of at least W L E bits using coefficients of up to W L C bits. Different cases for W L E and W L C are presented in the table.
The angles φ = 0, . . . , 32 in the first column of Table IV  correspond The last row in Table IV shows the average number of adders among all the angles in the column. These numbers are compared in Table V which justifies the selection of the best case for each angle. Furthermore, we can get an estimate of the savings of the proposed approach in terms of adders with respect to conventional approaches. With respect to canonical-signed-digit (CSD), the proposed approach saves 25% 30%, 35% and 36% of the adders for W L E 8, 12, 16 and 20, respectively. Likewise, with respect to the sum of the bits that are equal to '1' in the binary representation (BIN), the proposed approach saves 52%, 55%, 57% and 60% of the adders for the same cases. These saving have direct impact on the area of the FFT.
B. Exploration of FFT Algorithms
As FFT algorithms differ in the rotations throughout the FFT stages, to design the fully parallel FFT we have explored all the FFT algorithms that can be represented by the binary tree representation [37] . Furthermore, we have considered the split radix algorithm [38] - [40] , as it leads to the smallest number of non-trivial rotations.
The exploration consist of obtaining the φ values required at the stages of each FFT algorithm, and then sum the adder cost of the rotators that each algorithm includes. The φ rotations are calculated according to [37] , and the adder costs are taken from Table IV. The exploration of the FFT algorithms leads to the following conclusions:
• The number of adders is approximately proportional to the number of non-trivial rotations (NTR), according to
Thus, the FFT algorithms with the smallest number of non-trivial rotations (refer to Table I ) also lead to the smallest number of adders. These algorithms are the split radix algorithm and the algorithms based on radix-2 4 and radix-2 2 .
• Among all the binary tree FFT algorithms, radix-2 is the algorithm with the highest cost in terms of non-trivial rotations and in terms of adders.
• The DIF and DIT decompositions of the algorithms require the same number of adders. The reason is that DIF and DIT use the same rotations: The rotations at stage s = 1, . . . , n − 1 of the DIF decomposition appear at stage n − s of the DIT version. Based on these observations, we have collected in Table VI  the By considering that the total number of real adders in the butterflies of the FFT is
the total number of adders in fully pipelined FFTs is obtained as the sum of equations (7) and (8), i.e.,
This equation may be useful for other FFT algorithms and/or to estimate other values of W L E that are not collected in Table VI .
C. Implementation
To implement the proposed architectures we have developed tools that automate this task. The reason for this is that it is not feasible to implement all the rotators by hand. Fig. 6 shows the block diagram of the complete tool flow used to implement the architectures. First, Matlab is used to calculate the φ values for all the rotations of a given FFT algorithm. These values are stored in a text file. Second, Matlab together with algorithms implemented in C++ are used to obtain the adder graphs of the best rotators in Table IV . An adder graph is a directed acyclic graph in which each vertex, except the input, represents an adder/subtractor computing a certain multiple of the input. The adder graphs are stored in a text file containing standardized strings for each required rotator. The text files for the φ values and the adder graphs only have to be generated once and serve as a database for the code generator.
For the implementation of the fully parallel FFT, we have developed a VHDL code generator with the help of the FloPoCo library [42] . FloPoCo provides an environment to easily develop VHDL code generators for arithmetic circuits in C++. We made our implementation available as open source in the "uni_ks" branch of the FloPoCo Git repository [43] .
The configuration parameters that have to be provided to the code generator are the FFT size, N, the input word length W L I , and the effective word length of the rotators, W L E . The rotation matrix with the φ values of the FFT algorithm, and the adder graph are then looked-up in the text files.
The output word length, W L O , has been selected to be the same as the input word length W L I . Therefore, the butterfly additions/subtractions are done by using W L I + 1 bits. The result of each butterfly is provided to the rotator after it. The rotation is calculated without truncating any bit. After each rotator, the data word length is reduced to the W L I most significant bits by truncation. These W L I bits are provided to the butterfly of the next stage.
The FFT output is generated in bit-reversed order. However, as the architecture is fully parallel, all the outputs are provided in parallel. Therefore, the bit reversal is hard wired during code generation and does not have any hardware cost.
A key to high-performance circuits on FPGAs is pipelining. Therefore, a register is placed after each butterfly adder. Moreover, the multiplierless rotators are fully pipelined, which means that each adder is followed by a register. To get a valid pipeline in stages with different rotator depths, balancing registers are placed after rotators with shorter pipeline depth. This is done by using the pipeline framework of FloPoCo.
As an example, the VHDL for the 16-point radix-2 2 fully parallel FFT architecture in Fig. 2 , which corresponds to the case marked with (b) in Tables VI and VII , is obtained by the following FloPoCo call flopoco FullyParallelFFT wIn=16 wE=16 N=16 alg=BT where wIn and wE stand for W L I and W L E , respectively, N specifies the FFT size and alg=BT indicates that the selected FFT algorithm is the best binary tree algorithm. Alternatively, the split radix algorithm is selected by setting alg=SR. For all designs and FPGA devices, the experimental results include area, performance in terms of f CLK , latency and throughput, dynamic power consumption and signal-toquantization-noise ration (SQNR) [44] . All power results are obtained from a worst-case 50% input activity using XPower (Virtex 5 and 6) and Vivado's power estimator (Ultrascale). To evaluate the accuracy, the SQNR was determined by running intensive simulations of the generated VHDL.
V. EXPERIMENTAL RESULTS
For the same N and the same FPGA, the results for split radix and for the best binary tree algorithm are very similar in terms of the number of slices. The main difference between them is in the smaller latency of the latter. The reason is that the binary tree algorithm only has trivial rotators in odd stages, which reduces the number of pipelining registers in those stages. Conversely, the split radix algorithm has nontrivial rotators in all the stages, which leads to extra pipelining registers, many of them forming shift registers. Based on this, it could be expected that the split radix FFT requires a larger number of FF. However, in the FPGAs used here, shift registers are implemented by using LUTs [45] . This also explains why some split radix FFTs use more LUTs than the corresponding binary tree FFT, even though the number of adders is smaller in split radix.
Compared to previous works for N = 16 to 64 and for the same FPGA, the main advantage of the proposed FFTs is the reduction of hardware resources. Whereas previous approaches require a large number of DSPs, the proposed designs do not use any DSP. The cost is an increase in the number of slices that is small compared to the amount of DPSs that are saved.
Compared to previous 256-point fully parallel FFTs, the proposed architectures achieve more than double throughput than previous approaches, reaching 138.5 GS/s. For a word length of 16 bits both for the real and imaginary parts of the data, this corresponds to 4.4 Tbit/s. To the best of the authors' knowledge, this is the highest throughput for FFT architectures reported so far. Regarding area, the proposed 256-point fully parallel FFTs are more efficient than previous approaches in terms of LUTs. Note that the word length of the proposed architectures is 33% larger than the word length in [15] , whereas the number of LUTs is only 2% larger. With respect to FFs, the proposed architectures use a large number of them due to the deep pipelining that they have, which allows for the extremely high values of throughput.
The power consumption of the proposed designs grows proportional to the size of the architectures. Although values around 30 W for N = 256 may seem large, they are not when normalizing by the number of bits and frequency. By doing this, we obtain:
Finally, the SQNR of the proposed architectures is high and approximates to the following equation:
Note also that the SQNR of the proposed 256-point FFTs is 17.5 dB larger than that in [15] . This improvement comes from the use of 16 bits for data and for W L E . 
VI. CONCLUSIONS
In this paper, we have presented the fastest FFT architectures so far. The architectures correspond to a fully parallel implementation of the FFT. The design of these architectures involves the design of hardware-efficient and accurate shiftand-add rotators, the selection of rotator-efficient FFT algorithms, a proper pipelining of the architecture and the design of a tool that generates the architectures automatically. Experimental results for the implemented FFTs lead to throughput rates up to 138.5 GS/s, which breaks the barrier of 100 GS/s. approaches for shift-and-add constant multiplications.
APPENDIX SYMMETRIES IN FFT ROTATIONS
Throughout the paper, all rotators were considered only for angles in the range α ∈ [−45 • , 0 • ]. The rotator for the full angle range α ∈ [−180, 180 • ] can be constructed from the solution of the limited range rotator by using either trivial rotations (by ±90 • or 180 • ) and/or mirroring the rotator among the real axis. As a consequence, some outputs of rotators have to be negated or swapped. Negation is done by shifting the negation of the output to the input of adders/subtractors of the subsequent butterfly. Swapping is done by simply re-wiring the outputs during VHDL generation. Table VIII shows the rules on how to obtain the complex rotator constant. From a given full range angle α = −2πφ/N or its corresponding φ, the reduced range α and φ are obtained as given in Table VIII . Next, the complex rotator coefficient for the limited range is obtained for a given accuracy from best rotators in Table IV and transformed according  to the last column of Table VIII. Take, e.g., a rotation by φ = 116 of an N = 256 point FFT which has a corresponding angle of α = −163.125 • . Its reduced angles are obtained by Table VIII to 
