Abstract-In this brief, we propose a novel approach to implement multiplierless unity-gain single-delay feedback fast Fourier transforms (FFTs). Previous methods achieve unity-gain FFTs by using either complex multipliers or nonunity-gain rotators with additional scaling compensation. Conversely, this brief proposes unity-gain FFTs without compensation circuits, even when using nonunity-gain rotators. This is achieved by a joint design of rotators, so that the entire FFT is scaled by a power of two, which is then shifted to unity. This reduces the amount of hardware resources of the FFT architecture, while having high accuracy in the calculations. The proposed approach can be applied to any FFT size, and various designs for different FFT sizes are presented.
I. INTRODUCTION
In today's digital signal processing (DSP) world, there is often a need to convert signals between time and frequency domains. For this reason, the fast Fourier transform (FFT) has become one of the most important algorithms in the field. In order to calculate the FFT efficiently, various hardware architectures have been proposed. When high performance is required, feedback [1] - [8] and feedforward [9] , [10] hardware FFT architectures are attractive options, as they offer high throughput capabilities.
Single-delay feedback (SDF) FFT architectures [1] - [8] consist of a series of stages that process one sample per clock cycle. Each stage contains a butterfly and a rotator. The butterfly calculates additions, and the rotator carries out rotations in the complex plane by given rotation angles, called twiddle factors [11] .
Compared with the additions of the butterfly, rotations are more costly operations. For this reason, different approaches to implement rotators have been proposed in the past. The most straightforward approach is to use a complex multiplier [12] , which consists of four real multipliers and two adders. In addition, it requires a memory to store the twiddle factors. Another option is to implement the rotators as shift-and-add operations. Following this idea, the CORDIC algorithm [13] - [16] breaks down the rotation angle into several successively smaller angles and rotates each of them with a fixed shift-and-add network. Another alternative is to use multiplier-based shift-and-add rotators [11] , [17] - [20] . By using techniques, such as multiple constant multiplication [21] - [23] , these rotators carry out the rotation by reducing the complex multiplier to shift-and-add operations.
Among all these alternatives, multiplier-based shift-and-add rotators [11] , [17] - [20] are the most efficient option for small set of twiddle factors, whereas the CORDIC-based rotators [13] - [16] are the best alternative for large ones. However, CORDIC-based approaches and some multiplier-based shift-and-add rotators [11] , [19] scale the output by a scaling factor R = 1. This scaling allows for more accurate and hardware-efficient rotations [12] . In order to achieve unity gain, previous works have compensated the scaling factor by adding a scaling stage to the rotator [14] - [16] . However, this increases the usage of hardware resources.
In this brief, we present the novel multiplierless unity-gain SDF FFTs. They are obtained by designing the rotators in all FFT stages simultaneously, so that the output of the FFT has unity gain. Thus, the proposed approach neither requires the use of costly unity-gain rotators, nor circuits to compensate the scaling. This reduces the complexity of the FFT rotators and guarantees unity gain for the FFT. In this brief, we study different FFT sizes and propose suitable solutions for each size. Comparison in terms of number of adders and experimental results shows the advantages of the proposed approach with respect to previous works.
This brief is organized as follows. In Section II, we give an introduction to the SDF architecture and the FFT twiddle factors. In Section III, we explain the proposed approach and provide optimized architectures for different FFT sizes. In Section IV, we compare the proposed architectures to previous ones. In Section V, we present the experimental results, and in Section VI, we summarize the main conclusions of this brief.
II. BACKGROUND

A. SDF FFT Hardware Architecture
The SDF FFT is one of the most attractive and widely used FFT architectures. It allows for high throughput and requires a low amount of resources. An example of SDF FFT for N = 64 points is shown in Fig. 1 . It consists of n = log 2 (N) stages. Each stage includes a radix-2 (R2) butterfly and a rotator. The internal structure of an SDF stage is shown in Fig. 2 . It consists of a butterfly, two complex multiplexers, a buffer, a rotator, and a rotation memory. Additional pipelining registers can be used to reduce the critical path and increase the throughput. 
B. FFT Twiddle Factors
In the FFT rotators, each input is rotated by a different angle. This angle is determined by the twiddle factor
The parameter L is constant for each stage and represents the number of different rotation angles in that stage. These angles are the result of dividing the circumference into L equal parts. Among them, the specific rotation angle is determined by the parameter φ, which is a natural number in the range [0,
The complexity of the rotator depends on the number of angles that it has to rotate, L. The simplest rotator is W L = W 4 . This twiddle factor only includes trivial rotations (0°, 90°, 180°, and 270°). Trivial rotations are characterized by the fact that they can be calculated by simply exchanging the real and imaginary parts of the inputs and/or changing their sign.
For power-of-two FFT sizes, L is also a power of two and its value ranges from 4 to N. The specific value for each stage depends on the radix and decomposition of the FFT. The most typical decompositions are decimation in time and decimation in frequency (DIF) [24] . Nowadays, the most common radices are radix-2 k [9] , [25] . Table I shows the typical layouts for the 64-point DIF FFT in Fig. 1 . Note that the twiddle factors for radix-2 2 , 2 3 , and 2 4 are simpler than those in radix-2, which leads to simpler rotators. Also note that the radix-2 k SDF architectures only differ in the twiddle factors. The rest of the SDF architecture is independent of the algorithm.
III. PROPOSED APPROACH
A. Problem Formulation
The research problem solved in this brief can be formulated as follows. On the one hand, when implementing an FFT, we can choose among a large number of FFT algorithms [25] . On the other hand, there are numerous approaches to implement the rotators that calculate each twiddle factor. These implementations need different number of adders and have different gains. However, we aim for unity gain, so that the FFT outputs are not scaled. The research question is then: How to design an SDF FFT with the smallest number of adders and unity gain?
B. Design Method
Based on the previous considerations, the following method obtains multiplierless unity-gain SDF FFTs. The method is based on a joint selection of the FFT algorithm and the twiddle factors of the FFT stages.
Step 1: Select suitable FFT algorithms. All the possible FFT algorithms are discussed in [25] . These algorithms differ in the twiddle factors at the different FFT stages, as shown in Table I . The selection of a suitable algorithm is done based on the twiddle factors and on the type of rotators that are involved. Regarding the type of rotators, nowadays the CORDIC-based approaches are the best for large twiddle factors (W 64 and larger), and CCSSI for small twiddle factors (W 8 , W 16 , and W 32 ) [11] . Both of them lead to multiplierless rotators, i.e., they only use shift-and-add operations.
The criterion used to select the algorithm is to minimize the number of large twiddle factors calculated by the CORDIC, which are the most costly ones, and maximize the number of trivial rotators (W 4 ), which are the cheapest ones. There are also tradeoffs between the number of big (W 64 and larger) and small (W 8 , W 16 , and W 32 ) twiddle factors. In such case, various FFT algorithms are considered. For instance, a 1024-point radix-2 5 FFT includes one large twiddle factor and four small ones, and a 1024-point radix-2 4 FFT has two large twiddle factor and two small ones. Here, both cases are attractive.
Step 2: Determine the scaling of the rotators. This is done by considering that the total scaling of the FFT is a power of two, which can be reduced to unity by just considering that the binary point is in a different position of the binary representation [11] , that is
The scaling of the CORDIC is constant and cannot be modified. Thus, the scaling of those stages in which the CORDIC is used is considered to be R s = R C , where R C is a constant. Specifically, R C = 1.647 in the conventional CORDIC [13] and R C = 1.164 in the memoryless CORDIC [16] , when considering seven or more microrotation stages. For trivial rotators (W 4 ), the scaling is R s = 1. As a result, in order to achieve unity gain for the FFT, the scaling for small rotators needs to be
where N C is the number of stages where CORDIC rotators are used, and the product over s refers to the stages in which small rotators are used.
Step 3: Design the rotators for the small twiddle factors, so that the total scaling fulfills (2). This is done by applying the method for the CCSSI rotators [11] in the following way. First, each of the rotators is designed considering the case of a multiple constant rotator with uniform scaling. This means that the scaling of each of the rotators is not fixed yet, and a list of candidates for each rotator is obtained. Then, different combinations of rotators for the different stages are evaluated in terms of number of adders, rotation error, and R tot . This provides the most efficient designs.
Step 4: Select the architecture that offers the most suitable tradeoff between area and accuracy. Apart from the rotation error, in this step, other accuracy measures, such as the SQNR or the Frobenius norm (FN) [12] , may be evaluated depending on the demands of the target application. 16 , and W 32 are detailed in columns two to six. They show the size of the rotator, its coefficients, scaling (R s ), rotation error ( ), and number of adders of the rotator (Add.), as defined in [11] . The second last column shows the number of adders of the entire FFT architecture. The total number of real adders for the FFT is calculated as
C. Proposed Unity-Scaled FFT Architectures
The first term is the number of adders in the butterflies, i.e., four real adders per radix-2 butterfly. The second term is the number of Table II . The last column of Table II shows the FFT gain, G FFT . This is equal to R tot normalized to unity by dividing it by the closest power of two
In all the proposed designs, the FFT gain is approximately equal to 1, which meets the requirement of unity scaling. Furthermore, the proposed FFTs do not use complex multipliers, as all the rotators are implemented as shift-and-add. Table II offers a tradeoff between total number of adders and rotation error. For instance, the proposed radix-2 3 512-point FFT has low rotation error in all its rotators and requires a total of 94 adders. By contrast, the radix-2 4 512-point FFT has higher in some rotators, but only requires 80 adders. Thus, the proposed method offers a tradeoff between the area and the accuracy. Table III compares the proposed designs to the previous SDF FFT architectures, where all of them process one sample per clock cycle in pipeline. Table III includes the hardware resources for a general N, as well as for the particular case of N = 1024. Some previous designs use advanced radices, such as radix-2 2 and radix-2 4 , and use complex multipliers for the rotations [1] - [4] . This requires four complex multipliers per rotator [1] , [2] . Alternatively, three real multipliers per rotator have been used at the cost of increasing the number of adders [3] , [4] . Other previous designs propose multiplierless solutions that remove the overhead of the multipliers only by using shift-and-add operations for the rotators [5] , [6] .
IV. COMPARISON
The comparison shows that the proposed multiplierless unity-gain FFTs need the smallest number of adders among the previous SDF FFT architectures. This holds even when substituting multiplier-based rotators by CORDIC rotators in current efficient designs ( ).
V. IMPLEMENTATION
The two designs marked with a star ( ) in Table II Fig. 3 shows the internal structure of the rotators used in both designs. Both of them consist of six adders, and 14 and 16 multiplexers, respectively. The shifts shown in Fig. 3 are hard wired, and thus, they do not have any hardware cost. Registers are used in order to pipeline the rotators and allow for a higher clock frequency.
Post place-and-route results for the Virtex-7 XC7VX330T fieldprogrammable gate array (FPGA) (V7 in in Table IV . The proposed architectures achieve high clock frequencies, use a small number of slices, and they do not need any BRAM or DSP block. For comparison to previous 1024-point FFTs, Table IV also presents the results on a Virtex-4 XC4VSX55 (V4 in Table IV) . These results agree with the conclusions from Table III. Compared with [5] , the proposed design reduces the number of slices due to the lower number of adders and multipliers. Compared with [4] , the proposed design uses more slices but removes the needs for BRAMs and DSP blocks. Finally, the proposed architectures achieve higher clock frequency compared with the previous architectures. Table V shows the experimental results on application-specified integrated circuits (ASICs). The proposed design shows good figures of merit in terms of clock frequency, area, power consumption, energy, and SQNR, where the proposed approach is superior in most parameters. Furthermore, the area and power consumption of the proposed design could be further reduced by removing the pipelining used to increase the clock frequency. Normalized area and energy are calculated as
Energy (J/sample) = Power consumption (Tech./65 nm) × f CLK × N .
In order to quantify the improvement with respect to a usual FFT, the radix-2 4 1024-point FFT has been compared with a radix-2 4 1024-point FFT that uses multipliers instead of the proposed rotators. On FPGAs, the latter requires 2235 slices and achieves a clock frequency of 385 MHz, i.e., 37% more area than the proposed approach and 35% less clock frequency. On ASICs, the comparison is shown in Table V . The proposed FFT takes up 0.28 mm 2 at 990 MHz, whereas the radix-2 4 FFT that uses multipliers takes up 0.37 mm 2 at 629 MHz. This is 34% less area and 57% more frequency for the proposed design. Power consumption and energy are 24% smaller in the proposed FFT. Finally, regarding accuracy, the SQNR of the proposed approach is 48.7 dB and the use of multipliers leads to 51.3 dB. Thus, there is a drop of half a bit in SQNR in the proposed design, but it is still at a very high level. Likewise, the FN [12] of both approaches is very similar: FN = −112.92 dB for the proposed approach and FN = −114.26 dB for the use of multipliers.
VI. CONCLUSION
This brief shows how to design multiplierless unity-gain SDF FFT architectures. The proposed architectures are not only multipierless and achieve unity gain, but also require the smallest number of adders among current SDF FFTs.
The proposed architectures achieve good figures of merit in terms of clock frequency, area, power consumption, energy, SQNR, and FN.
