I. INTRODUCTION
T HE fast Fourier transform (FFT) is one of the most important algorithms in signal processing. Many hardware FFT architectures have been proposed, with the aims of speeding up the calculation of the FFT and reducing the amount of hardware resources.
Pipelined FFT architectures are the most common ones [1] - [9] . They process a continuous flow of data using a relatively small amount of resources. There are two main types of pipelined FFTs: serial pipelined FFTs process one sample per clock cycle, whereas parallel pipelined FFTs process several samples in parallel per clock cycle.
Parallel FFT architectures have been widely developed. Nowadays, there exist multipath delay commutator FFT architectures that use the minimum amount of butterflies and memory, with 100% utilization ratio [1] , as well as an efficient use of rotators.
Conversely, serial pipelined FFTs have not reached the efficiency of parallel ones yet. Typical radix-2 single-path delay feedback (SDF) FFTs [2] have a utilization ratio of 50% in butterflies and rotators. Other radices such as radix-4 [3] , [4] and radix-2 2 [2] improve the use of rotators. However, they do not improve the efficiency of butterflies. Single-delay commutator (SDC) FFTs [5] - [8] at the cost of larger memory. The same happens to the locally pipelined FFT [9] . Therefore, in all cases, there is a trade-off among butterflies, rotators, and memory. This brief presents the serial commutator (SC) FFT. The SC FFT uses a novel data management based on circuits for bitdimension permutation of serial data. The resulting SC FFT is the first one that requires the theoretical minimum amount of butterflies, rotators, and memory with 100% utilization.
This brief is organized as follows. Section II reviews the FFT algorithm. Section III studies the theoretical boundaries of the hardware resources. Section IV presents in detail the SC FFT. Section V shows the case of natural I/O order. Section VI compares the proposed architectures to previous ones. Section VII presents the experimental results. Finally, Section VIII summarizes the main conclusion of this brief.
II. FFT ALGORITHM
The N -point DFT of an input sequence x[n] is defined as
where W nk N = e −j(2π/N )nk . In order to compute the DFT efficiently, the FFT based on the Cooley-Tukey algorithm [10] is used most of the time. The FFT reduces the number of operations from O(N 2 ) for the DFT to O (N log N ) . Fig. 1 shows the flow graph of a 16-point radix-2 FFT decomposed according to decimation in frequency (DIF) [11] . The FFT is calculated in a series of n = log ρ N stages, where 1549-7747 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. ρ is the base of the radix r of the FFT, i.e., r = ρ α . In the figure, the numbers at the input represent the index of the input sequence, whereas those at the output are the frequencies k.
At each stage of the graph, s ∈ {1, . . . , n}, butterflies and rotations are calculated. Specifically, each number φ in between the stages indicates a rotation by
As a consequence, if φ = 0, no rotation must be carried out. Likewise, rotations by φ ∈ [N/4, N/2, 3N/4] are trivial. This means that they can be carried out in hardware simply by interchanging the real and imaginary components and/or changing the sign of the data.
III. THEORETICAL BOUNDARIES
The SC FFT is based on a simple observation. In Fig. 1 , each stage calculates N complex additions and N/2 rotations. Therefore, any radix-2 FFT architecture that processes one sample per clock cycle only needs a complex adder and half a rotator per stage. This leads to log 2 N butterflies and log 4 N − 1 rotators for the entire FFT, considering that the last stage does not have a rotator. These are the theoretical minimum number of resources for any radix-2 FFT that processes one sample per clock cycle.
Regarding memory, the theoretical minimum N − P [1] for P -parallel data also holds for serial data, where P = 1. Thus, the minimum memory for a serial FFT is N − 1. Fig. 2(a) and (b) shows the proposed SC FFT for N = 16 points and radix-2, respectively, for DIF and DIT. The architectures consist of n = log 2 N = 4 stages that include butterflies, rotators, and circuits for data management. Rotators that carry out trivial rotations are diamond-shaped, whereas general rotators are represented by a circle.
IV. SC FFT
Both butterflies and rotators are marked with 1/2. This means that they require half of the components in conventional butterflies and rotators: butterflies only use a real adder and a real subtracter instead of complex ones, and rotators use two real multipliers and one adder instead of four real multipliers and two adders. The half butterfly and half rotator form the processing element (PE) of the architecture, which is explained in detail in Section IV-A.
The circuits for data permutation are the circuits for elementary bit-exchange. These circuits have already been used for the calculation of the bit reversal [12] . However, this is the first time that this type of circuits is used in an FFT architecture. The data management of the SC FFT is explained in Section IV-B.
A. PE Fig. 3 shows in detail the PE used to calculate the butterflies and rotations of the radix-2 DIF SC FFT in Fig. 2(a) . The PE for the DIT SC FFT in Fig. 2(b) is analogous. The only difference is that the rotator is placed before the butterfly. The PE is composed of the half butterfly and the half rotator. The PE does the calculation of a butterfly followed by a rotator:
with the particularity that the inputs X 0 = X R0 + jX I0 and Table I shows the timing diagram of the PE in Fig. 3 . It can be observed that the butterfly operates first on the real part of the inputs and then on the imaginary part. The rotator also multiplexes the calculations in time. This allows for halving the adders and multipliers in the butterfly and rotator.
B. Data Management
The PE calculates a butterfly and a rotation on pairs of data that arrive in consecutive clock cycles. In order to fulfill this, the data management of the SC FFT places samples that must be operated together in consecutive clock cycles. This happens at all of the stages of the FFT. Fig. 4 shows the data management of the SC FFTs in Fig. 2 . The data management is the same for both DIF and DIF cases. Each column in Fig. 4 represents the input order to the corresponding stage. The order of arrival is from top to bottom. Therefore, x[0] and x [8] are the first and second inputs to the first stage, respectively. The figure shows that, at all of the stages, consecutive samples are operated together in a butterfly. This allows for the use of the PE with half of the resources.
In order to achieve the desired order, the SC FFT uses circuits for bit-dimension permutations of serial data, as shown in Fig. 5 . These circuits interchange pairs of data delayed by L clock cycles. In Fig. 4 , the first, second, and third stages interchange data separated by 3, 1, and 7 clock cycles, respectively. These are equal to the lengths of the buffers of the three first stages in Fig. 2 .
In a general case, for an SC FFT of length N , the length and delay of the buffers at stages s = 1, . . . , n − 2 are
and L = 2 n−1 at stage s = n − 1. The control of the architecture is simple and obtained directly from the bits of an n-bit counter c n−1 , . . . , c 0 that counts from 0 to N − 1. For a buffer of length L = 2 i − 1, the control signal S i is
The control signals must be delayed according to the pipeline of the architecture, so that the count starts when the first sample arrives at the corresponding shuffling circuit. The total amount of memory for the shuffling circuits can be obtained by adding the delays at all of the stages. This leads to a total memory of
By adding the memory included in the PEs, the total memory of the architecture is approximately N , which is the minimum for an N -point FFT.
As a result, the proposed SC FFT architectures use the minimum number of components for the butterflies, rotators, and memory, with a utilization of 100%.
V. SC FFT ARCHITECTURES FOR NATURAL I/O
The input and output orders of the SC FFT follow a sequence that is not in natural order, as shown in Fig. 4 . In order to achieve natural I/O order, shuffling circuits can be added at the input and output. This is shown in Fig. 6 for a natural I/O 16-point SC FFT. The data management for the architecture in Fig. 6 is shown in Fig. 7 . In the general case of a natural I/O N -point SC FFT, the input reordering circuit only needs to calculate the elementary bit-exchange σ : x n−1 ↔ x 0 . As explained in [12] , this permutation requires a shuffling circuit with a buffer of length Fig. 8 . In our example in Fig. 6 for N = 16 , the buffer length of the input reordering circuit is L = 16/2 − 1 = 7.
The output reordering circuit is more complex and requires (n + 1)/2 elementary bit-exchanges in series, as shown in Fig. 9 Tables II and III compare the pipelined FFT architectures for  serial data. Table II does not impose any specific order of inputs and outputs, whereas Table III compares the architectures for the natural I/O order.
VI. COMPARISON AND ANALYSIS
In Table II , the first column shows the type of architecture. The second, third, and fourth columns show the resources used by the architecture: rotators, adders, and data memory. The last two columns compare the performance in terms of latency and throughput. As all of the architectures that are compared process serial data, the throughput of all of them is 1 sample per clock cycle.
In Table II , it can be observed that the previous architecture requires the minimum of some of the hardware resources but not all of them. Various SDF FFT architectures [2] - [4] , [13] use the minimum amount of rotators and memory. Previous SDC FFTs [5] , [6] , [8] achieve the minimum number of adders. Moreover, the locally pipelined FFT [9] achieves the minimum number of rotators and adders. Finally, the proposed SC FFT is the first architecture that achieves the minimum amount in all hardware resources.
For natural I/O order, Table III compares previous SDC architectures to the proposed SC FFT. Compared to [5] - [7] , the proposed architecture reduces the number of rotators by 50%. Furthermore, up to N = 64 points, the memory of the proposed approach is also smaller than that in [5] - [7] . Compared to [8] , the proposed architecture has less memory for N ≤ 64 and more for larger N , with the differences being small.
VII. EXPERIMENTAL RESULTS
The proposed SC FFT for N = 1024 points and word length of 16 bits has been implemented on ASIC technology using the library UMC 55-nm process. Table IV compares the implementation with previous serial FFTs on ASICs. The proposed architecture improves the clock frequency of previous designs. At the same time, it achieves less area than previous 2048-point [16] and 256-point [17] SDF FFTs, high SQNR, and low power consumption. In the table, area and power are normalized to 55 nm and 0.9 V according to [18] .
VIII. CONCLUSION
This brief has presented the SC FFT architecture. This architecture is the first FFT to use circuits to calculate bit-dimension permutation on serial data. This creates a data management that allows for using the theoretical minimum amount of hardware resources for a serial FFT with 100% utilization. Compared to previous designs, the proposed SC FFT reduces either the number of rotator or the number of adders or the memory of the design. A solution for natural I/O has also been presented, 
