Abstract-This paper presents a novel split-radix fast Fourier transform (SRFFT) pipeline architecture design. A mapping methodology has been developed to obtain regular and modular pipeline for split-radix algorithm. The pipeline is repartitioned to balance the latency between complex multiplication and butterfly operation by using carry-save addition. The number of complex multiplier is minimized via a bit-inverse and bit-reverse data scheduling scheme. One can also apply the design methodology described here to obtain regular and modular pipeline for the other Cooley-Tukey-based algorithms. 
High-Speed and Low-Power Split-Radix FFT
Wen-Chang Yeh and Chein-Wei Jen
Abstract-This paper presents a novel split-radix fast Fourier transform (SRFFT) pipeline architecture design. A mapping methodology has been developed to obtain regular and modular pipeline for split-radix algorithm. The pipeline is repartitioned to balance the latency between complex multiplication and butterfly operation by using carry-save addition. The number of complex multiplier is minimized via a bit-inverse and bit-reverse data scheduling scheme. One can also apply the design methodology described here to obtain regular and modular pipeline for the other Cooley-Tukey-based algorithms.
For an (= 2 )-point FFT, the requirements are log 4 
I. INTRODUCTION
T HE FAST Fourier transform (FFT) and its inverse (IFFT) are essential in the field of digital signal processing. Recently, due to the popularity of the orthogonal frequency division multiplex (OFDM) system, the demand for high-speed and low-power FFT emerges from various applications. According to the European digital video/audio broadcasting (DVB-T/DAB) standards, an OFDM system may require FFT length ranging from 256-to 8192-point. Wireless local area network (WLAN) and HIPERLAN/2 systems require high-speed and low-power FFT/IFFT design [1] , [2] . The fourth-generation cellular phone and the forthcoming new WLAN systems may also incorporate OFDM system to deliver higher bandwidth [3] . Hence, it is important to design high-performance and low-power FFT for these applications.
For a static CMOS circuit, the power consumption is usually determined by the dynamic power, which can be written as where is the switching probability, is the capacitance being charged or discharged when switching, is the supply voltage, and is the clock rate. At architecture level, can be regarded as the operation count for a specific module, and is proportional to the part of a module that is being active. For example, for a multiplier, if the operation count is MUL# within clock cycles, then is proportional to . As the in (1) is usually determined by algorithm and its corresponding architecture, it is desirable that the chosen FFT algorithm has the least computational complexity as well as the corresponding hardware complexity. Among various FFT algorithms, the Cooley-Tukey algorithm [4] is very popular because it can reduce the computational complexity from to , and the regularity of the algorithm makes it suitable for VLSI implementation. To further reduce the computational complexity, radix-4, split-radix [5] , radix-2 [6] , radix-2/4/8 [7] , and higher radix versions have been proposed. In general, all of these algorithms decompose a length-( 2 ) FFT into odd half and even half recursively and effectively reduce the number of complex multiplications by utilizing symmetric properties of the FFT kernel. The split-radix algorithm is the best in terms of the multiplicative complexity for -point FFT when the multiplications with 1, are skipped. However, split-radix algorithm is inherently irregular because radix-2 stages are used for even half components, and radix-4 stages are used for odd half components, which results in an "L"-shaped butterfly unit.
Due to the irregularity of the butterfly unit, it is hard to design a regular and modular hardware pipeline for the split-radix algorithm. In [8] , a two-dimensional (2-D) processor array was proposed to implement split-radix algorithm. Its hardware complexity grows at , which makes such a design impractical for large . In [9] , a one-dimensional (1-D) linear array design was proposed. Although its hardware complexity has been reduced to only, the hardware requirement is more than twice as large as those of the other radix-4 FFT implementations proposed in [6] , [10] , and [11] .
In this paper, we present a SRFFT pipeline architecture that implements the split-radix algorithm efficiently and is suitable for VLSI implementation. The number of multipliers has been minimized to by means of sharing the multiplier between two adjacent stages. The sharing is achieved by using the bit-inverse and bit-reverse (BIBR) data scheduling scheme proposed here. The hardware complexity is equivalent to that of the radix-2 pipeline architecture [6] .
In order to balance the latency between complex multiplication and butterfly operation, the complex multiplier has been pipelined into two stages. The first stage is based on Wallace tree and modified Booth recoding, and the second stage is the final addition of the multiplication. The second stage is then merged into the succeeding two butterfly units to balance the latency. The repartition transforms the carry-propagation additions at the end of multiplication into carry-save additions, which can further reduce power consumption and increase performance without increasing hardware cost.
The proposed design approach can be generalized to other 2 -point FFT algorithms based on Cooley-Tukey decomposition to obtain regular pipeline architecture as well. For comparison, we have implemented a 64-point SRFFT pipeline and a 64-point radix-2 FFT pipeline. The post-layout simulation shows that the SRFFT design can operate at 150 MHz, 3.3 v, and 25 C. We have achieved 15% power reduction and 14.5% performance improvement when compared with the radix-2 design under equal conditions.
The organization of this paper is as follows. In Section II, we will analyze the relationships between FFT algorithms and pipeline architecture. The proposed SRFFT architecture and the multiplier folding scheme will be discussed in Section III. In Section IV, we will present the design of delay-balanced pipeline architecture. Post-layout simulation and its analysis are given in Section V.
II. ANALYSIS OF RADIX-2 FFT ALGORITHMS

A. Basic Formulation and Low Radix Algorithms
Given an input sequence , an -point discrete Fourier transform is defined as (2) where the is the time index, and the is the frequency index. The coefficient is defined as
For Cooley-Tukey radix-2 decimation-in-frequency (DIF) decomposition (2) is decomposed into even and odd frequency components
In (5), the is usually referred to as twiddle factor. The SRFFT algorithm [5] further decomposes the odd frequency (6) and (7) at the bottom of the page. By applying (4), (6) , and (7) recursively, the split-radix FFT can be obtained. On the other hand, the radix-4 algorithm can be obtained by decomposing (4) and (5) into , , , and frequency components. Table I shows the multiplicative complexity of the radix-2, radix-4, and split-radix algorithms. For non-4 length FFT, an additional radix-2 stage is used for the radix-4 algorithm. The split-radix algorithm shows a clear advantages over the other algorithms. To understand the relationship among the three algorithms intuitively, we can examine their signal flow graphs (SFGs) as shown in Fig. 1 . From the figure, we can see the following.
• Both the radix-4 and split-radix are superior to the radix-2 algorithm because " " terms are extracted. The complex multiplications with are accomplished by exchanging the real and the imaginary parts of the incoming data and then inverting the sign of the imaginary part.
• The split-radix algorithm is superior to the radix-4 algorithm because more " " terms are extracted in the SFG. Note that four twiddle factors are moved from the end of the second butterfly stage to the end of the third butterfly stage, and two of them become trivial multiplications.
B. High Radix Algorithms
If multiplicative complexity lower than the split-radix algorithm is desirable, higher radix FFT algorithms should (6) (7) be used. However, high-radix FFT algorithms often increase the circuit complexity and are not easy to implement. To discuss and to compare the efficiency of high-radix algorithms, radix-8 and radix-2/8 algorithms are examined here. The radix-2/8 algorithm can be considered to be an extension to split-radix(radix-2/4) algorithm by decomposing (6) and (7) one more radix-2 stage. The SFGs of radix-8 and radix-2/8 algorithms are shown in Fig. 2 . Besides 1 and terms, and are also extracted from the SFGs. Due to the symmetric properties of cosine and sine functions, the values of and can be written as and , respectively. Thus, a complex multiplication with one of the two coefficients can be computed using a constant multiplier and additions. The number of complex multiplications and the number of constant multiplications are summarized in Table II .
Obviously, the radix-2/8 algorithm has lower multiplicative complexity than the radix-8 algorithm because it extracts more and coefficients in the SFG. According to our hardware implementation, we found that the area of the constant multiplier is one and half times of a real multiplier. Hence, one constant multiplication is approximately equivalent to 0.4 complex multiplications. To compare the multiplicative complexity between low radix algorithms and high radix algorithms, one can multiply the number of constant multiplications by 0.4 and calculate the equivalent number of complex multiplications.
It is interesting to observe that the multiplicative complexity of the radix-8 algorithm derived in this work is the same as that of the radix-2/4/8 algorithm reported in [7] . Actually, if we examine Fig. 2 (a) more carefully, we will find that the SFG of the radix-8 algorithm is equivalent to that of radix-2/4/8 algorithm. The radix-2/4/8 algorithm implements the radix-8 butterfly using three radix-2 stages instead of one single butterfly. Therefore, we may term the radix-2/4/8 algorithm the radix-2 because of the equivalence at algorithm level. The butterfly unit consists of three radix-2 butterfly stages. The radix-2 design proposed in [6] can be regarded as another radix-2/4/8 design.
Based on this observation, we will not evaluate the performance of radix-2/4/8 algorithm any further. The difference equations used here to compute the number of complex multiplication for an ( 2 )-point FFT are (8) for the radix-8 algorithm, and (9) for the radix-2/8 algorithm. One can use these equations to verify the correctness of Table II .
C. Tradeoff Among the Algorithms
Based on the discussion in the previous sections, we can see that if the multiplications with 1 and are removed and the multiplications with can be realized using constant multiplication, the radix-2/8 algorithm will have the lowest multiplicative complexity among all the discussed algorithms. On the other hand, the fixed-radix algorithms have more regular SFGs than the mixed-radix algorithms. However, the radix-4 and radix-8 algorithms can be applied to 4 -point and 8 -point FFT's only, unless a radix-2 stage is also employed in the pipeline. Such a limitation also exists in other fixed-radix FFT algorithms but does not in the split-radix or the radix-2/8 algorithm.
The additive complexity has not been analyzed because it is basically the same for all the algorithms based on Cooley-Tukey decomposition, as discussed in [5] . Thus, the number of additions can be further reduced if we can minimize the number of multiplications and implement each multiplication using as few additions as possible. To summarize, mixed radix algorithms are quite attractive if regular and efficient hardware architecture can be found. We will present the proposed pipeline architecture for the split-radix algorithm in Section III and the delay-balanced pipeline to remove unnecessary computation in Section IV.
III. PIPELINE ARCHITECTURE FOR SRFFT
A. Previous Work
The one-dimensional linear array is very popular because it possesses regularity, modularity, local connection, and high throughput with moderate hardware complexity. Fig. 3 shows the commonly used radix-r 1-D pipeline architecture [11] . The number of butterflies and the number of multipliers are proportional to . However, the scheme shown in the figure cannot be applied to mixed-radix algorithms directly. The problem arises from the irregularity of the butterfly stage for mixed-radix algorithms. Take the split-radix algorithm as an example; both (6) and (7) can be rewritten in alternative form by defining ( ) in (5) as .
If (4), (10), and (11) are mapped directly into hardware, the shape of the butterfly unit would look like the one shown in Fig. 4 . Such an "L"-shaped butterfly unit is difficult to be integrated into pipeline architecture, and a similar problem also arises in radix-2/8 algorithm [7] - [9] . In [9] , a delay-commutator(DC)-based design was proposed to implement the splitradix algorithm. Although the design achieves the multiplicative complexity of the split-radix algorithm, the hardware requirement is considerably much higher than the other pipeline architectures. Table III compares the hardware requirement and the multiplicative complexity for several classical and new implementations [6] , [9] - [13] . The taxonomy is adopted from [6] . Split-radix single-path delay-feedback is denoted SRSDF, and SRMDC denotes split-radix multiple-path delay-commutator. Because there are two different radix-2 SDF designs, we denote the first one proposed in [6] as R2 SDFI and the second one proposed in this work as R2 SDFII. We will derive the SRSDF and R2 SDFII architectures in the following sections.
B. Memory System Design
In order to compute DFT via FFT, the input data and the intermediate results have to be reordered using memory. The required memory size is directly proportional to , and the number of memory access is proportional to . Therefore, it is important to reduce both the memory size and the number of memory access.
Take radix-2 FFT algorithm as an example. The computation of (4) cannot start until both and are available. For word-sequential I/O, the two samples will be separated by clock cycles if one sample is available per clock cycle. Consequently, the first samples have to be stored in a local memory until the other data sample arrives. Similar constraints also exist in the other FFT algorithms.
Two different buffering strategies have been developed for pipeline FFT architecture. One is delay-commutator (DC) architecture, and the other one is delay-feedback (DF) architecture [6] , as shown in Fig. 5 . The DC approach is shown in Fig. 5(a) . At the first cycles, the first samples are stored in " FIFO_I". At the next cycles, the butterfly receives from the input and from " FIFO_I" and generates outputs according to (4) and (5). Meanwhile, one of the results generated by the first butterfly is stored into " FIFO_II," and the other one is fed to the multiplier directly. During the cycles, data are stored into the " FIFO_I" FIFO in the first cycles and then are read from the FIFO in the second cycles. As a result, the utilization rate of each FIFO is only 50%.
For DF style as shown in Fig. 5(b) , the incoming samples are stored in the " FIFO" during the first cycles. When arrives, the inputs of the radix-2 butterfly unit will receive from the input and from the feedback FIFO for computation. One of the outputs of the butterfly unit is fed back to the " FIFO" again, which explains the name "delayfeedback." Data is both read from and written to each memory cell of the FIFO every clock cycle. The utilization rate of each FIFO is increased to 100%.
Table III also compares the memory requirement. The delay-feedback buffering strategy can implement radix-2, radix-4, radix-2 , and split-radix algorithms with only ( ) memory words. On the other hand, DC buffering strategy requires 1.5 words or more. Note that each word actually comprises of real-part and imaginary-part for complex FFT. Apparently, DF strategy is preferred as the memory size is proportional to , whereas the number of multiplier, pipeline register, or butterfly unit is of order only. Except for the size of the memory, it is also important to minimize the access to the memory in order to reduce the power consumption. In general, the number of memory access is related to the radix of FFT algorithm because radix-r stages are used for an -point FFT pipeline, as shown in Fig. 3 . Global memory access occurs only when data enter or leave each stage, and therefore, it is proportional to the number of stages in a pipeline. Taking the radix-2 pipeline as an example, we can see that it has the same number of global memory access as radix-2 pipeline, although it performs radix-4 FFT algorithm. The number of local memory access inside a radix-r stage also depends on the design of BF. For example, a radix-4 BF operation can be implemented using one radix-4 stage or two radix-2 stages with local memory buffer. Compared with a single radix-4 stage, the latter one has lower cost but more local memory access. Table III also shows the number of global memory access for the pipeline based architectures.
C. Proposed Delay-Feedback Pipeline Architecture
Based on the discussions in the previous sections, we can conclude that a good pipeline architecture for the FFT should possess the following features.
• At algorithm level, it should achieve the multiplicative complexity as low as possible.
• It should be suitable for any power of two length FFT.
• At the architecture level, use the delay-feedback buffering strategy to minimize the memory size.
• It should have modular and regular modules, local routing, and low control complexity. We achieve these features by projection mapping, which is a technique from systolic array [14] . At first, we observe that conventional fixed-radix pipeline architecture can be obtained through folding the SFG at the vertical direction. Fig. 6 (a) uses radix-4 SFG as an example. The BF_I unit implements common butterfly operation. The BF_II unit operates in two modes. For the first mode, it works just as the BF_I unit. For the second mode, the input data will be multiplied by " " before normal butterfly operation. Due to the spatial regularity of the SFG of radix-4 algorithm as shown in Fig. 1(b) , the coefficient " " exists at the beginning of odd stages only. Therefore, BF_I is used at even stages, and BF_II is used at odd stages. The obtained radix-4 FFT pipeline can use either the DC or DF buffering strategy to store intermediate data. When single-path delay-feedback (SDF) buffering strategy is chosen, the R2 SDFII FFT pipeline listed in Table III is obtained. Clearly, it uses the same number of butterfly units, multipliers, and the same memory size as the R2 SDFI FFT pipeline. Although the R2 SDFI architecture was obtained through a different manner, we believe that the two R2 SDF designs are equivalent at both the algorithm and at the architecture levels.
Based on the above procedure, we do the projection mapping for split-radix in the same way, and the result is shown in Fig. 6(b) . The BF_II unit is used at every stage except for the first stage, and two multipliers are used. Compared with the R2 SDFII pipeline, the obtained pipeline architecture for the split-radix algorithm uses more BF_II units and one more multiplier. The overall structure is still regular and suitable for VLSI implementation, and the control of the BF_II units and multipliers is similar to the R2 SDFII pipeline. Note that buffering style in the figure is not specified as it can be determined by the designer. We will use DF style throughout this paper.
The problem of the obtained pipeline architecture for the split-radix algorithm is that the number of multipliers is the same as radix-2 pipeline architecture, which equals , as shown in Table III . The utilization rate of the second multiplier drops to 25% only, which is very inefficient. To reduce the number of multipliers, a multiplier sharing scheme is developed here. At first, we try to use only one multiplier for two successive stages, as shown in Fig. 7 . A resource conflict problem arises if we use a conventional data scheduling scheme where the sum of a butterfly is sent to the next stage and the difference of a butterfly is fed into the buffer. The definitions of sum of a butterfly and difference of a butterfly are in accordance with a common radix-2 butterfly. The data coming from the two butterfly units will both require multiplication simultaneously in certain cases when a conventional data scheduling scheme is used.
To solve this problem, we propose a bit-inverse and bit-reverse (BIBR) data scheduling scheme. The BIBR scheme exchanges the order of output sequences from butterfly stages. Thus, for each butterfly unit, it stores the sum into the buffer and propagates the difference to the next stage first. The final output data sequence becomes BIBR order, instead of bit-reverse for the conventional scheme. Table IV compares the two data scheduling schemes. It is clear from the table that when the input data sequence is in normal order for both schemes, the output sequence will be in bit-reverse order for the conventional scheme and in BIBR order for the proposed scheme.
Finally, the SRSDF pipeline architecture described in Table III is obtained. A similar design procedure can be applied to obtain SDF pipeline for any algorithm based on Cooley-Tukey decomposition. However, the BIBR scheduling scheme does not eliminate the multiplier resource conflict for the R28SDF pipeline for the radix-2/8 algorithm when we try to use only multipliers. The R28SDF FFT pipeline still requires ( ) complex multipliers and constant multipliers. Further study is required on the issue of reducing the hardware requirement for the R28SDF pipeline.
IV. DESIGN OF DELAY-BALANCED PIPELINE
A. Two-Stage Pipelined Complex Multiplier Design
The design of a complex multiplier is essential for any complex FFT implementation because it consists of four real multiplications and two real additions. In addition to the large area, the latency of the FFT pipeline is often limited by the latency of the complex multiplication as well. Hence, many efforts have been devoted to reducing the latency of complex multiplication and minimizing the power consumption.
The equation for complex multiplication can be written as (12) where the refers to the incoming data from the previous stage, is the twiddle factor, and is the result of multiplication. Real part and imaginary part data are denoted using subscripts and , respectively. To obtain either or , two real multiplications and one real addition are required. Rather than design a new complex multiplier, we decided to repartition the pipeline to balance the latency because the basic problem for all of the previous designs is that the latency of a complex multiplication is in general twice as long as that of a butterfly operation. Although three-multiplication scheme can be used to reduce the area, it will further increase the latency of complex multiplication due to the encoding/recoding process [15] . Therefore, the repartition focuses on the balance of multiplication and butterfly operation, and it can be applied to either the common complex multiplication or the three-multiplication scheme.
At first, the partial products of the two multiplications are generated using modified-Booth encoding [16] , [17] and then fed into a Wallace tree [18] , [19] . The remaining two rows of partial products from the Wallace tree are converted to a two'scomplement format using a final adder. Fig. 8(a) shows the structure of the generated multiplier before repartition. CPA denotes carry-propagation addition as the final addition has to be accomplished via a CPA adder.
The latency of each block in Fig. 8(a) is estimated as follows. Based on the multiplier optimization algorithms proposed in [20] and [21] , the latency of several multipliers with different wordlengths are listed in Table V. Note that is the wordlength of ( ), is the wordlength of ( ), and the latency is normalized with respect to the delay of two inputs XOR gate. As the latency of a Wallace tree is related to the height of the array formed by the partial products rather than the width, and are encoded using modified Booth encoding to generate partial products. The height is for the merged Wallace tree with modified Booth encoding. However, if is smaller than , then should be encoded to reduce the height. Because the wordlength of the twiddle factors is fixed at 12-bit ( ) in our design, the latency of the Wallace tree is fixed at 8.5 XOR gates. The latency of the final adder ranges from 5.0 to 5.5 XOR gates. If a butterfly unit uses fast CPAs for butterfly operation, its latency will be approximately XOR gates [22] . Therefore, the latency of the butterfly is around five or six XOR gates. The latency of each stage before partition is also shown in Fig. 8(a) based on the above estimation. If we partition the complex multiplier into two stages to balance the delay and do not modify the butterfly unit as shown in Fig. 8(b) , the repartition will cause the area to increase due to the insertion of additional pipeline registers and will increase the total latency of the FFT. To avoid these problems, the final addition of the multiplier is merged into the butterfly units, as shown in Fig. 8(c) . The butterfly operation has been modified to take the final addition into consideration.
As discussed in Section III-B, the butterfly operates in two phases. During the first phase, the incoming data from the pre- vious stage are stored in a buffer. During the second phase, the data are computed according to (4) and (5), or (10) and (11), to complete the butterfly operation for split-radix algorithm. The modification here is done as follows. During the first phase, the carry-save format data from the previous stage are converted to two's-complement format using the adders and subtractors of the original butterfly unit. The data is then stored into the buffer as usual. During the second phase, the incoming data ( ) in carry-save format and the stored data ( ) in two's-complement format are sent to carry-save adders, i.e., one row of full-adders, to replace the original CPA with a carry-save addition (CSA). The CSA is modified such that it can be used to compute either ( ) or ( ). The superscript " " denotes that the data is in carry-save format. The outputs of the CSA are then fed to the original butterfly unit to complete the computation. Note that the latency of a CSA is fixed at two XOR-gate delays, regardless of the wordlength. Therefore, the latency becomes 8.5 XOR gates for the multiplier stage and 8 ( ) XOR gates for the butterfly unit.
B. Benefits of the Delay Balanced Pipeline Design
The benefits of the proposed design are manifold. At first, the delay of the multiplier and the butterfly unit is now balanced. Second, the total number of the clock cycle will remain the same as that of the original design. Third, since the CPA of each mul- tiplication has been converted to CSA, the power consumption can be reduced by an amount of Reduced CPA CPA CSA CMul (13) where CPA denotes the power consumed by a CPA operation, CSA denotes the power consumed by a CSA, and CMul denotes the number of complex multiplication for an FFT. Note that there are two CPAs for each complex multiplication. In addition to the power saving due to the conversion from CPA to CSA, the reduced latency can also be employed to reduce the power consumption via voltage scaling techniques.
Moreover, the repartition will not increase the total hardware cost. Table VI compares the cost of the original unbalanced design and the proposed balanced one. Assume that the cost of one bit full adder (FA) is equivalent to one bit register. For the original design, the CPA costs FAs, and there are 2 pipeline registers in our SRSDF architecture because each multiplier is connected to two butterfly units. When the CPA is merged into the butterfly units, we need two sets of CSA for the two butterfly units, and the number of pipeline register is doubled to 4 due to the carry-save format. The hardware cost is about the same for the two designs when .
C. Control Signals for Delay-Balanced SRSDF
A delay-balanced SRSDF pipeline for is shown in Fig. 9 . We use both local control signals and global control signals to synchronize the whole pipeline. The operation for a butterfly unit is determined by the received bf_mode and bf_by-pass signals and is described as follows.
The bf_mode signal is a 2-bit local control signal that propagates with data. The three different states indicated by the bf_mode signal are defined in Table VII . When a butterfly unit receives st_normal from previous stage, it will generate sum according to (4) and generate difference according to (5) . It will also generate a new bf_mode signal for the next stage according to Table VII . Similarly, if it receives st_mulj, it will multiply the input data with " " and then generate the sum and the difference. If a st_csa signal is received, the butterfly will send the incoming data into a carry-save adder and then generate correct outputs, as explained in Section IV-A. Therefore, three different butterfly units, BF_I, BF_II, and BF_III are used in this design. BF_I can accept st_normal, and BF_II can accept st_normal and st_mulj. BF_III can accept any one of the three states defined in the Table. The bf_bypass signal in the Table is generated by a -bit counter labeled with "bf_counter" in Fig. 9 . The MSB is connected to the first butterfly stage so that the first stage will change its state every clock cycles. Similarly, the ( )th bit of the counter will be connected to the th butterfly stage, as shown in the figure. When bf_bypass is "1," the data will be bypassed and stored into buffer. When it is "0," the butterfly will carry out one of the three possible BF operations specified by bf_mode.
The operation of the multiplier is determined by the bf_mode signals from the previous and the next butterfly units and the mul_mode signal from the twiddle factor table, which is labeled as " " in Fig. 9 . The bf_mode from the previous butterfly will be connected to pre_mode port, and the other one from the next butterfly will be connected to nxt_mode port, as shown in the figure. When none of the bf_mode signals is in st_csa state, the data from both butterfly units are simply bypassed. When one of the bf_mode is in st_csa state, the multiplier will multiply this incoming data with a twiddle factor, and the other one will be bypassed. Note that the pre_mode and nxt_mode signals will not be both in st_csa state simultaneously because the proposed BIBR data scheduling scheme is used. The twiddle factor table follows similar rules to deliver correct twiddle factor for multiplication. The mul_mode signal from the table is set to one whenever current multiplication is trivial and can be skipped. Only if the trivial multiplications are eliminated can the multiplicative complexity of the SRSDF be as low as indicated in Table I. TABLE VII  STATES INDICATED BY BF_MODE AND THE STATE TRANSITION TABLE   TABLE VIII AREA PERCENTAGES OF THE MODULES OF THE 64-POINT SRFFT For an SRSDF design without delay balancing, the bf_mode local control signal is the additional control overhead compared with a R2 SDF or R4SDF design. Note that the folding of multiplier is achieved by using BIBR scheduling and therefore does not increase any control complexity except for the multiplexers required at the input and output of a multiplication stage. The st_csa state is needed only when delay-balanced architecture is employed. The other control signals, such as bf_counter or mul_mode, are commonly employed in pipeline-based design. Thus, the overall control complexity is increased slightly due to the local control signal. The area percentages of control, datapath, and feedback memory for the 64-point SRSDF FFT are summarized in Table VIII . The size of control logic is very small compared with the other components. If the hardware for data multiplexing in multipliers and BF units is considered as control logic, the percentage of control logic will be increased to approximately seven. Most of the increase comes from the multiplexers for data routing in multipliers.
V. SIMULATION AND ANALYSIS
A. R2 SDFII and SRSDF FFT Pipelines
For comparison, we have implemented a 64-point ( ) R2 SDFII pipeline and a 64-point delay-balanced SRSDF FFT pipeline because they have similar hardware cost and performance, as shown in Table III . Note that although R2 SDFII is obtained through the procedure described in this work, its performance should be similar to the R2 SDF described in [6] . The input uses 12-bit ( ) for the real part and 12-bit for the imaginary part. To avoid overflow, the wordlength is adjusted by one bit at each stage, as shown in Fig. 9 . The output is 20-bit for both real and imaginary parts for a 64-point FFT pipeline. The wordlength of the twiddle factors is fixed at 12-bit ( ) for real part and imaginary part. The result of multiplication exceeding the required wordlength is truncated directly. These parameters, including the , , and , are configurable. The design is described using C/C++ language at first to verify the functionality and the effects of fixed point arithmetic. The C/C++ program is then converted to Verilog and then syn- thesized using the design analyzer from Synopsys. Automatic placement and route is done by Apollo from Avant!.
B. Area, Power, and Timing Performance
Table IX summaries the performance of the two designs. The SRSDF design can still function properly at 150 MHz at 3.3 V or 75 MHz at 2.7 V, whereas R2 SDFII cannot. The layout view of the 64-point SRSDF FFT chip is shown in Fig. 10 . It is pad limited with a core area of 1902 m ( ) 1820 m ( ). The gate count and the critical path are reported by the synopsys design analyzer, and the power consumption is reported by PowerMill. The functionality of both SRSDF and R2 SDFII are verified by the post-layout simulation done by TimeMill. The SRSDF FFT pipeline can achieve a higher clock rate because of the well balance of the multiplication and the butterfly operation. The novel SRSDF FFT architecture achieves about 15% power saving and 14.5% speed improvement compared with the R2 SDFII FFT pipeline. The power consumption for SRSDF or R2 SDFII does agree with the model predicted by (1) . For example, the power consumed by the SRSDF is 259 mW when supply voltage is reduced to 2.7 V. If we take the area into consideration, then the power reduction contributed by the delay-balanced pipeline and the split-radix algorithm is about 10%. However, when is 64, the number of multiplication for the two designs differs by four only. It means that most of the 10% power reduction should come from the delay-balanced design. The strength of the split-radix algorithm will not be significant when is small.
VI. CONCLUSION AND FUTURE WORK
This paper presents a novel delay-balanced SRSDF pipeline architecture, which is regular and extensible for any 2 -point FFT. Most of the conventional radix-r FFT pipeline has the restriction that the length of FFT has to be power of r. We remove such restriction by using split-radix algorithm. Compared with the R2 SDFII design, it saves 15% power consumption and 6% hardware cost and reduces the critical path by 14.5%, according to the post-layout simulation based on the 0.35 m Avant! cell library [23] . Thus, it does not only achieve the minimum hardware requirement but also saves the power and increases the maximum clock rate at the same time.
The 1-D linear array for the other FFT algorithms can be obtained via similar mapping procedure, and the delay-balanced pipeline architecture can also be used when higher clock rate and lower power consumption are desirable. The comparison of the fixed-radix and mixed-radix algorithms also provides useful information for a designer.
For the radix-2/8 algorithm, we also propose a R28SDF pipeline architecture in this work. It has low computational complexity but a high hardware cost as well. We will develop a cost-effective solution for R28SDF architecture in the future.
As mentioned in Section III-B, the number of global memory access for low radix algorithms can be reduced by using high radix pipeline structure. This is also considered as our future research direction because it may save significant power for OFDM systems with long length FFT.
