This paper presents a low-power, high-speed 4-data-path 128-point mixed-radix (radix-2 & radix-2 2 ) FFT processor for MB-OFDM Ultra-WideBand (UWB) systems. The processor employs the single-path delay feedback (SDF) pipelined structure for the proposed algorithm, it uses substructure-sharing multiplication units and shift-add structure other than traditional complex multipliers. Furthermore, the word lengths are properly chosen, thus the hardware costs and power consumption of the proposed FFT processor are efficiently reduced. The proposed FFT processor is verified and synthesized by using 0.13 µm CMOS technology with a supply voltage of 1.32 V. The implementation results indicate that the proposed 128-point mixed-radix FFT architecture supports a throughput rate of 1Gsample/s with lower power consumption in comparison to existing 128-point FFT architectures.
INTRODUCTION
UWB has recently attracted much attention as an indoor short-range high-speed wireless communication [1] . One of the most exciting characteristics of UWB is that it can support various data rates from tens of MB/s to hundreds of MB/s, thus satisfy most of the multimedia applications such as audio and video delivery. Multiband orthogonal frequency division multiplexing (MB-OFDM) is considered as the leading choice by the 802.15.3a standardization group for use in establishing a physical-layer standard for UWB communications [2] . OFDM based UWB not only has reliable highdata-rate transmission in time-dispersive or frequency-selective channels without having complex time-domain channel equalizers, but also provides high spectral efficiency [3] . Figure 1 is the UWB physical layer in MB-OFDM frame work. In the MB-OFDM systems, the FFT processor conducts total of 128 points, including 100 data tones, 12 pilot tones, 10 guard tones and 6 null tones. The FFT processor is one of the most important and complex modules in the physical layer of UWB, the execution time for 128-point should be at most 312.5 ns in order to satisfy the timing constraints. However the traditional FFT architecture cannot satisfy both the low power consumption and the high-throughput specification. Thus designers should develop high-speed power efficient FFT processor. There are many works focusing on the FFT/IFFT processor. Y. W. Lin et al. [3] proposes a mixed radix FFT algorithm by implementing radix-2 and radix-2 6 FFT. They divide the radix-2 6 FFT into two radix-2 3 FFT stages, and then realize the radix-2 3 FFT by 3 radix-2 stages. The architecture is designed as four-parallel-paths SDF pipelines, the max work frequency of FFT chip reaches 250MHz under 180nm CMOS technology library, the throughput rate is 1G sample/s with power dissipation of 175mW. Sang-In Cho et al. [4] employs a modified radix-2 4 algorithm and a radix-2 3 algorithm to significantly reduce the number of complex constant multipliers. It also employs four-parallel-path pipelined architecture. The simulation result indicates the FFT processor can support a throughput of up to 1G sample/s with a power dissipation of 112mW under 180nm CMOS technology library. Z.Wang et al. [5] proposed a two-stage radix-2 2 and radix-32 FFT algorithm, it uses radix-2 5 algorithm instead of radix-32, four parallel pipelines are employed at the second stage. S.Qiao et al. [6] Proposes radix-2 and radix-64 FFT algorithm as [3] , however, it develops a non-Cooley-Tukey radix-8 unit in order to save hardware cost. The measurement result shows the throughput rate is 409.6M sample/s with area saved by 20% to 63% compared to that of radix-2 SDF architecture. Table 1 show out the features of each algorithm. A novel mixed radix 128-point FFT algorithm is presented in this paper and multipath pipelined 128-point FFT architecture is designed. As most of the power consumption and hardware complexity in FFT processor come from the complex multipliers, carefully design will not only lower the power, higher the speed, but also guarantee a good level of signal-to-quantization-noise ratio (SQNR).
The paper is organized as follows. Section 2 describes the novel proposed 128-point mixed-radix FFT algorithm. Section 3 introduces the hardware architecture designed for the proposed algorithm. Section 4 demonstrates experiment results and compares results with existing FFT architectures. At the end, the conclusions are drawn in Section 5.
128-POINT MIXED RADIX ALGORITHM
A mixed radix-2 and radix-2 2 128-point FFT algorithm is proposed and introduced in this section. An N-point discrete Fourier transform (DFT) is defined as
Where x(n) and X(k) are complex values. The twiddle factor is expressed in (2)
To drive the proposed algorithm, n and k are determined by a four-dimensional linear index map: 
Substitute (3) and (4) 
As we can see in (5), the 128-point FFT is changed into one radix-2 and three radix-4 stages. Therefore, we have stage1 expressed by
S n n n k , which contains a radix-2 algorithm and a multiplication of twiddle factor 
Further decompose 2 n and 2 k in (7) and (8) 
By using similar method, stage three and four can be represented as (10) and (11) respectively. Figure 2 is the signal flow of the proposed FFT algorithm. 
5 5 6 5 6 6 6 5 1 1 Note that the inverse FFT (IFFT) of a length-N complex sequence can be obtained by (13)
The IFFT can be realized by making the complex conjugate at the input and without changing coefficients, then take the conjugate at the output and divide the result by N.
ARCHITECTURE OF THE PROPOSED FFT ALGORITHM
According to the proposed algorithm, a 4-data-path SDF pipeline processor was proposed. The block diagram of the proposed architecture is given in Figure 3 . Module one to four represents to stage one to four in signal flow respectively. The function of Module 1 is to realize the radix-2 FFT algorithm, while Modules 2 to 4 realize the radix-2 2 FFT algorithm. We defined 3 types of word lengths for our system: input/output word-length (IOWL), system word-length (SWL) and twiddle word-length (TWL). In this paper, we choose IOWL =SWL = 10 bits, TWL = 8 bits, the details will be discussed in Section 4. Generally speaking, four complex multipliers are needed in the four-parallel approach to implement the radix-2 FFT algorithm, thus the utilization rate of the complex multiplier is only 50%. While in our proposed architecture, as Module 2 needs 32 clock cycles to process x(i), we can share the complex multiplier for four paths separately during these 32 cycles. The detailed operation is described below. When y(i) are generated from BUs, two of the y(i), y(0) and y(1), are multiplied by the appropriated twiddle factors first while y(2) and y(3) are going to the register files. 16 clock cycles later, other two, y(2) and y(3), are multiplied then be fed to Module 2 at the same time with the results of y(0) and y(1). By using this rescheduling architecture, only two complex multipliers are needed, thus 50% of multipliers are saved and a 100% utilization of the multipliers is achieved.
Module 2 consists of four-parallel radix-2 2 SDF architectures and a complex multiplier module for 64 nk W as shown in Figure 4 . As can be seen in Figure 3 , the output data generated by the BU2 between the first step and second step should be multiplied by j, which can be implemented efficiently by just exchanging the real part and imaginary part with each other. In order to simplify the complexity of the complex multipliers, we do a further modification for the approach proposed in [3] . Figure 5 . Region A consists of values with p from 0~8, the values in other seven regions can be represented by transforming the data of region A according to Table 2 . Therefore, through the mapping method, only nine sets of constant values are needed. In practice, we only need to implement eight sets of constant values in region A, since the first pair of constant values (1, 0) is trivial. In addition, these constant values can be realized more efficiently by using 8-bit shift-add multiplier [7] . Table 3 shows the schedule of the twiddles for the four paths. The table only shows 16 clock cycles because the values of twiddle factor are repeated every 16 cycles. After mapping according to Table 2 , the results of the coefficient value and corresponding regions are shown in Table 4 and Table 5 . It can be clearly seen from Table 4 that the twiddle factor of four paths in each time slot has different values, except for the first 4 cycles which no multiplications are needed. As can be seen from the block diagram of the complex multiplier in Figure 4 , when the inputs come in the first module, the four path data are mapped into different complex constant multipliers according to the schedule in Table 4 . After the multiplications, the regions are selected at second module according to Table 5 and Table 2 . By using the modified complex multiplier, the system is much simpler, efficient in time and energy as only few registers and adders are needed. to the proposed algorithm, which is one less than that in [4] . Furthermore, as the TWL of the proposed FFT is 2 bits less than that of [3] and [4] , fewer additions are needed, the hardware cost is reduced, and the speed is improved.
We name the three complex multipliers in module 3 as TCM_1, TCM_2, and TCM_3. 1 W = , which is trivial, therefore only 3 twiddle factors should be designed in each multipliers. According to Table 6 , we can calculate the coefficients that are needed to represent all the twiddle factors in each multiplier. Table 7 is the coefficient table which shows that TCM_1 and TCM_3 require three coefficients while TCM_2 needs only one coefficient. Table 6. Time Schedule for TCM_1, TCM_2 and TCM_3   TCM_1  0  2  3  1  TCM_2  0  4  6  2  TCM_3  0  6  9  3   Table 7 . Coefficient Table   TCM_1 cos
The architecture of TCM1 is shown in Figure 6 . TCM1_S1 controls whether to exchange the real part with the imaginary part at the input, TWD1 in Figure 6 is the coefficient multiplier which multiplies the input by cos 4
π , cos Table 8 shows the 2's complement values of the coefficients for TWD1 in TCM_1 and TCM_3. We notice that the binary values of a, b and c have some parts in common, therefore to reduce the hardware cost, we decompose the binary values to find the sharing parts. The proposed architecture of TWD1 and TWD2 are shown in Figure 9 and Figure 10 . By using the proposed shift-add method, the complex multipliers in Module 3 are even faster and less power consume compared with that proposed in Module 2. Thus, a very high speed, low-power system is realized during this stage. Module 4 has two stages of BU_2s, only one twiddle factor of 'j' is needed which can be realized by exchanging the real part and imaginary part of the complex data.
EXPERIMENTAL RESULTS
Before implementing the hardware module, the proposed architecture is verified in Simulink. In order to choose properly the bit length for the system, we employ fix-point tool box to estimate the relationship between the Signal to Quantization Noise Ratio (SQNR) of the outputs and different IOWLs, SWLs and TWLs. As the word lengths increase, the average output SQNR will increase, while the hardware cost will increase as well. Thus choose the word length is a trade-off problem. We employ fix point tool box in MATLAB to simulate the affect of word length on GR. Three types of word lengths are defined: input/output word-length (IOWL), system word-length (SWL) and twiddle word-length (TWL). [8] proves that SQNR of FFT module in UWB should be enough when IOWL=6bits, SWL=TWL= 11bits. We also notice that in [4] the SQNR reaches 35 dB with SWL=TWL=IOWL=10 bit.
We simulate 18 groups of test data by implement different word lengths on GR, Table 9 shows the results. As the SQNR reaches 27.4dB when the word lengths are the same as [8] , similarly as [4] , the average SQNR is 36.7dB when bit length equals to 10. As aforementioned, besides the design of HW architecture, the word-length decision helps to reduce the power consumption. Thus to be reasonable, for instance we keep the IOWL and SWL as 10 bits, while reduce the TWL to 8 bits to slightly lower the SQNR, as see in Figure 11 , the average SQNR is 34.1 dB which still satisfy the requirement of MB-OFDM UWB protocol. After that, a manually optimized implementation in Verilog HDL of the proposed FFT processor has been obtained. The hardware module is verified and the outputs SQNRs of different input data are calculated, the average SQNR is 34dB.
At the end, the proposed FFT architecture is synthesized by using the UMCL130E 0.13 µm CMOS technology with a supply voltage of 1.32V. Table 10 compares the implementation results of the proposed FFT processor with the existing works. It indicates that the proposed FFT processor can support a data processing rate of 1 G sample/s with power dissipation of 43.79 mW at 250 MHz. In order to compare the power consumption with [3] and [4] , which use 0.18µm technology, the power consumption might be multiplied by a factor of around 1.4, thus is 61 mW, which is around 50% of that of [4] . Table 11 compares the hardware cost, FFT algorithm and throughput rate with two existing 128-point four-parallel datapath FFT architectures. During the test of circuit, we find that the critical path of the FFT processor lies in the two multipliers at stage 1, we employ Booth's multiplier, thus, the critical path is shortened and hardware cost of these multipliers is 75% of [3] . In the proposed architecture, only 3 trivial multipliers are needed, and their complexities are reduced because of the properties of the data stream and the sharing structure. As a result the hardware cost of the three trivial multipliers is 40% of [4] and 63.7% of [3] . The complex multiplier in stage 2 not only uses less constant complex multipliers than that of [3] , but the bit-length of the TW is less as well. 
CONCLUSIONS
In this paper, we propose a novel mixed radix 128-point FFT algorithm by combining modified radix-2 and radix-2 2 algorithms together. A low-power, high throughput rate FFT processor is built. Thanks to the algorithm, we are able to significantly reduce the number and complexity of the complex multipliers. The power consumption of the whole system has been reduced by around 50% compared with that of existing work. The implementation results indicate that the throughput rate of the proposed FFT processor with 10-bit IOWL, SWL and 8-bit twiddle factor word-length can support 1 Gsample/s with a power consumption of 43.79 mW at 250 MHz by using 0.13 µm CMOS technology.
We are currently finishing the design of the other modules of UWB system and, at the same time, refining the design method to assure first-silicon access.
