Abstract-In this paper, we present a novel 128-point FFT/IFFT processor for ultrawideband (UWB) systems. The proposed pipelined FFT architecture, called mixed-radix multipath delay feedback (MRMDF), can provide a higher throughput rate by using the multidata-path scheme. Furthermore, the hardware costs of memory and complex multipliers in MRMDF are only 38.9% and 44.8% of those in the known FFT processor by means of the delay feedback and the data scheduling approaches. The high-radix FFT algorithm is also realized in our processor to reduce the number of complex multiplications. A test chip for the UWB system has been designed and fabricated using 0.18-m single-poly and six-metal CMOS process with a core area of 1.76 1.76 mm 2 , including an FFT/IFFT processor and a test module. The throughput rate of this fabricated FFT processor is up to 1 Gsample/s while it consumes 175 mW. Power dissipation is 77.6 mW when its throughput rate meets UWB standard in which the FFT throughput rate is 409.6 Msample/s.
I. INTRODUCTION
U LTRAWIDEBAND (UWB) communication systems, which enable one to deliver data from a rate of 110 Mb/s at a distance of 10 m to a rate of 480 Mb/s at a distance of 2 m in realistic multipath environment while consuming very little power and silicon area, are currently the focus of research and development of wireless personal area networks (WPANs). Orthogonal frequency division multiplexing (OFDM) is considered as the leading choice by the 802.15.3a standardization group for use in establishing a physical-layer standard for UWB communications [1] . OFDM-based UWB not only has reliably high-data-rate transmission in time-dispersive or frequency-selective channels without having complex time-domain channel equalizers but also can provide high spectral efficiency. However, because the data sampling rate from the analog-to-digital converter to the physical layer is up to 528 Msample/s or more, it is a challenge to realize the physical layer of the UWB system-especially the components with high computational complexity-in VLSI implementation. The FFT/IFFT processor is one of the modules having high computational complexity in the physical layer of the UWB system, and the execution time of the 128-point FFT/IFFT in UWB system is only 312.5 ns. Therefore, if employing the traditional approach, a great deal of power consumption and high hardware cost of the FFT/IFFT processor will be needed to meet the strict specifications of the UWB system. Thus, this paper proposes a FFT/IFFT processor with a novel multipath pipelined architecture for high-throughput-rate applications.
The power consumption and hardware cost can also be saved in our processor by using the higher radix FFT algorithm and less memory and complex multipliers. This paper is organized as follows. Section II identifies the problems of the implementation of the FFT/IFFT processor in UWB system. Section III describes the 128-point mixed-radix FFT algorithm, including the radix-2 FFT algorithm and the radix-8 FFT algorithm, and the IFFT algorithm. Section IV focuses on describing the proposed FFT/IFFT architecture and compares its hardware cost and throughput rate with some existing FFT architectures in 128-point FFT. Some simulation results are shown in Section V. In Section VI, the microphotograph of the fabricated FFT/IFFT processor and the measurement results are presented. Then, conclusions are drawn in Section VII.
II. DESIGN ISSUE OF THE FFT PROCESSOR FOR THE OFDM-BASED UWB SYSTEM
A block diagram of the proposed physical layer of OFDM-based UWB system is shown in Fig. 1 . It contains a convolutional encoder, a Viterbi decoder, a pilot insertion, a QPSK-modulator/demodulator, a spreading/de-spreading, a guard interval insertion/removal, a 128-point FFT/IFFT, a serial-to-parallel (S/P) converter/parallel-to serial (P/S) converter, an analog-to-digital converter (ADC), a digital-to-analog converter (DAC), and a synchronization and channel estimation block. In the UWB system, the data rate is from 53.3 to 480 Mb/s with code rates , and . The bandwidth of the transmitted signal is 528 MHz and the OFDM symbol duration is 312.5 ns, including 60.61 ns for cyclic prefix duration and 9.47 ns for guard interval duration [1] . Thus, an FFT/IFFT has to compute one OFDM symbol within 312.5 ns and the throughput rate of this specification in 128-point FFT/IFFT is up to 409.6 Msample/s. In order to implement the physical layer of the UWB system more efficiently, the four-data-path approach is adopted to reduce the data sampling rate from the ADC, as shown in Fig. 1 , so that, after the S/P converter, the data sampling rate of each path can be down to 132 Msample/s.
Various FFT architectures, such as single-memory architecture, dual-memory architecture [2] , pipelined architecture [3] , array architecture [4] , and cached-memory architecture [5] , have been proposed in the last three decades. In our view, the pipelined architecture should be the best choice for high-throughput-rate applications since it can provide high throughput rate with acceptable hardware cost. The pipelined FFT architecture typically falls into one of the two following categories. One is multipath delay commutator (MDC) and the other is single-path delay feedback (SDF), as shown in Fig. 2(a) and (b), respectively [3] . In general, if appropriately reordered parallel input data can be supported simultaneously in the MDC scheme, this scheme provides times throughput rate of the SDF scheme. However, there are some limitations on the number of data path, the FFT size, and the radix-r FFT algorithm in the MDC architecture. Besides, the requirement of the memory and complex multiplier in the MDC scheme is more than that of the SDF scheme. In general, the MDC scheme can achieve a higher throughput rate, while the SDF scheme needs less memory and hardware cost.
For high-throughput-rate applications, the MDC architecture is more suitable than the SDF architecture in UWB applications if the input data are reordered in the input buffer before they are loaded into the MDC processor. Unfortunately, the traditional R2 (Radix-2) MDC architecture cannot provide the available throughput rate unless it raises the work frequency [6] ; the R4 (Radix-4) MDC architecture, which needs a power of four, has the limitation on FFT size [6] , and the Split-Radix (SR) MDC has higher hardware cost [7] . In addition, the higher radix FFT algorithm is difficult to be implemented in the traditional MDC architecture. In general, the higher throughput rate of the FFT processor can be provided by increasing the number of data paths in the MDC pipelined architecture. However, the hardware cost is also increased significantly because more memory and complex multipliers are needed to allow multiple data to be operated simultaneously. The main motivation of this paper is to design a novel four-data-path pipelined FFT architecture, which is called mixed-radix multipath delay feedback (MRMDF), by combining the features of the SDF and MDC architectures. The proposed FFT/IFFT processor not only suits the proposed UWB physical layer, as shown in Fig. 1 , but also can provide an available throughput rate to meet the UWB specifications. The MRMDF architecture has lower hardware cost compared with the traditional MDC approach and adopts the high-radix FFT algorithm to save power dissipation. The approach will be described in more detail in Section III-IV.
III. ALGORITHM Given a sequence
, an -point discrete Fourier transform (DFT) is defined as (1) where and are complex numbers. The twiddle factor is (2) In (1), the computational complexity is through directly performing the required computation. By using the FFT algorithm, the computational complexity can be reduced to , where means the radix-r FFT. The radix-r FFT algorithm can be easily derived from the DFT by decomposing the -point DFT into a set of recursively related -point FFT transform, if is power of r. In general, higher-radix FFT algorithm has less number of complex multiplications compared with radix-2 FFT algorithm which is the simplest form in all FFT algorithms. In an example, for the 128-point FFT, the number of nontrivial complex multiplications of the radix-8 FFT algorithm is 152, which is only 59.3% of that of the radix-2 FFT algorithm [8] . Thus, in order to save the number of complex multiplications, we choose the radix-8 FFT algorithm. Because the 128-point FFT is not a power of 8, the mixed-radix FFT algorithm, including radix-2 and radix-8 FFT algorithms, is needed. This will be derived in detail below. First let (3) Using (3), (1) can be rewritten as (4) Equation (4) can be considered as a two-dimensional (2-D) DFT. One is a 64-point DFT and the other is a two-point DFT. Then, by decomposing the 64-point DFT into the eight-point DFT recursively two times, we can complete the 128-point mixed-radix FFT algorithm. In order to implement radix-8 FFT algorithm more efficiently, using the radix-FFT algorithm proposed by He and Torkelson [3] , we further decompose the butterfly of the radix-8 FFT algorithm into three steps and apply the radix-2 index map to the radix-8 butterfly. By a three-dimensional (3-D) linear index map, and can be defined as (5) By means of (5), (4) takes the following form:
The twiddle factor in (6) is (7) Using (7), (6) becomes (8) where is defined as shown in the equation (9) at the bottom of the page.
In (9), we use the radix-2 index map to divide the eight-point DFT into three steps. Fig. 3 shows the signal flow graph (SFG) of the 128-point mixed-radix FFT algorithm. In Fig. 3 , the 128-point FFT is divided by three stages, where radix-2 FFT algorithm is used in the first stage, and the radix-8 algorithm is applied in the second and third stages. The black point shown in Fig. 3 means that the twiddle factor will be multiplied at that point, and each butterfly of the radix-8 algorithm is further decomposed into three steps. The twiddle factors, and , at the first step are trivial complex multiplications, because they can be written as and , respectively. Thus, a complex multiplication with one of the two coefficients can be computed using additions and a real multiplication, whose hardware can be realized by six shifters and four adders [9] .
The IFFT of an -point sequence is defined as (10) If we take the complex conjugate of (10) and multiply both sides by , we find (11) The right-hand side of (11) is recognized to be the FFT of the sequence and can be computed using any FFT algorithm [6] . By taking the complex conjugate of (11) and dividing both sides by , the desired output sequence is given by
According to (12), the IFFT can be performed by taking the complex conjugate of the incoming data first and then the outgoing data without changing any coefficients in the original FFT algorithm so that the hardware implementation can be more efficient.
IV. ARCHITECTRUE

A. Proposed FFT Architecture for the UWB System
The block diagram of the proposed 128-point FFT/IFFT processor derived from (4), (8) , (12), and the SFG (Fig. 3) is depicted in Fig. 4 . The proposed MRMDF architecture combining the features of the SDF and MDC architectures consists of Module 1, Module 2, Module 3, conjugate blocks, a division block, and multiplexers. The features of the proposed MRMDF architecture are the following:
• higher throughput rate can be provided by using four parallel data paths; • the minimum memory is required by using the delay feedback approach to reorder the input data and the intermediate results of each module; • the 128-point mixed-radix FFT/IFFT algorithm is implemented to save power consumption; • the number of complex multiplier is minimized by using the scheduling scheme and the specified constant multipliers. In the MRMDF architecture, the input sequence and the output sequence are in the specified order. The order of the output sequence is the bit reversal of the order of the input sequence, as seen in Fig. 4 . The operation of the FFT or IFFT is controlled by the control signal, FFT/IFFT, as shown in Fig. 4 . When an IFFT is performed in our processor, the sign of the imaginary part of the input sequences will be changed and then they will be performed by the process in treating FFT. The sign of the imaginary part of output data from FFT will be changed again and then will be divided by 128. Because 128 is a power of two, the operation of the division is implemented by shifting the decimal point location. The function of Module 1 is to implement a radix-2 FFT algorithm, corresponding to the first stage of the SFG, as shown in Fig. 3 . Modules 2 and 3 are to realize the radix-8 FFT algorithm, corresponding to the second and third stages of the SFG, as displayed in Fig. 3 . In order to minimize the memory requirement and to ensure the correction of the FFT output data, two different structures are built in Modules 2 and 3 to implement the radix-8 FFT algorithm. In addition, the hardware complexity of the complex multiplier will be also considered in the proposed architecture. In the next few subsections, each Module will be described in more detail.
1) Module 1:
Module 1 consists of a register file which can store 64 pieces of complex data, one butterfly unit (BU), two complex multipliers, two ROMs, and some multiplexers, as shown in Fig. 5 . The function of ROM is used to store twiddle factors. Only period of cosine and sine waveforms are stored in ROM, and the other period waveforms can be reconstructed by these stored values. The BU consists of four BU 2s, which operate the complex addition and complex subtraction from two input data. Because the radix-2 FFT algorithm is adopted in this module, BU cannot start until both the input sequences and are available. This corresponds to the first stage of SFG, as shown in Fig. 3 . The order of the four parallel input sequences in Module 1 is , and , respectively, where is from . Therefore, these two available data of each data path are separated by 16 cycles if one input data of each path is available per clock cycle. At the first 16 cycles, the first 64 data are stored in the register file. At the next 16 cycles, the eight input data and of the BU are received from the register file and the input, respectively, as shown in Fig. 5 . Then the BU generates the outputs data according to the radix-2 FFT algorithm. Meanwhile, four output data, , generated by the BU, are fed to Module 2 directly, and the other four output data are stored into the register file. After 32 cycles, these data are read from the register file and are multiplied by the twiddle factors simultaneously before they are sent to Module 2. In general, four complex multipliers are needed in the four-parallel approach to implement the radix-2 FFT algorithm. Also, the utilization rate of the complex multiplier is only 50%. This paper proposes a new approach to increase the utilization rate and to reduce the number of complex multipliers. The detailed operation is described below. When the 's are generated by the BU, two of the 's, and , are multiplied by the appropriate twiddle factors first before the 's are stored in the register file. After 32 clock cycles, other two 's, , and , are multiplied before the data 's are fed to Module 2. By rescheduling the time of the complex multiplications, it is clear to find that only two complex multipliers are needed in our approach, as shown in Fig. 5 . The utilization of the complex multipliers can achieve 100% by using our proposed approach. 
2) Module 2:
Module 2 consists of four BU 8 structures and one modified complex multiplier, as shown in Fig. 6 . These four BU 8s operate in the same way. The architecture of BU 8 is directly mapped from the three-step radix-8 FFT algorithm, whose SFG is shown in Fig. 3 . Also, the sizes of the three delay elements in the BU 8 are eight, four, and two points, respectively, as shown in Fig. 6 . The function of the delay element is to store the input data until the other available input data are received for the BU 2 operation. The output data generated by the BU 2 in the first step and second step are multiplied by a trivial twiddle factor, or before they are fed to the next step. These twiddle factors can be implemented efficiently, as mentioned in Section II. However, the four output data from the third step of the BU 8 need to be multiplied by the nontrivial twiddle factors simultaneously in the modified complex multiplier.
It is inefficient to build four complex multipliers for multiplying different twiddle factors simultaneously. So we modified an approach proposed by Maharatna et al. [10] to simplify the complexity of the complex multipliers. The twiddle factors of the modified complex multiplier are , where and are the real and imaginary parts of the twiddle factor and is from 0 to 49, as shown in Fig. 7 . However, only nine sets of constant values, with 0-8 in region A are needed, because the twiddle factor in the other seven regions can be obtained by using the mapping table, as seen in Table I . In practice, we only need to implement eight sets of constant values in the A region, since the first set of constant values (1, 0) is trivial. In addition, these constant values can be realized more efficiently by using several adders and shifters [10] . Table II shows the scheduling of the twiddle factor in each data path after the twiddle factors are mapped to region A. It can be clearly seen that the twiddle factor of four paths in each time slot has different values, except for time slots 2 and 3. In time slots 2 and 3, the hardware conflict will happen if only one constant multiplier 4 is built. Therefore, an additional constant multiplier 4 is used in our design to avoid spending one more cycle. The block diagram of the modified complex multiplier is shown in Fig. 8 . In the beginning, the four output sequences from the third step of the BU 8 are separated into real and imaginary parts. The data of each path are fed to appropriate constant multipliers according to the scheduling of the twiddle factor, as shown in Table II . Therefore, the entire constant multiplication calculation can be implemented by just using eight sets of constant values by swapping the real and imaginary parts appropriately and choosing the appropriate sign, following the mapping table (Table I ). The gate count of this approach can save about 38% compared to the four-complex-multiplier approach, and the performance of this approach is equivalent to that of the four complex multipliers.
3) Module 3: The radix-8 FFT algorithm is realized in Module 3. The detailed block diagram of Module 3 is shown in Fig. 9 . The structure of Module 3 is different from that of Module 2, because the two available data of the BU 2 in the second and third steps are in different data paths. Thus, a TABLE I  MAPPING TABLE USED TO DETERMINE THE VALUES OF  THE TWIDDLE FACTORS IN DIFFERENT REGIONS suitable structure is needed to ensure the correction of the FFT output data. Some output data, generated by the BU 2 in the first and second steps, are multiplied by the nontrivial twiddle factors before they are fed to the next step.
B. Comparison
In general, the performance and hardware cost of the pipelined FFT architecture are increased by using the multiple data-path approach. Thus, the multipath-based architecture usually provides higher throughput rate with higher hardware cost if the parallel input data can be supported in this approach.
The proposed MRMDF architecture hardware costs in terms of 128-point FFT are as follows:
• registers number: 124; • complex multipliers: , where the complexity of modified complex multiplier is only 62% of that of four complex multipliers; • complex adders: 48. Table III compares the hardware requirement, FFT algorithm, and throughput rate with several classical and proposed approaches in the 128-point FFT. The known MDC architectures like R4MDC [6] and the architecture proposed by Jung et al. [11] are not suitable for the 128-point FFT in UWB applications, because the FFT size used in their approaches is limited by a power of 4. In order that these two architectures are able to process the 128-point FFT, we modify both architectures by adding the proposed Module 1 to them. In addition, the throughput rate of the traditional MDC architecture is raised by increasing the utilization of butterfly units; this can be done by reordering the appropriate parallel input data in the input buffer before the data are loaded into the FFT processor. Consequently, the revised R4MDC and Jung's architectures can be compared with ours, as seen in Table III . It should be emphasized that the input buffer, whose size is usually proportional to the number of data paths, is needed in all FFT processors listed in Table III , except for our proposed MRMDF architecture and R2 SDF architecture. By combining the features of the R2 SDF and the R4MDC approaches, the proposed FFT architecture not only can implement the radix-8 FFT algorithm in a 128-point FFT to reduce the number of complex multiplications but also can provide four times the throughput rate, compared with the R2 SDF scheme, as listed in Table III . In addition, the numbers of register excluding the input buffer and complex multiplier used in our scheme are only 38.9% and 44.8% of those in the SRMDC architecture [7] . Although the number of complex adders in our design is greater than that in the others, the cost of complex adders is much less than that of registers and complex multipliers, respectively.
V. SIMULATION
The appropriate word length in an FFT/IFFT processor is determined by the fixed-point simulation before hardware implementation. Fig. 10 shows the simulation results for the relation of SNR with the internal word length of the FFT/IFFT processor. In Fig. 10 , IN SNR means the ratio of the signal to the AWGN coming from the channel at the FFT input; OUT SNR means the ratio of the signal to the AWGN plus the quantization noise generated by the operation of the finite word and truncation in a FFT/IFFT processor at the output FFT. As seen in the figure, keeping IN SNR fixed, the OUT SNR increases with increasing word length by suppressing the quantization noise, but it tends to saturate when the OUT SNR equals to the IN SNR. This means that, as the OUT SNR reaches saturation, the quantization noise can be almost ignored. Based on the simulation results, we determined the word length of the proposed FFT/IFFT to be 10 in both the real and imaginary parts. The determined word length not only keeps the quantization noise to the least value but also can minimize the hardware complexity. With the chosen word length of the proposed FFT/IFFT processor, system performance is evaluated through the complete UWB system platform. We consider the cases of different data rates (110, 160, 200, 320, and 480 Mb/s) with code rates , and in the AWGN channel. The package error rate (PER) plotted against SNR is shown in Fig. 11 . From the figure, it is clear to see that the proposed FFT processor can provide a satisfactory coded receiver performance (PER below 0.08) under the AWGN channel. 
VI. CHIP IMPLEMENTATION
After the appropriate word length of the proposed FFT/IFFT processor is chosen, the architecture of the processor was modeled in Verilog and functionally verified using Verilog-XL simulator. The output data from the Verilog coded architecture agreed with the output data of the FFT/IFFT in our UWB platform, which is coded by MATLAB. This test chip of the 128-point FFT/IFFT processor is fabricated in 0.18-m one-poly six-metal (1P6M) CMOS process. The core size including the test module is 1.76 1.76 mm , as shown in Fig. 12 . The function of the test module consisting of 3.072-kb SRAM is used to save 24 chip pins. Input data are stored serially in the test module from the chip input pins before the operation of the processor. The test module provides four complex data in parallel to the FFT/IFFT processor core when the processor begins to work. The 140-pin chip is packaged in 144-pin CQFP package, where 78 pins are signal pins and others are power pins. Table IV lists the chip summary and measurement results. The highest throughput rate of our proposed architecture is up to 1 Gsample/s with power dissipation of 175 mW at 250 MHz. According to the specifications of UWB system, the throughput rate of the FFT/IFFT is 409.6 Msample/s. At the working frequency of 110 MHz, the power consumption of the FFT/IFFT processor is only 77.6 mW for 480 Mbs/s.
VII. CONCLUSION
In this paper, a novel 128-point FFT/IFFT processor for OFDM-based UWB systems has been proposed. In our proposed MRMDF architecture, high throughput rate can be achieved by using four data paths. Furthermore, the hardware costs of memory and complex multiplier can be saved by adopting delay feedback and data scheduling approaches. In addition, the number of complex multiplications is reduced effectively by using a higher radix algorithm. This test chip has been designed, fabricated, and tested in 0.18-m CMOS process. The measurement results show that the throughput rate of this test chip is up to 1 Gsample/s while it dissipates 175 mW. When the throughput rate of our processor meets UWB standard, it only consumes 77.6 mW.
