I. Introduction
Ultra-wideband (UWB) systems supporting various data rates from tens of Mb/s to hundreds of Mb/s are very suitable for application to short range wireless communications because they can share the frequency band with existing narrowband systems [1] - [3] . One of the candidate schemes for the highspeed UWB physical layer (PHY) is a multiband orthogonal frequency-division multiplexing (MB-OFDM) scheme. One OFDM symbol in the MB-OFDM UWB system consists of 128 subcarriers and 37 zero samples. The 128 subcarriers are composed of 100 data subcarriers, 12 pilot subcarriers, 10 guard subcarriers, and 6 null subcarriers. Therefore, the fast Fourier transform (FFT) processor of the MB-OFDM UWB system conducts a 128-point FFT operation, where the sampling frequency is 528 MHz and the subcarrier frequency spacing is 4.125 MHz. Although the FFT period is 242.42 ns, the 128-point FFT operation is allowed to be performed within 312.5 ns because a length-37 zero-padded suffix duration (70.08 ns) is added in one OFDM symbol [3] .
Many FFT architectures have been developed over the last three decades. Recently, several parallel data-path pipelined FFT processors for UWB applications have been developed [4] - [9] . A 128-point mixed-radix FFT algorithm with a fourdata-path approach, including radix-2 and radix-2 3 FFT algorithms, was presented in [4] to reduce the number of complex multiplications. When the 128-point FFT algorithm is broken into three successive FFT algorithms, that is, one radix-2 FFT algorithm and two radix-2 3 FFT algorithms, the hardware cost of complex multipliers in the mixed-radix multipath delay feedback (MRMDF) FFT processor comes to be only 44.8% of that in a split-radix multipath delay commutator (SRMDC) FFT processor [4] , [10] . By modifying A Low-Complexity 128-Point Mixed-Radix FFT Processor for MB-OFDM UWB Systems Sang-In Cho and Kyu-Min Kang the approach proposed by K. Maharatna and others in [11] , Y.W. Lin and others in [4] efficiently realized nontrivial complex multipliers, at the fourth stage among seven stages for the 128-point FFT operation, with nine hard-wired constant units. Chakraborty and others proposed a hardware-efficient complex constant multiplier (CCM) structure in [7] . Although alternative FFT architectures for UWB applications have also been discussed in [8] and [9] , the hardware cost is still high due to several nontrivial complex multiplications needed at two stages for the 128-point FFT operation.
To further reduce the hardware complexity and power consumption, Cho and others recently presented a four-parallel data-path 128-point mixed-radix decimation-in-frequency (DIF) FFT processor operating at over 132 MHz in [5] . In the proposed FFT processor, nontrivial complex multiplication operations are only needed at the fourth stage by breaking up the 128-point FFT algorithm into two FFT algorithms, namely, radix-2 4 FFT and radix-2 3 FFT algorithms. Because a relatively large number of constant multipliers are required to implement twiddle factors (TFs) at the end of each stage in a conventional radix-2 4 FFT architecture [12] , a modified radix-2 4 FFT structure without constant multipliers at the third stage is presented. However, the proposed FFT architecture was not fully analyzed in [5] . There were also mistakes in the figures of [5] . In this paper, we present mathematical formulation and analysis of the proposed 128-point mixed-radix FFT algorithm. Detailed characteristics of the proposed FFT processor are also analyzed. The amended figures of the signal flow graph, butterfly units (BUs), and CCMs of the proposed FFT processor are given. We compare the hardware complexity of the proposed FFT processor and several existing 128-point FFT architectures with four parallel data paths. Multiplication units using a substructure-sharing scheme are additionally suggested to efficiently implement the constant coefficient multipliers with shift operations and additions [13] , [14] . The organization of this paper is as follows. The mathematical formulations of the 128-point mixed-radix FFT algorithm are given in section II. In section III, we describe the proposed FFT architecture with four parallel data paths. The hardware complexity of the proposed FFT architecture is compared with that of the existing 128-point FFT architectures for MB-OFDM UWB systems in section IV. Conclusions are given in section V.
II. 128-Point Mixed-Radix FFT Algorithm
Given a length-N complex input sequence x(n), its discrete Fourier transform (DFT) can be described as
where
is the TF, k is a frequency index, and n is a time index. As reported in many works [4] - [12] , a hardware-efficient mixed-radix FFT algorithm should be employed to reduce the number of complex multiplications because the 128-point FFT is not at a power of 4 or 8. In this section, we present a modified radix-2 4 
Using (2) and (3), (1) 
After some straightforward calculation, we have the fourth butterfly unit as 
where the third butterfly unit H N/8 (n), the second butterfly unit H N/4 (n), and the first butterfly unit H N/2 (n) are obtained by 
In the conventional radix-2 4 FFT architecture [12] , a relatively large number of multipliers are needed to implement the TFs, 
where the third butterfly unit 
Note that ⋅ ⎢ ⎥ ⎣ ⎦ is the floor function, which returns the largest integer less than or equal to its argument value.
Radix-2 3 FFT Algorithm
In this subsection, we further decompose the butterfly of radix-8 into three stages by adopting a radix-2 3 FFT algorithm. 
and
Using (11)- (13), (4) 
where 5 6 / 64 / 64 5 6 /32 /32 4 2
We break up the 128-point DFT into a 16-point DFT and an 8-point DFT, where the 16-point and 8-point DFTs are implemented by applying the modified radix-2 4 FFT algorithm and radix-2 3 FFT algorithm, respectively.
Note that the inverse FFT (IFFT) of a length-N complex sequence x(n) can be obtained by
The IFFT can be performed by taking the complex conjugate of the input data first and then the outgoing data without changing any coefficients in the original FFT algorithm [4] .
III. Four-Parallel Data-Path FFT Architecture
Because the sampling rate of the analog-to-digital (A/D) converter is 528 MHz in the MB-OFDM UWB system, it is not easy to design a receiver structure with a single data-path using current CMOS process technologies. A four-parallel data-path receiver structure including an FFT block and a Viterbi decoder can be considered to limit the system clock of the baseband modem core to a maximum of 132 MHz for practical VLSI implementation [15] , [16] . In this paper, we propose a hardware-efficient 128-point mixed-radix FFT architecture with four data paths to meet the high-speed requirements. The signal flow graph of the proposed fourparallel data-path 128-point FFT processor is shown in Fig. 1 , where the input sequence is broken into four parallel data streams. The order of the four parallel input sequences of the proposed FFT processor is x(4m), x(4m+1), x(4m+2), and x(4m+3), where 0,1, ,31. m = The radix-2 butterfly unit is simplified as shown in Fig. 2 . Figure 3 shows a block diagram of the proposed four-parallel data-path 128-point FFT processor. The proposed FFT architecture consists of butterfly units (BU1, BU2, and BU3), complex constant multipliers (CCM1, CCM2, and CCM3), complex booth multipliers Fig. 1 . Signal flow graph of the proposed four-parallel data-path 128-point mixed-radix FFT processor. (CBMs), and registers [5] , [17] . As discussed in section II, the proposed FFT architecture is based on both the modified radix-2 4 and the radix-2 3 DIF FFT algorithms in order to reduce the number of multipliers. The proposed FFT architecture actually requires multipliers in three stages, namely, stages 2, 4, and 6. The other stages performing -j multiplication arithmetic can be implemented by simply exchanging the imaginary value with the 2's complement of the real value without actual multiplication operation (see Fig. 4(b) ).
Butterfly Units
The proposed FFT architecture employs three kinds of …x (9)x (5)x (1) …x (10)x (6)x (2) …x (11)x (7) π CCM1 is composed of six real multipliers, three 2's complement logics, two real adders, and ten multiplexers. In Fig. 5(a) , when the twiddle factor is 1 16 ±W or Fig. 5(c) is equivalent to the output of the CCM2 multiplied by -j. As discussed in [4] , CCM2 or CCM3 with 10-bit word length can be implemented by using ten real adders and two multiplexers. In CCM1, six real multipliers can also be implemented using 24 real adders and shift operations. Accordingly, CCM1 can be implemented using 26 real adders and 10 multiplexers. The CCM1 architecture is approximately three times more complex than the CCM2 or CCM3 architecture.
In many FFT processors, multipliers are implemented so that the resultant bit width of the multiplication output remains the same as that of their input. Accordingly, a roundoff error may occur by shortening the bit width of the multiplication output. A fixed-width modified booth multiplier in [17] and a fixed-width canonic signed digit multiplier in [18] use error compensation bias schemes to efficiently compensate for the round-off error. Note that the CBM employed in stage 4 of the proposed FFT architecture is composed of two booth encoders, four partial product generators, several adders, and a read-only memory (ROM), which is detailed in [6] and [17] . 
Substructure-Sharing Multiplication Units
Because six real multipliers are needed to implement CCM1 as shown in Fig. 5(a) , the hardware complexity of CCM1 is rather high. In this paper, we propose an enhanced CCM1 with two substructure-sharing multiplication units (SMUs), shown in Fig. 6 , to reduce the hardware complexity of CCM1. The SMU of Fig. 6(a) is utilized for the multiplication operations of a real input value and three constant coefficients, cos( / 8), These three multiplication operations can be performed by simply using six additions and eight shift operations as shown in Fig. 6(b) if the proposed FFT processor is implemented with a 10-bit word length. Figure 7 shows an SMU for the enhanced CCM2 and CCM3. In 10-bit word length implementation, by employing the SMU scheme, CCM2 or CCM3 can be designed using only eight adders and two multiplexers. As such, the hardware complexity of CCM1, CCM2, and CCM3 can be significantly reduced using the proposed multiplierless multiplication units with the substructure-sharing scheme. 2) Power consumption is estimated by Synopsys' Power Compiler.
IV. Implementation Results
We determined the internal word length of the proposed FFT processor using a fixed-point simulation with MATLAB before hardware implementation. After the word length of the proposed FFT processor was chosen, the FFT architecture was modeled in Verilog HDL and functionally verified using a ModelSim simulator. Then, the FFT architecture was synthesized with the appropriate time and area constraints using the Synopsys Design Compiler. Note that the FFT processor was implemented and tested using Samsung 0.18 µm CMOS technology and a standard cell library. Table 2 compares the implementation results of the proposed FFT processor for three internal word lengths. The signal-toquantization noise ratio (SQNR) of the proposed FFT processor is about 24 dB when the word length is 8 bits, and the SQNR of the proposed FFT processor is about 47 dB when the word length is 12 bits. The hardware cost and power consumption of the proposed FFT processor are increased as the internal word length increases, whereas the operation clock 1) The nontrivial multiplier is the conventional complex variable multiplier [12] , [19] .
2) In Table 3 , the number of trivial multipliers is counted as the number of the complex constant multipliers for the twiddle factor 1 8 W or 3 8 , W which is realized by shifters and adders in the existing FFT processors [4] , [11] .
speed of the FFT processor is decreased as shown in Table 2 . Implementation results indicate that the proposed FFT processor with a 10-bit internal word length can support a data processing rate of 1 Gsample/s with a power dissipation of 112 mW at 250 MHz. Note that the throughput rate of the MRMDF FFT processor in [4] is up to 1 Gsample/s, and it consumes 175 mW. The power consumption of the proposed FFT processor is approximately 36% lower than that of the MRMDF FFT processor. Figure 8 shows the output signal-tonoise ratio (SNR) for the fixed input SNR with various internal word lengths in the proposed FFT architecture. As the word length is equal to or greater than 10 bits, the output SNR is almost saturated, and accordingly, the quantization noise can be nearly ignored. Based on the simulation results, the proposed FFT processor is implemented with a 10-bit internal word length. Table 3 compares the hardware complexity of the proposed FFT processor and the existing 128-point four-parallel datapath FFT architectures. Because the proposed FFT processor employs modified radix-2 4 and radix-2 3 FFT architectures, nontrivial multiplication operations are only needed at stage 4. In the proposed FFT architecture, four nontrivial complex multipliers at stage 4 are implemented with the CBMs presented in [17] with 60% of the hardware cost of conventional complex variable multipliers [12] , [19] . In addition, the hardware complexities of CCM1s at stage 2 and CCM2 (or CCM3) at stage 6 are significantly reduced by about 34% and 18%, respectively, by employing the proposed SMU architectures as compared to those of conventional CCMs. Note that the trivial multiplication operations of the proposed FFT processor can be performed with approximately 53% of the hardware cost of the conventional radix-2 4 FFT processor in [12] . The proposed FFT processor reduces the hardware complexity of complex multipliers by about 31% as compared to the MRMDF FFT processor in [4] . Table 3 indicates that the proposed 128-point mixed-radix FFT processor is a hardware-efficient structure and is therefore suitable for high-speed UWB applications. Figure 9 shows the floor plan of an MB-OFDM UWB system-on-a-chip (SoC) including the proposed lowcomplexity 128-point mixed-radix FFT processor. The implemented MB-OFDM UWB SoC consists of several modules, namely, a medium access control (MAC), a PHY, an analog front-end (AFE), a central processing unit (CPU), and memory blocks. In our implementation, the 128-point FFT/IFFT block occupies about 5.1% of the silicon area of the PHY module.
V. Conclusion
In this paper, we have proposed a hardware-efficient 128-point mixed-radix DIF FFT processor with four data paths for MB-OFDM UWB systems. We have derived a mixed-radix FFT algorithm composed of modified radix-2 4 FFT and radix-2 3 FFT algorithms. By employing the mixed-radix FFT algorithm in the proposed FFT architecture, we have significantly reduced the number of both CCMs and CBMs. In addition, the hardware complexity of the proposed CCMs for trivial multiplications has been reduced by approximately 32% when compared to that of the existing CCM structures by adopting multiplication units using a substructure-sharing scheme. Implementation results have shown that the proposed mixed-radix FFT processor with 10-bit internal word length and four parallel data paths can support a data processing rate of up to 1.0 Gsample/s with a power dissipation of 112 mW at 250 MHz using 0.18 µm CMOS technology.
