This paper presents a novel high-speed, low-complexity two-parallel 128-point radix-2 4 FFT/IFFT processor for MB-OFDM ultrawideband (UWB) systems. The proposed high-speed, low-complexity FFT architecture can provide a higher throughput rate and low hardware complexity by using a two-parallel data-path scheme and a single-path delayfeedback (SDF) structure. The radix-2 4 FFT algorithm is also realized in our processor to reduce the number of complex multiplications. The proposed FFT/IFFT processor has been designed and implemented with 0.18-μm CMOS technology in a supply voltage of 1.8 V. The proposed twoparallel FFT/IFFT processor has a throughput rate of up to 900 Msample/s at 450 MHz while requiring much smaller hardware complexity and low power consumption. key words: Fast Fourier transform (FFT), radix-2 4 , SDF, multiband orthogonal frequency-division multiplexing (MB-OFDM), ultrawideband (UWB) 
Introduction
Ultrawideband (UWB) communication systems, which enable the delivery of data from a rate of 110 Mb/s at a distance of 10 m to a rate of 480 Mb/s at a distance of 2 m in a realistic multipath environment, are ideally suited to application in short range wireless communications because they can share a frequency band with existing narrowband systems and offer a higher data rate than 802.11 or Bluetooth [1] , [2] . One of the communication methods for IEEE 802.15.3a standard is Multiband Orthogonal Frequency Division Multiplexing (MB-OFDM), which offers 528 MHz bandwidth [3] , [4] . To minimize power consumption and provide multiple simultaneous operating piconet satisfying the Federal Communications Commission regulatory, a MB-OFDM UWB system has been proposed. This system transmits OFDM symbols using a different carrier frequency from symbol to symbol according to time frequency codes [3] , [4] . MB-OFDM-based UWB not only has reliably high-data-rate transmission in time-dispersive or frequency-selective channels without having complex time-domain channel equalizers but also can provide high-spectral efficiency.
The FFT/IFFT processor is one of the modules having high computational complexity in the physical layer of the MB-OFDM UWB system. The present paper proposes a high-speed, low-complexity FFT/IFFT processor with a novel two data-path parallel pipelined architecture for high- 4 FFT/IFFT algorithm. Section 4 provides details of the proposed FFT/IFFT architecture. In Sect. 5, the hardware cost and throughput rate of the proposed FFT/IFFT architecture is compared with that of the existing 128-point FFT/IFFT architecture for MB-OFDM UWB applications. Conclusions are presented in Sect. 6.
Design Issue of the FFT Processor for the MB-OFDM UWB Systems
A block diagram of the proposed physical layer of MB-OFDM UWB systems is shown in Fig. 1 [3] , [4] . In the MB-OFDM UWB system, the data rate is from 53.3 to 480 Mb/s with code rates of 1/3, 11/32, 1/2, 5/8, and 3/4. In order to implement the physical layer of the MB-OFDM UWB system more efficiently, the four data-path approach has been adopted to reduce the data sampling rate from the analogdigital converter such that, after the serial-to-parallel converter, the data sampling rate of each path can generally be reduced to 132 Msamples/s [5] , [6] . However, the hardware cost is also increased significantly, because more memory and complex multipliers are needed to allow multiple data to be operated simultaneously. Various FFT architectures, such as multi-path delay commutator and single-path delay feedback (SDF) in the radix-2 and radix-4 algorithms, have been proposed over the past 3 decades [6] , [7] . The architectures listed above have distinctive advantages and some common requirements, as has been well described in [6] , [7] . The main motivation of this paper is to design a novel two-parallel data-path pipelined radix-2 4 SDF FFT/IFFT architecture that offers high throughput and low hardware complexity. The proposed FFT processor is not only appropriate for the MB-OFDM UWB physical layer but also can provide an available throughput rate to meet the MB-OFDM UWB specifications. The approach is described in more detail in Sects. 3-4.
Radix-2 4 Algorithm
A Discrete Fourier Transform of length N is defined as follows:
where W n , the so called "twiddle factor," denotes the N-th primitive root of unity, with its exponent evaluated modulo N. k is the frequency index and n is the time index. In order to derive the radix-2 4 algorithm, consider the following first 4 steps of decomposition. Applying a 5-dimensional linear index map
The Common Factor Algorithm (CFA) takes the form of
The twiddle factors can be expressed in the form of
The fourth butterfly unit has the expression of
where H(n) denotes the second butterfly unit. where B(n, k 1 ) denotes the first butterfly unit as follows.
The twiddle factor W n 16 in Eq. (5) has four complex numbers. The algorithm can take a complex constant multiplier instead of a programmable complex multiplier. Hence, the complex multiplication of twiddle factors, W n 16 , can be implemented in the Canonic Signed Digit (CSD) constant multiplier, which contains the fewest number of non-zero bits. As such, the area and power consumption can be reduced [8] . Figure 2 shows the signal flow graph (SFG) of the 128-point radix-2 4 SDF (R2 4 SDF) FFT algorithm.
Proposed Radix-2 4 FFT Architecture
A block diagram of the proposed two-parallel data-path 128-point R2 4 SDF FFT/IFFT processor is shown in Fig. 3 . The proposed architecture consists of a memory block, butterfly Fig. 3 is 64-point butterfly unit because it needs to wait the input data of even time. Meanwhile other BF units need just half point of butterfly, because the input data of odd times and even times operate separately. Two complex Booth multipliers are needed in the two-parallel approach to implement the radix-2 4 FFT algorithm. Figure 6 shows the complex Booth multiplier, which needs a ROM to store the multiplicand. Since only 1/8 period of cosine and sine waveforms are needed, 16 kinds of the twiddle factors, which is 1/8 out of 128 points, are stored in ROM [9] . Thus, the ROM stores all 32 bytes (16×8×2 bits) twiddle coefficients. To reduce the truncation error, the fixed-width Dadda multiplier with the error compensation method was used. For the complex Booth multiplier, the Dadda reduction network with error-compensation circuit [10] , [11] was used in the proposed FFT processor, as shown in Fig. 6(b) . The value of 1st table shown in Fig. 6(b) is obtained from the Partial Product Generator (PPG), in which the sign-bit pre-calculation vector value is "10101010101011." The signals y" at 6th column, which are the output of error-compensation circuit, are the inversion of the zero signals from Booth encoder. For rounding operation, '1' must be added at the 6th column. The horizontal segment of 2nd table represents the carry and sum Table 1 The CSD binary representation of twiddle factor (1 = −1).
output from the full-adder or half-adder. In other words, the crosswise segment represents (2, 2) or (3,2) The radix-2 4 FFT algorithm based two-parallel datapath architectures has fewer multipliers than those of lower radix FFT algorithms. For example, the radix-2 4 algorithm has the same number of multipliers as the radix-2 2 algorithm but can reduce the degree of multiplicative complexity by means of replacing a half of the full complex multipliers with trivial constant multipliers [7] .
The twiddle factors, W(8), W(16), W(24), and W(48) correspond to the trigonometrical functions of cos(π/8), sin(π/8) and cos(π/4), respectively. Table 1 shows the twiddle factor, which represents the 8-bits coefficients in the decimal format, the 2's complement, and the CSD format. Radix-2 4 algorithm can take complex constant multiplier instead of programmable complex multiplier. The Canonic Signed Digit (CSD) constant multiplier contains the fewest number of non-zero bits, so it can reduce the area and power consumption [8] . Figure 7 shows the structure of the CSD complex constant multipliers for cos(π/8), sin(π/8), and cos(π/4). To efficiently compensate for the quantization error, the truncated bits are divided into two groups (major group and minor group) depending upon their effects on the quantization error. The error compensation bias is first expressed in terms of the truncated bits in the major group. The effects of the other truncated bits in the minor group are then handled by a probabilistic estimation [8] . The total compensation bias, C, circuit is shown in Fig. 7 . Table 2 shows the scheduling of the twiddle factor in each data path, in which the multiplication of twiddle factor is separated with 1st data-path and 2nd data-path due to 2-parallel data-path operation. The CSD complex constant multiplier block consists of six CSD constant multipliers, 2's complement logics, and multiplexers as shown in Fig. 8 . Each real value and imaginary value of the output data are outputted by six CSD constant multipliers. When the real and imaginary values of twiddle factors are same, the two CSD constant multipliers are used and theirs two outputs are added to generate the output of the CSD complex multiplier. Otherwise, when the real and imaginary values are not same, the four CSD constant multipliers are used for the multiplication of input and twiddle factors. If inputs don't need to multiply with twiddle factor (in case of 'X' in Table 2 ), the output results are generated from the input directly.
Implementation and Performance Evaluation
The appropriate word length in the proposed 128-point twoparallel pipelined R2
4 SDF FFT/IFFT processor is determined by a fixed-point simulation before hardware imple- mentation. Figure 9 shows the simulation results for the relation of SNR with the internal word length of the FFT/IFFT processor. The detailed explanation is described in our previous paper [12] . Based on the simulation results, we determined the word length of the proposed FFT/IFFT processor to be 10 bits in both real and imaginary parts. In addition, the SQNR of the proposed 128-point R2 4 SDF FFT/IFFT processor is about 33 dB. After the appropriate word length of the proposed FFT/IFFT processor was chosen, the FFT/IFFT processor was implemented using a standard-cell based design methodology and the 0.18-μm MagnaChip Components library plus full-custom memory and register file blocks.
The performance and hardware cost of the pipelined FFT/IFFT processor are increased as a result of using the multiple data-path approach. In general, conventional FFT architectures have used a four-parallel data-path approach [5] , [6] , which requires higher hardware cost. However, the proposed two-parallel data-path pipelined R2 4 SDF FFT/IFFT processor provides higher throughput rate with higher clock frequency while the hardware cost is reduced significantly. Table 3 shows performance comparisons between the proposed two-parallel R2 4 SDF FFT/IFFT processor and the existing 128-point FFT/IFFT processor [6] . The twoparallel R2 4 SDF FFT/IFFT processor consists of 70,000 gates excluding memories, and the operating clock frequency is about 450 MHz. Although the number of registers in our design is greater than previous four-parallel architecture, it is implemented by two 190×10 bits RAM, which requires small area cost. Also, it not only has a significantly reduced number of complex multiplication and complex addition but also can provide the highest clock frequency 450 MHz due to two-parallel data-path and pipelined complex Booth multiplier. The highest throughput rate of our proposed architecture is up to 900 Msample/s at 450 MHz.
Conclusion
In this paper, a novel two-parallel data-path pipelined 128-point radix-2 4 SDF FFT/IFFT processor for a MB-OFDM UWB system has been proposed. In the proposed architecture, high-speed data processing and low hardware complexity can be achieved due to two-parallel data-path structure and high clock speed. Furthermore, the number of complex Booth multipliers is effectively reduced by using a radix-2 4 SDF FFT algorithm. The performance results show that the data processing rate is as high as 900 Msamples/s at 450 MHz while requiring small hardware complexity. The proposed architecture is expected to be incorporated in highspeed, low-complexity MB-OFDM UWB systems.
