In this paper, we present a new fast Fourier transform (FFT) algorithm to reduce the table size of twiddle factors required in pipelined FFT processing. The proposed algorithm can reduce the table size to half, compared to the radix-2 2 algorithm, while retaining the simple structure. In addition, a new dynamic data scaling approach is presented to reduce hardware complexity without degrading signal-to-quantization-noise ratio (SQNR). To verify the proposed algorithm, a 2048-point pipelined FFT processor is designed using a 0.18 m CMOS process. By combining the proposed algorithm and the radix-2 2 algorithm, the table size is reduced to 35% and 53% compared to the radix-2 and radix-2 2 algorithms, respectively. The FFT processor occupies 1.95 mm 2 and achieves SQNR of more than 55 dB without increasing the internal wordlength progressively using the proposed dynamic data scaling.
INTRODUCTION
The fast Fourier transform (FFT) is a major signal processing block being widely used in communication systems, especially in orthogonal frequency division multiplexing (OFDM) systems such as digital video broadcasting, digital subscriber line and WiMAX (IEEE 802.16 ). As such a system requires large-point FFT computation for multiple carrier modulation, usually more than 1024 points, it is desirable to reduce computational complexity as well as hardware complexity.
To reduce the computational complexity, various FFT algorithms have been proposed such as radix-2 2 , radix-2 3 as well as radix-2 and radix-4 algorithms. 1 2 Although the previous algorithms could reduce the computational hardware resources such as multipliers and adders, they did not seriously take into account the number of twiddle factors to be stored into tables. In the implementation of a large-point FFT processor, however, the tables become large enough to occupy significant area and power consumption. 3 In this paper, a new FFT algorithm is proposed to overcome the problem of the large table requirement, which not only reduces the table size by a factor of two compared to radix-2 2 algorithm but also retains the simple structure of radix-2 algorithm. Since additional computations incurred by applying the proposed algorithm can be implemented with a few adders, * Author to whom correspondence should be addressed. the overall computational complexity is almost the same as that of radix-2 2 algorithm. When a fixed-point representation is employed to implement a FFT processor, the wordlength has a significant influence on the accuracy and dynamic range. Although a long wordlength is required to achieve high signal-to-quantization-noise ratio (SQNR), it results in a large hardware complexity as the word sizes of memories and computational units such as complex multipliers and complex adders should be increased in proportion to the wordlength. 4 An efficient dynamic data scaling technique is also presented in this paper to lower the hardware complexity without degrading SQNR.
PROPOSED FFT ALGORITHM
The proposed algorithm can be derived by applying the Cooley and Tukey radix-2 decimation-in-frequency (DIF) decomposition two times. 5 The N -point Discrete Fourier Transform (DFT) of a sequence x n is defined as
where x n and X k are complex numbers. The twiddle factor is defined as follows.
The two decompositions can be expressed if n and k are replaced with 3-dimensional linear index maps shown below.
Using the above index maps, Eq. (1) can be rewritten as
where B · represents the following butterfly structure.
The main idea of the proposed algorithm is to take into account the value of n 3 in the summation of n 2 . For even n 3 = 2m , the sum is arranged as follows. If n 3 is odd (= 2m + 1), the sum becomes
where m is an integer between 0 and N /8 − 1. By substituting Eqs. (6) and (7) to Eq. (4), we obtain the following expression.
In (8), the expression of H · also depends on the value of n 3 . For even n 3 , H · is expressed as
If n 3 is odd, then H · is arranged below.
As k 1 is either 0 or 1, Eq. (9) indicates that the butterfly has a trivial multiplication of −j at the input side if n 3 is even, and Eq. (10) implies that an additional constant multiplication of W Figure 1 that illuminates the signal flow graphs of 16-point FFT corresponding to the radix-2 2 algorithm and the proposed algorithm.
DYNAMIC DATA SCALING
As a butterfly contains an adder and a subtractor, one bit should be increased in the result to avoid overflow, increasing the hardware complexity of memories and computational units. The simplest way to avoid the increase of the internal wordlength is to scale down the output value of each stage to half. If all the internal wordlengths are set to the wordlength of input, however, the resulting SQNR is very low because of severe information loss. To achieve a SQNR enough to meet the standard specification, therefore, this approach needs an internal wordlength that is much longer than the input wordlength, increasing the overall hardware complexity significantly.
Another data scaling approach is to dynamically scale the internal wordlength. One of the approaches is the block floating point (BFP) method. 6 When a pipelined architecture is used, however, the BFP method is not suitable because of its huge latency to normalize all outputs from a certain stage. Instead, a method called convergent block floating point (CBFP) has been proposed for pipelined architectures. 7 As shown in Figure 2 , the CBFP method also suffers from large memory overhead and increased latency caused by the intermediate buffer as well as complex normalization. Furthermore, the intermediate buffer in the CBFP logic has to store full-precision values because the normalization can be performed after the scaling factor is known. 8 Although a data scaling method that does not need additional buffers and latency has been proposed, it still requires the complex normalization that should be implemented with a number of compare and shift units connected in series at the output of each stage. 9 The proposed data scaling technique is based on an observation that there is no need to scale down the internal value if overflow does not occur in computing the value. Even if overflow occurs, scaling down to half, which can be achieved by a simple operation of 1-bit right shift, is all to accommodate the overflow. Based on this observation, we present an efficient dynamic scaling technique. The main idea of the proposed algorithm is to conditionally scale down the output value of a complex multiplication if overflow occurs in the computation, and tagging this information on the internal word. We examine the output value of a complex multiply unit to check whether it can be represented in n bits or not as shown in Figure 3 (a). The overflow can be easily detected by performing an Exclusive-OR operation for two most significant bits (MSBs) shown in Figure 3(b) . If overflow occurs in either the real value or the imaginary value, both the real value and the imaginary value is scaling down to half, which leads to less hardware complexity. The internal word format of the proposed dynamic scaling method is shown in Figure 4 (a). The data field in the internal word format is to represent the scaled data value, and the tag field is to indicate how many times the scalings are applied from the original input values to generate the corresponding data. If the proposed data scaling method is applied to the L-th stage, at most log 2 L bits are enough for the tag field. Therefore, In general, the two data words participating in a butterfly computation have different tag values. As shown in Figures 4(b and c) , the difference of the two tag values is calculated first, and then one data word with the smaller scale is shifted by the difference to make the scales of two data words equal. The tag value of the output word of a butterfly computation is initially set to the larger tag of the two input words. After the complex multiplication is completed, the output tag value is increased by one if overflow is detected.
At the final pipeline stage of N -point FFT, each output has the different tag value in general because each value experiences a different number of scalings. To obtain appropriate precision, the output is scaled up by the amount of the corresponding tag value. As the proposed conditional scaling technique makes the internal wordlength short, it leads to a lower hardware complexity without severe information loss.
PROPOSED 2048-POINT PIPELINED FFT
In pipelined DIF FFT processing, the twiddle factor table is largest at the first stage and reduced by a factor of two at the successive stages. Reducing the table sizes at the first several stages can be significant because the original table sizes are large enough to pay off the additional constant multipliers. The reduction is not considerable, however, at the latter stages. At each pipeline stage, we have to decide whether to apply the proposed algorithm or not with considering both the cost of the additional constant multiplier and the table size reducible by the proposed algorithm. When the cost of the additional constant multiplier is not compensated by the table reduction at a certain stage, the radix-2 2 algorithm should be applied from that stage to the last stage.
By combining the proposed algorithm with radix-2 2 algorithm, we designed a 2048-point pipelined FFT processor of which Single-path Delay Feedback (SDF) structure is shown in Figure 5 . 1 The overall table size can be reduced to almost half, compared to the structure that uses only radix-2 2 algorithm, by applying the proposed algorithm to the first four stages. In this case, two constant multipliers are required to compute nontrivial multiplications by W is not increased notably but the costs of multipliers and tables are increased significantly. 4 The complexity of the constant multiplier depends on the number of non-zero bits in the binary representation of the constant. To minimize the number of non-zero bits, the constants are expressed in the minimal signed digit (MSD) representation, as shown in Table I . Due to the sparse non-zero bits in the sine and the cosine values, the constant multipliers can be implemented with a few adders. By employing these two simple constant multipliers, we can reduce the required sizes of two largest tables to half. As the two tables takes more than 75% of the total table sizes required in the radix-2 2 algorithm, the reduction plays a significant role in lowering the overall complexity of the 2048-point FFT processor.
IMPLEMENTATIONS
We should the format of Sensor Letters. The hardware complexities required in the proposed algorithm and the previous algorithms are compared in Table II for the case of 2048-point FFT. The required table size indicates the total number of entries of the ROM tables. The /2 symmetric property of the twiddle factors is considered in counting the table size. We can reduce the table size further if we employ the /4 symmetric property. As the reduction ratio is independent of what symmetric property is used, the reduction ratio shown in Table II also applies to the case of /4 symmetry. As indicated in Table II , the proposed algorithm needs the minimal table size compared to other algorithms and the overhead is just two constant multipliers which can be implemented with a few adders.
Assuming that the input is represented in 12 bits, we compare four scaling schemes shown in Table III. Table IV shows that the internal wordlength configurations and SQNR performances resulting from the scaling methods. If no scaling is used, the wordlength is increased progressively, one bit per stage to avoid overflow. On the contrary, both the proposed dynamic scaling technique and the scaling-to-half method maintain the internal wordlengths constant. As denoted in Table IV , the proposed dynamic scaling technique (case III) can improve SQNR impressively, about 30 dB improvement, compared to the scaling-to-half method (case I) although those two configurations have similar hardware complexity. If the first stage is allowed to extend one bit as in case IV, we can obtain a SQNR performance of more than 55 dB using the proposed dynamic scaling technique. To achieve a SQNR of more than 55 dB by progressively increasing the internal wordlength, the wordlength should be lengthened to 19 bits, leading to huge hardware complexity at the latter stages as in case II. Compared to the CBFP method that requires additional buffers to store a group of values to be normalized, the proposed dynamic scaling method requires less memory as well as less computational delay. Table V shows memory sizes required to process 2048-point FFT. The memory requirement indicates the total sizes of FIFO memories and intermediate buffers. The memory overhead resulting from the CBFP method is enormous, as the full-precision values should be stored in intermediate buffers and the size of the buffers is comparable with that of the FIFO memories. In addition, the latency is also increased considerably by the intermediate buffers.
We designed a 2048-point pipelined FFT processor using a 0.18 m 4-Metal CMOS process. The internal wordlengths are configured as indicated in case IV in Table III . The proposed FFT processor occupies 1.95 mm 2 and the gate count is 75,809 excluding memories and ROMs. The FIFO buffers are implemented using RAM memories, and small-sized RAM and ROM memories are replaced with registers and logic circuitry, respectively.
CONCLUSIONS
We should the format of Sensor Letters. We have proposed a new FFT algorithm to reduce the size of twiddle factor tables and an efficient dynamic scaling method to lower overall hardware complexity in the implementation of large-point pipelined FFT processors. By applying the proposed FFT algorithm to the first several stages, the table size required in pipelined FFT processing is reduced approximately by half at the cost of a few simple constant multipliers compared to the radix-2 2 algorithm. Since the constant multipliers can be implemented by a few adders, the proposed algorithm is efficient in large-point FFT computation, especially in terms of area and power consumption. Based on the proposed FFT algorithm, we can design a 2048-point pipelined FFT processor that reduces the total size of twiddle factor tables to 35% and 53% compared to the radix-2 and radix-2 2 algorithms, respectively. In addition, the proposed dynamic scaling technique enables the proposed processor to achieve SQNR of more than 55 dB without increasing the internal wordlength progressively.
