In this paper, we present a new fast Fourier transform (FFT) algorithm to reduce the table size of twiddle factors required in pipelined FFT processing. The table size is large enough to occupy significant area and power consumption in long-point FFT processing. The proposed algorithm can reduce the table size to half, compared to the radix-2 2 algorithm, while retaining the simple structure. To verify the proposed algorithm, a 2048-point pipelined FFT processor is designed using a 0.18 µm CMOS process. By combining the proposed algorithm and the radix-2 2 algorithm, the table size is reduced to 34% and 51% compared to the radix-2 and radix-2 2 algorithms, respectively. The FFT processor occupies 1.28 mm 2 and achieves a signal-to-quantization-noise ratio (SQNR) of more than 50 dB. key words: discrete Fourier transform (DFT), fast Fourier transform (FFT), low-power design, orthogonal frequency division multiplexing (OFDM), pipelined processing
Introduction
The fast Fourier transform (FFT) is a major signal processing block being widely used in communication systems, especially in orthogonal frequency division multiplexing (OFDM) systems such as digital video broadcasting, digital subscriber line and WiMAX (IEEE 802. 16 ). As such as system requires long-point FFT computation for multiple carrier modulation, usually more than 1024 points, it is desirable to reduce computational complexity as well as hardware complexity.
To reduce the computational complexity of FFT processing, various FFT algorithms have been proposed such as radix-2 2 [1] , radix-2 3 [2] , radix-4+2 [3] , split-radix [4] as well as radix-2 and radix-4 algorithms. Although the previous algorithms could reduce the computational hardware resources such as multipliers and adders, they did not seriously take into account the number of twiddle factors required in FFT processing. The twiddle factors can be computed on the fly by using the CORDIC (COordinate Rotation DIgital Computer) algorithm [5] . As the CORDIC algorithm takes multiple cycles and requires a lot of hardware resources, the twiddle factors are usually stored in tables that are generally implemented with ROMs. In the implementation of a long-point FFT processor, however, the tables become large enough to occupy significant area and power consumption [6] .
In this paper, a new FFT algorithm is proposed to overcome the large table requirement, which not only reduces the table size by a factor of two compared to the radix-2 2 algorithm but also retains the simple structure of the radix-2 algorithm. Since additional computations incurred by applying the proposed algorithm can be implemented with a few adders, the overall computational complexity is almost the same as that of the radix-2 2 algorithm. For efficient FFT processing, a number of hardware architectures have been proposed, such as serial processing on a single processor [7] , pipelined processing [8] - [10] , and parallel processing [11] - [13] . Among them, the pipelined processing is preferred, as it can provide high performance with a moderate cost. Based on the proposed algorithm, a pipelined 2048-point FFT processor is designed. It can also support 1024/512/256-point FFT processing. By applying the proposed algorithm to the first several stages, the table size is reduced to 34% and 51% compared to the radix-2 and radix-2 2 algorithms, respectively. The rest of this paper is organized as follows. In Sect. 2, a new FFT algorithm is proposed. The hardware issues in the proposed algorithm are described in Sect. 3. A 2048-point FFT processor based on the proposed algorithm is presented in Sect. 4.
Proposed FFT Algorithm
The proposed algorithm can be derived by applying the radix-2 decimation-in-frequency (DIF) decomposition presented by Cooley and Tukey [14] two times. The N-point Discrete Fourier Transform (DFT) of a sequence x(n) is defined as
where x(n) and X(k) are complex numbers. The twiddle factor is defined as follows.
The two decompositions can be expressed if n and k are replaced with 3-dimensional linear index maps shown below.
Copyright c 2007 The Institute of Electronics, Information and Communication Engineers
Using the above index maps, Eq. (1) can be rewritten as
where B(•) represents the following butterfly structure.
The main idea of the proposed algorithm is to take into account the value of n 3 in the summation of n 2 . For even n 3 (= 2 m), the sum is arranged as follows.
If n 3 is odd (= 2m + 1), the sum becomes
where m is an integer between 0 and N/8 − 1. By substituting Eq. (6), Eq. (7) to Eq. (4), we obtain the following expression.
In Eq. (8), the expression of H(•) also depends on the value of n 3 . For even n 3 , H(•) is expressed as
If n 3 is odd, H(•) can be arranged as follows.
As k 1 is either 0 or 1, Eq. (9) indicates that the butterfly has a trivial multiplication of − j at the input side if n 3 is even and Eq. (10) implies that an additional constant multiplication of W 1 N is required at the input side if n 3 is odd. By performing the constant multiplications at the input side, all the exponents of the twiddle factors W
to be multiplied in Eq. (8) become even values. This implies that, in the proposed algorithm, the table required in every second stage needs to store only the twiddle factors of which exponents are even. However, in the radix-2, radix-2 2 , and radix-2 3 algorithms, most of the twiddle factor tables have twiddle factors associated with both even and odd exponents.
The N-point Inverse DFT (IDFT) of a sequence X(k) is defined as
As shown in Eq. (11), the inverse fast Fourier transform (IFFT) can be processed just as the same way as we do for FFT with changing only the sign of the imaginary part of the twiddle factors.
Proposed Table Size Reduction in Implementation
In pipelined radix-2 DIF FFT processing, the number of required twiddle factors is largest at the first stage and reduced gradually at the successive stages. The number of twiddle factors required in a stage is determined by two factors. The first factor is the maximum exponent value (MEV) of twiddle factors, and the other is the greatest common divisor (G.C.D) value among the exponents of the twiddle factors, that is, minimum stride. Considering these two factors, the number of required twiddle factors, N required , can be represented as follows.
However, the ROM size is usually larger than N required since the number of ROM entries is usually constrained to a power of two. The ROM size, N ROM , is also largest at the first stage. Figure 1 illustrates the signal flow graphs of 16-point FFT corresponding to the radix-2, radix-2 2 , radix-2 3 and the proposed algorithms. As shown in Fig. 1 , the twiddle factor table is needed at every stage in the radix-2 algorithm. Compared to the radix-2 algorithm, the radix-2 2 or radix-2 3 algorithm reduces the number of tables by introducing constant multipliers.
To reduce the ROM table size, in general, non-trivial twiddle factors should be regularly distributed. This regularity can be determined by the value of minimum stride. Table 1 shows the number of ROM entries required in 16-point FFT processing shown in Fig. 1 for the various FFT algorithms. Since the value of the minimum stride in the proposed algorithm is bigger than one, we can reduce the total number of entries of the ROM at least by a factor of two compared to the famous radix-2 2 algorithm that has a twiddle factor table in every second stage. Considering the π/2 symmetric property of the twiddle factors shown in Fig. 2 in counting the number of entries, the total number of table entries required for the case of 8192-point and 2048-point FFT processing are compared in Table 2 . By swapping the real and imaginary parts and changing their sign if required to compensate the π/2 symmetry, all twiddle factors can be obtained from the stored twiddle factors. As denoted in Table 2 , the proposed algorithm can effectively reduce the number of table entries compared to the other FFT algorithms.
In the proposed algorithm, the reduction of table entries is achieved at the cost of an additional constant multiplier per every second stage. Therefore, the effectiveness of the proposed algorithm can be determined by the complexity of the additional constant multiplier. The complexity of the constant multiplier depends on the number of non-zero bits in the binary representation of the constant. To minimize the number of non-zero bits, rounded constants are expressed in the minimal signed digit (MSD) representation [15] , as shown in Table 3 . Due to the sparse non-zero bits in the sine and cosine values, the constant multipliers can be implemented with a few adders. In case of W 1 8192 multiplication, the corresponding constant multipliers can be implemented with only two adders as shown in Fig. 3(a) . Similarly, the constant multipliers for W 1 2048 and W 1 512 can be implemented with four and six adders, respectively, as shown in Figs. 3(b) , (c). The number of adders needed for implementing the constant multiplier is specified in Table 3 . The constant multiplier for W 1 8 is used in every third stage in the radix-2 3 algorithm. Although the proposed algorithm is very effective in long-point FFT processing due to the low complexity of the additional constant multiplier, the reduction in table size is not considerable at the later stages of the pipelined FFT processor. For the two consecutive pipeline stages, we have to decide whether to apply the proposed algorithm or not with 2 algorithm should be applied from that stage. The hardware complexity defined by the number of adders in constant multipliers and the number of general multipliers is summarized in Table 2 for 8192-point and 2048-point FFT. The proposed algorithm is applied to the first 8 stages of the total 13 stages and to the first 6 stages of the total 11 stages for 8192-point FFT and 2048-point FFT, respectively. The remaining stages are processed by the radix-2 2 algorithm. As indicated in Table 2 , the proposed algorithm combined with the radix-2 2 algorithm can process longpoint FFT with comparable hardware complexity to other hardware-efficient FFT algorithms. In addition to the low hardware complexity, the required total number of table entries is minimal in the proposed algorithm combined with the radix-2 2 algorithm. We can reduce the table size further if we employ the π/4 symmetric property [16] . As the reduction ratio is independent of what symmetric property is used, the reduction ratio shown in Table 2 also applies to the case of π/4 symmetry. Reducing the table sizes at the first several stages by applying the proposed algorithm can be significant because the original table sizes at the first several stages are large enough to pay off the additional constant multipliers. 
Overall Structure of Pipelined 2048-Point FFT/IFFT Processor
By combining the proposed algorithm with the radix-2 2 algorithm as described in Sect. 3, we designed a 2048-point pipelined FFT/IFFT processor of which single-path delay feedback (SDF) structure [1] is shown in Fig. 4 . Also, it can process 1024/512/256-point FFT/IFFT. Although the designed processor is based on the SDF structure, the proposed algorithm can be applied to other pipelined structures, such as multi-path delay commutator (MDC) [1] . The π/2 symmetric property of the twiddle factors is considered in deciding the size of ROM tables. In the FFT processor, three constant multipliers are required to compute non-trivial multiplications of W Table 3 . In implementing the 2048-point FFT processor, the bit-width of the twiddle factors is set to 12 bits by performing several simulations. The longer wordlength is not cost efficient, as the SQNR performance is not increased notably but the cost of multipliers and tables are increased significantly [17] .
As a butterfly (BF) contains an adder and a subtractor, the sum should be increased by one bit to avoid overflow, increasing the hardware complexity of memories and computational units. The simplest way to avoid the increase of the internal wordlength is to scale down the output value of each stage to half. If all the internal wordlengths are set to the wordlength of the input, however, the resulting SQNR is very low because of severe information loss. To achieve a SQNR enough to meet the standard specification, therefore, this approach needs an internal wordlength that is much longer than the input wordlength, increasing the overall hardware complexity significantly. In the designed FFT/IFFT processor, the dynamic data scaling method [18] , a sort of semi-floating point representation, is adopted to lower the complexity of the computational units. Table 4 shows the internal wordlength configuration and SQNR performance of the designed FFT/IFFT processor when the input is represented in 12 bits. The proposed 2048/1024/512/256-point FFT/IFFT processor was described in a hardware description language and synthesized with a 0.18 µm 4-metal CMOS standard cell library. Figure 5 shows the layout of the proposed FFT/IFFT processor. The proposed FFT processor occupies 1.28 mm 2 and the gate count is 51,510 excluding RAM and ROM memories. The FIFO buffers are implemented using RAM memories, and small-sized RAM and ROM memories are replaced with registers and logic circuitry, respectively. The proposed processor achieves a SQNR of more than 50 dB as indicated in Table 4 . For 2048-point FFT/IFFT processing, the latency is 2057 cycles. In addition, the proposed FFT/IFFT processor can process the 1024/512/256-point FFT/IFFT by simply changing the input location.
The area occupied by the core including small-sized RAM and ROM memories is 40.3% of the total area and the hardware costs of the various functional blocks are listed in Table 5 . Table 6 shows the areas occupied by ROM tables where the large-size ROMs are generated by the memory compiler. It shows that, at the later stages, the hardware cost of the additional constant multiplier becomes bigger than that of ROM tables. Therefore, the radix-2 2 algorithm is applied to those stages. Based on the measurements shown in the Table 5 and Table 6 , we have estimated the areas required for the various FFT algorithms listed in Table 7 , where we can conclude that the proposed algorithm is the most area-efficient because of its minimum area occupied by ROM tables and the negligible overhead of additional constant multipliers. Although the radix-2 3 and radix-4+2 algorithms lead to comparable areas in Table 7 , their total areas are in practice larger than the proposed one because they need more complex control circuit. Furthermore, it is obvious that the proposed algorithm combined with the radix-2 2 algorithm can lead to more area-efficient implementation for the 8192-point processing compared to other algorithms since the area occupied by the ROM outweighs the area of the additional constant multipliers.
Conclusion
We have proposed a new FFT algorithm to reduce the size of twiddle factor tables required in long-point pipelined FFT processors. By applying the proposed FFT algorithm to the first several stages, the table size required in pipelined FFT processing is reduced approximately by half at the cost of a few constant multipliers compared to the radix-2 2 algorithm. Since the constant multipliers can be implemented by a few adders, the proposed algorithm is efficient in long-point FFT computation, especially in terms of area and power consumption. Based on the proposed FFT algorithm, we can design a 2048/1024/512/256-point pipelined FFT/IFFT processor that reduce the total size of twiddle factor tables to 34% and 51% compared to the radix-2 and radix-2 2 algorithms, respectively. The designed FFT/IFFT processor achieves a SQNR of more than 50 dB for 2048-point FFT processing.
