This paper presents a low-complexity multi-mode fast Fourier transform (FFT) processor for Digital Video BroadcastingTerrestrial 2 (DVB-T2) systems. DVB-T2 operations need 1K/2K/4K/8K/ 16K/32K-point multiple mode FFT processors. The proposed architecture employs pipelined shared-memory architecture in which radix-2/2 2 /2 3 /2 4 FFT algorithms, multi-path delay commutator (MDC), and a novel data scaling approach are exploited. Based on this architecture, a novel lowcost data scaling unit is proposed to increase area efficiency, and an elaborate memory configuration scheme is designed to make single-port SRAM without degrading throughput rate. Also, new scheduling method of twiddle factor is proposed to reduce the area. The SQNR performance of 32K-point FFT mode is about 45.3 dB at 11-bit internal word length for 256QAM modulation. The proposed FFT processor has a lower hardware complexity and memory size compared to conventional FFT processors.
Introduction
Digital Video Broadcasting-Terrestrial 2 (DVB-T2) is the next development of the DVB-T standard [1] . Building on the success and on the technology of DVB-T, it provides additional facilities and features in line with the developing Digital Terrestrial Television (DTT) market. DVB-T2 promises a 30% to 50% increase in capacity in equivalent conditions already used for DVB-T. Also it allows additional features and services such as High Definition Television (HDTV) services. In the physical (PHY) layer design of the DVB-T2, orthogonal frequency division multiplexing (OFDM) is one of the main technologies, and fast Fourier transform (FFT) is the key component. In previous DVB-T/H systems, a 2048/4096/8192-point FFT processor is used and the specified symbol duration is 224 μs/448 μs/896 μs. Also, previous DVB-T/H systems used QPSK/16QAM/64QAM modulations [2] , [3] . DVB-T2 proposes three additional modes of 1K/16K/32K (112 μs/1792 μs/3584 μs) to the three DVB-T/H modes. Therefore, a FFT processor must be able to work in 1K/2K/4K/8K/16K/32K-point multiple modes in DVB-T2 systems. Also, DVB-T2 proposes additional 256QAM modulation.
Design of efficient FFT architectures have been actively investigated since last decades. Generally, FFT architectures can be divided into two different categories: pipelined structures (such as the multi-path delay commutator (MDC), multi-path delay feedback (MDF), or single-path delay feedback (SDF) architectures) [4] - [6] , and memory-based structures [7] , [8] . Pipelined architectures have the advantage of high-throughput, but demand high area cost especially for long-length FFTs. Memory-based FFT architectures usually contain one butterfly processing element (PE), memory banks and control logics. Although they have low area costs, their throughputs are often limited by the available number of PEs and memory access bandwidth. The memory-based architecture can be divided into two categories: the cached-memory architecture [7] and pipelined shared-memory architecture [8] .
On designing a long-size FFT processor, one still has to consider its power consumption and hardware cost. Furthermore, the power consumption of both data access in memory and operation of complex multipliers is more than 75% of total power consumption in the FFT processor [9] . To reduce the power consumption of the FFT memories and complex multipliers, the familiar and useful way is to reduce the memory access times, internal word length and the number of operations in complex multipliers.
In this paper, we propose a new multi-mode memorybased FFT processor to minimize hardware complexity and to improve clock speed. It locally employs pipelined architecture to realize the Butterfly Unit (BU) and globally utilizes the pipeline in memory-based architecture to save the area. This work uses a multi-path delay commutator (MDC) architecture. By doubling the pipelined processing path, the execution speed can be balanced with hardware costs efficiently. Also, the use of the radix-2 4 algorithm reduces the memory access times. However, more word length is needed to maintain the sufficient signal-to-quantization noise ratio (SQNR) in the fixed-point long-size FFT processor. To overcome this problem, we propose a new scaling approach, which is called the indexed block scaling approach. Thus, it can maintain a sufficient SQNR without increasing the word length. The memory occupies lots of chip area and increase power consumption especially in the long-size FFTs. The occupied area of the memory module is not only proportional to the amount of stored data and word length but also proportional to the number of ports. Thus, single-port memory is used in the FFT processors to reduce area significantly [10] .
The rest of this paper is organized as follows. Section 2 describes the design considerations for the FFT processor and proposes new data scaling approach. Section 3 presents
Copyright c 2011 The Institute of Electronics, Information and Communication Engineers the proposed pipelined shared-memory architecture and describes several techniques to reduce area in practical implementation. In Sect. 4, the implementation and comparison are presented. Finally, conclusions are made in Sect. 5.
Design Consideration for the FFT Processor

FFT Algorithms
In designing an FFT processor, one can first determine the required specifications of the target FFT processor as a function of FFT radix r, and then decide the most suitable radix. Table 1 shows the computational complexity of the complex multipliers for several FFT algorithms in 32K mode, in which the radix-2 4 algorithm has lowest computational complexity. Thus, the proposed FFT processor employs the radix-2 4 decimation in frequency (DIF) algorithm for most operations. But depending on the FFT size, the radix-2 4 DIF FFT algorithm cannot be used in some operations and the radix-2/2 2 /2 3 DIF FFT algorithms are used in some parts of operation. Detailed radix-2/2 2 /2 3 /2 4 algorithms are well explained in [4] , [5] , [11] , [12] .
Scaling Approach
In order to maintain data accuracy in fixed-point FFT processor, the internal word length of the FFT processor is usually larger than the word length of input data to achieve a higher SQNR, especially in a long-size FFT processor. In order to solve this problem, a scaling approach is used in the long size FFT processors to minimize the quantization error.
The block floating point (BFP) approach is one of the scaling approaches, usually used in FFT implementation. In traditional BFP, the largest value is detected and all computational results are scaled by a scale factor in stage N before starting the calculations of the stage N + 1 [13] . The BFP approach has the advantage of the lowest scaling overhead. However, this approach has the lowest SQNR performance compared with any other scaling approaches.
The hybrid floating point (HFP) approach is also a well-known scaling approach. The HFP approach is a hybrid and simplified scheme for floating point representation of a complex number. This approach uses a single exponent for the real and imaginary parts. Also, supporting HFP requires pre and post processing units in the arithmetic building blocks [14] . The HFP approach has the highest SQNR performance except for floating point. However, this approach has the disadvantage of high scaling overhead. Table 1 Computational complexity of the complex multiplication for FFT algorithm (32K mode).
The block scaling (BS) approach is a co-optimized approach that combines HFP approach with BFP approach. This approach divides blocks into an arbitrary block size in which each block has a single exponent. The values included in each block are scaled to the corresponding exponent. The scale factor (exponent) is determined when the operation of each block is finished. The data in the block is scaled before starting to operate the next block. All scale factors need to be stored in a table, and the operated data in a block are stored in a cache [7] . The BS approach has a sufficient SQNR performance close to the HFP approach and has the middle of scaling overhead compared with any other scaling approaches.
To improve the SQNR from BS approach, we propose indexed block scaling (IBS) approach based on indexing technique. It improves SQNR by increasing the number of scale factors and indexes. The IBS approach has different characteristics compared to the BS approach. When the scaling block size is increased in the BS approach, the prefetch buffer size is also increased proportionally and the number of exponents is decreased. On the other hand, the IBS approach is using an index table instead of a prefetch buffer unlike the BS approach. This approach is effective in terms of area cost if the block size is higher than 128, as shown in Fig. 1 .
In the BS approach, the elements inside the block are determined sequentially. However, with the IBS approach, it can be possible to determine the sequence of elements inside the block flexibly. Thus, the IBS approach can gain a higher SQNR performance than the BS approach with the same block size. Figure 2 shows an example of the differences of the scaling block of the BS and IBS approach (Block size = 4, Radix-2 2 , 16-point FFT). In the proposed architecture, the PE block is the structure of the four butterfly units connected in series. Thus, when the BS approach is applied to this structure, the block size is determined only proportional to 2 4n . And the IBS approach has almost the same SQNR compared with the BS approach using 2 4 times exponents. In the proposed approach, we used a scaling block size 256. As a result, the SQNR performance of 32K-point FFT achieved 45.3 dB for 256QAM signals, which is a better SQNR performance compared to the BS approach. Figure 3 shows the SQNR performance of each scaling approach and internal word length (IWL). The BFP approach features the lowest scaling overhead but the worst SQNR performance. In contrast, the HFP approach has the worst scaling overhead, but the best SQNR performance. However, the proposed IBS approach has a better SQNR performance and less scaling overhead than the conventional BS approach. Thus, the IBS approach has the efficient memory element usage compared with the BS approach.
Proposed FFT Architecture
The pipelined shared-memory FFT processor is described in this section. Figure 4(a) shows the top block diagram of the proposed architecture. It mainly consists of processing element (PE), 8K-word (22 bit) single-port SRAM, and a control unit. The proposed FFT processor performs an input/output (I/O) operation and PEcomputation operation, repeatedly. The FFT mode is determined by a mode selection signal. In this section, we describe how each block can be implemented to reduce area.
Processing Element
Figure 4(b) shows the proposed PE based on the MDC architecture. Through the control of the input sequence in the PE, it can reduce the buffer by half compared to the conventional MDC architecture. The proposed architecture is based on radix-2 4 MDC folding architecture. It has different iterations of PE computation operations in each mode. 1K/2K/4K-point FFT modes can be performed by the PE computation operation during three iterations, and the other point FFT modes can be performed by the PE computation operation during four iterations. Figure 5 shows the operation of each radix. The proposed PE can execute different operations according to the radix. Figure 6 shows the operation of each mode. Two iterations of the radix-2 4 DIF algorithm operation are performed in all modes during the final 8 stages. Additionally, 1K-point FFT performs the radix-2 2 operation during the first 2 BU stages. 2K-point FFT mode is performing the radix-2 3 operation during first 3 BU stages. 4K-point FFT mode is performing the radix-2 4 operation during the first 4 BU stages. 8K-point FFT mode is performing radix-2 and radix-2 4 operations during the first 5 BU stages. 16K-point FFT mode is performing radix-2 2 and radix-2 4 operations during the first 6 BU stages. 32K-point FFT mode is performing radix-2 3 and radix-2 4 operations during the first 7 BU stages. Figure 7 shows the proposed complex constant multiplier which consists of six constant multipliers, adders and MUXs. Proposed complex constant multiplier consists of three types of constant multipliers, which compute multiplication using the twiddle factors cos(1/16)π = 0.9239, cos(2/16)π = 0.7071, and cos(3/16)π = 0.3827. Coefficients are generated using Canonic Signed Digit (CSD) number. CSD constant multiplier contains the fewest number of nonzero bits, so it can reduce the area and power consumption [4] , [15] . The area cost of the proposed multiplier is only 70% of the complex Booth multiplier without the ROM table. The conventional MDC structure using the radix 2 4 algorithm has a larger size of constant multiplier than the SDF and MDF structures. However, by adjusting the scheduling of the twiddle factor, hardware complexity can be reduced. Table 2 shows the scheduling of the twiddle factor W 16 using the proposed and conventional scheduling schemes. To avoid the extra cycles and further reduce the hardware complexity, the simplification scheme is proposed as follows: 1) The proposed scheduling scheme can reduce conflicts by using idle Time Slot 0 in Table 2 and feedback loop as shown in Fig. 5. 2) The twiddle factors of the 2nd data-path in Time Slots 5 and 6 are shifted to Time Slots 6 and 7. And the twiddle factor of the 2nd data-path in Time Slot 7 is shifted to Time Slot 0. That is, in the Time Slots 5, 6, and 7, the maximum number of complex constant multiplier used in the proposed scheduling scheme is 2, whereas the conventional scheduling method has 4. Thus, the proposed scheduling method requires a lower number of complex constant multiplier compared with the conventional method. Figure 8 shows a block diagram of the complex Booth multiplier. The proposed complex Booth multiplier is a nonpipeline structure because low area and low power are more important factors than the clock speed in the memory based design. In the multiplication block, the word length of twiddle factor influences the SQNR performance. For this reason, twiddle factors are determined by appropriate word length. Figure 9 shows the SQNR for internal word length (IWL) and twiddle factor word length (TWL). If the internal word length is 11 bits, 10 bits and more twiddle factor word lengths have the same SQNR performance. For this reason, the twiddle factor word length is determined to be 10 bits.
Complex Constant Multiplier
Complex Booth Multiplier
The complex Booth multiplier needs a look-up table (LUT) using read-only memory (ROM) to store the twiddle factor values. Figure 10 shows the twiddle factors for the ROM table. Only 1/8th period of cosine and sine waveforms are stored in the ROM and the other period waveforms can be reconstructed with these stored values. Also, if the word length of the twiddle factor is determined to be 10 bits, a FFT processor using 1/4 twiddle factors has almost the same SQNR performance as previous FFT processors using all twiddle factors. For this reason, this architecture uses only a 1/32nd period of cosine and sine waveforms. As a result, the hardware complexity of the ROM table has been reduced.
Overflow Detection and Scaling Unit
The overflow detection and scaling unit (ODSU) detects the overflow from the input data and converts the overflow detection result into the exponent form. Figure 11 shows the structures of ODSU1 and ODSU2. The exponent value of ODSU1 is propagated to the ODSU2 and used as the scaling factor in the scaling unit of ODSU1. ODSU2 performs an index table update operation That is, ODSU2 gets the final exponent information through the addition of propagated exponent value from ODSU1 and outputs the exponent value from the overflow detection unit in ODSU2. Using this information, ODSU2 decides whether or not the index table is updated, and performs the index update operation. The location of the first overflow, which is generated in each scaling block, is recorded in the index table. This index table data is used to scale the input values at the equalizer block of the next PE operation.
Memory Unit
As mentioned in Sect. 1, the major part of the FFT processor's power consumption results from the main memory blocks. A large number of memory accesses are one of the most significant problems of power consumption. Also, more than half of all chip area is occupied by main memory blocks. Generally speaking, main memory blocks are the most critical block in the FFT processor in terms of both power consumption and hardware complexity. So, in order to reduce chip area and power consumption, singleport SRAM and low internal word length are adopted in the proposed architecture. The single-port SRAM is operated with lower memory access and occupied a lower area than dual-port SRAM. The total memory size in our design is 704 Kbit, consisting of 11-bit real and imaginary parts.
The read/write scheduling scheme of the four memory banks and prefetch/prewrite buffers is shown in Fig. 12 . The detailed operation of the memory block in the FFT processor is as follows: Four memory blocks are performing the read operation during the latency of the PE block as shown in Fig. 12 . Since then, the memory banks are performing the write and read operations per each clock cycle, iteratively. Table 3 compares the storage requirements and SQNR of the proposed IBS approach and previous scaling approaches. The proposed scaling approach provides a storage size that is more efficient than those of previous BS and HFP approaches. In case of SQNR, the proposed approach has 45.3 dB at 11-bit internal wordlength for 256QAM modulation, which is comparable SQNR performance with those of BS and HFP approaches, and much better SQNR performance than BFP approach. The proposed FFT architecture was modeled in Verilog HDL and simulated to verify its functionality. After complete verification of the design functionality, it was then synthesized using appropriate time and area constraints. Both simulation and synthesis steps were carried out using SYN-OPSYS synthesis tool and 0.18-μm CMOS standard cell library. The total number of gates is 41,000 gates from the synthesized results excluding memories. From the prelayout simulation, the proposed architecture can operate at a maximum clock frequency of 110 MHz. The execution time to compute 32K-point FFT is 595.5 μs at 110 MHz, which is enough to meet the specification of DVB-T2 standard (3584 μs).
Results and Comparison
In Table 4 , the performance characteristic of the proposed processor are summarized and compared with the previous works. For comparison of hardware complexity, the number of complex adders, complex multipliers and memo-ries are used because the area is dominated by adders, multipliers and memories. The results show that the proposed processor provides less number of complex adders, complex Booth multiplier and complex CSD multiplier than other FFT processors. To compare fairly with the previous works, the memory elements are normalized to 8K-point and 32K-point FFT modes. The proposed processor with supporting 32K-point FFT mode has smaller memory size than the other processors with supporting 8K-point FFT mode. The SQNR and execution time is enough to meet the specification of DVB-T2 standard. As a result, the proposed processor achieves better hardware complexity compared to the other FFT processors. Moreover, the proposed processor results in much higher operating frequency than the other memory-based FFT processor.
Conclusion
In this paper, a new pipelined shared-memory FFT architecture has been proposed for DVB-T2 applications. By adjusting the scheduling of the twiddle factor, the size of complex constant multiplier was reduced. In order to reduce the internal word length, a new indexed block scaling approach has been proposed to preserve SQNR at low scaling overhead. Also, single-port SRAM with minimal word length is adopted to further reduce hardware complexity. Therefore, the proposed processor achieves lower word length, less memory usage and higher SQNR performance, which results in low hardware complexity and small memory size. The proposed architecture has potential applications in DVB-T2 systems.
