This paper presents digital signal processor (DSP) instructions and their data processing unit (DPU) architecture for high-speed fast Fourier transforms (FFTs) in orthogonal frequency division multiplexing (OFDM) systems. The proposed instructions jointly perform new operation flows that are more efficient than the operation flow of the multiply and accumulate (MAC) instruction on which existing DSP chips heavily depend. We further propose a DPU architecture that fully supports the instructions and show that the architecture is two times faster than existing DSP chips for FFTs. We simulated the proposed model with a Verilog HDL, performed a logic synthesis using the 0.35 µm standard cell library, and then verified the functions thoroughly.
I. INTRODUCTION
Today, various communication standards have been rapidly developed: WLAN, DTV, Cable modem, WCDMA, CDMA2000, etc. With these systems, after their algorithms have been thoroughly fixed and verified, custom application specific integrated circuit (ASIC) chips have been implemented to reduce their cost, size, and power consumption. However, ASIC-based solutions may be inadequate for adopting various standards since they must be redesigned for each application. With the rapid increase in transistor density, it has become feasible to keep the functionality entirely in a programmable digital signal processor (DSP), allowing much faster changes and upgrades [1] .
However, recent DSP technologies have not yet satisfied the requirements of high-speed communication standards. In particular, orthogonal frequency division multiplexing (OFDM) and discrete multitone (DMT) modem systems [2] , which are necessary to achieve high-speed data transmission in narrow bands, need to perform several hundred or thousand points of fast Fourier transform (FFT) within a few tens of microseconds. Commercial DSP chips have not yet reached these requirements [3] , [4] .
High-speed FFT computations may be one of the main research topics for the next generation wire/wireless communications. To meet high-speed FFT computations on DSP chips, this paper proposes instructions and their data processing unit (DPU) architecture which can be embedded as the core in DSP chips. The proposed instructions support new FFT operation flows that are different from the multiply and accumulate (MAC) flow in typical DSP chips. The proposed architecture uses few additional data-path circuits, without A DSP Architecture for High-Speed FFT in OFDM Systems Jaesung Lee, Jeonghoo Lee, Myung H. Sunwoo, Sangman Moh, and Seongkeun Oh modification or addition of the arithmetic units used in the existing DSPs [5] - [12] . We modeled the proposed architecture with Verilog HDL, synthesized it using the HYUNDAI TM 0.35 µm standard cell library with a SYNOPSYS TM tool, and did a timing simulation.
The proposed architecture performed the FFT operation flow about 2 times faster than the existing DSP chips [5] - [7] in terms of execution cycles. The rest of this paper is organized as follows. Section II describes the FFT algorithm and existing DSP-based FFT implementations. Section III presents the proposed instructions and their hardware architecture for high-speed FFT, and section IV discusses implementation and the performance comparisons with the existing DSP chips [5] - [9] . Finally, section V contains concluding remarks.
II. EXISTING DSP-BASED FFT IMPLEMENTATIONS
We first describe an FFT algorithm and existing DSP-based FFT implementations. The radix-2 FFT is represented by (1) ,
where
= is the complex twiddle factor. This equation is computed by repeating the radix-2 butterfly operation [13] . Figure 1 shows how to compute the above butterfly with its DPU units on general DSP chips [14] . Re In Fig. 1 , ①, ②, and ③ represent the first, second, and third clock cycles, respectively, when one dual MAC instruction, which is generally used on existing DSP chips, is performed in one clock cycle. Because four multiplications are required, if we have two multipliers, then two clock cycles (i.e., cycles ② and ③) are needed. A deeper pipeline of the DPU can make a higher operating clock frequency. However, the computation of one butterfly requires the same clock cycles even if the operating clock frequency varies.
The latest DSP chips have one dual-MAC and four or more ALU units in their DPUs [5] - [7] , [9] - [12] . Accordingly, the DSP chips mainly depend on dual-MAC units with some adders/subtractors for computing the FFT butterfly. Because multipliers are larger than other arithmetic units, the number of multipliers is limited to two or four in the DPU [8] . Consequently, the flow graph in Fig. 1 is the appropriate solution for computing the FFT on dual MAC-based DSP chips.
This section presents new FFT instructions based on the enhanced complex multiplication [15] , the proposed operation flows, and their new DPU architecture. If (2) and the flow graph in Fig. 2 for complex multiplications [15] are used, then the flow graph of the general complex-multiplication in Fig. 1 can be replaced. Note here that,
In Fig. 2 , one addition is performed first and then three multiplications are performed. Finally, one addition and one subtraction complete the complex multiplication. This scheme requires only three multiplications instead of the four in Fig. 1 . 
III. THE PROPOSED DSP INSTRUCTIONS AND THEIR ARCHITECTURE
The flow graph shown in Fig. 3 can be obtained for two radix-2 butterflies using the complex multiplication of Fig. 2 . In Fig. 3 , ①, ② and ③ represent the first, second, and third clock cycles, respectively, and they represent two radix-2 butterflies. As shown in Fig. 3 , the number of arithmetic operations in the flow graph can be optimized because the number of multiplications is six. In contrast, the two radix-2 butterflies using the scheme in Fig. 1 require eight multiplications.
The flow graph in Fig. 3 needs new instructions different from the MAC instruction to perform an addition first and then a multiplication next as in cycle ① or ②. To fulfill this requirement, we propose the new add and multiply (AMPY) instruction, which is a one cycle instruction. Since two AMPYs can be executed in parallel, we used dual AMPY instructions.
The consecutive-ADD instruction that performs one addition after two subtractions by the dual AMPY instruction in one cycle is needed to perform cycle ① or ②. The multiply and double-accumulate (MDAC) instruction performs one addition and one subtraction after one multiplication. Since two MDAC instructions are executed concurrently in cycle ③, we used dual MDAC instructions.
If another scheme (Fig. 4) is used, the add and doublemultiply (ADMPY) instruction is needed to perform two multiplications after one subtraction as in cycle ①. In cycle ②, we need the add and dual MAC (ADMAC) instruction that performs an addition first and then a dual MAC operation next.
The scheme using the dual-AMPY instruction and the dual-MDAC instruction performs two butterflies in 3 cycles (Fig. 3) , and thus, it takes 1.5 cycles per one radix-2 butterfly. The other scheme using the ADMPY instruction and the ADMAC instruction takes 2 cycles per one radix-2 butterfly (Fig. 4) .
Fig . 4 . The proposed flow graph using the AMAC instruction.
These instructions can be used along with other instructions using parallel '||' notations as in very long instruction word DSP chips [7] .
To perform the new instructions and other operations in Figs. 3 and 4 efficiently, the existing DPU architectures must be modified. However, we need neither to append more arithmetic units in existing DPUs, nor to modify the internal architecture of the typical arithmetic units (adders or multipliers) themselves. The dual AMPY instructions in Fig. 5 , which perform cycle ① or ② in Fig. 3 , are executed using Adder1, Adder2, Adder3, Alu0, and Alu1, while the consecutive-ADD instruction that performs cycle ③ in Fig. 3 is executed using Alu0, Alu1, and Adder3. Cycle ① in Fig. 4 is performed Fig. 3 . The proposed flow graph of two radix-2 butterflies using the enhanced complex multiplication.
General Registers Accum ulators Alu0 Alu1
General Registers 
General Registers
General Registers Accumulators Alu0 Alu1
(b) ADMAC Fig. 7 . The switched data-paths corresponding to the operations in Fig. 6 .
using the ADMPY instruction which uses Adder3, Mul0, and Mul1 while cycle ② is performed using the ADMAC instruction that uses Adder0, Adder1, Alu0, Alu1, Mul0, and Mul1. Finally, the MDAC instruction is executed using Mul0, Mul1, Adder0, Adder1, Alu0, and Alu1. The next figures illustrate the flows mentioned above. Figure  6 explains how to compute two radix-2 butterflies using dual-AMPY, consecutive ADD, and dual MDAC successively and describes the corresponding switched data-paths in the DPU. Here, "switched" means that the data-paths are changed by 2-by-1 Multiplexers. In Fig. 6, (a) is performed at the first clock cycle, (b) is performed at the second clock cycle, and then (c) is performed at the third clock cycle. Figure 7 explains how to compute two radix-2 butterflies using ADMPY and ADMAC and describes the corresponding switched data-paths in the DPU. In Fig. 7, (a) is performed at the first clock cycle and then (b) is performed at the second cycle.
IV. IMPLEMENTATION
The timing simulation using the CADENCE TM Verilog-XL shows the maximum delay path is about 6.92 ns, and thus, the maximum operating clock frequency is about 144.5 MHz. If a deeper pipeline is used, a higher operating clock frequency can be obtained. Hence, the required FFT computation time can be reduced. Table 1 presents performance comparisons among the DSP architectures for FFT computation [5] - [8] . Note that the performance figures of commercial DSP chips are given by their data sheets or references. complete the complex multiplication in one cycle and may have a performance similar to the proposed DPU. However, it requires larger hardware than the proposed DPU. Using only half the number of operation units, the proposed DPU can show a better performance than the SC140. We used a one stage pipelined Carry Look-ahead Adder as an adder and a three stage pipelined Wallace-tree multiplier as a multiplier. The proposed architecture was modeled by a Verilog HDL. We performed a logic synthesis using the HYUNDAI TM 0.35 µm standard cell library with a SYNOPSYS TM Design
Compiler and did a timing simulation was performed.
V. CONCLUSIONS
This paper proposed DSP instructions and their DPU architecture for high-speed FFTs in OFDM systems. First, we proposed the novel instructions that are necessary to perform FFT computation and then a DPU architecture that can support the proposed instructions as well as general DSP instructions. The proposed architecture, having little hardware overhead, can perform FFTs about two times faster than the existing DSP chips in terms of execution cycles. In addition, it is clear that the power consumption of the proposed architecture is lower than existing architectures because it uses fewer function units.
