Abstract-Mobile
I. INTRODUCTION
The true Mobile WiMAX standard of 802.16e is divergent from Fixed WiMAX. While clearly based on the same OFDM base technology adopted in 802. , the 802.16e version is designed to deliver service across many more sub-channels than the OFDM 256-FFT. It is important to note that both standards support single carrier, OFDM 256-FFT and at least OFDMA 1K-FFT.
OFDM technology is used for many communication systems such as Asymmetric digital subscriber line (ADSL), Wireless Local Area Network (WLAN) or Multimedia Communication Services [1] . One of the key components in OFDM system is the Fast Fourier Transform (FFT). There are more and more communication systems require higher points FFT and higher symbol rates. The requirement establishes challenges for low power and high speed FFT design with large points.
The FFT algorithm eliminates the redundant calculation which is needed in computing Discrete Fourier Transform (DFT) and is thus very suitable for efficient hardware implementation [2] . In addition to computing efficient DFT, the FFT also finds applications in linear filtering, digital spectral analysis and correlation analysis, Ultra Wide Band Manuscript received July 22, 2010 1 Dept. of electronics and communication engineering, r.s.r engineering college, kadanuthala, nellore, ap, India(email: kamathamhk@yahoo.com) 2 Dept. of telecommunication engineering, srm university, kattankulattur, chennai, tn, India. 3 Dept. of electrical and computer engineering, gonzaga university, spokane, wa, usa.
(UWB) applications, etc. A hardware oriented radix-2 2 algorithm [3] is developed by integrating a twiddle factor decomposition technique in divide and conquer approach to form a spatially regular Signal Flow Graph (SFG). Mapping the algorithm to the cascading delay feedback structure leads to the proposed architecture [4] .
The next section describes architecture & design methodology, followed by its implementation in VERILOG Hardware Description Language (VERILOG) code and utilization, performance and implementation in OFDM systems. Finally we conclude with a comparison of hardware requirement of R2 2 SDF and several other popular pipeline architectures.
II. ARCHITECTURE AND DESIGN METHODOLOGY

A. Radix-2 2 Decimation in Frequency FFT Algorithm
A useful state-of-the-art review of hardware architectures for FFTs was given by He et al. [5] and different approaches were put into functional blocks with unified terminology. From the definition of DFT of size N [6] :
where W N denotes the primitive Nth root of unity, with its exponent evaluated modulo N, x(n) is the input sequence and X(k) is the DFT. He [5] applied a 3-dimensional linear index map,
and Common factor algorithm (CFA) to derive a set of 4 DFTs of length N/4 as, 
[ ( , , ) ]
where n 1 ,n 2 ,n 3 are the index terms of the input sample n and k 1 ,k 2 ,k 3 are the index terms of the output sample k and where H(k 1 ,k 2 ,k 3 ) is expressed in eqn (4). (3) . Note the order of the twiddle factors is different from that of radix-4 algorithm.
Applying this CFA procedure recursively to the remaining DFTs of length N/4 in eqn (3), the complete radix-2 2 Decimation-in-frequency (DIF FFT) algorithm is obtained. The corresponding FFT flow graph for N=16 is shown in Fig. 1 [3] . Fig 2 outlines an implementation of the R2 2 SDF architecture for N=1024, note the similarity of the data-path to R2SDF and the reduced number of multipliers. The implementation uses two types of butterflies; one identical to that in R2SDF, the other contains also the logic to implement the trivial twiddle factor multiplication, as shown in Fig 3 (i) , (ii) respectively [3] .
Due to the spatial regularity of Radix-2 2 algorithm, the synchronization control of the processor is very simple. A (log 2 N)-bit binary counter serves two purposes: synchronization controller and address counter for twiddle factor reading in each stage. With the help of the butterfly structures shown in Fig 3, the scheduled operation of the R2 2 SDF processor in Fig 2 is as follows. On first N/2 cycles, the 2-to-1 multiplexers in the first butterfly module switch to position "0", and the butterfly is idle. The input data from left is directed to the shift registers until they are filled. On next N/2 cycles, the multiplexers turn to position "1", the butterfly computes a 2-point DFT with incoming data and the data stored in the shift registers.
The butterfly outputs Z1(n) and Z1(n + N/2) are computed according to the equations given in eqn (5) . Z1(n) is sent to apply the twiddle factors, and Z1(n + N/2) is sent back to the shift registers to be "multiplied" in still next N/2 cycles when the first half of the next frame of time sequence is loaded in. The operation of the second butterfly is similar to that of the first one, except the "distance" of butterfly input sequence are just N/4 and the trivial twiddle factor multiplication has been implemented by realimaginary swapping with a commutator and controlled add/subtract operations, as in Fig 3(i) (ii), which requires two bit control signal from the synchronizing counter. The data then goes through a full complex multiplier, working at 75% utility, accomplishes the result of first level of radix-4 DFT word by word. Further processing repeats this pattern with the distance of the input data decreases by half at each consecutive butterfly stages. After N-1 clock cycles, the result of the complete DFT transform streams out to the right, in bit-reversed order. The next frame of transform can be computed without pausing due to the pipelined processing of each stage. [7] . The introduction of Verilog-based synthesis tools in1988 by then-fledging Synopsys and the 1989 acquisition of Gateway by Candence Design Systems were important events that led to wide-spread use of the language [7] .
VERILOG synthesis tools can create logic-circuit structures directly from VERILOG behavioral descriptions, and target them to a selected technology for realization. Using VERILOG, you can design, simulate, and synthesize anything form a simple combinational circuit to a complete microprocessor system on a chip.
VERILOG started out with and still has the following features [7] : 1) Designs may be decomposed hierarchically. 2) Each design element has both a well-defined interface and a precise functional specification. 3) Functional specifications can use either a behavioral algorithm or an actual hardware structure to define initially by an algorithm, to allow design verification of higher level elements that use it; later, the algorithmic definition can be replaced by a preferred hardware structure. 4) Concurrency, timing, and clocking can all be modeled.
VERILOG handles asynchronous as well as synchronous sequential-circuit synthesis. 5) The logical operation and timing behavior of a design can be simulated. Thus, VERILOG started out as a documentation and modeling language, allowing the behavior of digital-system designs to be precisely specified and simulated. The VERILOG language specification allows multiple modules to be stored in a single text file. When one VERILOG module instantiates another, the compiler finds the other by searching the current workspace, as well as predefined libraries, for a module with the instantiated name. Thus, when using VERILOG-1995, there should be only one definition of each module, usually in a file with the same name as the module.
However, VERILOG-2001 actually allows you to define multiple versions of each module, and it provides a separate configuration management facility that allows you to specify which one to use for each different instantiation during a particular compilation or synthesis run. This lets you try out different approaches without throwing away or renaming your other efforts. All these features of VERILOG will help better in simulation and synthesis of our proposed architecture.
IV. IMPLEMENTATION IN VERILOG
The R2 2 SDF presented above has been fully coded in VERILOG Hardware Description Language (VERILOG). Once the design is coded in VERILOG, the Modelsim XEIII 6.2c compiler [8] and the Xilinx Foundation ISA Environment 9.1i [9] generate a net-list for FPGA configuration. The net-list can then be downloaded into the FPGA using the same Xilinx tools and Texas Instruments prototyping board.
From the architecture of R2 2 SDF in Fig 2, the butterfly blocks BF2I and BF2II are described as building blocks in VERILOG code. Booth multiplication algorithm for signed binary numbers is used for complex multipliers. Thus, the overall latency of the real implementation varies as the processing word length changes [3] . Look-up-table (LUT) based Random Access Memories (RAMs) and Flip-Flops are used to implement feedback memory of the very last stages where are the RAM blocks in the FPGA are used for the rest of the stages. Similarly, LUT-based Read Only Memories (ROMs) are used to implement twiddle ROMs of the very last stages whereas Block RAMs are used for the rest of stages [5] . The FFT is heavily pipelined to achieve as highest clock frequency as possible. Twiddle factors are generated by an external program and embedded to the VHDL code.
The implementation results after implementing in Xilinx Spartan3 FPGA (fig 4) The resulting figures show that our implementation outperforms the other implementations of that kind. Its speed nearly matches that of the Xilinx core but its throughput is more than 3 times higher due to its pipeline nature. The fundamental principle of the OFDM system is to decompose the high rate data stream (bandwidth = W) into N lower rate data streams and then to transmit them simultaneously over a large number of subcarriers [13] . The IFFT and the FFT are used for, respectively, modulating and demodulating the data constellations on the orthogonal subcarriers [14] .
In an OFDM system, the transmitter and receiver blocks contain the FFT modules as shown in Fig 5(a) & (b) . The FFT processor must finish the transform within 312.5ns to serve the purpose in the OFDM system. Our FFT architecture effectively fits into the system since it has a minimum required time period of 10.827ns (table 1(b) ).
An OFDM carrier signal is the sum of a number of orthogonal sub-carriers, with baseband data on each subcarrier being independently modulated commonly using some type of quadrature amplitude modulation (QAM) or phase-shift keying (PSK) [15] . This composite baseband signal is typically used to modulate a main RF carrier. s[n] is a serial stream of binary digits. By inverse multiplexing, these are first demultiplexed into N parallel streams, and each one mapped to a (possibly complex) symbol stream using some modulation constellation (QAM, PSK, etc.). Note that the constellations may be different, so some streams may carry a higher bit-rate than others.
The receiver picks up the signal r(t), which is then quadrature-mixed down to baseband using cosine and sine waves at the carrier frequency. This also creates signals centered on 2f c , so low-pass filters are used to reject these. The baseband signals are then sampled and digitized using analogue-to-digital converters (ADCs), and a forward FFT is used to convert back to the frequency domain [15] . OFDMA can be seen as an alternative to combining OFDM with time division multiple access (TDMA) or timedomain statistical multiplexing, i.e. packet mode communication. Low-data-rate users can send continuously with low transmission power instead of using a "pulsed" high-power carrier. Constant delay, and shorter delay, can be achieved. OFDMA can also be described as a combination of frequency domain and time domain multiple access, where the resources are partitioned in the timefrequency space, and slots are assigned along the OFDM symbol index as well as OFDM sub-carrier index. OFDMA is considered as highly suitable for broadband wireless networks, due to advantages including scalability and MIMO-friendliness, and ability to take advantage of channel frequency selectivity [15] . The trans-receiver structure of OFDMA is shown in fig 7. 
VII. PERFORMANCE AND IMPLEMENTATION
A. Hardware Requirement
The radix-4 butterfly needs 3 complex adders and 1 complex multiplier [16] , while the proposed butterfly structure needs only 4 complex adders and 1 complex multiplier. This is because our design implements the constant multiplier by 4 reused complex adders. Fig. 8 shows memory addressing of SRAM0 and SRAM1 for N=64. All of the above-mentioned use separated single SRAM into 2 smaller SRAMs. This design can double SRAM throughput with inter-leaving access. In table 2, the hardware requirement of the proposed design is compared with various pipelined designs. 
B. Power Consumption
The power consumption is measured by the number of times of data transition. The data transition times are proportional to the SRAM access times. Here we assume that the adders and multipliers are active at each clock cycle because of the pipelining architecture. The more the SRAM access times, the higher the power consumption Fig. 9 shows the SRAM access times versus N points FFT. The SRAM access times is linear to the number of the recursive iterations in FFT as described in Eq(6).The SRAM is accessed twice each clock cycle, so Eq(6) is multiplied by 2.
It shows that the proposed design has less memory access than the radix-4 FFT by 20 ~ 40%. Therefore, the proposed architecture consumes much lower power. 
C. Speed
With fixed clock frequency, the processing OFDM symbol rate decreases as the FFT point N increases. A comparison with fixed clock frequency of 50 MHz is shown in Fig. 10 based on Eq. (7). It shows that the proposed architecture is better than radix-4 FFT by 25~ 66% .
symbol rates = (clock frequency)* N Total cycles of each N points (7) For fixed analog-to-digital converter sampling rate, the OFDM symbol rate in receiver is fixed too. It then requires higher chip clock frequency to process higher point FFT. We make a comparison with clock frequency and N-points FFT as shown in Eq. (8) . 
D. Implementation
The Fujitsu Mobile WiMAX™ SoC, MB86K22, fully complies with the IEEE 802.16e standard using an OFDMA PHY [21] . This fully integrated baseband IC was built using the Fujitsu 65nm advanced CMOS low-leakage process technology. The operating power of the MB86K22 has been reduced by 36 percent from the previous generation. Power-gating technology shuts down the power supply in the unused blocks inside the device, so that the entire mobile WiMAX module consumes only 0.5mA, extending battery life. The chip supports multiple operating frequency and channel bandwidth profiles, as well as various subcarrier allocation schemes.
The module would be able to support multiple channel bandwidths such as 3.5MHz, 5MHz, 7MHz, 10MHz and 20MHz; and all the popular frequency bands such as 2.3GHz, 2.5GHz and 3.5GHz. The Dual-core processor architecture best supports application-rich Mobile WiMAX operations. We have implemented the FFT, designed for efficient mobile WiMAX, by 16 bits word length and synthesized in Xilinx ISE 9.1 design compiler. 
VIII. CONCLUSIONS
We have proposed a memory based recursive FFT design which has much less gate counts, lower power consumption and higher speed. The proposed architecture has three main advantages (1) fewer butterfly iteration to reduce power consumption, (2) pipeline of radix-2 2 butterfly to speed up clock frequency, (3) even distribution of memory access to make utilization efficiency in SRAM ports.
In summary, the speed performance of our design easily satisfies most application requirements of mobile WiMAX 802.16e, which uses OFDMA modulated wireless communication system. Our design also occupies lesser area, hence lower cost and power consumption. 
