Abstract-OFDM systems are currently being widely used in 4G and UWB communication systems. OFDM system consisting of IFFT and modulator increases the complexity of hardware implementation. In this paper, design and implementation of Autocorrelator and CORDIC algorithm for OFDM is discussed for optimized power and delay. Autocorrelator is used for frame detection and carrier frequency offset estimation. The CORDIC is used to estimate the frequency offset and to calculate the division in the channel estimation algorithm. HDL and test bench is developed to simulate and verify the functionality of both the modules. The design is implemented on Xilinx FPGA and semicustom ASIC targeting 130nm technology. The autocorrelator design is optimized to consume 5684 cells (13969.5μm² of area) with power reduced to 78.7μW. The CORDIC architecture occupies 2288 μm² total cell area. The total area comprising of both the modules is 2558 μm² consuming 12.05μW.
I. INTRODUCTION
Today"s communication standards such as Hiperlan/2, IEEE 802.11a/g, IEEE 802.16 and digital video broadcasting (DVB) already use orthogonal frequency division multiplexing (OFDM) for digital modulation. OFDM is one such modulation technique that takes into account the issues concerning the reliability and speed of network. With these vital advantages, OFDM has been adopted by many wireless standards such as DAB, DVB, Wireless LAN, and WMAN. MIMO, on the other hand, employs multiple antennas at the transmitter and receiver sides to open up additional sub channels in spatial domain. Since parallel channels are established over the same time and frequency, high data rates are achieved without the need of extra bandwidth. Due to this bandwidth efficiency, MIMO is included in the standards of future BWA. Overall, these benefits have made the combination of MIMO-OFDM an attractive technique for future high data rate systems [1] , [2] .
II. ORTHOGONAL FREQUENCY DIVISION MULTIPLEXING
A combination of modulation and multiplexing constitutes to orthogonal frequency division multiplexing, in other words OFDM. Independent signals that are a sub-set of a main signal are multiplexed in OFDM and also the signal itself is first split into independent channels, modulated by data and then re-multiplexed to create the OFDM carrier. Orthogonality of the sub-carriers is the main concept in OFDM. The property of Orthogonality allows simultaneous transmission on a lot of sub-carriers in a tight frequency space without interference from each other. This acts as an undue advantage in OFDM [3] [4] [5] . Therefore, OFDM is becoming the chosen modulation technique for wireless communication. With the help of OFDM, sufficient robustness can be achieved to provide large data rates to radio channel impairments. In an OFDM scheme, a large number of orthogonal, overlapping narrow band sub-channels or sub-carriers transmitted in parallel divide the available transmission bandwidth. Compact spectral utilization with utmost efficiency is achieved with the help of minimally separated sub-carriers. Main attraction of OFDM lies with how well the system handles the multi-path interference at the receiver end. Fig 1 shows the block diagram of transmitter and receiver section of an OFDM system [1] . The blocks circled in Fig.1 are the area of interest of this work and details of which are provided in the next section.
III. AUTOCORRELATION
Autocorrelation is the cross-correlation of a signal with itself. It is a measure of how well a signal matches a time shifted version of itself, as a function of the amount of time shift. Autocorrelation is useful for finding repeating patterns in a signal, such as determining the presence of a periodic signal which has been buried under noise or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. Autocorrelation involves only one signal and provides information about the structure of the signal or its behavior in the time domain. In the receiver, the radio frequency (RF) data is converted to digital data and is down converted to intermediate frequency (IF) band. The down converted signal needs to be synchronized with the carrier signal. Any mismatch between the transmitter carrier and receiver carrier is rectified in the synchronizer unit. The autocorrelation equation is given by: The synchronizer unit consists of an Autocorrelator. In this design, the samples of the incoming signal are auto correlated and the peak amplitude point is detected and the time delay between the transmitter and receiver carrier is computed. This information of time difference synchronizes the receiver with the transmitter module. The synchronized data is fed into the FFT module for data demodulation. The incoming data samples of 128 samples, each represented by 8 bit signed representation is delayed and is correlated with the incoming signal. Thus, the two signals (incoming signal and time shifted version of the incoming signal) are stored in a memory and are processed. Fig. 2 shows the block diagram of the Autocorrelator design. The design is carried out in Verilog HDL and simulated using ModelSim.
The Autocorrelator is designed in such a way that "N" number of inputs can be fed in. The output port named as "line" displays the auto correlated version of it. In this design a RAM is modeled to accommodate 128 samples each of 8 bits wide. The address of the RAM is generated with the help of a counter and the outputs of the RAM are "a" and "b". For example if the address of "a" is [addr], then address of "b" is minus one position i.e. [addr-1]. Therefore, when the first sample goes into the RAM, it is stored in "a" register and "b" register is 0. When the second sample goes into the RAM, the previous value of the register "a" is stored in "b" and "a" gets the new value. This operation, in other words can be called as the shift operation. Here, each sample is getting shifted by one position i.e. 8 bits. The writing operation is done only when the write enable pin (we) is high and the 'we" becomes low when the RAM is filled. Reset pin is turned high for 10ns after which it is made low. The duty cycle of the clock is 100ns. The writing operation occurs in the negative or falling edge of the clock to save the writing time. Therefore, for 128 samples, this design performs 127 forward shifts and produces 127 outputs serially from a single output port namely "line". The accumulator accumulates the result and is fed from an intermediate register only in the event of a clock are shown in fig 3. Multiplier and adder are the two major blocks that consume power. Various multipliers are analyzed for its area and power performances.
A. FPGA Implementation of Multipliers and Adders
The implementation process was done using XILINX-ISE 10.1 version and then implemented on SPARTAN-II board. The synthesis reports taken for the multipliers are presented in Table I . From the FPGA implementation results, Wallace tree multiplier was found to consume less power with minimum delay but at the cost of increased area and therefore was found to be suitable for the design of Autocorrelator; and since Ripple Carry Adders consume less power and area when compared to other adders, they were employed in the design. 
B. Simulation Results of Autocorrelator
The number of slices utilized by the Autocorrelator design on Spartan-3 are 68 out of 768 which is considerably very economical and area saving. The number of bonded IOB"s is 35 out of 124 available resources. The design works at a frequency of 134.252MHz. Therefore this design takes less area and performs the autocorrelation of 128 samples with 127 shifts producing exact outputs. The results are compared with MATLAB simulation and to no surprise they were exact. Fig. 3 and Fig. 4 present the Matlab and HDL simulation results.
IV. COORDINATE ROTATION DIGITAL COMPUTER (CORDIC)
CORDIC is a simple and efficient algorithm to calculate hyperbolic and trigonometric functions. All of the trigonometric functions can be computed or derived from functions using vector rotations. Vector rotation can also be used for polar to rectangular and rectangular to polar conversions, for vector magnitude, and as a building block in certain transforms such as the DFT and DCT [7] , [8] . The CORDIC algorithm provides an iterative method of performing vector rotations by arbitrary angles using only shifts and additions. The algorithm, credited to Volder [2] is derived from the general (Givens) rotation transform as shown in equations (2) and (3) . The basic equations required to implement CORDIC are:
The equations (2) and (3) modified as follows (4) (5) (6) (7) where ) = 0.60722 and (8) If the rotation angles are restricted so that , the multiplication by the tangent term is reduced to simple shift operation. According to [4] and [5] , the CORDIC processor can be configured to work as a circular rotation and vectoring mode, and also as linear vectoring mode. The circular vectoring mode is used to calculate coarse Carrier Frequency Offset (CFO) estimation and fine CFO estimation. The circular rotation mode is used to correct coarse CFO of the broadcast preamble, and to correct received section C and received OFDM symbols. The linear vectoring mode is used to calculate the division in the channel estimation stage.
The Fig.5 shows a fully pipelined fast iterative CORDIC architecture [8] that is adopted in this design. The iterative CORDIC architecture can be obtained simply by duplicating each of the above difference equations shown in equations (2) to (5) . The decision function decides the sign of the "Y" or "Z" register depending on whether it is operated in rotation or vectoring mode. In operation, the initial values are loaded into the "X" and "Y" registers. Then on each of the next "n" clock cycles, the values from the registers are passed through the shifters and adder-subtractors and the result is placed back in the registers.
A. Pipelined CORDIC Architecture Design
The simulation results for CORDIC block are discussed. The total number of slices utilized by this design is 112 out of 960 in Spartan-3, which is about 11% utilization factor. The operating frequency for the pipelined architecture of fast CORDIC is 219.443 M Hz. Fig.5 shows the simulation result of the pipelined CORDIC architecture run with the help of an exhaustive test bench. 
B. Simulation Results of CORDIC Design

V. ASIC IMPLEMENTATION RESULTS
The design of both Autocorrelator and CORDIC was modeled using Verilog HDL and synthesized to implement on a Spartan-3 Xilinx FPGA. The architectures of both Autocorrelator and CORDIC was further processed through ASIC Design flow to check for the timing of both setup as well as hold time for 130nm technology. Design Compiler tool which helped in generating the timing analysis for the same providing details of area and power for both the architectures. The Autocorrelator design raked up 5684 cells taking up 13969.5μm² of area for the cells and total area of 16609.6 μm² consuming 78.7μW. The CORDIC design raked up 2288 μm² total cell area for 552 numbers of cells. Fig. 5 Fast CORDIC Pipe line [9] The total area comprising 2558 μm² consuming 12.05μW. The SDC files created during DC flow were taken as inputs to perform Physical Design. Prior to Physical Design, Prime time tool was used in performing timing analysis to check for setup and hold time. 
VI. CONCLUSION
An Autocorrelator was designed to perform the autocorrelator of 128 samples each of 8 bits wide. The building blocks for the scheme were a 128x8 RAM, a multiplier, an accumulator and a counter. Matlab simulations were performed prior to the Verilog HDL coding to check if the functionality was achieved.
A fully pipelined CORDIC processor was designed with the help of Verilog HDL and synthesized. An exhaustive test bench was also written to simulate the functionality of the processor. The CORDIC processor worked as per the expectations giving us exact values for high inputs. For future advancements a bit serial iterative CORDIC architecture can be adopted. In bit serial iterative CORDIC architecture the interconnections can be made minimal and the logic between resistors would be simple making it even more efficient.
