INTRODUCTION
Recently, orthogonal frequency division multiplexing (OFDM) has gained considerable interest due to the high spectral efficiency and capability to deal with severe channel impairments encountered in wireless transmission. Consequently, OFDM has been considered in several standards and applications [1] [2] . In spite of the desirable features, OFDM systems are very sensitive to synchronization errors, particularly carrier frequency offsets (CFO). The presence of CFO will introduce an inter-carrier interference (ICI) due to the loss of orthogonality among the subcarriers which dramatically degrade the performance of the whole system. Therefore, the inner receiver must estimate and eliminate the possible influence of the unknown CFO where accurate CFO estimation is critical for OFDM systems with high impact on the communication performance as only offsets of small fraction of the subcarriers spacing can be tolerated. For example, a signal-to-interference ratio (SIR) of 20 dB is achieved for frequency offsets less than 4% of the subcarriers spacing [6] .
The CFO can be estimated by inserting pre-known information at the transmitter such as pilots or training sequences. Such methods are power and bandwidth inefficient and usually known as data-aided (DA). On the other hand, the non data-aided (NDA) or blind methods are power and bandwidth efficient where the estimation can be performed with the help of only unknown data symbols or by exploiting the intrinsic information inherited in OFDM symbol. Constructing a robust, accurate and efficient blind CFO estimator is a very challenging task. Therefore, it has been considered in the literature in many publications [5] - [8] . For example, the cyclic prefix based estimator (CPE) utilizes the CP redundancy in the OFDM symbol structure where the CFO is estimated by maximizing the sliding correlation between the data points in the start and end of the symbol. Although this scheme has a low computational complexity, it has a very sensitive performance in multipath fading channel as part of the CP will be corrupted by the inter-symbol interference (ISI). Recently, two blind CFO estimators for OFDM systems have been proposed in [7] and [8] . These estimators are based on minimizing the power difference between either adjacent subcarriers in the same OFDM symbol or the same subcarrier in consecutive OFDM symbols and they are called power difference estimator (PDE)-Frequency and PDE-Time, respectively. As frequency domain methods, the computational requirements of the PDE estimators are higher than the CPE. However, these estimators have accurate and robust performance even in severe frequency selective fading channels compared with other blind CFO estimation techniques.
The rapid increase in the processing requirements of state of the art wireless devices has exceeds the speed of current digital signal processors (DSPs) which cannot fulfilled the system throughput requirements. Therefore, FPGAs has emerged as an attractive option for implementation of highly complex signal processing due to their performance and configurability [4] . In addition, the implemented system can enjoy the highly parallel architecture and embedded multiplier for FPGA solution which is far more cost efficient than application specific integrated circuit (ASIC) implementation.
Due to its robust performance and in contract to the CPE, the PDE estimators can be utilized to estimate the fractional CFO in the acquisition and tracking phase of the synchronization process. Hence, this paper aims to investigate efficient hardware architecture for prototyping the PDE CFO estimator on FPGA device. The model based design using Xilinx System Generator (XSG) has been used. The high level design flow to map the considered CFO algorithm into flexible and configurable hardware architecture has been investigated. The design flow includes the mapping of the algorithmic steps to the XSG FPGA intellectual property (IP) components. The proposed architecture designs are evaluated in terms of estimation accuracy and the required hardware resources. This paper consists of four sections. Section II introduces basic idea of the PDE-Frequency and PDE-Time CFO estimator for OFDM systems. Section III describes the structure of the hardware that is used to prototype the considered CFO estimation algorithm on FPGA. Section IV presents the evaluation results of the proposed CFO estimation architecture.
BLIND CFO ESTIMATOR FOR OFDM SYSTEMS
In this section, the algorithm of the PDE CFO estimators for OFDM systems is presented. Using the PDE method, the CFO can be estimated using the frequency domain data based on measuring the power difference either between adjacent subcarrier in the same OFDM symbol or the same subcarrier in consecutive OFDM symbol. This section presents the two possible realizations of the PDE scheme for blind CFO estimation in OFDM systems. In the noise-free case and with perfect CFO compensation i.e. ̂= , the FFT output on the kth subcarrier can be expressed as In this case, the power of the kth FFT output is given by, For constant modulus (CM) constellations | ( )| 2 = | | 2 which is assumed to be normalized to one, then (2) can be written as,
A. PDE-Frequency
By assuming that the channel changes slowly in the frequency domain, then the channel frequency response over any two adjacent subcarriers is approximately equal. Thus, However the approximation given in (4) becomes equality only when the CFO is estimated and compensated perfectly. Consequently, by minimizing the power difference between all subcarriers over OFDM block, the following cost function can be formulated [10] , Where L is the number of OFDM blocks used in the estimation process. The CFO estimates can be obtained by minimizing (5), where µ denotes the trial values of the CFO.
B. PDE-Time
This scheme is based on the assumption that the channel changes slowly in the time domain. Consequently, the channel frequency response at subcarrier k over two consecutive OFDM symbols is approximately constant. Therefore, Again this is valid if the CFO is perfectly compensated. Therefore, by minimizing the power difference between the same subcarrier in consecutive OFDM symbol, the following cost function can be formulated [11] , and the CFO estimates can be obtained by minimizing (8) , By noting that, And Thus, the cost function given in (5) and (8) can be simplified to
And
The above cost function can be approximated by the following simple sinusoid, Where A and C are constants independent of µ with A<0. Since the cost function is sinusoidal, the minimization process, which is usually, performed using methods such as the line search or the gradient descent, can be replaced by the curve fitting method [10] [11]. The curve fitting method leads to closed-form estimation of ǫ by evaluating (5) and (8) at three special trial points, namely −1/4, 0, 1/4. Then the CFO estimate ̂ as follows
ARCHITECTURE DESIGN
In this section, the architecture design for prototyping the PDE CFO estimator on FPGA device will be described. The xilinx system generator (XSG) design flow has been used to map the considered algorithm into FPGA hardware realization. Using the Matlab/Simulink combined with the XSG as the main design tool allows a simple and intuitive hardware design and simulation tool for DSP algorithms. Using XSG will bridge the gap between simulation and implementation by rapid prototyping Matlab co-simulated algorithms into FPGA runnig blocks providing a balance between hardware abstraction level and real-time system implementation.
A. Parallel-Stream Architecture (PSA) As described before the PDE estimators require to evaluate the cost function for three trail carrier frequency offsets, e.g. J(-1/4), J(0) and J(1/4). Utilizing the highly efficient FPGA parallelism feature, the cost function for the three trail points can be computed in parallel. Therefore, the direct mapping for the PDE will result into streams processing the input data using three parallel FFT cores. The schematic diagram for the proposed parallel-stream architecture for the implementation of the PDE estimator is shown in Fig. 1 . The CFO compensated streams to compute the J(-1/4) and J(1/4) are generated by heterodyning the output of xilinx direct digital synthesizer (DDS) core with the received OFDM data. The DDS generates a complex sinusoid with a trail CFO which is used to compensate the impaired received OFDM data using a complex multiplier. The two streams can be generated using a single DDS where the second input stream is generated by negating the output of the DDS. Then the xilinx FFT core has been used to generate the frequency domain data for the parallel input stream. The FFT core has been configured to operate in a pipelined streamed input/output mode based on radix-2 butterfly algorithm to reduce the time spent in the transform phase and to provide high throughput for the output stream. Since the FFT block provides a streaming mode for computing the Discrete Fourier Transform (DFT), there is no stalling of input and the architecture data-path can be heavily pipelined where the inputs arrive at the FFT block as soon as all the previous inputs for processing enter. As a result, the streaming Radix-2 FFT block takes a minimum time for FFT transform among all the FFT core libraries provided with XSG. The power difference module is used to sequentially accumulate the subcarriers power difference values. The power difference module can be configured to compute the power difference of the adjacent subcarrier or the same subcarrier in consecutive OFDM symbols of the FFT output stream. Then, the computed cost function are used to estimate the CFO where the CORDIC module is configured in a polar mode to compute the angle of the input vector. The CORDIC algorithm is implemented in 3 steps:
Step 1: Coarse Angle Rotation. The algorithm converges only for angles between -pi/2 and pi/2, so if x < zero, the input vector is reflected to the 1st or 3rd quadrant by making the x-coordinate non-negative.
Step 2: Fine Angle Rotation. The resulting vector is rotated through progressively smaller angles, such that y goes to zero. In the i-th stage, the angular rotation is by either +/-atan(1/2i), depending on whether or not its input y is less than or greater than zero.
Step 3: Angle Correction. If there was a reflection applied in Step 1, this step applies the appropriate angle correction by subtracting it from +/-pi. 
B. Multiplexed-Stream Architecture (MSA)
In order to reduce the computational complexity, a resource efficient implementation for the PDE CFO estimator on FPGA using single FFT core has been proposed. The core design for the proposed multiplexedstream architecture is shown in Fig. 2 where the three parallel input streams are multiplexed into one stream and a single FFT core has been used to compute the frequency domain data for the multiplexed input streams. The easiest way to realize this architecture is to use the dual-port RAM block. The three input streams are stored in a dual-port RAM which enables simultaneous access to the memory space at different sample rates using multiple data widths. The stored data read from the dual-port RAM with higher sampling rate and multiplexed into a single stream for FFT processing. The clock frequency for the FFT core in the multiplexedstream design should be three times faster than the counterpart in the parallel-architecture design. The computed cost functions for the multiplexed-streams are demultiplexed and then processed to estimate the CFO. 
FPGA PROTOTYPING RESULTS
This section presents the hardware evaluation and FPGA prototyping results for the proposed PDE CFO estimator architecture. For hardware designer, it's important to tune the word length of the designed system such that it will have a little impact on the accuracy of the output parameters. Moreover, the impact of CORDIC iterations on the estimation accuracy is also reported. The design has been implemented using the XSG design flow and synthesis using the xilinx ISE synthesis tools and then placed and routed to obtain the hardware area utilization results. The target device is a Xilinx Virtex6 FPGA (xc6vsx315t) with package 3ff1156. To test the proposed system, 104 OFDM symbols impaired with a normalized CFO=0.4, N=64 and Ng=16 have been generated in Matlab and passed to the hardware architecture through the gate way-in block. The mean square error (MSE) has been used to evaluate the estimation accuracy of the hardware architecture. The word-length has a significant impact on the area consumption, throughput and accuracy. To configure the word length for the proposed architecture, the MSE for the hardware estimated CFO is evaluated for different word-length values as shown in Fig. 3 . For noiseless channel, it is obvious that the MSE decreases as the word-length increases. However, increasing the wordlength requires more hardware resources. Therefore, to tune the word-length without degrading the estimation accuracy, a practical situation is considered where a noisy channel with SNR=40 dB is used. It is clear that the MSE for the PDE-Time and PDE-Frequency start to saturate for word-length >12 and increasing the word length will not help to further improve the estimation accuracy of the proposed architecture. Therefore, a word-length of 12-bit can be used to achieve minimal resource utilization while preserving the accuracy for the estimation algorithm. The impact of the CORDIC processor elements (PEs) on the accuracy of the estimated CFO is illustrated in Fig. 4 . For noisy channel with SNR=40 dB, it is clear that the MSE for the PDE-Time and PDE-Frequency start to saturate for PEs>10. Therefore, using 10 PEs for the CORDIC will be sufficient to preserve the estimation accuracy with minimal hardware utilization. The hardware resource utilization and the maximum frequency for the proposed architectures are illustrated in 
CONCLUSION
This work investigated an efficient FPGA prototyping architecture for CFO estimation in OFDM systems. A parallel and multiplexed-stream architecture for the PDE CFO estimator has been designed and evaluated. The designed architectures mapped into FPGA using XSG design ıow. Prototyping results confirmed that the proposed architectures leads to an efficient FPGA implementation occupying a small portion of the device resources with high throughput. The prototyping results for the multiplexed-stream architecture showed that a saving of 50% in the resource utilization can be achieved compared with the parallel-stream architecture. 
