Abstract-Based
I. INTRODUCTION Orthogonal Frequency Division Multiplexing (OFDM)
has recently been applied widely in wireless communication systems due to its high data rate transmission capability with high bandwidth efficiency and its robustness to multi-path delay. The channel estimation (tracking) in OFDM systems is generally based on the use of pilot sub-carriers in given positions of the frequency-time grid [1] . For fast-varying channels, non-negligible fluctuations of the channel gains are expected between consecutive OFDM symbols (or even within each symbol) so that, in order to ensure an adequate tracking accuracy, it is advisable to place pilot sub-carriers in each OFDM symbol [2] , [3] . To eliminate the need for channel estimation and tracking, differential demodulation can be used in OFDM systems, at the expense of a 3~4dB loss in signal-to-noise ratio (SNR) compared with coherent demodulation. Accurate channel estimation [4] , [5] , [6] can be used in OFDM systems to improve their performance by allowing for coherent demodulation.
The structure of OFDM signaling allows a channel estimator to use both time and frequency correlation. Such a two dimensional estimator structure is generally too complex for a practical implementation. To reduce the complexity, separating the use of time and frequency correlation has been proposed [8] .
Traditional one-dimensional channel estimation techniques for the OFDM systems can be summarized as follows: (a) least-squares (LS), (b) minimum meansquare error (MMSE) and (c) linear minimum meansquare error (LMMSE) estimators. LS estimators have low complexity, but they suffer from a high mean-square error (MSE), especially if the system operates with low signal-to-noise ratios (SNR). On the other hand, MMSE estimators, based on time-domain channel statistics, are high-complexity estimators. They give good performance for sample-spaced channel environments, but have limited performance for non-sample-spaced channels and high SNR's [9] . Finally, LMMSE estimators give good performance for both sample-spaced and non-samplespaced channels [10] .
Though the linear minimum mean-squared error (LMMSE) estimator using only frequency correlation has lower complexity than one using both time and frequency correlation, it still requires a large number of operations. A low complexity approximation to the frequency based LMMSE estimator that uses the theory of optimal rank reduction is introduced [5, 17, 18] . But the rank selection is determined by empirical value. The precise selection and hardware realization method was not given.
In this paper, a rank selection criterion is presented while the FPGA implementation method is proposed.
After presenting the OFDM system model and our scenario in Section 2, the low rank selection model is introduced in Section 3. The FPGA design method for Hermitian matrix is analyzed in Section 4. The proposed system performance of estimator in Section 5 and the simulation results appear in Section 6. Conclusion is presented in Section 7.
II. SYSTEM DESCRIPTION

A. Channel Model
As shown in Fig. 1 , OFDM channel model should be described as a set of parallel Gaussian channels. OFDM system is described in matrix notation as [5] y = Xh + n n n n − = ⋅⋅⋅ n is vector of independent identically distributed complex zero-mean Gaussian noise with variance 2 n σ . Noise n is assumed to be uncorrelated with 
hh is the channel autocorrelation matrix. In [4] , Beek described LMMSE estimator as 
where
is a constant depending on the signal constellation point.
C. Optimal Low-Rank Approximations
If hh R and SNR are known beforehand or are set to fixed nominal values, singular value decomposition for hh R is employed to fulfill rank reduction, then [9] = Λ H hh R U U
where U is a unitary matrix containing the singular vectors while Λ is a diagonal matrix containing the singular values 0
Then optimal rank-estimator is 
where p Δ is a diagonal matrix that is given by , and the column space of X which defined as the observation data space is where   2  2  2  2  2  2  2  1   ( , , , 
Let S be eigenvector matrices which corresponds to the r principal eigenvalue of X R . The column space of S called observation data space and expressed as 
is the singular value, p is the order of rank reduce,
is the channel estimation value. Rank value p highly relates to computational complexity of LMMSE in Eq. (5). In [9] 
The greatest integer which satisfy the criteria i σ ε ≥ should be selected as the estimate value of effective rank which be expressed as r ) . This criteria equivalent to select the greatest integer which satisfies
where ε is a small positive number which should be selected according to the computer precision and data precision. By norm ratio method, let m n × matrix k A be the rank k approximation of m n × matrix A . Define the , etc. Take 2-path delay-fading channel model as an example, the rank order of above two method is calculated by MATLAB and shown in Table I.   TABLE I C. Estimator Complexity Express Eq. (5) in the fashion of first order vector sum, then LMMSE estimator is design for wireless channel environment whose autocorrelation matrix is hh R and signal/noise ratio is SNR while the autocorrelation matrix of real channel h % is hh R % % and signal/noise ratio is SNR . Estimation error for rank-p channel estimator is 0 0
If h % and n % are uncorrelated, using trace formula for matrix, then
where k μ is the kth diagonal element of matrix H hh U R U and represents the channel energy for kth transmission coefficient.
The lower bound of minimum mean square error is 
D. Simulation Results
A frequency selective Rayleigh fading channel model is chose to simulate multi-tap delay-line. As a performance measure, we use un-coded SER for BPSK signaling. The SER in this case can be calculated from the MSE by Eq. (9) [12] . At the assumption that cyclic prefix should eliminate ISI, training sequence pilot is used, then MSE is defined as 2 
MSE h h = −
) .
The simulation of MSE under different SNR is presented in Fig. 3 whose channel model is 7 paths Rayleigh fading channel, and in Fig.4 whose channel model is 2 paths Rayleigh fading channel. Fig.4 show SNR performance for low-rank LMMSE estimator under Rayleigh fading channel. The low-rank LMMSE estimator is better than the LS estimator and a little worse than LMMSE estimator without rank reduce. Fig.5 gives SER performance for low-rank LMMSE. SER is obtained by statistical average of error symbols in 100 times cycle test.
Figure5. SER before and after rand reducing
Compare with traditional LMMSE, low-rank LMMSE causes about 4dB performance loss which should be led by channel estimator error caused by rank reduction. But the performance loss is within the acceptable level which means the rank reduced estimator has better practicability, and hence applicability in practical OFDM system.
IV. FPGA DESIGN FOR HERMITIAN MATRIX
Channel autocorrelation function matrix hh R belongs to Hermitian matrix. The key problem of FPGA realization for SVD channel estimation is how to improve the operation speed of the SVD for Hermitian matrix. Hemkumar presents a method employing a 2-steps Q transform strategy to fulfill SVD for complex matrix [13] . In FPGA realization process, the 2-steps strategy needs complex calculation which should decrease the calculation speed. To obtain high real-time ability, a revised method that transforms the complex calculation for Hermitian matrix to real symmetric matrix calculation is proposed.
Let H be Hermitian matrix, In this paper, the parallel Jacobi algorithm is used to realize SVD for real symmetric matrix [11] . As well as systolic matrix, real matrix has symmetry factor. Rotating the matrix will get real symmetric matrix as well. To save system resource, the upper triangular matrix is used to realize SVD of real symmetric matrix [11] .
By systematically "zeroing" off-diagonal entries, matrix A can be diagonal. To be specific, consider the iteration:
The parameter ζ is some small machine-dependent number. Each pass through the "until" loop are called a "sweep". So we get the hardware realization block for SVD of matrix which include two parts:
(1) Processing unit in matrix: 1) 2 2 × sub matrix rotating calculation unit 2) 2 2 × sub matrix bilateral rotating unit Above two units which should be realized by CORDIC algorithm is used to fulfill the 2 2 × sub matrix transformation.
(2) Global communication control unit This unit is used to control the digital exchange between processing units and the timing relationship between control signals, fulfill Jacobi iteration and furthermore realize matrix SVD algorithm.
A. CORDIC Unit
Limited by system resource, CORDIC cyclic structure is used to calculate the rotating angle and realize matrix bilateral rotating unit simultaneously. CORDIC method can be given by [11] ( ) ( ) ( ) 
is the signature which should decided the direction of rotating. 1 i z + is the angle accumulator which is used to calculate the variation of rotating angle.
B. Parallel Jacobi Algorithm
By using a series bilateral rotating, Jacobi algorithm decrease norm for off-diagonal elements with symmetrical structure, then transmit matrix to diagonal matrix including singular value and realize matrix SVD. The iterative process is shown below. [15] .
The Brent and Luk ordering rule shows that the sub rotation is taken per column and the after moving by arrow, no collision happens between rotating matrix which should be processed in parallel while parallel Jacobi algorithm can be fulfilled by Systolic array. Considering symmetry factor of real matrix, in this paper, an improving Systolic array which has an upper triangular model is used to realize SVD.
C. Realization for Array Processing Units
(1) Realization of singular value matrix In order to get SVD matrix by Parallel Jacobi algorithm in improving Systolic array, two functional units, rotation angle calculation unit and bilateral rotating processing unit which both be implemented by CORDIC iterative structure, are needed. Fig. 6 shows the realization block diagram rotation angle calculation unit while Fig.7 shows that of the bilateral rotating processing unit. We present a two steps strategy to realize parallel Jacobi data scheduling algorithm in Systolic array. 
2) Data exchange among processing units In Systolic array, l θ , r θ are transferred only along the diagonal processing units. Data processing unit and communication control unit are designed in each processing unit that no extra memory and reading operation are needed in the operation progress which should improve the capability of digital processing and scalability of system while global communication control unit is designed to control the initialization, timing relationship and iteration number of each processing unit which should promote the orderly and coordinated processing operation [16] .
(2) Realization of singular vector matrix The SVD progress of matrix is
. U is the right singular vector matrix of matrix A which can be get by unilateral rotating identity matrix whose order number equals to matrix A .
The array structure of singular vector matrix is similar to Systolic array but only different in function of processing units and the data scheduling algorithm. Take right singular vector for example.
1) The processing units of array are all unilateral rotating units.
2) Rotation angle r θ is transferred to unit with same column subscript by Systolic array used to get singular value matrix.
3) The interconnections among processing units which only exchange elements between the adjacent processing units with same row subscript are not as same as that inside processing unit.
The data exchange algorithm is shown in Table IV.   TABLE IV.  DATA EXCHANGE SCHEDULING ALGORITHM INSIDE PROCESSING 
Compared with real matrix SVD, complex matrix SVD needs more sweeping times than real matrix SVD whose experience value of sweep is log N . With a little more occupied resource increase, compare with SVD for Hermitian matrix, SVD for real matrix present a better real time performance for its high convergence speed. On the other hand, real computation is easier than that for complex which should greatly reduce the complexity of system realization.
VI. SIMULATION RESULTS AND ANALYSIS
In this paper, EP2C70F896C6 which belong to CycloneⅡ series device is selected to realize design by VHDL language on QuarterⅡ 8.0 design platform. Let frequency of simulation clock be 100MHz, and the time sequence for processing units is shown in Fig.9 . Fig. 9 (a) shows the simulation result for diagonal processing units of SVD for 8 8 × real symmetry matrix.
A 2 2 × diagonal matrix is obtained after once Jacobi rotating for each diagonal processing unit and the simulation result is convergent after three sweep periods. The occupied resource of SVD for 8 8 × real symmetry matrix is shown in Fig.11 . 
VII. CONCLUSION
A new low-rank SVD channel estimator design is proposed in this paper. Based on signal/noise sub-space theory, a rank selection criterion is given to reduce the complexity of LMMSE channel estimation method. The simulation results of MSE under 7 paths Rayleigh fading channel and 2 paths Rayleigh fading channel is presented. Compared with the full LMMSE there is only a small loss in performance, up to an SNR of 25 dB, but a reduction in complexity with a factor. To realize SVD for high rank Hermitian matrix which is the technology difficult for channel estimator implementation, a simplified method is proposed. Proposed method transforms complex calculation for Hermitian matrix to real symmetric matrix calculation to obtain good real time property and higher efficiency. Parallel Jacobi data scheduling algorithm in improving Systolic array is used to calculate SVD for real matrix to save the occupied resource of system. The design is realized by VHDL language on QuarterⅡ 8.0 design platform whose result is just according with the Matlab simulation result which proofs the correct of design.
