In this paper, a novel timing synchronization algorithm and its implementation scheme are presented. Although conventional synchronization algorithm with correlation method has good performance, it is difficult to implement in hardware due to its high resource consumption and long capture time. The proposed method focuses on addressing these problems for practical application in orthogonal frequency division multiplexing (OFDM) systems. Instead of correlation, the novel algorithm based on preamble adopts convolution to accelerate the frame capture procedure. Implemented on a Virtex-5 series Xilinx FPGA, at least 85% of block RAMs are saved while comparable or even better performance is achieved by using the algorithm. Finally, the scheme is proven stable and efficient in a Gbps OFDM trial system.
Introduction
Because of its robustness in against multi-path fading, OFDM is widely used in various communication systems. Promising enough, OFDM is extremely sensitive to timing synchronization error, which causes the degradation of system performance [1] .
There have been numerous papers on the subject of timing synchronization for OFDM in recent years. In literature [2] , a maximum likelihood (ML) estimator using redundant information contained in the cyclic prefix (CP) is designed. A rapid and robust timing synchronization method is presented in [3] . It takes advantage of a preamble including two identical halves to estimate symbol timing. By designing special training symbol based on [3] , literature [4] has a sharper peak timing metric trajectory. Literature [5] combines self and cross correlation to achieve synchronization even under low signal-to-noise ratios (SNRs), high frequency offset and multi-path environment. In those papers, the theoretical analysis and simulation results are provided without considering implementation constraints.
The instantaneous power is compared with a longterm estimated power in [6] , by which a framestart is acquired. Although the method has low complexity and could be implemented with modest resource consumption, the performance deteriorates rapidly as the SNR decreases. Most of other practical solutions for timing synchronization are based on auto correlation [7] or cross correlation [8] . No matter auto correlation or cross correlation is employed, checks only one sample of a frame but costs at least N (length of training sequence) taps. It is much lower than the entry data rate. Therefore, a large amount of storage space is needed for a whole frame.
In this work, we propose a novel algorithm for implementation to address these problems. The timing synchronization is achieved by a method combined convolution with auto correlation calculation. Computer simulations show that the performance of the proposed algorithm is comparable or even superior to that of conventional correlation methods. The structure of proposed algorithm is divided into several submodules to be implemented on a Virtex-5 Xilinx FPGA device. Some simplifications and modifications are adopted to save hardware resource. A Finite State Machine (FSM) is designed to decide how the estimator works before and after synchronization. Then the design is applied to our Gbps OFDM trial system in wireless environment, where performance of the implementation is analyzed in terms of resource utilization and synchronization accuracy in high and low SNRs.
The remainder of this paper is organized as follows. Section 2 gives a brief description of physical frame and signal model. Section 3 explains the principle of the proposed algorithm. Section 4 shows the implementation of the algorithm on a Xilinx FPGA and Section 5 analyzes the performance of the design. Finally, the paper is concluded in Section 6.
Frame structure and signal model
The physical frame structure of the Gbps OFDM trial system is shown in Figure 1 . The preamble is composed of two contiguous OFDM symbols, each having the same Constant Amplitude Zero AutoCorrelation (CAZAC) sequence. The CAZAC sequence has perfect auto correlation and cross correlation properties. It is defined in [9] :
(1) Where, N is the length of the sequence.
The base band expression of received signal is:
Where, εis the carrier frequency offset normalized to the subcarrier spacing, ω(k) represents the zero-mean complex additive white Gaussian noise ( AWGN) and
Where, x(k) denotes the transmitted OFDM signal and h(l) is the impulse response of the wideband channel [10] . 
Synchronization algorithm
In this section, we propose a two-stage convolution-based algorithm that achieves high detection speed and sharp peak time metric. The algorithm is presented in detail as follows.
Peak searching
First step of the proposed synchronization is based on the convolution of the received signal with a known CAZAC sequence :
where, r(n) denotes the sequence cut from received signal by a slide window; t represents the time by which the slide window moved. In Figure 2 , the dotted line indicates the time metric of conventional auto-correlation method. It has smooth roll-off and a plateau as the existence of cyclic prefix [3] , which decreases the performance of synchronization. Compared to auto-correlation method, proposed convolution metric has sharper peaks. As for the perfect correlation properties, peaks exist in four fixed points (P 1 , P 2 , P 3 , and P 4 in Figure 1 ) and small value in other points. The performance of convolution metric is comparable or even superior to that of the conventional cross correlation method in Figure 3 . Cross correlation has two higher peaks and one lower peak while four peaks exist in convolution metric. The first peak in the first figure in Figure 3 is zoomed in and shown in the second figure in Figure 3 . It can easily be seen that the metric of convolution is smoother than that of cross correlation in non-peak samples. Since the position where the slide window starts working in a frame is random, the number of samples that slide window cut from the preamble is not constant. Therefore, the four peaks' values change as the beginning point of the slide window changes. As illustrated in Figure 4 , the first peak value is as large as 1000 when start point is 1, while smaller than 500 when start point is 512. It is not sure whether the position where magnitude exceeds the threshold is the first peak. It can be anyone out of P 1 , P 2 , P 3 and P 4 . Consequently, a further detection operation is needed to determine the exact position of the found peak and acquire synchronization. 
Peak confirming
Although the single peak value is unsure, the sum of peak values in P 1 and P 2 is comparatively stable. As seen in Figure 5 , average peak value in P 1 and P 2 keeps in a certain range. The found peak must be P 1 or P 2 , as long as the threshold is set to less than the average value. The possible found peak is then limited to two locations, which reduces the complexity of the second detection operation. 
Hardware implementation scheme
The design is targeted on a virtex-5 xc5vsx95t XILINX FPGA with some modifications and simplifications. The block diagram of proposed synchronization method is presented in Figure 6 . It mainly comprises four parts: convolution, division operation, correlation computing and comparator. Figure 6 Timing synchronization structure
Convolution module
In the estimator, convolution between received signal and local CAZAC sequence is performed. The detail of convolution module is shown at the bottom left of Figure 6 . Considering the high complexity of convolution design in the time domain, multiplication in the frequency domain is adopted to replace it. The multiplication unit is divided into four parts: fast fourier transform (FFT), read-only memory (ROM), multiplier and inverse FFT (IFFT). All of them use IP core resources provided by Xilinx for implementation, improving the overall performance of convolution. The ROM module stores the CAZAC sequence in frequency domain. The FFT module transforms received sequence from time domain to frequency domain while IFFT module transforms multiplier output from frequency domain to time domain. Transform sizes of FFT/IFFT are N.
Division module
In (4) and (6), there are division operations that cost a large number of slices and LUTs in implementation. However the division by t has no such hardware cost and can be implemented with a simple binary shift, as long as t is an integer power of 2. So a transform module is designed to prune the divisor R into 2 t and the rule is defined as follows:
Obviously, the division might suffer from performance deterioration after pruning. But it does not affect the synchronization accuracy since the average peak value is still much larger than the max value in non-peak positions in Fig. 7 . It simplifies the design of division and yields a good trade-off between performance and complexity. 
Correlation module
The exhaustive architecture of correlation module is presented at the bottom right of Fig. 6 . It is composed of a complex multiplier and an accumulator. The accumulator has two output ports, each having the same value. One is exported off the module and the other is used as an input signal of the accumulator which is cleared every N taps.
Comparator module
The aim of the comparator1 in the first step (peak searching) is to find a peak that exceeds the threshold. A pulse signal is generated to halt the first step to reduce power consumption. Meanwhile, the second step of detection (peak confirming) is started and comparator 2 is used to verify the precise position of the found peak. Decision is made according to the calculation result of division module. The timing synchronization is achieved after the found peak's location is determined.
To demonstrate how the synchronization estimator works, a Finite State Machine (FSM) is designed. As shown in Figure 8 , the three main states and their working procedure are as follows: 
Scheme performance analysis
Compared to conventional correlation methods, the proposed implementation scheme has advantages in frame head capture time and resource utilization. The detection speed of conventional correlation method is 1/N sample per tap, because the interval between two correlations is N taps and one correlation calculation determines only one sample. The proposed scheme has the same interval between two convolutions but decided N samples. The detection speed is one sample per tap, which is as quick as entry data rate and N times that of correlation. The utilization of block RAM and DSP is estimated before implementation and shown in Table 1 . Although the proposed scheme costs 7 times as much DSP as correlation, it still occupies less than 9% of available DSP resource. Owing to the rapid detection speed, the design saves 234 block RAMs, which is 96% of total available RAM resource. The proposed scheme is more reasonable in terms of resource utilization. The design could be implemented on virtex-5 xc5vsx95t XILINX FPGA and Table II shows the major resources utilization. As shown in the table, the design occupies no more than 20% of available device resource, which can be considered as an efficient implementation. Figure 9 Frame-start and received signal with different SNR
The proposed scheme is applied to our experimental platform in wireless environment. It is a TDD-OFDM system with peak rate 1Gbps. Part of the frame observed by the logic analyzer in time domain is shown in Figure 9 . The first signal "state" is the state in which the estimator is working. The second one indicates the beginning of the frame. The third one is the real part of the received signal. The signal in the bottom part is worse than that in the top one because of higher noise level. No matter in high SNR or low SNR, the found frame head is correct and the estimator always works in tracking state. The synchronization mechanism is proven to be stable in real tests.
Conclusions
A novel timing synchronization algorithm is presented in this paper. By using convolution instead of correlation, the detection speed of the proposed algorithm rises up to N times as quick as the conventional correlation method. Besides, when the scheme is implemented on a virtex-5 Xilinx FPGA device, 85% of block RAMs for frame storage are saved due to the rapid processing speed and the low complex. Despite the simplification, the time metric of proposed algorithm has comparable performance to that of cross correlation algorithm. What's more, the estimator applied to the Gbps OFDM trail system is proven efficient and stable in wireless environment.
