The free space optical communication (FSOC) signals often exhibit a much larger dynamic range and much lower OSNR than optical fiber communication signals, due to the large-scale relative motion of the terminals, change of atmospheric channel conditions and large transmission loss. For this reason flexible digital coherent receivers (DCR) capable of adapting to different channel conditions have attracted much attention in the FSOC field. As timing recovery is indispensable for DCRs, the development of a timing recovery algorithm (TRA) capable of processing such FSOC signals is of particular interest in FSOC field. In this paper we evaluate the performance of Gardner's timing error detector (TED) and propose a new blind TED with a lower computation complexity and more stable output characteristics. Based on the proposed TED, we design a feedback parallel TRA suitable for the FSOC signals and implement it in a FPGA to evaluate its real-time performance. Numerical simulations and experiments illustrate the merits of the proposed TRA based on new TED with respect to the TRA based on Gardner's TED.
Introduction
Free space optical communication (FSOC) can exploit the unregulated and nearly unlimited bandwidth in the near-infrared band and provide higher data rate, lower size, weight and power (SWaP) profile lasercom terminals compared with microwave communication [1] - [3] . FSOC systems often work in the photon-starving regime and thus the optical signal input into the receiver often has a very low OSNR. Furthermore in the FSOC system the channel parameters are time-variable and often change drastically due to large-scale relative motion of the terminals and atmospheric channel condition variations, making the signal OSNR vary in a large range. Therefore the optical receivers in FSOC systems must adapt to the FSOC signal with a large dynamic range and very low OSNR. In recent years flexible digital coherent receivers (DCR) widely used in optical fiber communication (OFC) systems have attracted much attention in the FSOC field because they support programmable adjustment of the sensitivity by using different data rates, modulation formats and FEC rates [4] . Recently experiments have shown that DCRs in combination with digital coherent combining (DCC) algorithms can handling FSOC signals with extremely low OSNR (much lower than 0dB) and very wide dynamic range (much large than 30 dB) [5] - [7] .
The timing recovery algorithm (TRA) is used to resynchronize the receiver with the incoming data stream and thus is indispensable for the DCRs. The performance of TRAs relies on the timing error detector (TED). By now various TED schemes have been proposed, such as the well-known "square-and-filter" proposed by Oerder & Meyr ("O&M") [8] , and those proposed by Gardner [9] , Godard [10] and Lee [11] . Four samples per symbol (SPS) are typically required for the "O&M" scheme because the square nonlinear operation will double the signal bandwidth. The others require only 2 SPS and thus are more appealing, especially when the DSP chip processing capacity is limited [12] . In terms of computation complexity, TED requires 2 real multiplications in each symbol for Gardner's, 16 for "O&M", 12 for Lee's and 4 for Godard's (after the signal spectrum has been calculated with FFT) [13] . As regard to the timing phase estimation accuracy, Lee's and Gardner's TEDs exhibit about 1 dB lower jitter than the other two [13] . Recently a variety of modified algorithms and techniques have been reported [14] - [16] , but most of them aim at OFC signals with fiber transmission distortions and a relatively higher OSNR. So we limit ourselves to the classic TED algorithms. The TRAs proposed by now have two working mode: feedback and feedforward. The "O&M" and Lee's TED are often applied with a feedforward TRA, whereas Gardner's and Godard's TED are commonly used in a feedback TRA. Considering the computation complexity and accuracy of the TED, the feedback TRA based on Gardner's TED is considered as a strong candidate for the DCRs.
However, by now the performance of the Gardner's TED based TRA is investigated mainly for the OFC optical signals with OSNR around 10 dB and dynamic range less than 10 dB [13] , while the FSOC DCRs need to handle optical signals with OSNR much lower than 0 dB and dynamic range much larger than 30 dB [7] . Furthermore because power consumption is more strictly limited in FSOC systems, the development of a TRA with lower computation complexity is imperative.
In this paper we first analyze and evaluate the performance of the Gardner's TED and the feedback TRA based it for the FSOC signals, and then propose a multiplication-free blind TED algorithm with a lower computation complexity and more stable output characteristics. Based on the proposed TED we design a parallel TRA suitable for FSOC signals with a large dynamic range and very low OSNR. To evaluate its real-time processing performance, the parallel TRA is implemented in a FPGA. It is demonstrated in simulation and experiment that the proposed TED and TRA outperform the Gardner technique for the FSOC signals.
A Multiplication-Free TED With Stable Output Characteristics
For the Gardner's TED the timing phase error is evaluated using the following equation
Here x (2n), x (2n + 1) and x (2n + 2) are three adjacent output samples of the interpolator. The curve depicting the variation of the estimated timing phase error ε Gard ner versus the real one follows a sinusoidal shape, and thus is commonly referred to as S-curve [13] . Fig. 1 (a) shows the example Gardner's TED S-curves obtained under different OSNRs using 256 symbols with 2 SPS of a NRZ-BPSK signal for each point. As can be seen from Fig. 1(a) , the peak-to-peak values of the S-curves decrease from about 0.5 to about 0.1 when OSNR decreases from 18 to −3 dB. The slope of the S-curves around zero timing phase error (denoted by k d ) changes by about 6.7 times as shown in Fig. 1(c) . This is because the value of ε Gard ner is related to the magnitude of input signal samples. For a fixed input power, the magnitude of the signal samples is smaller when OSNR is lower, thus leading to a smaller ε Gard ner . When no signal is present the S-curve will turn into a horizontal straight line with k d = 0. The unstable output characteristics, and more specifically the large variations of k d , may lead to a performance degradation or even failure of the feedback TRA, as to be shown later. Recently Lee proposed a blind feedforward TED taking the following form [11] 
Here x (n) and x (n + 1) are two neighboring output samples of the interpolator. The notation Re(·) stands for the real part of the operand within the brackets. Lee's TED exhibits the same accuracy as Gardner's, but it requires 12 real multiplications in each symbol which is 6 times as large as that required by Gardner's TED [13] . To reduce the computation complexity the above equation can be rewritten as
When N is large, it can be further simplified to the following equation as the last two items on the right containing two samples x (0) and x (N − 1) are much smaller and thus neglectable compared with the summation item before them which contains N − 1 samples.
sgn (x (n) + jx(n + 1)) sgn (x * (n) + jx * (n + 1)) (−1) n (4)
Here the notation sgn(·) represents the complex sign function which is defined by sgn(c) = sgn[Re(c)] + j · sgn[Im(c)], where Im(·) stands for the imaginary part of the operand within the brackets. As the complex numbers involved in the calculation only consists of ±1 ± j, no multiplication is required (multiplication can be replaced by addition and simple sign bit change). Thus the proposed TED has lower computational complexity than the Gardner's TED. Furthermore by calculating the angle to estimate the timing phase error (which is also multiplication-free as to be explained later), the TED output is no longer directly affected by the magnitude of the input signal samples, thus is more stable when OSNR changes. The example S-curves of the new TED obtained under different OSNRs using 256 symbols with 2 SPS of a NRZ-BPSK signal for each point are shown in Fig. 1(b) . As we can see the S-curves overlap with each other very well when OSNR varies by more than 20 dB, and thus k d keeps nearly the same as shown in Fig. 1(c) . This stable output can ease the adaptation of the feedback TRA to the FSOC signals, as to be shown later in the TRA design section.
To compare the accuracy of the proposed TED with the Gardner's TED, timing jitter in decibel (dB) is used as the metric [13] . It is defined with the variance of zero-crossing positions of the S-curves normalized by the symbol period. Fig. 2 (a) and (b) shows 100 S-curves of the Gardner's and proposed TED obtained by scanning the sampling phases on bursts of NRZ-BPSK signal with a block size of 512 samples and OSNR of 6 dB. As can be seen the zero-crossing position of the S-curves has a relative drift (jitter) over time. The timing jitters obtained under different OSNRs are shown in Fig. 2(c) . As can be seen the proposed TED has similar accuracy as the Gardner's TED. The jitter is the same around OSNR = 0 dB. At very low OSNR (−3 dB) the former exhibits only 0.8 dB higher jitter.
Design of the Feedback Parallel TRA
We now briefly describe the working principle of the feedback all-digital TRA. The interested reader is referred to Refs. [17] , [18] for further details. As shown in Fig. 3(a) , the feedback all-digital TRA employs a phase locked loop (PLL) consisting of a TED, a loop filters (LF), a number-controlled The real-time processing necessitates a parallel TRA to break the limitation of the much lower working frequency of the DSP chips. The architecture of the proposed feedback parallel TRA is depicted in Fig. 4 . P stands for the parallelization factor. The same timing error is used to compensate all the samples arriving in one clock period, allowing a reduction of the required resources. To mitigate the strong noise impact, all of the P parallel samples output from the interpolator are used in the estimation of timing error given by Eq. (4). In other words N is equal to P.
According to Eq. (4) to average out the noise impact P should be as large as possible. But the time required to calculate the timing phase error ε and the latency of the parallel TRA also increases with P. The total latency is equal to L = P · l where l being a positive integer, represents the latency required to compute all the operations of the loop shown in Fig. 3(a) . When L is too large, the averaged ε calculated from the previous samples will deviate from the current real one and lead to a convergence failure, especially when there is a large sampling clock offset (SCO) making the timing phase error change fast. So P should be chosen to balance the competing requirements of averaging out the noise and achieving a sufficient timing phase error tracking speed. Fig. 5 shows the performance of the parallel TRA when the input OSNR varies in a large range and SCO = 100 ppm. In Figs. 5-6 the LF is assumed to be optimal. The metric used to evaluate the TRA performance is the error vector magnitude (EVM) [19] of the timing recovered signal after being processed by the digital carrier algorithm. In Figs. 5(a) and (b) l is assumed to be 10 and 25 cycles, respectively. As we can see when l = 10 cycles the curve with P = 8 deviates from the theoretical value when OSNR is below 6 dB. It means the TRA with P = 8 fails to work when OSNR is lower than 6 dB. When OSNR is lower than 3 dB the TRA with P = 16 also fails to work. While the TRAs with P = 32 to 256 keep working when OSNR is decreased to −3 dB. This is because when P is too small the impact of noise can't be averaged out, thus leading to the failure of the TRA. Fig. 5(b) shows that when l = 25 cycles, the TRA with P = 256 also fails. This is because when the loop latency L = P · l becomes too large the TRA can't track the timing phase error any more. In Fig. 6 SCO is increased from 100 to 500 ppm. Fig. 6(a) shows that only the TRAs with P = 32 and 64 can work when l = 10 cycles. Fig. 6(b) shows that only the TRA with P = 32 can work when l = 25 cycles. From Figs. 5-6 we can conclude that when SCO or l is increased the upper limit of P is reduced, while when OSNR is decreased the lower limit of P is increased. In other words the options of P are narrower when SCO or l is large while OSNR is low. To work under very low OSNR we can reduce the SCO by using a better clock resource or reduce l by improving the algorithm.
The loop latency l comes mainly from the angle calculation in Eq. (4). It is commonly implemented using a Xilinx CORDIC IP-core which has a large latency (more than ten cycles) [20] . To reduce l the angle calculation operation is implemented using a look-up-table (LUT) scheme. The table contains the values of arc tangent and the angle can thus be retrieved from the table according to ratio of the imaginary to the real parts. By this way angle calculation operation can be multiplication free and requires only 1 cycle. The proposed TED given by Eq. (4) is thus truly computation-free.
The other important element determining the TRA performance is the LF. The internal structure of the commonly used proportional-plus-integral LF is shown in Fig. 3(b) [17] , [18] . The selection of the values of the proportional and integral element gains K 1 and K 2 determines the stability of the PLL and the convergence error. K 1 and K 2 should be set according to the following equations [18] 
Here ξ is the damping factor and ω n is the natural frequency. Eqs. (5) and (6) show that the optimal K 1 and K 2 are determined by k d . Fig. 7 shows the EVM obtained by numerical simulations for different combinations of K 1 and K 2 when Gardner's (left column) and the proposed TED (right column) are adopted, respectively. In the simulation, the parallel TRA has P = 64 and l = 10 cycles and the input optical signals are the NRZ-BPSK signals with SPS = 2 and OSNR of −3, 3 and 9 dB. The SCO is set to be 500 ppm. The laser frequency offset and linewidth are 500 MHz and 500 kHz, respectively. Noting that we let K 1 = 2 −m and K 2 = 2 −n , so that the multiplications in the LF can be replaced by shift operations. Therefore K 1 and K 2 are represented by m and n in these figures. As can be seen, when K 1 and K 2 deviates from the optimal range, EVM increases sharply and when the EVM degradation is more than 6 dB the TRA actually fails to converge. Within the regions encircled by the solid black lines the EVM degradation is below 1 dB. These regions are considered as the safe zones. As can be seen from Fig. 7 , when the input signal OSNR is changed, the safe zone of the former changes evidently, while the safe zone of the latter is nearly fixed. This because k d changes significantly with OSNR for the Gardner's TED, but it almost doesn't change for the proposed TED as shown in Fig. 1(c) . Fig. 8 plots all of the safe zones corresponding to different OSNR values in the same figure. The safe zone corresponding to a specific OSNR value has a specific stripe pattern. The overlapped region among all of the safe zones indicates that the TRA and the specific combination of K 1 and K 2 are suitable for all of the OSNR values. As we can see when Gardner's TED is used, no overlapped region between all of the safe zones can be found when OSNR varies from −3 to 18 dB. While when the proposed TED is used, the overlapped region is expanded a lot and is outlined by the red solid line as shown in Fig. 8(b) . The result shows that with the proposed TED, the TRA can more easily adapt to the FSOC signals with a large dynamic range and achieve a better performance as the options of K 1 and K 2 are wider.
Real-Time Platform Implementation and Tests
To evaluate the real-time processing performance of the feedback parallel TRA, we compile the feedback parallel TRA using the Xilinx Vivado High-Level Synthesis (HLS) compiler by targeting the Xilinx Virtex 7 FPGA (XC7VX485T) as the execution fabric. The total PLL latency l is found to be 10 cycles. According to Fig. 6 (a) the parallelization factor P is set to be 64 so that the TRA can work when OSNR is as low as −3 dB. The maximal clock frequency is found to be 312.5 MHz, thus 312.5 × 64 = 20 GSamples can be processed per clock cycle. It means 10 GBaud signal with SPS = 2 can be processed in a real-time manner. The FPGA resource utilization summary is shown in Table 1 .
To test the real-time processing performance the digital samples of a NRZ-BPSK signal with 6 bit resolution, a certain timing phase error and SCO are generated using Matlab and commercial software VPI TransmissionMaker V. 9.0. The digital samples with timing errors are sent to the FPGA via a read only memory. The output timing recovered signals are transferred to a computer via a FIFO memory and USB cable. EVM of the output signal is then calculated to evaluate the performance [21] . Fig. 9 shows the variations of the timing recovered signal EVM as a function of the input OSNR for the parallel TRAs based on Gardner's and the new TED, respectively. The dashed line represents the theoretical values of EVM. When Gardner's TED is used we evaluate the performance of the TRAs adopting the three different combinations of K 1 and K 2 highlighted by the blue rectangles in Fig. 8 (a) . While for the TRA based on the new TED only one combination of K 1 and K 2 highlighted Fig. 7 . The EVM obtained by numerical simulations for different combinations of K 1 and K 2 when Gardner's (the left column) and the proposed TED (the right column) are adopted. The OSNR is set to be 9, 3 and −3 dB respectively. by the blue rectangles in Fig. 8 (b) is adopted. As we can see from Fig. 9(a) , the three TRAs based on Gardner's TED can work respectively within the OSNR ranges of 9∼18 dB, 0∼6 dB and −3∼3 dB, but none of them can work under all OSNRs. On the contrary, the TRA based on the proposed TED can work well under all OSNRs. Furthermore the EVM obtained is very close to the theoretical value. Figs. 9(b) and (c) show a portion of the fractional interval output by the NCO of the proposed TRA when SCO = 500 ppm and OSNR = −3 and 18 dB, respectively. As can be seen, after the acquisition stage is completed, the fractional interval has a steady varying period per 2000 symbols (corresponding to 1/SCO = 1/500 ppm). The results show that the TRA accomplishes the target of timing adjustment successfully for both OSNR = −3 and 18 dB, but the acquisition stage is little longer when OSNR = −3 dB.
To investigate the SCO tolerance, SCO of the input sampled signals is varied from −500 to 500 ppm. The sampling phase is randomly selected. The variations of the EVM of the timing recovered signal against SCO are shown in Fig. 10 . The horizontal dashed lines represent the theoretical values of the EVM. As we can see the EVM of the timing recovered signal is smaller than the theoretical value in all cases owning to the sampling diversity gain [13] . The results show that the proposed parallel TRA can work over the whole SCO variation range.
Conclusions
To solve the pressing challenges posed by the FSOC signals for the timing recovery technique which is indispensable for the DCRs, we propose a multiplication-free blind TED algorithm which has a lower computation complexity and more stable output characteristics compared to the Gardner's TED. Based on the proposed TED, we design a parallel TRA suitable for the FSOC signals with a large dynamic range and very low OSNR. The effects of the parallelization factor, PLL latency l and SCO on the performance are also investigated. Simulation and experimental results validate the advantages of the proposed TED and TRA compared with the classic Gardner technique.
