Abstract-In this paper a low-complexity time synchronization algorithm for optical orthogonal frequency division multiplexing (OFDM) is proposed. The algorithm is based on a repetitive preamble that allows the use of a short cross correlator with an exponential average filter for postprocessing before a threshold detection. The signals in the correlation have been quantized with 1 bit, and the correlations have been implemented as a hard-wired tree adder to reduce the hardware cost. This solution has been verified in a passive optical network (PON) system using a directly modulated distributed feedback (DFB) laser achieving excellent performance with low computing processing complexity even in low signal-to-noise ratio scenarios. Finally, a parallel hardware architecture has been proposed for this time synchronization algorithm, and it has been implemented in a field programmable gate array device reaching a sample rate throughput up to 7.4 Gs∕s.
Abstract-In this paper a low-complexity time synchronization algorithm for optical orthogonal frequency division multiplexing (OFDM) is proposed. The algorithm is based on a repetitive preamble that allows the use of a short cross correlator with an exponential average filter for postprocessing before a threshold detection. The signals in the correlation have been quantized with 1 bit, and the correlations have been implemented as a hard-wired tree adder to reduce the hardware cost. This solution has been verified in a passive optical network (PON) system using a directly modulated distributed feedback (DFB) laser achieving excellent performance with low computing processing complexity even in low signal-to-noise ratio scenarios. Finally, a parallel hardware architecture has been proposed for this time synchronization algorithm, and it has been implemented in a field programmable gate array device reaching a sample rate throughput up to 7.4 Gs∕s.
Index Terms-FPGA implementation; OOFDM; PON; Real-time signal processing; Time synchronization.
I. INTRODUCTION
T he fast growing bandwidth demand in the access network market will not be supported by the current wired and wireless access techniques. Therefore, passive optical networks (PONs) are being widely adopted and implemented as a high-speed strategy for broadband access due to their low cost, high reliability, and easy maintenance. Orthogonal frequency division multiplexing (OFDM) has recently been introduced into fiber communications due to its flexible dynamic bandwidth allocation, high spectral efficiency, and strong resistance to chromatic dispersion (CD) [1, 2] . OFDM is a multicarrier modulation technique where each symbol is composed of N samples that are generated by performing an N-point inverse fast Fourier transform (IFFT) on N complex data symbols. OFDM systems are very sensitive to errors in time and frequency synchronization. A time synchronization algorithm (TSA) must estimate where the fast Fourier transform (FFT) window begins to cover only N samples belonging to the same OFDM symbol, avoiding in this way inter-symbol interference (ISI) and inter-carrier interference (ICI).
Over the past several years, much research effort has been dedicated to developing TSAs for OFDM systems for the wireless environment; see [3] [4] [5] [6] and references therein. However, these algorithms cannot be directly applied to optical OFDM (OOFDM) systems due to their high hardware computational complexity, as OOFDM systems operate with throughputs of several gigasamples per second (Gs/s). To achieve such throughputs the algorithms must be implemented in hardware using highly parallelized architectures such as field programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). So, the design of tailored high-speed, low-complexity OOFDM synchronization solutions with good accuracy is important for real-time implementation of the OOFDM technique in next-generation, cost-effective, high-capacity transmission systems. Therefore, to make feasible the implementation of OOFDM systems using current technologies, it is mandatory to reduce as much as possible the computing complexity of the digital signal processing algorithms.
A simple symbol synchronization technique utilizing subtraction and Gaussian windowing has been proposed and implemented for OOFDM systems in [7, 8] , where the authors exploit the periodic structure of the cyclic prefix (CP) of OFDM symbols. Although they were originally designed for general OFDM systems, they can be adapted to be used in preamble-based OFDM systems, where a known preamble is transmitted before the OFDM data symbols for synchronization and channel estimation purposes; nevertheless, better performance can be obtained by using algorithms that take advantage of the known preamble structure as in [9] [10] [11] [12] . In [9] , the authors make use of an autocorrelation of the incoming signal scaled by the signal power to detect the beginning of the training sequence. Another approach is to replace the autocorrelation by a cross correlation and eliminate the division by the received signal power [10] [11] [12] .
In this paper we have developed a TSA for a real-time intensity-modulation and direct-detection (IM/DD) preamblebased OOFDM reception system. In order to achieve a hardware efficient implementation of this algorithm, we propose a novel N p -parallel symbol synchronization method based on a designed preamble with a repetitive structure in the time domain; this TSA performs a cross correlation between the known preamble and the received data without using multipliers. The use of a repetitive preamble allows us to employ a shorter cross correlator than the one used in the works cited above, which results in a lower hardware complexity implementation. To test our algorithm we implemented an experimental setup over 100 km standard single-mode fiber (SSMF) using a digital-analog converter (DAC) operating at 4 Gs∕s to generate the test signals, and the photoreceiver output is captured with a digital oscilloscope working at 20 Gs∕s and stored for offline processing. Thanks to the parallel pipelined architecture and its low hardware cost, the proposed algorithm has been successfully implemented in a Xilinx Virtex-7 FPGA device working at more than 7 Gs∕s.
The paper is organized as follows. In Section II the proposed TSA is described. Section III presents the experimental setup used to evaluate our TSA and the obtained results. Section IV describes the hardware implementation and comparisons with other algorithms from the literature. Finally, conclusions are given in Section V.
II. PROPOSED ALGORITHM
In a previous work we proposed a synchronization technique for wireless OFDM systems using cross correlation with a repetitive preamble [6] . The main problem of the solutions designed for wireless systems is the difficulty to work at sample rates of Gs/s, which are usual in IM/DD OOFDM systems. Thus, we take the idea of employing a repetitive preamble from [6] to reduce the cross-correlation complexity, but we have developed a new solution for the postprocessing of the cross-correlator output to reduce the hardware cost, maintaining good performance. The block diagram of the proposed TSA is shown in Fig. 1 . After quantization, the input signal is cross correlated with the known preamble, and the result is processed by an exponential average filter. Finally, there is a threshold detection block to find the peak of the filter output.
The preamble structure used in this work for OOFDM is shown in Fig. 2 . It consists of eight identical short symbols (SS) of N ss samples and two identical long symbols (LS) of N samples, each one preceded by a guard interval (GI) composed of 2N cp samples from the last samples of the LS. Where N cp is the size of the cyclic prefix, N ss is the size of a SS and is equal to N∕8, and N is the size of the IFFT. The length of the cyclic prefix must be at least equal to the length of the channel dispersion, and this value does not affect the proposed TSA. In this work we have worked with an FFT size of 256 and a CP of 32 samples long.
This structure is similar to the one used by WLAN standard IEEE 802.11a/g [13, 14] . The first part of the preamble composed of eight SS can be used to detect the presence of the signal, to estimate the beginning of the OFDM data symbols, to manage the distortion in the early samples of the preamble during the settling time of the automatic gain control (AGC) stage, and to estimate the carrier frequency offset (CFO). Although in IM/DD optical systems a real-valued baseband OFDM signal is usually employed and CFO is not present, the use of a repetitive preamble makes it feasible to use radiofrequency (RF) modulated OOFDM signals, where the RF sections in transmission and reception may introduce CFO. This transmission scheme, where the OFDM signal is generated by using quadrature and in-phase branches modulating an RF sin/cos carrier, can double the data rate by using another couple of ADC/DAC without doubling the sampling rate specifications of the converters. Finally, channel estimation can be accurately achieved using the long symbols.
The repetitive part of the preamble with eight SS, also called the training sequence (TS) in this paper, is generated by modulating one of every eight subcarriers (from subcarrier 1 to N∕2 − 1) with quadrature phase shift keying (QPSK) symbols, while the remaining subcarriers are filled with zeros before using the IFFT. As the generated signal must be composed of real-valued samples, it is necessary that subcarriers from −1 to −N∕2 1 have Hermitian symmetry around the direct current (DC) subcarrier: it implies that only N∕8 of N total subcarriers are not zero, and the DC carrier is used to bias the laser. The length of the training sequence N ts is equal to the FFT length N. In the experimental measurements we have used the values N 256 and N ss 32 because they facilitate the parallel implementation of the algorithm, as will be discussed below.
From an implementation point of view, a cross correlator is a block with high complexity in optical communications, as it requires a large number of multipliers to process several Gs/s. To avoid this high computational cost, it is possible to simplify its implementation with a scheme without multipliers [6] by means of a hard-wired tree adder, where the TS values are represented by their sign bit and the input signal xn is quantized (Q) with as few bits as possible. Then, the cross correlation can be expressed as in Eq. (1), where sgn is the sign function. This solution, i.e., the replacement of adders and multipliers by a hard-wired tree adder, is possible because the TS values are known and they determine the structure of the tree adder. We will show that it is also possible to quantize the input signal with one bit. A similar approach has been employed in [11, 15] , where the full multipliers have been substituted by XNOR multipliers and a tree adder, but that
1- solution does not take into account that the TS is known by the receiver and more hardware resources are needed:
Qxn m · sgnSSm: (1) Only N ss − 1 real adders are needed to implement a cross correlator as a hard-wired tree adder, whereas the traditional implementation requires N ss real multipliers and N ss − 1 real adders. For example, if we use a N ss 32 and a SS with the signs f1;1;1;1;1;−1;−1; 1;1;1;1;−1;−1;−1;−1;−1;−1;−1;−1;−1;−1;−1;1;1; 1;−1;−1;−1;1;−1;−1;−1g, the cross correlator is implemented as shown in Fig. 3 , where the box represents the last four terms of the cross correlator calculated as Eq. (2). If we regroup these four terms to implement two-step hard-wired tree adders, we obtain Eq. (3):
fQxn 28 − Qxn 29g − fQxn 30 Qxn 31g:
The cross correlator generates a periodic peak at its output when the TS is at its input; then, we make use of the exponential average filter in Eq. (4) to enhance these peaks and reduce the background noise: the amplitude of peaks generated when SS is present grows, and the amplitude of other spurious peaks decreases because the filter averages the actual cross-correlator output (Pn) with the filter output delayed N ss samples (Mn − N ss ). The average filter also avoids false detections when the input is not the TS and the background noise is high. Moreover, thanks to this average filter, the peaks at the output vanish quickly once the TS ends, because now in Pn there is no periodic peak. So, the implementation of a threshold detector to find the last peak is simplified as the difference between the periodic peak and the spurious peaks is high. By using α 0.5 both products in the average filter can be implemented by a bit shift operation, and the hardware cost is reduced:
The output of this averaged cross correlation has eight major peaks, each one coinciding with the presence of the last SS sample at the cross-correlator input, as can be seen in Fig. 4 . The last peak is used as a reference for time synchronization: once the TS has entered the cross correlator, we expect a peak every N ss samples; when it disappears it means the GI has started and therefore the last peak has been found. Once its location is found, it is used to select the incoming 2N cp N samples from xn that correspond to the LS part of the preamble; then the LS is employed to estimate the channel response. The reference peak can be detected by setting a threshold value at the output of the filter and using a small control logic; this threshold is selected in the same way as in [6] .
A. Finite-Precision Analysis
We have developed a fixed-point model of the TSA where signals are quantized with the following number of bits: a for correlation input data, b for filter input data, and c for filter output data, as shown in Fig. 1 . As the filter is an exponential recursive one, c can have the same width as b. However, b depends on the input width a and the growth of the N ss terms to be added in the correlator: b a log 2 N ss . The performance of the TSA with different quantization values has been evaluated by simulation, and later these results have been validated in a real experiment. Thus, 10 4 preambles have been generated and transmitted through an additive white Gaussian noise (AWGN) channel. At the receiver side, the probability of correct time detection (PCTD) has been computed as
PCTD
Number of correct timing offset acquisitions Number of total timing offset acquisitions ; where correct time detection is considered when the TSA detects a peak at the last sample of the TS, or one sample before or one sample after.
Simulation results for N ss 32 are shown in Fig. 5 . It is clearly observed that the PCTD of the proposed method does not depend on a for signal-to-noise ratio (SNR) >3 dB; for example, as a reference, a channel with a SNR of 3.6 dB to obtain a BER value of 10 −2 is necessary if a QPSK modulation scheme is employed. So, we can expect good performance (PCTD ≈ 1) of the TSA using an input signal quantized with 1 bit (a 1, that is, using the sign bit of the received signal) in practical scenarios where the SNR would be higher; then the growth in the number of quantization bits in the cross correlator would give b c 6 for N ss 32. As the hardware cost of digital signal processing has a strong dependence on the number of quantization bits, these low values allow us to obtain a low-cost hardware implementation. We have taken the BER value of 10 −2 as a reference because it is commonly considered a forward error correction (FEC) threshold to obtain an error-free transmission when a soft decision FEC coding with 20% redundancy is employed [16] ; on the other hand, in case a hard decision FEC coding with 7% redundancy were employed, the FEC threshold would be 3.8 × 10 −3 [17] , which corresponds with a SNR of 4.9 dB.
III. EXPERIMENTAL SETUP AND RESULTS

A. Experimental Setup
In this experiment the number of OFDM subcarriers is set to N 256, but due to the Hermitian symmetry only 128 are defined: 112 transport data, 15 are used for the frequency guard interval, and DC is used to bias the laser. The CP length is 1∕8 of an OFDM period, i.e., 32 samples in every OFDM symbol. Data subcarriers are modulated with QPSK symbols. The OFDM samples are generated offline using MATLAB, and they are sent to a Maxim DAC MAX19693 (12 bits) operating at 4.0 Gs∕s to produce the required analog electrical signal. As a result of these settings the bit rate in our experiment is 3.11 Gb∕s. First, the electrical OFDM signal is low pass filtered (LPF), and then it passes through an electrical amplifier (EA) before optical conversion. A single-mode 1550 nm directly modulated linear optically isolated distributed feedback (DFB) laser is driven by the amplified electrical OFDM signal with 3 dBm optical output power. Before the photoelectric conversion, the power of the detected optical OFDM signal can be changed by a variable optical attenuator (VOA). The received baseband OFDM signal is obtained via a highperformance InGaAs photodiode (PD) with a bandwith of 3.0 GHz. The converted electrical OFDM signal is preamplified by an EA and is sampled at 20 Gs∕s by a Tektronix DPO TDS7154B (8 bits) and stored for offline processing in MATLAB. Figure 6 shows the experimental setup for the IM/DD OOFDM transmission system over 100 km SSMF.
B. Results
The proposed TSA with a preamble composed of eight SS of N ss 32 samples has been evaluated in the setup shown in Fig. 6 , and its performance has been compared with the ones obtained by two other TSAs: the first one was proposed by Park et al. [5] for wireless channels, and it is based on the autocorrelation of the received signal; the second one was proposed in [12] for OOFDM, and it is based on a cross correlation. Figures 7 and 8 show the probability of obtaining a correct time detection versus the received optical power for a back-to-back (BTB) connection and after transmission through 100 km SSMF, respectively. These results have been obtained transmitting 1200 OFDM test frames for each algorithm and for each received power. Figure 9 shows the curves of BER versus received optical power for back-to-back and after 100 km SSMF; in both curves the received optical power needed to obtain a FEC threshold BER of 10 −2 has been highlighted. The BER results have been calculated for correctly detected frames; in this case all the TSAs give the same results and the figure characterizes the behavior of the complete system after synchronization. It can be seen that all three synchronization algorithms can accurately synchronize in time for the BER value of reference, although Park's method has poorer performance for low values of received optical power. Both methods using cross correlation have better performance, but Chen's is a bit better thanks to its larger correlation size. Nevertheless, the difference between both algorithms in the 100 km experiment appears only for received optical power levels below −23 dBm, corresponding to BER values poorer than 10 −1 -that is, corresponding to optical power levels that are not useful in a real scenario. The same behavior can be noticed in the BTB case.
IV. HARDWARE IMPLEMENTATION
The real-time OFDM receiver has been implemented on a Xilinx Virtex-7 VC707 evaluation board and a 4DSP FMC126 ADC card. The first board is equipped with a XC7VX485T-2 FPGA chip (with a maximum clock frequency of 650 MHz), and the second board is equipped with an E2 V 10-bit EV10AQ190 ADC chip allowing a maximum sampling rate of 5 Gs∕s. The EV10AQ190 ADC sends to the FPGA four sampled data at the same time in dual data rate (DDR) mode via 40 low-voltage differential signaling (LVDS). The FPGA has a dedicated serial-to-parallel converter (ISERDES), which can create a 4-, 6-, 8-, or 10-bitwide parallel word. If the smallest bit-wide parallel word (4) is selected, 16 sampled data per clock cycle are obtained. Therefore to achieve a throughput of 5 Gs∕s we need to process 16 channels in parallel using a clock frequency of 312.5 MHz.
The performance and the hardware cost of the TSA are affected by finite-precision issues in the representation of inner variables and the input signal. The finite-precision analysis of our algorithm was presented in Subsection II.A. In the following sections we present its parallel implementation and compare the resources used by this implementation with other algorithms. Finally, we present the implementation results for Virtex-7 XC7VX485T-2 FPGA.
A. Parallel TSA The algorithm described in Eqs. (1) and (4) is a singleinput single-output (SISO) system. To obtain a parallel processing structure, the SISO system must be converted into a multiple-input multiple-output (MIMO) system using the look-ahead technique described in [18] . Equation (6) describes a parallel cross-correlation system with N p inputs per clock cycle, where k denotes the clock cycle: Fig. 7 . Curves of probability of correct time detection versus the received power for the back-to-back case and the marker for 10 −2 BER. Fig. 8 . Curves of probability of correct time detection versus the received power after transmission over 100 km SSMF and the marker for 10 −2 BER. Fig. 9 . Curves of BER versus the received power for the back-toback case and after transmission over 100 km SSMF.
The N p -parallel exponential average filter is described as
N p N ss − 1 and N p adders are needed to implement Eqs. (6) and (7), respectively. For our purpose it is necessary to detect the last peak by setting a threshold value at the output of the N p exponential average filters. The correct timing location is determined by the position of the last peak. To implement this task, N p comparators and one priority decoder are needed.
The parallel implementation of the TSA and the detailed signal processing flow are shown in Fig. 10 with N p 16 and N ss 32. This value of the N ss parameter has been chosen as a trade-off between the algorithm's performance and the hardware complexity to simplify the N p -parallel hardware architecture of the exponential average filter. If N ss < N p , there exist long feedback loops in the parallel recursive filter. In this case, the filter output depends on a previous output that is being computed in parallel; so, combinational paths are generated among outputs. These combinational paths exhibit long delays and usually are the critical paths that limit the maximum operating frequency in the algorithm implementation. On the other hand, if N ss > N p but N ss is not an integer multiple of N p , an irregular hardware structure is obtained, introducing routing delays and limiting the operating frequency. However, these problems are avoided if N ss is chosen as
x (15) x (14) x (13) x (12) x (11) x (10) x (9) x (8) x (7) x (6) x (5) x (4) x (3) x (2) x (1) x (0) x(-1) an integer multiple of N p . In such a case, each parallel output only depends on itself delayed by N ss ∕N p samples; so, the critical path is not increased and the hardware structure is regular, as the N p -parallel average filters can be implemented as N p independent filters. For example, as in our case N p 16, this means that one delay in a branch gives a total delay of 16 samples. Therefore, to obtain the term M16k − 32 in Eq. (7) we need to delay M16k two times instead of 32 times. Samples M16k − 32 and M16k are outputs from the same filter, avoiding dependences among the 16 parallelized branches of the filter.
B. Complexity Comparisons
The complexity of six hardware implementations of TSAs is presented next. References [7, 8] are based on subtraction and Gaussian windowing of the CP, Ref. [9] is based on autocorrelation of the TS, and Refs. [10] [11] [12] are based on cross correlation of the TS. These implementations are used for high-speed OOFDM receiver systems, and all algorithms process N p samples in parallel. For a fair comparison of various time synchronization techniques, they are evaluated in an IM/DD OOFDM system with real-valued OFDM signal generation and detection, in which only the symbol synchronization is considered. The algorithms presented in [9, 10] were designed for complex-valued OFDM signals in coherent optical communications. We have estimated their computational cost when they are used with real-valued OFDM signals to be able to compare them with the rest.
The algorithm proposed in [9] makes use of a repetitive preamble based on autocorrelation; it needs a special demultiplexed channel to reduce its computational cost. The size of the training sequence is equal to 8N. In [10] a training sequence is generated with values randomly selected from the set of f−1; 1g. At the receiver side, the received data are cross correlated with the training sequence. This correlation operation is implemented using additions and subtractions, so multiplication is removed and a large area is saved. In [11, 12] the sign of the received data is cross correlated with the sign of the training sequence, so multiplications are replaced by XNOR multipliers. In these algorithms the size of the TS is equal to N cp N.
The generation of the training sequence used by the proposed TSA has been described in Section II. At the receiver side, the quantized received data are cross correlated with the sign of the training sequence; this correlation is implemented using only additions and subtractions. It was shown in Subsection II.A that using only 1 bit for quantizing the received signal preamble is enough to obtain good performance for a BER value of 10 −2 ; this reduction in the number of bits dramatically reduces the computational cost.
In Table I the complexity of these TSAs is shown; they can be classified in two groups depending on the use or not of multipliers. Moreover, those that use multipliers [7] [8] [9] also need to find a maximum (Max) to determine the correct timing position of the start of the data symbol, which is more computationally expensive than using a threshold detection (Th). The algorithms based on cross correlation have been implemented without multipliers. In [11] and [12] the average bit-width of pipelined adders is smaller than those in [10] . This is due to the fact that whereas [11] and [12] work with 1 bit quantization for input and reference signals, in [10] the input signal is quantized with 7 bits, as shown in Table I . Our algorithm has lower complexity than the rest because it only needs N ss adders, where N ss ≪ N ts , to determine the correct timing location. It also benefits from using a hard-wired tree adder instead of using XNOR multipliers. For example, with N 256, N cp 32, and N p 16, the TSA described in [12] needs 4624 real adders, 4608 XNORs, and two threshold detectors, while our algorithm needs 512 real adders and one threshold detector, which implies nine times fewer adders, zero XNORs, and half the hardware cost to detect the peak of the correlation. The small number and size of the adders in our TSA make the system latency lower than in the other algorithms.
C. FPGA Implementation
The time synchronization architecture has been modeled using the VHDL hardware description language and verified using the MATLAB finite-precision model. It has been implemented on a Xilinx Virtex-7 XC7VX485T-2 FPGA using the Xilinx ISE 14.7 software tool. The number of slice registers and slice LUTs used in our design is 1690 and 1773, respectively. The achieved maximum operating frequency is 464.253 MHz, which would allow it to work in real time at a sampling rate up to 7.4 Gs∕s.
V. CONCLUSION
In this work, a new TSA has been proposed for IM/DD optical systems using OFDM modulation. This synchronization algorithm makes use of a repetitive preamble, where the receiver cross correlates the received signal with the repetitive part of the preamble using only 
