Abstract-This paper presents the first demonstration of the use of a periodically poled lithium niobate device for signal processing at 640 Gbit/s. Clock recovery is performed successfully using the lithium niobate device, and the clock signal is used to control a nonlinear fiber-based demultiplexer. The full 640-Gbit/s system gives error-free performance with no pattern dependence and there is less than 1-dB power penalty after 50-km fiber transmission.
I. INTRODUCTION
T HE demand for higher telecommunication bandwidth is continuously growing, and the challenge of providing it is met in research labs around the world by strong efforts to find new techniques and technologies to lift the burden. For instance, an impressive 25.6-Tbit/s data transmission using polarization multiplexed RZ-DQPSK in the C+L band was demonstrated recently [1] . In addition, at OFC 2008, plenary speaker Bob Metcalfe, inventor of the Ethernet, professed that 1 Terabit/s Ethernet (TE) will be needed in the near future and that it is essential to conduct fundamental research on new technologies to enable this, since current technologies cannot [2] . Whether 1 TE will be best created by a serial or a parallel approach is an open question, but to answer it, it is necessary to conduct research on multiple paths.
One path to increase the channel capacity is to increase the serial data rate. To go far beyond 100 Gbit/s, which is the current electrical state-of-the-art, e.g., [3] , will require optical techniques. Using optical time-division multiplexing (OTDM), 640 Gbaud symbol rates has so far been demonstrated as the highest pulse rate carrying data by a few groups worldwide, first in [4] and then most notably in [5] . For almost twenty years now, OTDM has been explored as a possible route to generate high bit rates in the optical domain, but there has hitherto been no market penetration. There are several reasons for this, and apart from market circumstances, we believe the most important one is the lack of good, stable and practical solutions for essential functionalities. With the introduction of internet video transmission, the bit rates have exploded, and internet exchange office congestion is becoming a real limitation. Therefore, there is once again a need for basic research in solutions to congestion problems. There is currently a great focus on 100-Gb/s Ethernet (100 GE), and, five to ten years from now, in Internet exchange stations, one may have several 100 GE lines, which may need to be transmitted to the same destination and to avoid congestion, it may be beneficial to employ an ultrafast optical Ethernet multiplexing. This would result in an optical 1000 GE, or 1 Terabit/s Ethernet, 1 TE. One great concern with the parallel technologies developed so far is their massive power consumption. Serial solutions combined with circuit switched networks may help to reduce the power consumption. An essential functionality, very challenging at these ultrahigh bit rates, is that of clock recovery (CR), and it was not until very recently (2007) that this was demonstrated for the first time at 640 Gbaud [6] , breaking the 1996 record of 400 Gbit/s [7] . Filter-assisted cross-phase modulation in a semiconductor optical amplifier (SOA) [8] was used, and this allowed for the 640-Gbit/s clock recovery [6] , [9] . However, in addition to the fast effects in semiconductors, they also have slow recovery times, which will inevitably lead to patterning effects for OOK modulation formats. To avoid this, a truly ultrafast scheme would be preferable. Nonlinear wave mixing in lithium niobate is one such scheme, as proposed in [10] . In fact over the last few years, periodically poled lithium niobate (PPLN) has proven to be a very promising candidate for a key component in optical signal processing [11] , [12] .
This paper reports on the use of a truly ultrafast, and thus pattern-independent, all-optical nonlinear effect for CR, namely sum-frequency generation in a quasi-phase-matched (QPM) adhered ridge waveguide periodically poled LiNbO3 module (ARW PPLN), and demonstrates its potential in a successful 640-Gbit/s transmission experiment. This is the first time PPLN is demonstrated at such high bit rates and the second time ever that a full 640-Gbit/s transmission experiment, i.e., including clock recovery on the aggregated line rate data signal, is conducted. In this case, no pattern dependence is theoretically expected, nor experimentally found. Error free and low-penalty 640-Gbit/s transmission is obtained and the PPLN requires only 1-mW average 640-Gbit/s data power.
II. EXPERIMENTAL PROCEDURE
The experimental setup used is sketched in Fig. 1 . The optical signal is generated by an erbium glass mode-locked laser (ERGO) at 10 GHz and 1557 nm. The pulses are data modulated with a PRBS (MOD), and after pulse compression multiplexed in a passive fiber-based split-and-delay multiplexer (MUX) designed to preserve the PRBS sequence at the 640-Gbit/s data rate as well as keep the data in one single polarization. For pulse compression, the data pulses are chirped by Self Phase Modulation (SPM) in 400 m of dispersion flattened highly nonlinear fiber (HNLF). The positive dispersion in the remainder of the transmitter, corresponding to 20 m SMF, linearly compresses the data pulses to 530 fs FWHM in the resulting 640-Gbit/s data signal. An eye diagram of the 640-Gbit/s data, obtained with an optical sampling oscilloscope, is also shown in Fig. 1 . This shows clear and open and well-equalized 640-Gbit/s data eyes.
The data signal is sent to the receiver with the clock recovery unit and a NOLM-based demultiplexer or it is first guided through a dispersion and slope-compensated fiber span of 50-km SMF-IDF. The total 50-km span residual dispersion is 0.13 ps/nm and the slope is 0.07 ps/nm , which is within the general requirements for less than 1-dB transmission penalty for 640-Gbit/s transmission and leads to a pulse broadening of less than 100 fs. The PMD in this span is negligible, as there is no noticeable effect of tuning the polarization into the span. In the clock recovery setup, the data signal is injected into the ARW PPLN, which acts as a phase comparator between the data and a local clock signal, thus generating an error signal, proportional to the sine of the phase difference between the clock and data. This error signal is then used to lock the loop to the data signal frequency.
The ARW PPLN used in this experiment, sketched in Fig. 2 , is designed for efficient processes from the 1550 nm wavelength range [13] . Active Mg LiNbO is set on a low-index adhesive, and together with a ridge structure this gives an optical waveguide with a tight optical confinement with an index difference of [13] . The ridge (2.5 m high and 8 m wide) ensures good modal overlap between light at the fundamental wavelength (1550 nm) and the second harmonic (780 nm), which will enhance the conversion efficiency. The mode size at both wavelengths is about 3 m. Traditionally LiNbO is attractive because of its high with fs-timescale response, but drawbacks are pulse and beam walk-off due to group velocity dispersion (GVD) and birefringence in the crystal, respectively, which both tend to reduce the effective interaction length. Periodic poling greatly reduces the birefringence-induced beam walk-off by periodically changing the sign of , making the process quasi-phase-matched (QPM). The GVD-induced pulse walk-off is reduced simply by reducing the length of the device. This is made possible by increasing the normalized conversion efficiency per length, enabled by the ridge structure resulting in good modal overlap between the fundamental and second harmonic. The length of the device is 30 mm and the QPM period 17 m, resulting in efficient sum-frequency generation (SFG) between the 1557 nm data and the 1567 nm clock at 782 nm. The normalized conversion efficiency for the 30 mm device is for the packaged module, and 900%/W for the naked chip. The process takes place on an fs-timescale, as mentioned above. However, due to dispersion in the material, the group velocity mismatch (GVM) between the fundamental 1560-nm wave and the second-harmonic 780-nm wave amounts to 0.29 ps/nm, giving 8.7 ps in the packaged module. The clock and data pulses, both in the 1560-nm range, have negligible GVM between them (0.356 fs/mm giving 10.7 fs for the module), so this means that for the purpose of clock recovery, the timing resolution is not limited by the module, but by the obtainable pulse widths for the data and clock pulses. Note that due to the GVM between the fundamental and the harmonic, the produced error signal will be broadened by the 8.7 ps, but this has no consequence for the application here, since the error signal will be varying only very slowly. The GVM only puts a limit on the repetition rate of the clock pulses, not on the data pulses. The clock pulse repetition rate should thus not exceed 1/8.7 ps 100 GHz. Thus, the customary 10-or 40-GHz base rate is more than accommodated for. However, the data signal could in principle be extended to 1 Tbit/s. To summarize the design principle, a high normalized conversion efficiency per unit length in the ARW PPLN, obtained by the ridge structure, enables reduced device length, leading to a small group velocity mismatch and a large wavelength bandwidth, making it suitable for ultrafast operation.
The local clock signal consists of 10-GHz pulses from a semiconductor tunable mode-locked laser (TMLL) driven by a voltage-controlled oscillator (VCO), which in turn is tuned by the error signal. The overall bit-rate limitation is set by the pulsewidth of the local clock and the response time of the mixing process, so short pulses and fast mixers are required. The TMLL runs at 1567 nm, fulfilling the phase-matching condition with the 1557-nm data, and pulse compression by soliton compression in a high-power EDFA is employed. The TMLL pulse is amplified to 30 dBm and injected into a second 30-dBm amplifier where a soliton is excited in the Er-doped fiber, which has a positive gain. When the soliton travels through the Er-fiber, the peak power will increase due to the gain, and hence self-phase modulation (SPM) will dominate over GVD leading to an adiabatic compression. The compressed pulses have a 700-fs FWHM pulse width, and a pedestal at roughly 20% of the pulse peak power (bottom of Fig. 4) , which is expected to reduce the contrast of the error signal. The overall loop length with pigtails and EDFAs is about 60 m, and, with a PLL bandwidth of 200 kHz, the loop is expected to be stable, according to [14] . Fig. 3 shows the spectra of the input and output signals to and from the ARW PPLN together with the generated error signal. When the clock and the data pulses overlap in time in the PPLN, a sum-frequency product at 782 nm will be generated. Based on the measurements shown here, the of the packaged module at 10-mW input power is dB . Fig. 4 shows a cross-correlation trace of the 640-Gbit/s data signal with a 500-fs sampling pulse, confirming that the data signal is of good quality as indicated by the eye diagrams in Fig. 1 . The individual channels are correctly separated by 1.57 ps, and they are well-equalized. The same transmitter is used in [15] , where it is shown that all channels are error-free with about 3-dB variation in sensitivity. Fig. 4 (bottom) shows an autocorrelation trace of the local clock pulses, with the clearly visible pulse pedestal mentioned above. This pedestal is expected to reduce the contrast of the error signal, but not limit the clock recovery performance.
If the clock and data signals are not synchronised they will scan across each other at the difference frequency, generating a slowly varying error signal, see 320-Gbit/s data signal is approximately 20%, which is expected to be due to the clock pulse pedestals and some background second-harmonic generation. The 640-Gbit/s error signal also clearly resolves the individual data pulses, although with a reduced contrast (about 10%), primarily owing to the width of and pedestals on the clock pulses. The error signals are perfectly suitable for locking, and successful clock recovery at both 320 and 640 Gbit/s is achieved. Please note that the polarization needs to be set very carefully on both the data and the clock pulses when they enter the waveguide, to achieve phase matching.
III. EXPERIMENTAL RESULTS
The produced error signals at 320 and 640 Gbit/s are used successfully for clock recovery both before and after transmission. To make the situation as realistic as possible, most of the characterization measurements presented in the following are performed after transmission. Fig. 6 shows the 640-Gbit/s locking performance after 50-km transmission in terms of the integrated timing jitter derived from the single-sideband-to-carrier ratio (SSCR) phase-noise (integration range: 1 kHz-1 GHz). Fig. 6 shows the timing jitter directly out of the VCO when locking to the 640-Gbit/s transmitted data, showing around 150-fs rms timing jitter.
Since SFG in the PPLN is almost instantaneous, the clock recovery is expected to be independent of the OOK data pattern it receives. To investigate this, the PRBS sequence into the multiplexer is changed. The multiplexer is PRBS-maintaining only for a word length, though, so in this characterization the PPLN does not receive pure PRBS sequences for the higher word lengths. However, the bit sequences still remain very different to each other for different input sequences. The rms timing jitter is around 150 fs, and this number only changes Fig. 7 . Timing jitter for various average data input powers-less than 100-fs jitter is obtained for a dynamic range of more then 15 dB; 1 mW data power is sufficient.
by 10 fs when changing the PRBS sequence input to the multiplexer in the transmitter, effectively confirming the theoretical expectation of no pattern dependence. So, no pattern dependence is expected, and within the limits of this setup, none are observed.
When locked, the clock recovery is locked on one of the 64 tributaries, and even though the aggregate 640-Gbit/s signal is not a PRBS, each tributary is, and thus this investigation shows that the PLL can lock to long sequences of zeroes.
The PLL is designed to have a hold-in range of 37.5 MHz, i.e., it can tolerate the data frequency to drift that much and still maintain locking. The pull-in range of the PLL is 8.8 MHz, so the VCO and data frequency offset should not exceed this, if the PLL is expected to pull into locking.
The rms timing jitter values in Fig. 6 are directly out of the VCO, but this is not what is fed to the demultiplexer. The VCO controls an ERGO laser, which in turn controls the demultiplexer. The ERGO laser itself has very low jitter and a quite low PLL bandwidth of 20 kHz. This means, in effect, that the noise above 20 kHz will be filtered away yielding lower timing jitter on the actual control pulses. Fig. 7 shows the rms timing jitter from the control pulse source applied to the NOLM demultiplexer for various average data input powers, when the average clock power is 4 dBm. The ERGO filters away phase noise from the VCO above 20 kHz, and hence this timing jitter is somewhat lower than the values straight out of the VCO in Fig. 6 . This helps in getting error-free demultiplexing as shown in Fig. 8 . Fig. 7 shows that less than 100 fs jitter can be obtained for this system, for average data input powers ranging from 1 to 15 dBm, giving an experimentally obtained dynamic range of 16 dB. This means that 1-mW average data power is enough for this scheme to work. When going below 1 dBm input power, the signal-to-noise ratio (SNR) out of the detector simply gets too low. For higher data input powers, the pump will eventually get depleted, and there will be more SHG from the data itself giving rise to a bigger offset in the error signal, inevitably leading to loss of locking. Where this occurs cannot be quantified with the present setup, since there is not enough power available. Fig. 8 . BER curves for 640-Gbit/s demultiplexing to 10 Gbit/s with the clock derived from the clock recovery before and after transmission compared with the 640 Gbit/s back-to-back. There is an only 2.7 dB CR penalty and less than 1 dB transmission penalty. Insert: demultiplexed eye after 50-km transmission. Fig. 8 shows the BER results when using the recovered clock to drive the control pulse source for the NOLM demultiplexer. The receiver power is measured after the multiplexer, just before entering the pre-amplifier receiver. The 640-Gbit/s back-toback (b-b) demultiplexing is error free (i.e., BER ) with no error floor and has a sensitivity (i.e., receiver power at BER ) of 30.3 dBm. Compared with the eye diagram in Fig. 1 , which seems to show some intersymbol interference (ISI), there is no such sign in the BER curves. This is because there is no ISI, as the eye diagram interference is an artefact of the sampling oscilloscope used. It uses a 900-fs sampling pulse, and this will overlap with neighboring channels, giving the appearance of ISI. When comparing to the cross-correlation traces in Fig. 4 , which uses a 500-fs sampling pulse and thus has a higher temporal resolution than the sampling oscilloscope, it is verified that there is no noticeable pulse overlap between channels. Using the CR without transmission (i.e., b-b), error-free performance is readily achieved with a sensitivity of 27.6 dBm, i.e., a penalty of only 2.7 dB. Using the CR after transmission is also successful and error-free performance is achieved, with an additional penalty of only 0.8 dB. These results clearly demonstrate that the PPLN module works satisfactorily in the full 640-Gbit/s transmission system. Please note that the b-b 640-Gbit/s test-bed is very stable and all 64 channels are error-free with a sensitivity spread of 3.3 dB, as more thoroughly described in [15] . The channels shown in Fig. 8 are typical channels taken from the middle of the 3.3-dB spread, as verified by scanning through a couple of channels and finding similar sensitivities. This is further corroborated by inspecting the cross-correlation traces of the 640-Gbit/s data signals, confirming that the channels are still narrow, equally spaced and well-equalized, like the original 640-Gbit/s data. After transmission, the pulse broadening is less than 100 fs, effectively rendering the transmitted data signal very similar to the original one.
In order to characterize further the requirement on timing jitter to obtain a BER , the BER corresponding to different phase noise curves is measured. Fig. 9 shows characterization results of the phase noise and derived timing jitter after transmission of the 640-Gbit/s data. The recovered clock signal straight out of the VCO is compared to the pulses out of the ERGO locked to the VCO. As seen in Fig. 9 . Jitter filtering by the control pulse source. Top: SSCR of the recovered clock for the 640-Gbit/s transmitted data straight out of the VCO and after the ERGO laser. Bottom: Integrated rms timing jitter as a function of the upper integration limit (integration from 1 kHz and upwards) for the VCO and the ERGO after transmission of the 640-Gbit/s data. Fig. 9 (top) , the ERGO cuts away excess phase noise above the ERGO PLL bandwidth of 20 kHz. In the VCO SSCR, there is a peak at about 200 kHz, which stems from the PLL bandwidth of about 200 kHz, It is at this frequency that the PLL shifts from tracking the data SSCR to following the VCO SSCR [14] . This peak and the phase noise associated with it is eliminated by the ERGO. Fig. 9 (bottom) shows the integrated rms timing jitter values with the lower integration range fixed at 1 kHz and as a function of the upper integration limit. This plot enables one to see in which frequency range the phase noise gives rise to most timing jitter. A big difference is observed between the VCO and the ERGO phase noise and jitter. While the VCO jitter continues to increase to more than 100 fs beyond the 20 kHz point, the ERGO jitter remains well below 100 fs. This implies that below 20 kHz, the VCO noise will have a direct impact on the demultiplexing, but the VCO noise above 20 kHz is less important, as long as it is clean enough for the ERGO to be able to lock to it. According to the rule of thumb provided in [16] , the timing jitter for the control pulse for 640 Gbit/s should be around 90 fs, which can be obtained with this scheme. To quantify this rule of thumb, the BER is measured. Fig. 10 displays the relation between BER and VCO timing jitter and the SSCR at 10-kHz offset from the carrier, as this will be transferred to the ERGO. Fig. 10 (top) shows the SSCR spectrum for two cases: one for low integrated jitter (150 fs) and one for high jitter (170 fs). The biggest difference between the two traces is around and below 10 kHz where there is about 10 dBc/Hz difference. Comparing such SSCR traces to the obtained BER values at the demultiplexer output gives the link between BER and rms timing jitter (see the bottom of Fig. 10 ). As observed, there is a huge difference in BER performance when changing the SSCR 10 dBc/Hz in the 10-kHz range. Demultiplexing only becomes error-free with VCO jitter below 160 fs, and this in turn corresponds to less than 100-fs jitter on the ERGO. It is thus experimentally found (for this setup) that a dBc/Hz is required on the VCO output at 10-kHz offset, corresponding to an rms jitter of less than 160 fs, to obtain error-free demultiplexing. The control ERGO laser then filters away the rest excess phase noise to get below 100 fs, and this result therefore agrees well with the rule of thumb in [16] . As shown in Fig. 7 , this setup can readily provide less than 100-fs timing jitter in a large dynamic range. Please note that with the pulse compression techniques used here, it is empirically found that, under optimum working conditions, only about 10-fs timing jitter is added from the compression stage.
IV. DISCUSSION
The timing jitter values obtained here are quite good considering the large phase-noise contribution from the TMLL, which has a free-running rms timing jitter of 400 fs. The reason for the low jitter values is a very low-noise VCO and the low bandwidth of the PLL. Replacing the TMLL with a low-jitter laser as in [17] , the overall jitter obtained in this setup is expected to become even lower or more stably so. The free-running VCO has a very clean carrier peak with very low noise around it (60 dB SNR). When the loop is closed, however, the noise from the TMLL is circulated around in the loop and this is added to the locked VCO spectrum. The 200-kHz peaks, corresponding to the PLL bandwidth, is also clearly observed here. Within 200 kHz, the VCO tracks the data, and beyond the VCO follows its own noise (plus the noise added from the TMLL). The SNR in this case is about 50 dB, i.e., which is clearly lower than the VCO's own noise. This again leads to the interpretation that a laser with lower noise will improve the performance.
Reducing the loop length from its present 60 m, allowing for an expansion of the PLL bandwidth, will also help to lower the timing jitter, as the influence of the low-jitter data signal in this setup will dominate [14] .
It is worth noting that using flat-top switching windows, as demonstrated in [18] , timing jitter up to 22% of the timeslot can be tolerated, which for 640 Gbit/s corresponds to 350 fs. In that case, the requirements on the presented scheme here would be greatly relaxed and should make this whole scheme even more stable than the present version.
Regarding stability, if practically implementing this scheme in a real transmission system, the inherent polarization dependence of the PPLN will need to be addressed. There has already been demonstrated various polarization diversity schemes for other polarization-dependent switches, and we would speculate that these would also be applicable here. For instance, one could add a polarization beam-splitter (PBS) in front of the PPLN and then apply a half-wave plate in one arm before merging the two PBS outputs and injecting the signals into the PPLN. This should alleviate the polarization influence on the switch, but would need further investigations. Apart from that, the switch is already very compact and stable, being packaged into a fiber-pigtailed temperature stabilized module.
In the demonstration here, the tolerance to transmission span parameters has not been directly investigated. However, there are some general requirements on dispersion for getting less than 1-dB power penalty at a BER , as stated in Section II [19] . These requirements are more to do with the demultiplexing, but there are also limitations on the clock recovery. If the data pulses get too broad, the error signal will become too small, and locking will not be possible. The requirement will be slightly less stringent than for demultiplexing, but on the same order of magnitude, as the data pulses still need to be sufficiently narrow to be properly distinguishable in the PPLN.
V. CONCLUSION
We have reported on a novel clock recovery scheme, relying on truly ultrafast sum-frequency generation in an ARW PPLN. The temporal resolution of the setup was sufficient to resolve a 640-Gbit/s OTDM data signal, and locking at bit rates up to 640 Gbit/s was successfully achieved before and after transmission over 50 km of SMF-IDF fiber. Timing jitter of less than 100 fs was obtained for this system with a dynamic range of 16 dB. No pattern dependence was expected and none was found. The clock recovery unit gave error-free performance with excellent quality and less than 1-dB transmission penalty. Only 0 dBm average power in the 640-Gbit/s data signal was needed and only 4 dBm clock power was used, so this is a low-power solution for clock recovery. The overall power usage in this proof-ofprinciple laboratory implementation for the full receiver is, however, not particularly low-power, as we here need to use various tricks for pulse compression and need about 20 dBm control power to the NOLM. In a future setup, quantum-dot modelocked lasers have been shown to generate sub-ps pulses with very low timing jitter, so pulse compression could be avoided. The results presented in this paper constitute the first demonstration of the use of a PPLN at such high bit rates and is only the second demonstration of 640-Gbit/s clock recovery and the second full (i.e., including line rate clock recovery) 640-Gbit/s transmission demonstration ever. systems dedicated to optical packet switching nodes. He is the author or coauthor of more than 70 publications and communications in international journals and conferences. He has participated in the European programs OPTIMIST and BREAD dedicated to the road-mapping of European Broadband-for-all activities in Europe and to the European Network of Excellence ePhoton/ONe and now BONE and EUROFOS.
