Simulation results and a first standard cell CMOS implementation are reported for efficient carrier and data recovery methods in the IF domain of digital I&Q receivers, for synchronous QPSK transmission with standard DFB lasers. 
Introduction
The desire of utilizing existing fiber links most efficiently has recently directed a lot of research towards multilevel phase modulation schemes. The goal is to increase spectral efficiency and limit the symbol rate in order to reduce dispersion effects. Coherent transmission has the additional advantage of detecting field-proportional quantities, which enables an electronic compensation of all linear fiber distortions. This includes polarization transformations and turns the normally undesired polarization selectivity of coherent receivers into an asset if polarization diversity and polarization division multiplex are used to double the bit rate.
DQPSK modulation lacks sensitivity compared to QPSK. Consequently the latter is preferable, but heterodyning would require excessive receiver bandwidth. Therefore, intradyne inphase and quadrature (I & Q) baseband receivers are advantageous in this respect. The known feedforward carrier recovery strategy of PSK systems [1] can be extended to QPSK [2] . Analog implementations may suffer severely from bandwidth limitations and could introduce serious crosstalk in the several needed multiplier banks [3] . On the other hand, a digital implementation is attractive in this context and has been demonstrated offline with oscilloscope-sampled experimental data [4, 5] . A major obstacle is the requirement to perform the main calculations of the digital carrier recovery scheme in parallel at the actual line rate, as proposed in [6] . Demultiplexing to lower clock rates is necessary to allow the use of standard cell CMOS technology, but leads to massive parallelization of the processed data. Until now, real-time synchronous QPSK transmission with commercial off-the-shelf DFB lasers has not been reported because the computational effort appeared prohibitively high. This paper therefore compares our earlier digital carrier and data recovery proposal [6] with two novel variants thereof by simulation, together with a hardware-efficient CMOS standard cell implementation in 120 nm technology.
Simulation overview
The simulation program generates random 10 Gbaud QPSK transmitter data. Data is then impressed onto a DFB laser signal (S) in a QPSK modulator (see Fig. 1 ), in 1 or 2 polarization channels (20 or 40 Gbit/s aggregate data rate). The I&Q data is Gray-encoded to form a quadrant number, which is modulo 4 differentially encoded to determine the quadrant of the optical phase. The optical transmission model also contains the coherent receiver with a second DFB laser as its local oscillator (LO). The frequency difference between S and LO leads to an intermediate frequency (IF) carrier of the received signal. Other effects, e.g. thermal noise and shot noise in the receiver, can also be considered in the simulation. Quantization effects of the analog-digital converters (ADCs) and several internal bit resolutions of the following digital signal processing unit are included to enable the optimization of key elements, e.g. look-up-tables (LUTs) for phase determination of complex numbers. For the digital signal processing, the simulation program allows to feed identical digital data into various filter and decoder concepts. The BER for each concept is determined by comparison of the transmitted data with the output data after differential decoding. Based on calculated BERs, a comparative evaluation of these concepts is possible. In the following section, simulation results for one polarization channel are presented and explained in detail. A second polarization channel can simply be added into the simulation model, but the current CMOS implementation incorporates only a single channel.
Phase recovery approaches
The digital carrier recovery method presented in [6] is based on raising the received complex signal to its 4th power and averaging. Two novel digital concepts have been developed in order to optimize this key element of the digital receiver. In a comparative Monte Carlo (MC) simulation, these two concepts have proven to be superior to the earlier concept [6] by lowering the SNR requirements for low BERs by 1 dB (Fig.2) . For each BER value, up to 10 8 transmitted symbols were randomly generated. The IF is assumed to be 0.16% of the symbol rate with a sum linewidth of 0.001 times the symbol rate. Curve 1 (Theory) is a theoretical reference curve that shows the BER of an ideal QPSK transmission system, neglecting the influence of phase noise and differential encoding. Curve 2 (Ideal Phase Recovery) is (like all other curves except for curve 1) based on Monte Carlo simulation and BER statistics, but the simulated IF phase without modulation and noise is provided for demodulation and decoding in the receiver. This is only possible in simulation and provides a better estimation for the practical limitations of the phase recovery subsystem than the theoretical BER curve 1. Curve 3 (Original Concept) is obtained by averaging as described in [6] . After the averaging filter, a LUT is used to obtain the phase of the filtered signal.
Curve 4 (NCF) results from the first improved concept referred to as nonlinear complex filter (NCF). The characteristic number of the filter is N = 5 which means that 2N + 1 = 11 values of the double-squared input signal are averaged. Compared to the original concept, nonlinear complex compensation functions are added. These are advantageous because raising samples of a noisy signal to its 4th power in order to remove the QPSK modulation causes increased noise. The complex filter is optimized by different weight coefficients for its input values.
Curve 5 (SMLPA) shows the BER result for a second concept that was developed to avoid the high resource requirements of the first approach. The extensive calculations were replaced by a heuristic method based on maximum likelihood estimation and an additional selectivity mechanism. Therefore, it is referred to as Selective Maximum Likelihood Phase Approximation (SMLPA). Compared to the NCF, the BER results of the SMLPA are almost identical, but the hardware effort for the realisation is much lower. It was possible to implement a complete SMLPA filter and demodulator system with an internal demultiplex facor of 16 on a CMOS chip area of 0.308 mm² (excluding pads and power rings).
CMOS implementation
The described system was implemented on a CMOS chip to prove the functionality of the concept. Fig. 3 shows a block diagram of the realized system. The incoming I and Q signals are combined to a complex number, which is then raised to its fourth power. This value is used as the index of a LUT containing the argument. In this CMOS circuit, a SMLPA is used for filtering because it requires fewer resources than the NCF, although the NCF gives better results. The NCF might be further optimized and applied to future CMOS implementations. The filter uses a six stage pipelining structure, accordingly delaying the output values by six clock cycles. The delay block contains register banks that align the appropriate values for the demodulator, which calculates the quadrant number of the received complex sample. The CMOS implementation decodes 16 data streams in parallel, which are generated by demultiplexing the input data stream. The system is synthesized from VHDL hardware description to a target frequency of 625 MHz. Fig. 4 shows a photograph of the chip (1.02 mm by 1.02 mm) that has an average power consumption of 0.814 W and a complexity of 27377 NAND2 equivalent gates. It is manufactured in the HCMOS9 technology from ST Microelectronics that provides a minimal feature size of 120 nm, six metal layers, and a supply voltage of 1.2 V.
Conclusions
Two novel carrier and data recovery methods for synchronous QPSK are presented with Monte Carlo simulation results. Maintaining the general advantages of an earlier published feedforward carrier recovery concept, the SNR requirements at low BERs were improved by 1 dB with both concepts. The second concept is extremely hardwareefficient and was therefore chosen for a CMOS implementation that is also presented with a complexity of 27377 NAND2 equivalent gates on a die size of 1.04 mm 2 .
