Abstract-A phase-locked clock and data recovery circuit incorporates a multiphase LC oscillator and a quarter-rate bang-bang phase detector. The oscillator is based on differential excitation of a closed-loop transmission line at evenly spaced points, providing half-quadrature phases. The phase detector employs eight flip-flops to sample the input every 12.5 ps, detecting data transitions while retiming and demultiplexing the data into four 10-Gb/s outputs. Fabricated in 0.18-m CMOS technology, the circuit produces a clock jitter of 0.9 ps rms and 9.67 ps pp with a PRBS of 2 31 1 while consuming 144 mW from a 2-V supply.
I. INTRODUCTION
C LOCK and data recovery (CDR) circuits operating at tens of gigabits per second pose difficult challenges with respect to speed, jitter, signal distribution, and power consumption. Half-rate 40-Gb/s CDR circuits have been implemented in bipolar technology [1] , [2] , but they require 5-V supplies and draw 1.6-5 W of power. (The work in [1] uses an external oscillator and 90 phase shifter.) On the other hand, the recent integration of 10-Gb/s receivers in CMOS technology [3] encourages further research on CMOS solutions for higher speeds, especially if it leads to low-voltage low-power realizations.
This paper presents the design and experimental verification of a 40-Gb/s phase-locked CDR circuit fabricated in 0.18-m CMOS technology. Realized as a quarter-rate architecture, the circuit incorporates a multiphase oscillator and a phase detector (PD) with inherent data retiming and 1-to-4 demultiplexing. Section II describes the design implications of the technology limitations, arriving at the CDR architecture. Section III deals with the design of each building block, and Section IV examines the effect of nonidealities. Section V summarizes the experimental results.
II. CDR ARCHITECTURE

A. General Considerations
We consider the technology limitations in the context of the full-rate deserializer shown in Fig. 1(a) . The of an nMOS transistor with m and mV is approximately equal to 50 GHz. If used in a differential pair, such a device bias requires a single-ended peak-to-peak input swing of about 700 mV to ensure relatively complete switching of the tail current, which is a value five times that necessary in bipolar counterparts. That is, a bipolar transistor having the same allows much faster operation. In practice, current-steering flipflops (FFs) in 0.18-m CMOS technology with a fanout of one fail at approximately 12 Gb/s even if large clock swings are used. Since broadband data is much more difficult to amplify than are narrowband clocks, this observation suggests relaxing the data path design in exchange for more stringent clock generation. The speed of FFs can be improved by inductive peaking [4] , but inductor parasitics in CMOS technology limit such an improvement to less than 40%, prohibiting even a half-rate CDR approach. More importantly, the large area consumed by inductors in at least four latches required for a half-rate phase detector [5] would make it difficult to route the 40-Gb/s data and the 20-GHz clock with reasonable skews (Section IV). Similarly, the first and second ranks of demultiplexing in Fig. 1(a) face severe speed limitations.
Another critical issue in Fig. 1(a) relates to the design of high-speed frequency dividers. Due to the limited capture range of CDR circuits, the oscillator must initially be locked to an external reference [3] , requiring a feedback divider. Moreover, dividers are necessary for generating clock frequencies used in demultiplexing of the data. Typical divide-by-two circuits fail 0018-9200/03$17.00 © 2003 IEEE at frequencies above 20 GHz in 0.18-m CMOS technology. Also, LC oscillators having a reasonable tuning range at 40 GHz present another challenge.
Let us now consider the deserializer shown in Fig. 1(b) . Here, a quarter-rate CDR circuit inherently retimes and demultiplexes the data, obviating full-rate FFs and frequency dividers. 1 The sensitivity of the clock phase margin (jitter tolerance) to oscillator phase mismatches can be alleviated through the use of large devices and careful layout.
Design of a 40-Gb/s system in 0.18-m CMOS technology is motivated by two factors: 1) the mask and fabrication cost of the next generation (e.g., 0.13 m) is substantially higher; and 2) many new techniques that relax the speed requirements here can be used in future generations to lower the power dissipation and complexity.
B. CDR Architecture
The CDR circuit employs a quarter-rate architecture to relax the issues described above. Shown in Fig. 2 , the circuit incorporates a multiphase voltage-controlled oscillator (VCO), a quarter-rate phase detector (PD), a voltage-to-current ( / ) converter, and a simple loop filter. The PD uses the half-quadrature phases provided by the VCO to sample the input data every 12.5 ps, thereby detecting data edges and determining whether the clock is early or late. Four of these samples fall in the center of the data eye, retiming and demultiplexing the 40-Gb/s input into four 10-Gb/s outputs. In the absence of data transitions, the / converter generates no output current, leaving the oscillator control line undisturbed. The circuit is fully differential, except for the oscillator control line.
With quarter-rate sampling, the FFs' hold time can be four times that required in full-rate operation, but their acquisition speed must still guarantee correct sampling of the input bits in less than 50 ps. The FF design described in Section III accomplishes this goal.
It is interesting to compare the quarter-rate architecture of Fig. 2 with a full-rate system in terms of power dissipation, hardware, and clock load capacitance. The PD of Fig. 2 employs 16 latches to perform phase detection, data retiming, and 1-to-4 demultiplexing, with each clock phase driving two latches. Depicted in Fig. 3 , the full-rate counterpart incorporates seven latches in an Alexander PD [6] , four latches in two 1 In applications requiring a full-rate recovered clock, a 40-GHz oscillator can be injection-locked to the 10-GHz VCO. cascaded divide-by-two circuits, and a minimum of nine latches in the demultiplexer, i.e., nine latches operating at full rate and eleven at half rate. Note that the full-rate clock drives the input capacitance of nine latches. Thus, the two architectures consume comparable power levels in their digital sections, but the latter presents a substantially larger capacitance to the full-rate clock.
III. BUILDING BLOCKS
A. VCO
The speed, jitter, and driving capability required of the oscillator point to the use of an LC realization. A number of multiphase LC oscillators have been reported. Coupled oscillators [8] - [10] operate away from the resonance frequency of the tanks so as to create the required phase shift, thus bearing a tradeoff between reliability of oscillation and the phase noise [10] . The multiphase oscillator in [11] drives transmission lines by a resistively-loaded gain stage, incurring energy loss in each cycle.
The multiphase oscillator introduced here is based on the concept of differential stimulus of a closed-loop transmission line at evenly-spaced points, as illustrated conceptually in Fig. 4 (a) with two differential negativecells. Approximated in Fig. 4 (b) by lumped inductors and capacitors, the circuit consists of eight inductors forming a loop and four differential negativecells driving diagonally opposite nodes. In the steady state, every two such nodes sustain a phase separation of 180 , thus providing 45 phase steps in between. Unlike the topologies in [9] and [10] , this oscillator does not operate away from the resonance frequency. Also, in contrast to the design in [11] , the transmission line requires no termination resistors, thereby displaying lower phase noise and larger voltage swings for a given power dissipation and inductor . Fig. 5 plots the phase noise of three 10-GHz half-quadrature oscillator topologies simulated in Spectre RF and designed with the same power dissipation and inductor . The proposed oscillator achieves at least 7-dB lower phase noise and twice the voltage swing.
The oscillation frequency of the circuit is uniquely given by the travel time of the wave around the loop. Noting that the phase velocity in a transmission line is equal to , where and represent the inductance and capacitance per unit length, respectively, we write the oscillation frequency of this topology as (1) where and , respectively, denote the lumped inductance and capacitance of each of the eight sections. The topology of Fig. 4(b) necessitates long interconnects between the nodes and their corresponding cells. However, recognizing that diagonally opposite inductors carry currents that are 180 out of phase, we modify the circuit as shown in Fig. 4(c) , grouping inductor elements into differential structures and placing the cells in close proximity of the oscillator nodes. Exploiting the higher of differential inductors [12] , the VCO incorporates the cell shown in Fig. 4(d) , shaping the rising and falling edges by the pMOS transistors and hence lowering the upconversion of noise [13] . The value of each differential inductor is 0.9 nH.
Each node in the oscillator in loaded by a MOS varactor. With the oscillation common-mode level set by the cells to around , the MOS varactors can go from the accumulation mode to the depletion mode, maximizing the tuning range. Unlike the design in [11] , the VCO avoids external control and, hence, exhibits a high gain, approximately 1 GHz/V, thus necessitating a smaller ripple on the control line to obtain low jitter.
Each differential port of the VCO in Fig. 4(b) is buffered by an inductively-loaded differential pair. These buffers serve to: 1) isolate the VCO from the long interconnects going to the PD that would otherwise introduce greater uncertainty in the oscillation frequency; 2) generate voltage swings above the supply voltage, thus driving the FFs in the PD efficiently (Section III); and 3) isolate the VCO from the data edges coupled through the phase detector.
Device mismatches in the circuit of Fig. 4 (c) yield nonuniform separations between adjacent VCO phases and, hence, static phase error in the CDR. Circuit simulations predict a phase mismatch of 0.22 ps for a 5% mismatch between the LC products in two adjacent tanks. Such a mismatch also shifts the oscillation frequency by 0.3%, which is an error well within the tuning range.
An interesting issue in the proposed VCO is that, due to symmetry, the wave may propagate clockwise rather than counterclockwise. In the present prototype, this effect is not observed, perhaps because the of the inductors is slightly higher when the current flows from the outer turns toward the inner turns. Nonetheless, to achieve a more robust design, a means of detecting the wave direction is necessary. Since nodes that are 90 apart in one case exhibit a phase difference of 90 in the other case, an FF sensing such nodes at its data and clock inputs generates a constant high or low level, thereby providing a dc quantity indicating the wave direction. 2 As described in the next section, this result can be used to avoid corruption of data.
B. Phase Detector 1) Architecture:
The PD employs eight FFs to strobe the data at 12.5-ps intervals (Fig. 6) . In a manner similar to an Alexander topology, the PD compares every two consecutive samples by means of an XOR gate, generating a high level if an edge has occurred. To determine the polarity of the phase error from three consecutive samples, the outputs of two XORs are applied to a / converter, which produces a net current if its inputs are unequal. In lock, every other sample serves as a retimed and demultiplexed output.
It is important to note that, in the absense of data transitions, the FFs generate equal outputs, and each / converter produces a zero current, in essence presenting a tristate (high) impedance to the oscillator control. This is in contrast to other bang-bang topologies [7] , [5] that continue to apply a high or low level to the VCO during long runs, creating a potentially high jitter.
The early-late phase detection method used here exhibits a bang-bang characteristic, forcing the CDR circuit to align every other edge of the clock with the zero crossings of data after the loop is locked. 3 In reality, the metastable behavior of the FFs leads to a finite PD gain, allowing the clock edges to sustain some offset with respect to the data zero crossings. Shown in Fig. 7 is the input/output characteristic of the PD together with the / converter, obtained by transistor-level simulations while the circuit senses a 40-Gb/s random data stream and eight phases of the 10-GHz clock. For a phase error of less than 2.5 ps, the PD displays a relatively constant gain of 100 A/ps. With an ideal / converter, a finite phase difference would still lead to injection of a finite current into the loop filter (similar to an ideal integrator), forcing the loop to lock with a zero static phase error. The output resistance of the / converter, on the other hand, results in lossy integration, necessitating a small change in as the control voltage varies from minimum to maximum. Nevertheless, simulations indicate that, with the present PD and / converter design, the static phase offset reaches 0.8 ps as the / output varies from near zero to near . As mentioned in the previous section, the arbitrary wave direction in the oscillator yields two possible sets of phases, but a FF can detect the direction. Analysis of the PD and loop operation for the two possibilities reveals that in one case, the even-numbered PD outputs are metastable whereas in the other case, the odd-numbered outputs are. Thus, the dc level generated by direction detector FF can simply select and reroute the nonmetastable outputs to the next rank of demultiplexing.
2) FF Design: Even though the PD FFs operate with a 10-GHz clock, proper sampling of 40-Gb/s data still requires fast recovery from the previous state and rapid acquisition of the present input. To this end, both a wide sampling bandwidth and a short clock transition time are necessary. Fig. 8(a) depicts the master-slave FF used in the phase detector. Here, nMOS switches and sample on the parasitic capacitances at nodes and when is high and isolate these nodes from when is low. Since the minimum input common-mode (CM) level is dictated by the gate-source voltage of -and the headroom required by , the sampling switches experience an overdrive voltage of only 0.5 V even if reaches , failing to provide fast sampling. This issue is remedied by setting the CM level of and equal to , a choice afforded by inductively-loaded stages following the VCO core. The peak value of thus exceeds by 0.8 V, more than doubling the sampling speed of and . 4 The large clock swings also minimize the transition times. 5 As and turn off, their channel charge injection and clock feedthrough introduce a CM pedestal of 200 mV at the gates of and . Fig. 8(b) shows the simulated eye diagram at nodes and .
With large clock swings available, the current switching in pairs --, and -is accomplished by gate control rather than conventional source-coupled steering. The proposed topology offers two advantages: 1) since the tail current source is removed, -can be much narrower, presenting a smaller capacitance to the VCO buffer; and 2) since the drain currents of -are not limited by a tail current source, these transistors experience class-AB switching, drawing a large current at the peak of the clock swing and providing greater voltage swings and a higher gain in the data path. The coupling capacitors and in Fig. 8 can potentially occupy a large area and/or present a great bottomplate capacitance to the clock lines. To resolve this issue, these capacitors are realized as fringe structures [14] using metal-2 through metal-5 layers. Fig. 9 compares the performance of the above FF with a standard current-steering topology consuming the same power and driven by the same clock swings with and without inductive peaking. Derived from simulations, the waveforms in Fig. 9(a)-(c) represent the 10-Gb/s output resulting from quarter-rate sampling of 40-Gb/s data. It can be seen that the proposed FF introduces much less intersymbol interference (ISI). The circuit can benefit from inductive peaking, but skew issues in the layout (Section IV) prohibit the use of inductors here.
The connection of the sources of -to ground rather than to tail current sources can potentially corrupt the data in the presence of noise on the supply voltage of the VCO buffer. Simulations reveal that a 10-mV 1-GHz sinusoid on the supply increases the jitter at the output of the FF by 0.03 ps.
Each FF in the PD must drive an XOR gate. Furthermore, four of the FFs must also drive output buffers or subsequent stages of demultiplexing. To avoid systematic delay mismatches resulting from loading disparity, all FFs are immediately followed by a Cherry-Hooper amplifier (Fig. 10) [15] , tapered up in driving capability by a factor of 1.5 from -to -. 
3) XOR Gate and / Converter:
The XOR gates used in the PD must exhibit symmetry with respect to their two inputs and operate with a low supply voltage. Shown in Fig. 11 along with the / converter, the XOR gate is a modified version of that in [16] , with transistors and forming local positive feedback loops and avoiding the reference voltage necessary in the earlier realization [16] .
The / converter copies the output current of the XOR, providing nearly rail-to-rail swings for the oscillator control line. Unlike charge pumps, / converters need not switch after every phase comparison and are, therefore, free from the dead-zone issue.
IV. EFFECT OF NONIDEALITIES
A. Staggered Outputs
In order to save power and minimize the clock load capacitance, the PD uses only one FF in each data sampling path. As a result, the outputs of consecutive FFs in Fig. 6(a) are staggered by 12.5 ps, failing to provide the instantaneous phase-error information simultaneously. Since the low-pass filter extracts the average value of the / output, the CDR loop still locks properly, but the misalignment creates ripple on the oscillator control line.
To estimate the jitter resulting from this effect, we consider the worst case, illustrated in Fig. 12 , where only half of the PD is shown for the sake of clarity. Assuming a run of at least seven zeros before sample 1, we observe that sample 2 drives both and high, still generating . At sample 3, goes low, goes high, and both and assume their maximum value. This condition holds for 87.5 ps, leading to a peak jitter of (2) where denotes the oscillation period, and the VCO delay 6 and the change in the voltage across are neglected. In this design, Grad/s/V, mA, and , yielding ps. Note that for larger run lengths, the above integration still takes place for 87.5 ps.
B. Finite Group Velocity
The performance of the PD is also influenced by the finite group velocity of the 40-Gb/s input data as it travels from one end of the FF array to the other end. 7 As shown in Fig. 13(a) , the input data flows on differential microstrip lines made of metal-6 on top of a metal-1 ground plane. The microstrip lines are designed to have a 100-differential characteristic impedance with the input capacitance of the FFs included. This capacitance lowers the line group velocity to about 40% of the speed of light, resulting in a skew of nearly 2.9 ps between the inputs of the first and last FFs.
As shown in Fig. 13(b) , the sampled zero crossings of the data before locking are shifted to the left by 0.75, 1.5, and 2.25 ps. However, upon lock, the loop tries to minimize the dc level generated by these samples, shifting the first sample to the right by 1 ps and the last to the left by 1 ps. This is equivalent to periodic phase modulation of the input by 1 ps at a rate of 10 GHz [ Fig. 13(c)] . Fortunately, the limited bandwidth of the CDR rejects this jitter. Nonetheless, a skew of 1 ps is introduced in the sampling points.
Note that if the FFs use inductive peaking, the 32 inductors required in the PD would increase the distance between the first and last FF to several millimeters and the skew to 10-15 ps. In that case, the jitter tolerance of the circuit would be heavily compromised.
The linear placement of the PD FFs inevitably introduces some skew in the routing of the clock phases. However, the 
V. EXPERIMENTAL RESULTS
The CDR circuit has been fabricated in a 0.18-m CMOS technology. Fig. 14 shows a photo of the die, which measures 1.0 1.4 mm . The spirals employ a linewidth commensurate with electromigration limitations ( mA/ m for metal-6). The input and output are designed as 50-microstrip structures consisting of metal-6 atop metal-1, and skews are minimized through symmetry in layout. The circuit is tested on a high-speed probe station with a 40-Gb/s Anritsu random data generator providing the input.
Shown in Fig. 15 are the VCO tuning characteristic and free-running spectrum. The VCO provides a tuning range of 1.2 GHz 8 with a phase noise of 105 dBc/Hz at 1-MHz offset. The of the differential inductors used in the VCO is estimated as follows. In the measurement, the oscillator output is monitored on a spectrum analyzer while the tail current of the cells is reduced so as to place the circuit at the edge of oscillation. Next, the tail current thus obtained is used in the simulation and the equivalent parallel resistance of each inductor is lowered until the circuit fails to oscillate. For such value of , we have . Yielding , this technique of course assumes that the value of the inductor and oscillation frequency are predicted accurately. Note that other oscillator parameters such as phase noise and output swing are also functions of , but it is much easier to place the circuit at the edge of oscillation than to calculate the from phase noise or output swing measurements. Fig. 16(a) depicts the CDR input and output waveforms under locked condition in response to a pseudorandom sequence of length . The bit-error rate (BER) is measured using the setup shown in Fig. 16(b) , where a high-speed 4-to-1 multiplexer sensing two relatively uncorrelated binary sequences generates a 40-Gb/s stream. One of the demultiplexed channels produced by the CDR circuit corresponds to channel 1 and is applied to the BER tester. 9 The resulting BER is equal to 10 . As observed in Fig. 16(a) , experiences some ISI due to the limited bandwidth of the output buffers, possibly degrading the BER. Other sources of the high BER relate to noise pickup in the probe station and static phase errors in the PD. Fig. 17 shows the recovered clock, suggesting an rms jitter of 1.756 ps and a peak-to-peak jitter of 9.67 ps. However, as shown in the inset, the oscilloscope itself suffers from rms and peak-to-peak jitters of 1.508 and 8.89 ps, respectively. Thus, the CDR output contains a jitter of 0.9-ps rms and at most 9.67-ps peak-to-peak. 10 The jitter transfer and tolerance have not been 9 This arrangement may be somewhat optimistic in that the multiplexer output does not contain long runs. 10 It is unclear whether and how the peak-to-peak values can be subtracted.
measured due to lack of necessary equipment. 11 The performance of this work and some other previously published CDR circuits is summarized in Table I . 12 
VI. CONCLUSION
This work demonstrates the potential of standard CMOS technology for CDR circuits operating at tens of gigabits per second. The proposed oscillator, PD, and FF topologies resolve a number of circuit and architecture issues. Furthermore, the use of inductors boosts the raw speed of the technology considerably. In addition to a high speed, this CDR circuit achieves a power dissipation that is substantially lower than that of previous work.
