Abstract-A 16-phase phase-locked loop (PLL) suitable for a 25 Gb/s, 1/8 th rate clock and data recovery (CDR) system is presented. Sub-rate CDRs have been shown to be reconfigurable to operate at fractions of their peak data rate. When operating at less than the maximum data rate, power dissipation can be reduced. Also, the jitter that can be tolerated from the multi-phase PLL is higher, suggesting a desirable power/jitter trade-off for power minimization across data rates. This PLL allows for on-the-fly trading off of power dissipation versus jitter performance by dynamically connecting between one and three identical sub-VCOs together to generate its 16 output phases. To minimize sudden phase excursions during VCO reconfiguration, a power-on sequence and a variablecapacitance load have been used. Simulation results using TSMC 65 nm show that the power dissipation reduces by 57% when switching from three to one VCO. Alternatively, the jitter can decrease by a factor of 1.5x when the PLL is reconfigured from one to three VCOs. This can be done with a maximum absolute timing excursion of 16 ps, making it suitable for an on-the-fly reconfiguration of the CDR from 12.5 to 25 Gb/s.
INTRODUCTION
Receiving serial data transmitted over an electrical or optical channel, in the absence of a forwarded clock, requires a clock and data recovery (CDR) system which correctly samples and retimes the incoming data. As CDR implementations at >25 Gb/s are now common place, minimizing CDR power dissipation and chip area become the focus of research. For example, [1] uses a wide-band phase-locked loop (PLL) in order to suppress voltagecontrolled oscillator (VCO) phase noise from a compact ring VCO. Despite an energy efficiency of 1.8 pJ/bit, CDR power is not a negligible fraction of overall receiver power.
Other recent work has proposed CDRs capable of multi-rate operation using fixed clock rates. In [2] , a method was shown to select different data rates by discarding samples taken from an Alexander phase detector. This discarding scheme made the system a quarter-rate, half-rate or full rate receiver as the data rate was reduced. However, the reported energy per bit in this design was three times that in [1] at the highest data rate and no power reduction scheme was implemented when the CDR operated at a lower data rate. The authors did propose some modifications to their work, such as shutting off unused samplers that would reduce power dissipation but not proportionally with the data rate.
A variable data rate link with proportional power dissipation gives the system the option of saving power when peak link capacity is not needed. However, the power benefits may not offset the system performance penalty if the time to increase the data rate is long. Another aspect of multi-rate operation that was not discussed in [2] was how quickly the data rate can be changed. In this work, multi-phase PLLs to support multi-rate low-power CDRs are investigated, with the goal of supporting on-the-fly changes in data rate.
In order to support the lowest power CDR and extend it to multiple data rates, the techniques in [2] can be applied to the architecture in [1] . A modified version of the CDR in [1] is shown in Fig.1 . This CDR operates as a 1/8 th rate architecture. By disabling clock signals and capturing fewer data/edge samples per PLL clock period, the input data rate can be reduced with modest power reduction. However, the power dissipation of the PLL will remain constant and limit power reduction at lower data rates.
The power dissipation of the PLL is set by the maximum tolerable jitter when the CDR is operating at its highest data rate. If the input data rate is lowered, assuming a constant eyeopening relative to the unit interval (UI), the PLL's timing jitter can increase without degrading bit-error rate. This suggests that the jitter requirement of the fixed-frequency PLL can be relaxed when the data rate is lower.
In this work, a multi-phase PLL, similar to that in [1] , has been designed for 3.125 GHz operation intended for use in a 1/8 th rate, 25 Gb/s CDR that can also support lower data rates. It features a parallel VCO architecture based on techniques introduced in [3] that allows power dissipation and jitter performance to be traded-off in order to dissipate less power when the jitter requirements are relaxed. In [3] it was shown that phase noise (jitter) was able to be statically tuned allowing a jitter/power dissipation trade off. Multiple VCOs in parallel were connected as needed through pass transistors in This work was supported by NSERC, CMC Microsystems, the Faculty of Engineering and Computer Science at Concordia University and the Regroupement Stratégique en Microsystèmes du Québec order to increase the effective size of the transistors in the ring to dissipate more power, but as a consequence reduce the phase noise. However, the effect on the output phase of on-the-fly reconfiguration of the VCO was not considered.
II. PROPOSED ARCHITECTURE A. Voltage Controlled Oscillator
In this work, the overall VCO is composed of three subVCOs, each of which is a current-starved ring VCO. As shown in [3] , ring oscillators can be placed in parallel in order to obtain phase-noise reduction. This has been shown to be equivalent to increasing the transistor sizes, which in turn increases capacitance and power but lowers phase noise. As [3] is designed for steady-state operation, one deficiency with it arises as a result of power-on and connection transients during the dynamic connection of additional sub-VCOs, as phases in separate rings are arbitrary. The connection of unmatched phases on the same node forces abrupt phase changes in the overall VCO. This effect can be resolved if the corresponding phases of different rings are aligned before connecting them together. The work presented in this paper addresses the need to have three sub-VCOs available that can be aligned and connected without significantly perturbing the PLL's output phase.
A portion of the proposed VCO architecture can be seen in Fig. 2 . The overall eight-stage VCO uses three sub-VCOs, two of which are depicted in the figure. Also shown is the mechanism by which each sub-VCO can be connected to the overall VCO's output through buffers. Inverters in the main path propagate the signal around the loop and are the primary reason for oscillation. Cross-coupled inverters ensure the outputs of each stage are 180° out of phase.
Each of the sixteen phases has a set of inverters attached to them, acting as buffers (labeled as EN3). At the output of these buffers is a node referred to as a common phase node, or V c . In this scheme, each of the three identical VCOs can drive V c . With the buffers mentioned above, one last inverter is added to connect V c to the following stage of the ring (shown as From V c and labeled with EN1) forming the "auxiliary path". For devices in the auxiliary path, the VSS' node is connected to ground, whereas those in the main path have the node connected to current starving devices. This helps to keep the delay of the auxiliary path fixed. The strengths and capacitance of each of the components in the auxiliary path determine where it must re-enter the ring, possibly varying between designs using more VCOs. The re-entry point in this design was found by comparing the buffered ring output to that of advanced stages of the same ring, in order to find the best match.
The number of active sub-VCOs can be selected as needed depending on the instantaneous jitter and power requirements of the system. As shown in [4] , if it is assumed that n VCOs in parallel behave as a single VCO dissipating n times more power and having transistors n times as wide, we expect a reduction in phase noise corresponding to:
relative to a single VCO, where n is the number of active subVCOs in parallel and L(f) is the phase noise at a frequency offset f. In the case of two and three active sub-VCOs, the phase noise is expected to drop by 3 dB and 4.8 dB respectively, relative to a single VCO. With the sub-VCOs placed in a PLL, the overall closed-loop phase-noise reduction is expected to be shifted down by the same amount as in the open loop case, assuming the VCOs' noise contributions are dominant. The rms jitter in the closed-loop case is expected to drop by n where n is the number of active sub-VCOs.
When one VCO is running, there exists an auxiliary path that is designed to have the same delay as the forward path through the ring-oscillator stages. This auxiliary path contains inverters strong enough to drive the respective common node, and input inverters (EN1) weak enough so as to be in phase with the respective re-entry point of the ring. If multiple VCOs are enabled, the input current to the common nodes rises and thus the delay of the auxiliary path lowers as predicted by:
Where C is the constant node capacitance and derivative of voltage with respect to time is the rise time of the node. This delay reduction in the auxiliary path causes a net reduction in the per-stage delay of the ring VCO, increasing its free- running frequency. The per-stage delay is the weighted average of the delay through the main and auxiliary paths, similar to a delay interpolator. This is not a significant problem if the VCO is in a PLL, as the feedback loop can compensate for the frequency change by adjusting the VCO's control voltage. However, the PLL cannot suppress sudden phase excursions occurring at the instant a sub-VCO is added.
In order to have smooth transitions when activating a new VCO, the following procedure has been developed. In Fig. 2 , one can observe several enable signals, labeled as ENx. It should be noted here that times for the following procedure can vary depending on the implementation of the circuit, based on drive strengths and node capacitance. When wishing to activate a ring, first EN1 is activated. This allows the weak input inverters to start charging up the capacitive nodes inside the ring. Next, EN2 is activated to start up the ring. This takes the partially charged nodes, from the EN1 process, and pulls them apart further due to the cross-coupled inverters. At this point the ring will start oscillating and become phase aligned with V c due to the coupling from the V c node through the EN1 inverter. The final step is to activate the EN3 inverters, which create a buffered link between the newly aligned sub-VCO and V c . As seen in Fig. 3 the whole aligning process can be completed in approximately one clock cycle. In this example EN1 is activated at 8 ns, EN2 at 8.3 ns and EN3 at 8.5 ns in order to show a visible spread.
B. Phase-Locked Loop
The PLL was designed using loop parameters and circuits similar to [5] . As this work has shown, for a wideband PLL optimal results occur with ζ= 1.6 and ω n /ω 0 = 0.04. The PLL block diagram can be seen in Fig. 4 , where PD is the combination of the CML XOR phase detector, as well as a voltage-to-current converter which is a differential to singleended transconductor.
The loop filter contains an RC branch as well as a parallel capacitance, and values of these components can be calculated by neglecting the parallel capacitance for simplification. The closed-loop transfer function is found to be:
Where K is the multiplication of VCO gain, K VCO, and phase detector gain, K PD , and R and C are the loop-filter components.
When additional sub-VCOs are connected, the abrupt event perturbs the VCO by speeding up the VCO and advances it by Δt seconds. If a change from one to three VCOs is done in anticipation of an increase in data rate, the Δt change and the resulting settling time can be problematic. In order to solve this issue we revisit our initial assumptions in (2) , where now a constant capacitance value at the common node will not be assumed. This signifies that if the capacitance were to change with the increase of input current at the node, the excursion due to the change in the delay of the auxiliary path can be minimized. This variable capacitance must be able to be activated as needed, since reductions in the auxiliary path delay only occur when multiple sub-VCOs are enabled.
NMOS transistors were chosen to implement variable capacitors. The V c nodes are each connected to the gate of an NMOS transistor. Bulk terminals are grounded and source/drain (S/D) terminals are attached together. By lowering the S/D voltage from VDD to ground on the activation of EN3, the gate capacitance of the MOSFET can increase from the oxide capacitance (Cox) in series with the depletion region capacitance (C low ) up to Cox (C high ).
III. SIMULATION RESULTS
Simulations were carried out using TSMC 65 nm technology. Fig.5 shows the reduction in open-loop phase noise as multiple VCOs are attached to the common phase node. The results follow theory presented in (1), as a reduction of 3 dB and 4.7 dB with two and three active sub-VCOs, respectively, was observed.
The excursion theory and compensation method has been shown to work in reducing the deviation from the reference clock when sub-VCOs are activated. Fig.6 illustrates the result of the non-compensated case where the variable capacitance was not activated, and the compensated case where it was. This figure was obtained by examining the difference between the crossing points of the rising edges of the reference clock and the corresponding VCO output phase. This result was normalized in order to remove static offset which would be corrected by a secondary CDR alignment loop. The results show that the maximum excursion Δt is reduced by nearly 70% to 16 ps through the use of the variable capacitors. If it is assumed that the PLL output phases are aligned for optimum sampling of the data at the input of the CDR, Δt represents a sudden excursion from the desired sampling point. The CDR presented in [1] has a high frequency jitter tolerance of 0.4 UI, implying that high-frequency changes in clock vs. data alignment up to 0.4 UI can be tolerated. The transition from one to three sub-VCOs occurs when the PLL is reconfigured while supporting 12.5 Gb/s in anticipation of an increase to 25 Gb/s. At 12.5 Gb/s, 16 ps is 0.2 UI, meaning that the Δt occurring is less than the CDR's jitter tolerance.
The results of the closed loop simulation are presented in Fig.7 . The open-loop VCO results are also plotted. By integrating the noise power obtained from the closed loop phase noise plots, random jitter was found to be 161.1fs rms, 123.8fs rms and 107.2fs rms for one, two and three sub-VCOs, respectively. This corresponds to a reduction in jitter of 1.30 and 1.50 times, respectively. These ratios are slightly lower than theoretical values. This is due to the fact that other noise sources are present in the PLL which become more prominent when the VCO's contribution is reduced. Due to the in-band non-VCO noise being non-negligible, there was less reduction than expected. Out of band, the theoretical reduction in phase noise was close to being achieved at 4.2 dB. In band, the reduction in phase noise was only 3.1dB.
Power dissipation is shown in Table I , where the average powers for two situations are listed. As it can be seen, a power savings of approximately 57% is observed when reducing from three to one active sub-VCO. Possible other methods of power savings will come from turning off buffers, similar to the implementation in [2] where samples are discarded. 
IV. CONCLUSION
A multi-VCO architecture for use in a multi-phase PLL has been designed and simulated in TSMC 65nm. The results show that jitter and power can be traded off making the PLL suitable as a fixed-frequency, multi-phase clock source for a variable-rate CDR. A 57% decrease in power dissipation led to a 1.5x increase in rms output jitter. A novel alignment scheme was introduced, along with a compensation scheme that keeps phase excursions during VCO reconfiguration to less than 5% of the clock period.
This may also find use in post fabrication optimization of systems, where jitter specifications can be reached by calibration to account for mismatch or process variation. 
