Abstract-A fully integrated, phase-locked loop (PLL) clock generator/phase aligner for the POWER3 microprocessor has been designed using a 2.5-V, 0.40-m digital CMOS6S process.
II. INTRODUCTION
This paper describes a fully integrated PLL-based clock generator/phase aligner used for the POWER3 microprocessor. The microprocessor is fabricated in IBM CMOS6S technology and contains approximately 12 million transistors. With the microprocessor actively executing instructions, this PLL achieved cycle-cycle jitter of 10.0 ps rms, 80 ps P-P in its application environment and 8.4 ps rms, 62 ps P-P with the microprocessor in a reset state with a portion of the clock tree active.
A simplified block diagram of the PLL clock generator is shown in Fig. 1 . The external reference or BUSCLK enters a receiver and is divided by two by divider stage before entering the phase/frequency detector (PFD) as
The internal feedback signal from divider is compared to by the PFD, which generates an error signal , which is used by the charge-pump and filter network to control the voltage-controlled oscillator (VCO). The output frequency of the VCO is divided by and is used as the main processor clock (PCLK) after passing through four levels of clock buffering in an H-tree clock distribution network. The processor clock is passed through a delay-matching receiver before entering divider completing the feedback path. Since at equilibrium the inputs of the PFD will be matched in frequency (and phase), the processor-to-bus frequency ratio is equal to the ratio which is equal to the ratio allowing integer or noninteger frequency synthesis by changing divider ratios. Since the technique does not require clock choppers [2] , the duty cycle and phase alignment are relatively insensitive to environment and process tolerances. The output of the VCO is also connected to frequency divider which is used for the L2 cache clock (L2CLK). Since the processor-to-L2 clock-frequency ratio is also adjustable to integer or noninteger ratios. Other phase-synchronous clocks may be designed in similar fashion, and quadrature or interstitial clocks may be created by a polarity change at the divider input. Using the structure of Fig. 1 , the VCO frequency is equal to times the processor clock frequency For cases when is even, the processor clock edges are generated from only one VCO clock edge; hence a nearly ideal 50% processor clock duty cycle may be achieved through its independence from the VCO duty cycle.
III. PROCESS TECHNOLOGY
The microprocessor and integral clock generator PLL are fabricated in a five-layer CMOS process with 0.4-m feature 0018-9200/99$10.00 © 1999 IEEE The PLL clock generator is shown in the microprocessor die photograph of Fig. 2(a) . The dimensions of the entire PLL are 1040 640 m . It is shown with the major features identified in Fig. 2(b) .
IV. PLL CLOCK GENERATOR COMPONENTS

A. Phase/Frequency Detector
The digital PFD generates a signal that conveys relative phase and frequency error information about its inputs to the charge pump and filter. The PFD design is based on a threestate machine structure [8] , as depicted in Fig. 3(a) . From the initial reset state, a rising edge on the input will assert the UP output until the rising edge of appears, which deasserts UP and forces a reset of both flip-flops [ Fig. 3(b) ].
A rising edge first appearing on similarly asserts DOWN until a rising edge arrives at followed by a subsequent reset. Complementary outputs are generated by the PFD for use in the differential charge-pump stage that follows the PFD. The pulse width of the output varies proportionally with the phase error between the two inputs, except for the deadzone region as the difference approaches zero. This dead zone exists when the phase error becomes small relative to the combined response time of the PFD, charge pump, and filter circuits. Circuit simulation results show a nominal dead zone of 25 ps. Concerns of current mismatch in the charge-pump and filter networks are reduced at the expense of increased dead zone by preventing simultaneous assertions of UP and DOWN
B. Power-Supply Isolation
A separate analog power connection (AVDD) is used for the analog circuits [current reference, charge pump, commonmode rejection (CMR), filter initialization, and VCO circuits] to increase the isolation of the sensitive circuits from the logicinduced switching noise present on the main power supply. To allow the detection of potential defects using conventional testing, the AVDD pin is held low, disabling the analog devices that normally draw dc current. Both on-chip and on-module decoupling is used on AVDD.
C. Reference Circuit
A thermal voltage-referenced current source is used to provide temperature-and supply-independent biasing for the analog circuits in the PLL. The circuit contains an array of P diffusions in the N-well connected to form two forward-biased diodes with areas that differ by a factor of ten. When connected as shown in Fig. 4 , the current through each leg has two stable operating points, A or
The startup circuit prevents the zero current state from occurring by injecting current into one leg during initial power-on. The resistor is implemented using the precision resistor available in the process, which has a temperature coefficient (TC) of 2000 ppm/ C. The positive TC's of the thermal voltage term and the resistor tend to cancel, providing a reference current TC of 785 ppm/ C at 85 C. The reference current is used for subsequent generation of reference currents and the PMOS bias voltage through mirroring. Sensitivity to power-supply change is 1.7%/V for 20% change on VDD. 
D. Process and Temperature Compensation
Variations in due to process are monitored using the circuit shown in Fig. 5(a) . All of the current sources are generated directly from the reference circuit current A constant current is passed through a branch containing short-channel NMOS devices, creating a monitoring voltage , which is sensitive to NMOS device length variations. This voltage is compared to a reference voltage generated by a constant current through a long-channel NMOS device that is relatively insensitive to length variations. The devices and bias currents used for length sensing are sized so that and are equivalent for a nominal process. To minimize temperature sensitivity, the bias currents correspond to the zero-temperature coefficient (0-TC) region of the devices. The two voltages are compared using a differential amplifier, which generates a current proportional to the NMOS offset from nominal. This current is mirrored to produce a current that is injected into a precision resistor used for combining various process monitors to generate a compensating reference voltage. The compensating reference voltage is connected to the active load elements of the VCO, which control the VCO's voltage swing. A current generated from a similar PMOS circuit also is injected into the resistor.
Weighted combinations of standard bias circuits with differing voltage and temperature coefficients have been used previously to compensate reference circuits for VCO's [9] . In this case, however, temperature was monitored directly by comparing the voltage of two series-connected devices biased by current below their 0-TC operating point to the voltage of two parallel devices biased by current significantly above their effective 0-TC point [ Fig. 5(b) ]. The devices and bias currents are sized so that both branches of the differential amplifier are balanced at for nominal temperature conditions. The inset shows the I-V characteristics as a function of temperature for the series (subscript 2) and parallel (subscript 1) connected devices; the 0-TC points correspond to the crossing point where the current is invariant with temperature. The current in one leg of the differential amplifier varies proportionally with temperature and is mirrored and added to the summing junction of the resistor A constant bias current is also added to the summing junction to establish the correct weighting of the various compensating currents and to correct for the TC of the summing junction resistor.
Using a statistical process model, the process compensation was designed to favor the stabilization of the "best case" side of the distribution over the "worst case" side in anticipation of future process trends. Given the limited range over which a circuit may be practically compensated, the performance for the "best case" devices was not sacrificed at the expense of extensive compensation of the poorest performing devices. For the unsorted population, this approach allowed a reduction in the sensitivity of the VCO to process variability by a factor of 3.6 (55.4-15.2%) over the uncompensated VCO; temperature sensitivity was reduced by a factor of 4.7 (38.6-8.2%).
E. Charge Pump
The reference circuit is used to generate the currents and for use within the charge pump. The peak chargepump current may be adjusted in 30-A increments from 30 to 240 A by scaling the mirror currents as shown in Fig. 6 . The error signals and generated by the PFD are used to switch the peak current selected. Adjusting the charge pump allows for optimization of the loop characteristics for different divider and VCO settings. Differential outputs P and P are included for high CMR in the subsequent analog circuits.
F. Loop Filter
The differential loop filter and initialization circuits are shown in Fig. 7 . Currents to and from the charge-pump circuit enter the filter at nodes P, P. The input to the filter contains NMOS transmission-gate clamping devices to limit the maximum filter voltage to where is the NMOS threshold voltage for a large source-bulk voltage. The filter output is connected to the VCO control input at nodes An initialization circuit activated during the initial system power-on-reset is used to precharge the filter capacitors to the nominal common-mode voltages at nodes
G. Common-Mode Control
It is possible for common-mode voltages to develop in the filter from leakage, drift, or device mismatch. Since the common-mode voltage can introduce frequency offsets in the VCO or even inhibit operation for extreme cases, the circuit shown in Fig. 8 was used in conjunction with the filter clamps described earlier. The common-mode voltage of the filter is sensed by generating currents proportional to and and summing them across a load device to produce A differential amplifier compares to a reference voltage and generates a current , which is proportional to the common-mode voltage. The current is mirrored by two identical current sources, which bleed current from both filter capacitors simultaneously without affecting the differential voltage between them. The maximum drain currents for this structure, which corresponds to the case when both clamps have activated, are approximately 16 A. For typical cases where the common-mode voltage is below 600 mV, the bleed currents are 1 A. Stability of the network is assured by heavy dominant-pole compensation. 
H. Voltage-Controlled Oscillator
The VCO design is based upon a delay-interpolating ring oscillator structure [9] - [11] , as shown in Fig. 9 . In contrast to the current-starved and current-modulated VCO's, which are very commonly used for microprocessor clock generators, delay-interpolating VCO's have relatively low-to-moderate VCO gains and are well suited to fully differential control and signal path circuit implementations. The lower VCO gain of the delay-interpolating VCO's produces significantly less jitter due to coupled noise than higher gain structures. The limited operating frequency range for delay-interpolating VCO's, which must be less than 2 : 1 to ensure monotonicity, may be effectively augmented by selecting suitable divider ratios or by adding programmability to the VCO signal paths.
The frequency limits of the VCO are determined by the longest and shortest path delays through the structure. Fig. 9 shows an example high-frequency limit of period composed of three delay units and one mixer unit, and a low-frequency limit of period composed of six delay units and one mixer unit. These frequency limits also affect the VCO gain (for a given mixer design) as well as the center frequency. The frequency limits may be independently controlled using the multiplexers shown in Fig. 9 , allowing flexible control of the VCO operating range and greater than ten-to-one adjustment range for VCO gain.
The delay elements and mixer designs are based upon PMOS source-coupled pair differential amplifiers with NMOS load networks [ Fig. 10(a) and (b) ] which allow voltagecontrolled swing adjustment through effective load-line translation by adjusting the voltage The high impedance provided by the current source improves the supply noise rejection for the source-coupled pair, and the N-well improves the isolation to the p bulk substrate noise. The variation of the threshold voltage due to bulk effect is eliminated using bulk-to-source biasing throughout the structure. Sensitivity of the VCO to low repetition rate, 100-mV steps on VDD and AVDD is 0.418 ps/mV. Center-frequency common-mode voltage sensitivity is 3.5% over the full input range dictated by the common-mode control circuit. Nominal VCO gain for the settings that produce the maximum VCO range is 185 MHz/V. The worst case VCO power dissipation is 30 mW.
I. Dividers and Receivers
Dividers and (Fig. 1 ) may be individually programmed and support division by 2, 3, 4, 5, 6, 8, or 10. The dividers are placed in pairs within the layout to improve device matching between and and between and The receivers shown in Fig. 1 are also placed together and are located near the I/O pad for BUSCLK.
V. PLL MEASUREMENTS
The damping factor, loop gain, and natural frequency of the PLL may be adjusted over a wide range to match the application by changing the charge-pump and VCO gain as described above. System testing was conducted with 90-A peak charge-pump current using the maximum frequency and range on the VCO with a variety of divider settings and BUSCLK frequencies. The processor clock was accessed from the clock tree through a series of inverters. A time-interval measurement (TIM) system was used to measure cycle-cycle period jitter statistics for a number of packaged die representing various process skews. The processor was operated using an array initialization program loop with the fixed-point and floatingpoint processors active for the "active" processor tests, and was also operated in a "quiet" mode reset state. All tests were performed at room temperature with ambient forcedair cooling. Conventional first-cycle oscilloscope-based jitter measurements were performed periodically and provided P-P jitter results that were consistent with those measured on the TIM system. The external clock was provided by a highfrequency pulse generator, with 7.3 ps rms, 36 ps P-P jitter. Fig. 11(a) shows a histogram of cycle-cycle period measurements taken with the processor in an inactive reset state but with the clock tree active. The frequencies of the reference clock, processor clock, and VCO are 85, 170, and 340 MHz, respectively, which corresponds to a 3-dB loop bandwidth of 2 MHz. The distribution of samples in the histogram follows a Gaussian distribution with period jitter of 8.4 ps rms, 62 ps P-P. The minimum period measured for this sample size was 26.2 ps less than the mean (3.1 sigma away). Assuming that cycle-time failures only occur on the minimum period side, the worst case clock jitter penalty for this system (i.e., a "quiet" processor) is 26.2 ps at 3.1 sigma confidence (or 25.2 ps penalty at 3.0 sigma). Since a peak-to-peak jitter approximately equal to the PFD dead zone can exist for the PLL, the 25 ps simulated value for the dead zone may be a significant component of the measured jitter. Fig. 11(b) shows a clock-jitter histogram for the processor executing the array initialization routine for a large population A Gaussian curve has been superimposed on the histogram for comparison purposes. The frequencies of the reference clock, processor clock, and VCO are 90, 180, and 360 MHz, respectively. For this system (i.e., an "active" processor), the period jitter has increased to 10.0 ps rms, 80 ps P-P, and the worst case clock-jitter penalty is 37.1 ps at 3.7 sigma confidence (or 30.1 ps at 3.0 sigma). The effective noise penalty for running the array initialization routine is 4.9 ps at 3.0 sigma.
VI. CONCLUSION
This work demonstrates the viability of a low-jitter PLL design approach amenable to high-speed microprocessors. Measured jitter for the design was 8.4 ps rms, 62 ps P-P for quiet conditions and 10.0 ps rms, 80 ps P-P for the processor active. A tunable, moderate-gain VCO with active process and temperature compensation provides high powersupply rejection and low sensitivity to temperature and process variability. A differential design approach maintains noise immunity in both control and signal paths within the analog portions of the PLL.
ACKNOWLEDGMENT
The author wishes to thank J. Peter for layout of the PLL, N. James and H. Casal for the hardware characterization and divider implementation, R. Kodali for circuit simulation and specification, D. Woeste and J. Strom for the divider and lock detector circuits, and S. Dhong and M. Papermaster for their continuous support of this work.
