### Architectures and Circuits Leveraging Injection-Locked Oscillators for Ultra-Low Voltage Clock Synthesis and Reference-less Receivers for Dense Chip-to-Chip Communications

Gautam R. Gangasani

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences

COLUMBIA UNIVERSITY 2018

© 2018 Gautam R. Gangasani All rights reserved

### Architectures and Circuits Leveraging Injection-Locked Oscillators for Ultra-Low Voltage Clock Synthesis and Reference-less Receivers for Dense Chip-to-Chip Communications

Gautam R. Gangasani

#### Abstract

High performance computing is critical for the needs of scientific discovery and economic competitiveness. An extreme-scale computing system at 1000x the performance of today's petaflop machines will exhibit massive parallelism on multiple vertical fronts, from thousands of computational units on a single processor to thousands of processors in a single data center. To facilitate such a massively-parallel extreme-scale computing, a key challenge is power. The challenge is not power associated with base computation but rather the problem of transporting data from one chip to another at high enough rates. This thesis presents architectures and techniques to achieve low power and area footprint while achieving high data rates in a dense very-short reach (VSR) chip-to-chip (C2C) communication network.

High-speed serial communication operating at ultra-low supplies improves the energy-efficiency and lowers the power envelop of a system doing an exaflop of loops. One focus area of this thesis is clock synthesis for such energy-efficient interconnect applications operating at high speeds and ultra-low supplies. A sub-integer clock-frequency synthesizer is presented that incorporates a multi-phase injection-locked ring-oscillator-based prescaler for operation at an ultra-low supply voltage of 0.5V, phase-switching based programmable division for sub-integer clock-frequency synthesizer is presented to ensure injection lock. A record speed of 9GHz has

been demonstrated at 0.5V in 45nm SOI CMOS. It consumes 3.5mW of power at 9.12GHz and  $0.05mm^2$  of area, while showing an output phase noise of -100 dBc/Hz at 1MHz offset and RMS jitter of 325fs; it achieves a net  $FOM_A$  of -186.5 in a 45-nm SOI CMOS process.

This thesis also describes a receiver with a reference-less clocking architecture for high-density VSR-C2C links. This architecture simplifies clock-tree planning in dense extreme-scaling computing environments and has high-bandwidth CDR to enable SSC for suppressing EMI and to mitigate TX jitter requirements. It features clock-less DFE and a high-bandwidth CDR based on master-slave ILOs for phase generation/rotation. The RX is implemented in 14nm CMOS and characterized at 19Gb/s. It is 1.5x faster that previous reference-less embedded-oscillator based designs with greater than 100MHz jitter tolerance bandwidth and recovers error-free data over VSR-C2C channels. It achieves a power-efficiency of 2.9pJ/b while recovering error-free data (BER<  $10^{-12}$ ) across a 15dB loss channel. The jitter tolerance BW of the receiver is > 200MHz and the INL of the ILO-based phase-rotator (32-Steps/UI) is <1-LSB.

Lastly, this thesis develops a time-domain delay-based modeling of injection locking to describe injection-locking phenomena in nonharmonic oscillators. The model is used to predict the locking bandwidth, and the locking dynamics of the locked oscillator. The model predictions are verified against simulations and measurements of a four-stage differential ring oscillator. The model is further used to predict the injection-locking behavior of a single-ended CMOS inverter based ring oscillator, the lock range of a multi-phase injection-locked ring-oscillator-based prescaler, as well as the dynamics of tracking injection phase perturbations in injection-locked masterslave oscillators; demonstrating its versatility in application to any nonharmonic oscillator.

## Contents

| Li       | st of                 | Figure | S                                                                   | iii |
|----------|-----------------------|--------|---------------------------------------------------------------------|-----|
| Li       | st of                 | Tables |                                                                     | xi  |
| 1        | $\operatorname{Intr}$ | oducti | on                                                                  | 1   |
|          | 1.1                   | Motiva | tion and focus area                                                 | 1   |
|          | 1.2                   | Thesis | highlights                                                          | 4   |
|          |                       | 1.2.1  | Singular performance boost leveraging ILOs                          | 4   |
|          |                       | 1.2.2  | Time-delay based model for nonharmonic ILOs                         | 7   |
|          |                       | 1.2.3  | A 0.5V, 9GHz Sub-Integer Clock-Frequency Synthesizer using          |     |
|          |                       |        | Multi-Phase Injection-Locked Prescaler                              | 7   |
|          |                       | 1.2.4  | RX for VSR C2C links with Clock-less DFE and high band-             |     |
|          |                       |        | width CDR                                                           | 8   |
|          | 1.3                   | Thesis | Organization                                                        | 10  |
| <b>2</b> | $\operatorname{Tim}$  | ie-Dom | ain Model for Injection Locking in Nonharmonic Oscilla-             |     |
|          | tors                  |        |                                                                     | 11  |
|          | 2.1                   | Introd | uction                                                              | 11  |
|          | 2.2                   | Models | s for Injection Locking                                             | 12  |
|          | 2.3                   | Quasi- | Linear Model For Injection Locking in Differential Ring Oscillators | 15  |
|          | 2.4                   | Time-I | Domain Model For Injection Locking in Differential Ring Oscil-      |     |
|          |                       | lators |                                                                     | 20  |
|          |                       | 2.4.1  | Analytical Expressions for the Oscillator Waveforms                 | 20  |

|   |                | 2.4.2   | Effect of an Injection Signal                               | 22        |
|---|----------------|---------|-------------------------------------------------------------|-----------|
|   |                | 2.4.3   | Injection Locking Range                                     | 23        |
|   |                | 2.4.4   | Injection Locking Dynamics                                  | 27        |
|   | 2.5            | Time-   | Domain Model For Injection Locking in Single-Ended Inverter |           |
|   |                | Based   | Ring Oscillator                                             | 31        |
|   |                | 2.5.1   | d vs. $\Delta$ Relationship                                 | 32        |
|   |                | 2.5.2   | Injection Locking Range                                     | 34        |
|   |                | 2.5.3   | Injection Locking Dynamics                                  | 37        |
|   | 2.6            | Summ    | ary                                                         | 39        |
| 3 | A 9            | m GHz~S | ub-Integer Clock-Frequency Synthesizer at Ultra-Low Sup     | _         |
|   | $\mathbf{ply}$ |         |                                                             | 40        |
|   | 3.1            | Introd  | luction                                                     | 40        |
|   | 3.2            | Archit  | ecture and circuit description                              | 42        |
|   |                | 3.2.1   | PFD, CP, and VCO                                            | 42        |
|   |                | 3.2.2   | ILRO based Prescaler                                        | 44        |
|   |                | 3.2.3   | Phase-Switching Programmable Divider                        | 48        |
|   |                | 3.2.4   | Automatic Injection-Lock Calibration                        | 49        |
|   | 3.3            | Exper   | imental Results                                             | 51        |
|   | 3.4            | Summ    | ary                                                         | 57        |
| 4 | A 1            | .9Gb/s  | Receiver for Chip-to-Chip Links with Clock-Less DFE         | C         |
|   | and            | High-   | BW CDR based on Master-Slave ILOs                           | <b>59</b> |
|   | 4.1            | Introd  | luction                                                     | 59        |
|   | 4.2            | Syster  | n-level Considerations                                      | 61        |
|   |                | 4.2.1   | Channel Equalization                                        | 61        |
|   |                | 4.2.2   | Receiver Architecture                                       | 63        |
|   | 4.3            | Circui  | t Blocks and Descriptions                                   | 67        |
|   |                | 4.3.1   | CTLE                                                        | 67        |
|   |                | 4.3.2   | Data Edge-Detection and Injection                           | 67        |
|   |                | 4.3.3   | Reference-less frequency acquisition                        | 71        |

|          |                  | 4.3.4  | Resistively-Interpolated MILO-SILO based Phase-Rotation                 | 75  |
|----------|------------------|--------|-------------------------------------------------------------------------|-----|
|          |                  | 4.3.5  | Clock-less DFE                                                          | 78  |
|          |                  | 4.3.6  | Jitter-tolerance BW using $d$ vs. $\Delta$ based time-delay model $~.~$ | 82  |
|          | 4.4              | Measu  | rements                                                                 | 85  |
|          |                  | 4.4.1  | Experimental Setups                                                     | 85  |
|          |                  | 4.4.2  | MILO-SILO, Phase Rotation and Recovered Clock                           | 85  |
|          |                  | 4.4.3  | Receiver Performance                                                    | 90  |
|          |                  | 4.4.4  | Performance summary and comparison                                      | 96  |
|          | 4.5              | Summ   | ary                                                                     | 99  |
| <b>5</b> | Con              | clusio | n                                                                       | 100 |
|          | 5.1              | Future | e Research                                                              | 102 |
| Bi       | Bibliography 103 |        |                                                                         | 103 |

# List of Figures

| 1-1 | Exascale performance needs to rely on massive parallelism                                                                                                                                                                                                                                      | 2  |
|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 1-2 | A strawman architecture for a massively-parallel exascale processor<br>running a billion parallel threads. Reprinted from [4]. $\ldots$ $\ldots$ $\ldots$                                                                                                                                      | 3  |
| 1-3 | Prior-art of various applications leveraging ILOs.                                                                                                                                                                                                                                             | 5  |
| 1-4 | Singular performance boost, such as highest reported clock-frequency<br>synthesizer speed at ultra-low supply of 0.5V and highest reported<br>chip-to-chip operation for links with $> 100MHz$ CDR bandwidth, is<br>reported when leveraging unique features of ILOs                           | 6  |
| 1-5 | Block diagram of the ultra-low supply clock-frequency synthesizer                                                                                                                                                                                                                              | 8  |
| 1-6 | Quarter-rate RX architecture for very short-reach chip-to-chip links<br>with clock-less DFE and high-bandwidth CDR based on Master-Slave<br>injection-locked oscillators.                                                                                                                      | 9  |
| 2-1 | (a) Frequency domain model for injection locking of resonator based oscillators; (b) Resonator amplitude and phase characteristic; the amplifier A is assumed to have a unity frequency response; (c) phasor diagram at $\omega_{INJ}$ for the signals in the locked oscillator when in steady |    |
|     | state                                                                                                                                                                                                                                                                                          | 13 |

| 2-2 | (a) Delay based, time-domain model for injection locking in non-harmonic                                                                                                 |    |
|-----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
|     | oscillators; the delay element, $D$ , has a delay $I_d$ whereas the inverter<br>is assumed ideal with zero delay: (b) the free running frequency of es                   |    |
|     | s assumed ideal with zero delay, (b) the free-fulfning frequency of os-<br>cillation $f_{\rm c}$ is $1/(2T_{\rm c})$ ; (c) assuming finite transition slope signals, the |    |
|     | addition of an injection signal $S_{max}$ to the oscillator signal $S_{r}$ ro                                                                                            |    |
|     | sults in an extra delay $d$ in the oscillation loop so that $f = -f_{TVI} -$                                                                                             |    |
|     | suits in an extra delay, $u$ , in the oscillation loop so that $f_{osc} = f_{INJ} = 1/(2(T_d + d))$ in the injection-locked state.                                       | 14 |
| 2-3 | Four stage differential ring oscillator, with an injection stage operating                                                                                               |    |
|     | on the first stage. The oscillator's delay stages (1-4) are identical;                                                                                                   |    |
|     | the injection stage's bias current and degeneration resistance are scaled                                                                                                |    |
|     | to scale the injection level                                                                                                                                             | 16 |
| 2-4 | Edges of the locking range for the differential 4-stage ring oscillator                                                                                                  |    |
|     | operating quasi linearly w.r.t. the ratio $\alpha$                                                                                                                       | 17 |
| 95  | $\theta$ w r t the injection frequency for the differential 4 stage ring oscillator                                                                                      |    |
| 2-0 | operating quasi-linearly with $\alpha = 10$                                                                                                                              | 18 |
|     | operating quasi-intearry with $\alpha = 10$                                                                                                                              | 10 |
| 2-6 | Measured waveforms for the differential 4-stage ring oscillator oper-                                                                                                    |    |
|     | ating quasi-linearly ( $R_E = 20 \ \Omega$ ): injection input $V_{inj}$ , stage 1 input                                                                                  |    |
|     | $V_{i1}$ and stage 1 output $V_{O1}$ with $\alpha = 10$ ; (top) $V_{inj}$ is $-99.84^{\circ}$ out                                                                        |    |
|     | of phase with $V_{i1}$ at 3.375 MHz, the upper edge of the locking band-                                                                                                 |    |
|     | width; (middle) $V_{inj}$ is in phase with $V_{i1}$ in the center of the lock range                                                                                      |    |
|     | at 3.21 MHz; (bottom) $V_{inj}$ is 79.75° out of phase with the oscillat-                                                                                                |    |
|     | ing input waveform $V_{i1}$ , at 3.098 MHz, the lower edge of the locking                                                                                                | 10 |
|     | bandwidth                                                                                                                                                                | 19 |
| 2-7 | For non-linear operation each stage of Fig. 2-3 is modeled as a hard                                                                                                     |    |
|     | amplitude limiting mechanism, whose output current drives a $R$ - $C$ load.                                                                                              | 21 |
| 2-8 | Differential output voltage $v_d(t)$ of a stage in the 4-stage ring-oscillator                                                                                           |    |
|     | of Fig. 2-3                                                                                                                                                              | 21 |
| 2-9 | Effect of the injection signal on the output voltage $v_d = v_{d,i} + v_{d,inj}$ .                                                                                       | 22 |

| 2-10 | Waveforms for the differential 4-stage ring oscillator when injection                                 |    |
|------|-------------------------------------------------------------------------------------------------------|----|
|      | locked; the last 3 stages have a delay $t_d$ and the first stage has a delay                          |    |
|      | $t_d + d$ due to the injection                                                                        | 24 |
| 2-11 | Measured waveforms for the differential 4-stage ring oscillator operat-                               |    |
|      | ing non linearly $(R_E = 0 \ \Omega)$ with $\alpha = 10$ : the injected signal, $V_{inj}$ ,           |    |
|      | the stage 1 input voltage, $V_{I1}$ , and the stage 1 output voltage, $V_{O1}$ , are                  |    |
|      | shown for varying $\Delta$ , the delay between $V_{INJ}$ and $V_{I1}$ ; (top) $\Delta = \Delta_{min}$ |    |
|      | at the upper edge of lock range at 3.61 MHz; $t_{d1}$ , the delay through                             |    |
|      | stage 1, i.e. the delay between $V_{O1}$ and $V_{I1}$ , is 31.4 ns; $\Delta = 0$ and                  |    |
|      | $t_{d1} = 36.1$ ns in the middle of the lock range at 3.49 MHz; (bottom)                              |    |
|      | $\Delta = \Delta_{max}$ and $t_{d1} = 39.1$ ns at the lower edge of the lock range at                 |    |
|      | 3.37 MHz                                                                                              | 25 |
| 2-12 | Edges of the locking range w.r.t. $\alpha$ for the differential 4-stage ring                          |    |
|      | oscillator operating non linearly                                                                     | 26 |
| 2-13 | Calculated, simulated and measured $d$ as a function of $\Delta$ for the dif-                         |    |
|      | ferential 4-stage ring oscillator operating non-linearly with (a) $\alpha=10$                         |    |
|      | and (b) $\alpha = 6.8$                                                                                | 27 |
| 2-14 | Injection lock transient waveforms for the differential 4-stage ring os-                              |    |
|      | cillator used for the derivation of $\Delta[n+1]$ from $\Delta[n]$                                    | 28 |
| 2-15 | Simulated and calculated injection lock dynamics of the differential                                  |    |
|      | 4-stage ring oscillator for a step change in frequency from 3.4MHz to                                 |    |
|      | 3.6MHz at $\alpha = 10$ (top) $\alpha = 6.8$ (bottom).                                                | 29 |
| 2-16 | Experimental setup used to observe the injection lock dynamics of the                                 |    |
|      | 4-stage differential ring oscillator.                                                                 | 30 |
| 2-17 | After an FM modulation step trigger, the injection frequency generator                                |    |
|      | settled to the new frequency in about 1.5 cycles; that time point is                                  |    |
|      | labeled as $T = 0$                                                                                    | 30 |
| 2-18 | Measured and calculated injection lock dynamics of the 4-stage differ-                                |    |
|      | ential ring oscillators for a step change in frequency from 3.4MHz to                                 |    |
|      | 3.6MHz at $\alpha = 10$ (top) $\alpha = 6.8$ (bottom).                                                | 31 |

| 2-19 | Single-ended 3-stage CMOS-inverter based ring oscillator, with an in-                 |    |
|------|---------------------------------------------------------------------------------------|----|
|      | jection stage operating on the the first stage. Each of the three stages is           |    |
|      | made of nine $(9x)$ identical inverters. For closed-loop operation switch             |    |
|      | S1 is closed. The injection level can be switched from $\alpha = 9$ to $\alpha = 4.5$ |    |
|      | by opening or closing switches (S2,S3)                                                | 32 |
| 2-20 | Waveforms and the definition of d and $\Delta$ for the single-ended 3-stage           |    |
|      | ring oscillator in Fig. 2-19                                                          | 33 |
| 2-21 | d vs. $\Delta$ relationship for the single-ended 3-stage ring oscillator obtained     |    |
|      | through open-loop simulations for different values of $\alpha$ . Also shown are       |    |
|      | the d vs. $\Delta$ relationships when the oscillator is operating in closed loop      |    |
|      | and injection locked.                                                                 | 34 |
| 2-22 | Open-loop d vs. $\Delta$ plots obtained through measurements and simula-              |    |
|      | tions at $\alpha = 9$ for the single-ended 3-stage ring oscillator                    | 35 |
| 2-23 | Open-loop d vs. $\Delta$ plots obtained through measurements and simula-              |    |
|      | tions at $\alpha = 4.5$ for the single-ended 3-stage ring oscillator                  | 35 |
| 2-24 | Closed-loop d vs. $\Delta$ plots obtained through measurements and simu-              |    |
|      | lations for the single-ended 3-stage ring oscillator                                  | 36 |
| 2-25 | Edges of the locking range w.r.t. $\alpha$ for the single-ended 3-stage ring          |    |
|      | oscillator                                                                            | 36 |
| 2-26 | Measured and calculated injection lock dynamics for the single-ended                  |    |
|      | 3-stage ring oscillator for a step change in frequency from 9.35MHz to                |    |
|      | 9.75MHz at (top) $\alpha = 9$ (bottom) $\alpha = 4.5.$                                | 38 |
| 2-27 | Simulated and calculated injection lock dynamics for the single-ended                 |    |
|      | 3-stage ring oscillator for a step change in frequency from 9.35MHz to                |    |
|      | 9.75MHz at (top) $\alpha = 9$ (bottom) $\alpha = 4.5.$                                | 38 |
| 3-1  | Block diagram of the ultra-low supply sub-integer clock-frequency syn-                |    |
|      | thesizer using ILRO based prescaler for divide-by-3 function, followed                |    |
|      | by a phase-switching based sub-integer programmable divider and an                    |    |
|      | automatic injection-lock calibration loop for ILRO and VCO                            | 41 |

| 3-2  | PFD with extra delay in reset path                                                                              | 43 |
|------|-----------------------------------------------------------------------------------------------------------------|----|
| 3-3  | Differential charge-pump with unity-gain buffer based architecture along                                        |    |
|      | with common-mode feedback circuit.                                                                              | 43 |
| 3-4  | VCO, using a cross-coupled inverter architecture                                                                | 45 |
| 3-5  | $General\ concept\ of\ odd-M\ stage\ multi-input\ injection\ to\ achieve\ modulo-$                              |    |
|      | M division and achieve wider injection lock range                                                               | 46 |
| 3-6  | $\label{eq:constraint} Ultra-low \ voltage \ pseudo-differential \ implementation \ of \ the \ ILRO \ prescale$ | r  |
|      | in a divide-by-3 configuration                                                                                  | 47 |
| 3-7  | Circuit block diagram for the phase-switching based programmable                                                |    |
|      | divider                                                                                                         | 48 |
| 3-8  | eq:automatic injection-lock calibration algorithm to coarsely set the ILRO                                      |    |
|      | free-running frequency.                                                                                         | 49 |
| 3-9  | Automatic injection-lock calibration algorithm to optimally select the                                          |    |
|      | VCO band.                                                                                                       | 50 |
| 3-10 | Fabricated chip micrograph and layout of the PLL                                                                | 51 |
| 3-11 | Measurement of (a) $V_{cilo}$ versus $F_{osc}$ (b) Input dBm versus Freq. lock                                  |    |
|      | range                                                                                                           | 52 |
| 3-12 | Linear-fit of measured and calculated lock ranges at different injection                                        |    |
|      | input levels and self-oscillation frequencies                                                                   | 53 |
| 3-13 | (a)<br>Vco gain curves. (b) Auto-calibration between ILRO and VCO                                               | 54 |
| 3-14 | Measurement of (a) Output spectra of the clock-frequency synthesizer                                            |    |
|      | at different sub-integer division ratios. (b) Phase noise plot at division                                      |    |
|      | ratio of 96                                                                                                     | 56 |
| 3-15 | Power consumption distribution in the sub-integer clock-frequency syn-                                          |    |
|      | thesizer                                                                                                        | 57 |
| 4-1  | RX equalization capabilities, such as CTLE peaking and 1-tap DFE                                                |    |
|      | are evaluated for channel performance margins.                                                                  | 63 |
|      |                                                                                                                 |    |

| 4-2  | Channel operating margin study with signal impairments at different                                 |    |
|------|-----------------------------------------------------------------------------------------------------|----|
|      | RX peaking and DFE settings. 1-tap DFE gives robustness to system                                   |    |
|      | solution in case of degradation due to crosstalk and PN-skew. To im-                                |    |
|      | prove signal-to-noise ratio in face of crosstalk, peaking could be dialed                           |    |
|      | down and $h_1$ -tap could be used for post-cursor equalization                                      | 65 |
| 4-3  | Quarter-rate RX architecture for very short-reach chip-to-chip links                                |    |
|      | with clock-less DFE and high-bandwidth CDR based on Master-Slave                                    |    |
|      | injection-locked oscillators                                                                        | 66 |
| 4-4  | RX CTLE equalization using a single-stage peaking amplifier. $\ldots$ .                             | 68 |
| 4-5  | Power spectrum of NRZ signalling for a L-bit repeating pattern, show-                               |    |
|      | ing a null at data rate                                                                             | 69 |
| 4-6  | (a) RZ data spectra with $T_b/2$ delay into the XOR cell (b) Simulated                              |    |
|      | RZ injection level with 19Gbps NRZ input data rate                                                  | 70 |
| 4-7  | Edge-detection, clock signal extraction and injection scheme                                        | 71 |
| 4-8  | Schematics of limiting-amplifier used for ${\sim}20{\rm dB}$ differential gain                      | 71 |
| 4-9  | (a) The replica delay-line uses the regulated-voltage of the MILO-SILO                              |    |
|      | block as well as the $C_L$ settings to track the data rate by maintaining                           |    |
|      | $T_b/2$ delay (b) Simulated tracking variation due to mismatch in the                               |    |
|      | replica buffer.                                                                                     | 72 |
| 4-10 | Schematics of CML XOR stage.                                                                        | 73 |
| 4-11 | Shows the use of consecutive early-late transitions to discriminate be-                             |    |
|      | tween and phase and frequency error for tracking and correction $% \left( {{{\bf{x}}_{i}}} \right)$ | 76 |
| 4-12 | Reference-less frequency lock algorithm which sets the MILO-SILO free                               |    |
|      | running frequency to lock in the center of the injection lock range. This                           |    |
|      | ensures optimal margin against drift and for jitter-tolerance                                       | 77 |
| 4-13 | Master-Slave ILO-based $360^o$ phase-rotation using resistive-interpolated                          |    |
|      | edges for injection.                                                                                | 79 |
| 4-14 | Coarse phase-selections to fine resistive-interpolation settings at differ-                         |    |
|      | ent phase-rotator positions over 2UI.                                                               | 80 |

| 4-15 | Clock-less direct-feedback DFE with variable delay replica-cell tied to        |    |
|------|--------------------------------------------------------------------------------|----|
|      | the delay elements in ILOs to optimally meet DFE loop timing margins.          | 81 |
| 4-16 | Discrete-time model of the dual-loop clock-data recovery loop                  | 83 |
| 4-17 | Relationship between input injection phase perturbation and output             |    |
|      | phase change and its impact on settling time constant                          | 83 |
| 4-18 | Open-loop $d$ vs. $\Delta$ values of MILO-SILO used to calculate JTOL and      |    |
|      | compared against closed-loop simulations.                                      | 84 |
| 4-19 | Fabricated chip micrograph and layout                                          | 86 |
| 4-20 | Measurement setup of the DUT. Data is generated in J-BERT N4903B $$            |    |
|      | and multiplied up using N4876A 2:1 multiplexer. The data then goes             |    |
|      | through a Megtron6 PCB before entering the DUT on the probe sta-               |    |
|      | tion. Serial-scan interface is controlled using National Instruments           |    |
|      | NI-2162 digital I/O accessory and NI PXI-1042, which is also used to           |    |
|      | interface with LabView GUI                                                     | 87 |
| 4-21 | Shows the setup used to measure the rotator INL/DNL. 1010 data                 |    |
|      | pattern from the J-Bert is multiplied up using N4870A 2:1 Mux to               |    |
|      | injection-lock into the MILO-SILO in the device under test (D.U.T).            |    |
|      | The MILO-SILO phase-rotator output recovered clock from D.U.T is               |    |
|      | pattern-locked to a trigger in DCA-X 86100D sampling scope. MILO-              |    |
|      | SILO phase-rotator is rotated and its phase-step is calculated with            |    |
|      | reference to the previous waveform in scope memory. $\ldots$ $\ldots$ $\ldots$ | 88 |
| 4-22 | Lock range of MILO-SILO at different frequency control voltages $(v_c)$        |    |
|      | and switchable load cap $(C_L)$ with (a)1010 pattern. (b) PRBS7 pattern.       | 89 |
| 4-23 | Measured INL/DNL values of the MILO-SILO based phase-rotator                   |    |
|      | over extremes of operating speed.                                              | 90 |
| 4-24 | Measurement of random jitter $(RJ_{rms})$ on the recovered clock includes      |    |
|      | not only the jitter transfer from the injected data but also the output        |    |
|      | clock buffers and driver.                                                      | 91 |

| 4-25 | (a) Recovered clock random jitter as function of data baudrate post-                   |    |
|------|----------------------------------------------------------------------------------------|----|
|      | BBFD (b) After initial frequency lock calibration, as supply voltage of                |    |
|      | the MILO-SILO regulator changes by $\pm 5\%$ or as temperature deviates                |    |
|      | between 0C and 100C, the recovered clock RJ shows less than 10fs of                    |    |
|      | deviation with no discernible trends.                                                  | 92 |
| 4-26 | Measured channel insertion loss over 20-inch Megtron<br>6 ${\rm PCB}$ and $Rx_{Input}$ |    |
|      | eye diagram after channel at 19Gb/s                                                    | 93 |
| 4-27 | RX performance at 19Gbps over 20-inch MEG6 channel                                     | 94 |
| 4-28 | (a)<br>Measured JTOL BW at 19Gb/s for PRBS7 data at BER of<br>$10^{-12}$               |    |
|      | over a 10dB loss channel (b) JTOL BW as a function of temperature                      |    |
|      | variation after frequency lock calibration.                                            | 95 |
| 4-29 | Power consumption distribution in the RX                                               | 97 |
| 4-30 | Comparison of RX CDR bandwidth, speed, power efficiency, and chan-                     |    |
|      | nel loss at Nyquist against other reference-less clock-data recovery de-               |    |
|      | signs                                                                                  | 98 |

# List of Tables

| 2.1 | Lock range measurements and predictions for the 4-stage differential    |    |
|-----|-------------------------------------------------------------------------|----|
|     | ring oscillator operating non linearly                                  | 26 |
| 2.2 | Lock range predictions, measurements, and simulations for the single-   |    |
|     | ended 3-stage ring oscillator                                           | 37 |
| 3.1 | Performance Summary and Comparison of Low-Supply PLLs                   | 55 |
| 4.1 | Shows the many parameters used for channel margin study                 | 62 |
| 4.2 | Summary of receiver performance                                         | 96 |
| 4.3 | State of art comparison of energy-efficient dense VSR-C2C interconnects | 98 |

#### Acknowledgements

I would especially like to thank my Ph.D. advisor, Prof. Kinget, whose support and guidance made this work possible. His approach to research, from defining the problem in the broader context to approaching it from innovative angles while maintaining clarity, is inspiring and hopefully I have learnt some of it to carry forward. I also wish to acknowledge Prof. Tsividis and Prof. Shepard at Columbia for teaching some of the best courses I ever had, these courses gave me a solid foundation in integrated circuit design. This work would not have been possible without material support and encouragement from my managers at IBM and at Globalfoundries, Kevin Kramer and Daniel Storaska. My deepest appreciation to them for showing faith in me through the years. Over the years I have had the pleasure of working with many talented individuals, some of my best learning, insights and successes were possible through this teamwork. I would especially like to acknowledge the collegial camaraderie I have enjoyed while working with Dr. Bulzacchelli, Dr. Meghelli, and Dr. C.-M. Hsu. Next I wish to thank all the people who helped me build the prototypes and test the chips, some of whom are: Kevin Guay, George May, Al Brouillette, Ruben Recinos, Peter Coutu, and Mike Wielgos. Finally, I wish to offer my sincere gratitude to my friends and family, who have been instrumental in my life not only in influencing me but also supporting me to complete this degree. I thank my entire family: my parents, sister and brother-in-law, parents-in-law and every one else who has offered me so much support. I thank my lovely kids, Pallavi and Nikhil, for their love and for giving me a deeper perspective and the pleasure of leading a fuller life. And lastly, I offer my dearest thanks to my wife, Veena, without whose super mom like abilities, in running a busy medical practice and a household through my unpredictable schedules, nothing would have been possible. I owe an eternal debt to her love, support, and kindness through it all.

### Chapter 1

### Introduction

#### 1.1 Motivation and focus area

In high-performance computing (HPC), the major milestones are emergence of systems whose aggregate performance first crosses a threshold of  $10^{3k}$  operations performed per second, for some k. Gigascale ( $10^9$ ) was achieved in 1985, terascale ( $10^{12}$ ) was achieved in 1997. Today there are petascale ( $10^{15}$ ) systems deployed, and exascale ( $10^{18}$ ) systems is the next way point in HPC. Scientific frontiers demand faster and bigger computers to analyze an avalanche of data and advance our knowledge. A quest for answers to grand scientific challenges is the main motivation behind developing and building exascale supercomputers and beyond [1].

The semiconductor industry has been fueled by systems utilizing continuous improvement in the cost, performance, and power of semiconductor content. Historically these improvements have delivered the resultant Dennard scaling benefits in power and performance. These improved metrics translated in a straight forward manner from device to circuit to processors to systems. Moore's law scaling, interpreted with the paradigm above, is finished; and there are increasingly difficult challenges in delivering power/performance/cost improvements at device level to circuits and system level [2].

To achieve exascale operation, rather than Moore's law one has to rely on massive parallelism as shown in Fig. 1-1 [3]. The key challenges to a massively-parallel exascale system are energy and power. Supply voltage scaling is the most effective means to reduce total power consumption, but interconnect delay and energy too cannot be ignored since higher concurrency increases demand on the interconnect fabric. A traditional router based interconnect would exceed the power budget due to increased concurrency, and hence a new hierarchical and possibly heterogeneous interconnect fabric is desired. It would employ simple busses for shorter interconnect, and complex routers to communicate over longer distances as shown in Fig. 1-2 [4]. Each group in such a strawman architecture consists of 12 multi-core processor chips, each having 16 optimally designed DRAM chips and 12 router chips. 32 of these groups would be housed in a rack and 583 racks would make a complete exaflop system. With each processor containing 742 cores, one has 166 million cores running in parallel a billion threads. The energy budget can be utilized by a large number of transistors for delivering throughput performance with extreme parallelism using large number of small cores. These cores will use aggressive voltage scaling for energy efficiency, will be fine grain power managed, connected with energy efficient hierarchical and heterogeneous interconnect networks, and the entire system will employ resiliency. With millions of cores and billions of threads, not only would clock tree planning be extremely challenging and constraining but the EMI radiation of all the interconnect communication could cause severe interference.



Figure 1-1: Exascale performance needs to rely on massive parallelism.

This thesis presents design solutions for some of these challenges. As the cores

12 ROUTER 0 0 INTERFACES DRAM 0 PROC. Interconnect 16 DRAM INTERFACES CHIP L2/L3 L2/L3 ... Interconnect 000 11 Rega Regs Reg Reg Rec Re Re TTT ПТ Ш Ш ПТ ТТ пп FPU FPU FPU FPU FPU FPU FPU FPU DRAM 15 1 Group

Interconnect for intra and extra Cabinet Links

Figure 1-2: A strawman architecture for a massively-parallel exascale processor running a billion parallel threads. Reprinted from [4].

might resort to aggressive voltage scaling, a high performance sub-integer clockfrequency synthesizer operating off of a ultra-low core supply for embedded sub-rate IO clocking applications is presented. Also to reduce clock tree planning complexity and interference among the billions of interconnect threads over very-short reach (VSR) channels, a reference-less architecture is presented for such chip-to-chip links having high jitter-tolerance bandwidth to enable SSC and mitigate EMI as well as reduce transmitter power.

#### 1.2 Thesis highlights

#### 1.2.1 Singular performance boost leveraging ILOs

An oscillator can be pulled in to an injected signal frequency if periodic steady-state injection leads to a change in the average period of the oscillator. The constraint of periodicity in the steady-state behavior along with strength of injection limits the lock range. Also, the synchronization effect of injection manifests itself as correction of the oscillator zero crossings. A resultant reduction of phase noise depends on injection level, initial frequency delta and the number of oscillator periods of jitter accumulation between periodic injection pulses. A combination of these basic principles manifest themselves in many of today's transceiver and frequency synthesis techniques, as seen in Fig. 1-3.

Superharmonic ILOs achieve even/odd division at very high operating speeds [5–12]. While, subharmonic ILOs achieve frequency multiplication with multi-phase outputs [13–17]. Delay modulation between master and slave ILOs leads to 360° phase-rotation with jitter-filtering [18]. A grid of coupled ILOs is shown to produce a standing-wave oscillator for reduced clock skew across the chip [19]. Forwarded-clock injection-locked to local oscillators aids clock-data recovery without a PLL or a full-fledged CDR with good jitter-tolerance bandwidth and low power [20–23]. A popular technique is to inject a subharmonic reference clock into the oscillator in a PLL to lower the phase noise [24–38]. High bandwidth CDR and burst-mode operation is possible by injection locking the data edges into the oscillator for recovered clock [39–41]. ILO when used as a prescaler increase the speed of operation [42, 43]. Injection-lock based carrier synchronization is demonstrated in a mm-wave intraconnect solution [44]. Finally, fast settling of an ILO is used to modulate a transmitter for direct FSK modulation [45].

Fig. 1-4, shows the record performances demonstrated in this thesis leveraging low-supply operation of injection-locked oscillators at high relative speeds and highbandwidth of the locking process of the ILOs to track input jitter on the injected signal. The subsequent sub-sections highlight these features.



Figure 1-3: Prior-art of various applications leveraging ILOs.



Figure 1-4: Singular performance boost, such as highest reported clock-frequency synthesizer speed at ultra-low supply of 0.5V and highest reported chip-to-chip operation for links with > 100MHz CDR bandwidth, is reported when leveraging unique features of ILOs.

#### 1.2.2 Time-delay based model for nonharmonic ILOs

A time-domain delay-based model was developed to predict the injection locking behavior of non-harmonic oscillators such as ring oscillators. The effect of the injection signal on the oscillator is modeled with a d versus  $\Delta$  characteristic which captures the additional delay d in a stage due to the effect of the injection signal with a delay  $\Delta$ . Using this characteristic, the injection-locking range as well as injection-locking dynamics can be accurately modeled and predicted. This modeling approach was applied to a differential four-stage ring oscillator where analytical expressions for the waveforms could be derived along with an analytical expression for the d versus  $\Delta$ characteristic. Versatility of the modeling approach was demonstrated by analyzing the locking behavior of a single-ended three-stage CMOS-inverter-based ring oscillator. In this case the d versus  $\Delta$  characteristic was derived from simulations and measurements. By simulating for d versus  $\Delta$  characteristic, the model is also applied to predict the lock range of a multi-phase injection-locked ring-oscillator-based prescaler, as well as the dynamics of tracking injection phase perturbations in injection-locked master-slave oscillators. The presented time-domain delay-based modeling approach can be applied to any nonharmonic oscillator as long as the relationship between the extra delay, d, and the delay,  $\Delta$ , between the injection signal and the relevant internal oscillator is available.

### 1.2.3 A 0.5V, 9GHz Sub-Integer Clock-Frequency Synthesizer using Multi-Phase Injection-Locked Prescaler

A 9-GHz sub-integer clock-frequency synthesizer, shown in Fig. 1-5, incorporates a multi-phase injection-locked ring-oscillator-based prescaler for operation at an ultralow supply voltage of 0.5V, phase-switching based programmable division for subinteger clock-frequency synthesis, and automatic calibration to ensure injection lock. The synthesizer consumes 3.5mW of power at 9.12GHz and 0.05mm<sup>2</sup> of area, while showing an output phase noise of -100dBc/Hz at 1MHz offset and RMS jitter of 325fs; it achieves a net FOM<sub>A</sub> of -186.5 in a 45-nm SOI CMOS process. Key features are:



Figure 1-5: Block diagram of the ultra-low supply clock-frequency synthesizer.

- (a) A record speed of 9GHz has been demonstrated at 0.5V in 45nm SOI CMOS.
- (b) The proposed multi-phase multi-input ILRO-prescaler eliminates the speed bottleneck, while automatic injection-lock calibration ensures lock between the VCO and the ILRO-prescaler.
- (c) The phase-switching based programmable divider structure provides fine frequency resolution through sub-integer division.

### 1.2.4 RX for VSR C2C links with Clock-less DFE and high bandwidth CDR

A RX with a reference-less clocking architecture, Fig. 1-6, for high-density VSR-C2C links is described. It features clock-less DFE and a high-bandwidth CDR based on master-slave ILOs for phase generation/rotation. The RX is implemented in 14nm CMOS and characterized at 19Gb/s. It achieves a power-efficiency of 2.9pJ/b while recovering error-free data (BER<  $10^{-12}$ ) across a 15dB loss channel. The jitter tol-



Figure 1-6: Quarter-rate RX architecture for very short-reach chip-to-chip links with clock-less DFE and high-bandwidth CDR based on Master-Slave injection-locked os-cillators.

erance BW of the receiver is 250MHz and the INL of the ILO-based phase-rotator (32Steps/UI) is < 1-LSB. Key highlights are:

- (a) A receiver architecture that simplifies clock-tree planning in dense extreme-scaling computing environments and has high-bandwidth CDR to enable SSC for suppressing EMI and to mitigate TX jitter requirements.
- (b) This receiver is 1.5x faster that previous reference-less embedded-oscillator based designs with greater than 100MHz jitter tolerance bandwidth and recovers errorfree data over VSR-C2C channels.
- (c) It has a linear first of its kind phase generator/interpolator based on master-slave ILOs.
- (d) It has a clock-less DFE seamlessly (no DFE specific delay calibration) using variable delay information from the embedded-ILO to maintain optimal DFE loop margins while directly feeding back into the CTLE output.

#### 1.3 Thesis Organization

This thesis focuses on presenting the advances in circuits and systems for serial communications in extreme-scale systems and some of the relevant modeling. Injectionlocked oscillators are heavily leveraged to achieve high speed and performance at ultra-low scaled core supplies and to achieve high-bandwidth clock-data recovery using embedded reference-less oscillators. Chapter 2 takes a unique and essentially simplifying perspective on non-harmonic ILOs and develops a time-delay based model to predict any ILOs locking range and dynamics. The model is developed based on the correlation between the delay of the injected signal w.r.t. the oscillator signal into a stage and its effect on the output delay of that oscillator stage.

Chapter 3 focuses on high-performance clock synthesis based off likely ultra-low scaled core supply in extremely-scale systems. It uses a minimal stack to have the highest possible speed for the injection-locked prescaler, a key speed bottleneck. Important techniques to automatically achieve lock between the VCO and prescaler as well as achieving programmable sub-integer division without compromising the loop bandwidth are presented. Such a technique would be of interest when an embedded sub-rate clocking is needed to work off the core supply with small power/area signature.

Chapter 4 follows with a solution for potential issue of clock-tree planning and interference in extreme-scale system with billons of threads. It presents a reference-less RX for VSR chip-to-chip links which mitigates the complexity of clock-tree planning and improves resilience of the system. Also, the high bandwidth of the clock-data recovery lends the design to SSC and improved EMI and potential TX power savings due to reduced jitter requirements. Finally, chapter 5 summarizes the thesis and ends with a discussion of potential avenues for future research.

### Chapter 2

# Time-Domain Model for Injection Locking in Nonharmonic Oscillators

#### 2.1 Introduction

Injecting a signal into an oscillator leads to injection locking phenomena when the injected signal has frequency components close to oscillator's frequency or its harmonics. Injection locking is useful to establish a relationship between a free running oscillator and a reference oscillator, without requiring a full frequency-synthesizer. Injection locking in harmonic oscillators has been applied in applications such as frequency multiplication [46], and the generation of variable phase shifts [47]; injection locking in ring oscillators has been used for frequency division [48], and precision quadrature generation [49].

Theoretical studies of injection locking have focused on harmonic oscillators and mostly relied on narrow-band frequency-domain descriptions using phasors as in, e.g., [50–52]. Some studies have used a describing function for the nonlinear element of the oscillator, but assume a tuned resonator to feedback to the input of the nonlinear element, to arrive at the injection locked model [53, 54]. Non-harmonic oscillators such as ring or relaxation oscillators do not have a harmonic resonator and these narrow-band frequency-domain models do not apply. Numerical techniques to model non-harmonic oscillators have been presented in [55] and an analytical time-domain derivation to predict the injection-lock range for ring oscillators has been presented in [56]. In this chapter we develop a time-domain delay based model to describe injection locking in non-harmonic oscillators and to derive the injection locking bandwidth, as well as the injection locking dynamics.

#### 2.2 Models for Injection Locking

We briefly review two modeling approaches for injection locking: a frequency-domain, phase-shift based model and a time-domain, time-delay based model; we are using simplified or idealized representations of the building blocks for this introductory discussion of the basic concepts and will investigate some models in great detail in later sections.

Fig. 2-1(a) shows a simplified block diagram of a resonator based harmonic oscillator in its locked state that can be used for modeling injection locking [50–52]. At the self resonance frequency,  $\omega_o$ , the phase shift through the tank is zero ( $\angle H = 0$ ), but at an injection frequency,  $\omega_{INJ}$ , the phase shift through the tank is non-zero ( $\angle H = -\phi$ ), as shown in Fig. 2-1(b). The effect of the addition of the injection signal to the oscillator signal is an additional phase shift,  $\angle (S_O, S_I) = \phi$ , in the loop which compensates the phase change ( $\angle H = -\phi$ ) in the resonator to obtain a total phase shift around the loop of zero so that the phase condition for oscillation is satisfied again. Varying the phase shift  $\theta$  between the injection signal  $S_{INJ}$  and the oscillator signal  $S_I$  leads to different phase shifts,  $\phi$ , in the summer, as shown in Fig. 2-1(c); for an injection signal with a frequency within the locking range for the oscillator, the injection locking transient dynamics adjust  $\theta$  to obtain the appropriate phase shift  $\phi$  [50–52].

This frequency-domain model relies on the presence of a narrow-band resonator in the loop so that the signals have a single dominant frequency component. This enables the use of transfer functions and phasor analysis and the phase balance around the loop can be used as a necessary oscillation condition. Such model can be adapted for the use in non-harmonic oscillators as long as the large signal behavior of the



Figure 2-1: (a) Frequency domain model for injection locking of resonator based oscillators; (b) Resonator amplitude and phase characteristic; the amplifier A is assumed to have a unity frequency response; (c) phasor diagram at  $\omega_{INJ}$  for the signals in the locked oscillator when in steady state.

building blocks is close to their small signal response and they operate quasi-linearly. An equivalent resonator transfer function H can then be derived from the transfer function of the different stages in the oscillator [59]; we will work out an example of this quasi-linear analysis for a 4-stage differential ring oscillator in section 2.3 to provide a comparison point with the time-domain modeling approach.

For non-harmonic oscillators which operate in a strongly non-linear regime the frequency-domain model does not apply but the time-domain delay based model presented in Fig. 2-2 can be used. The delay through the loop,  $T_d$ , sets the free-running oscillation period  $T_0 = 2T_d$ . In order to change the oscillation period by injection locking to  $T_{INJ} = T_0 + 2d$ , the injection signal needs to introduce an additional delay d in the loop. Assuming signals with finite-transition slopes, the addition of the injection signal,  $S_{INJ}$ , with a delay  $\Delta$  compared to the oscillator signal  $S_I$  leads to an additional delay d around the oscillation loop. Varying  $\Delta$  leads to a different loop delay d; for a given  $f_{INJ}$  within the locking bandwidth, the injection locking transient dynamics will adjust  $\Delta$  so that appropriate d is generated.

In this chapter we derive the delay based model in detail for a differential 4stage ring oscillator in section 2.4 as well as for a single-ended 3-stage ring oscillator built with standard CMOS digital inverters in section 2.5. Using the time-domain,



Figure 2-2: (a) Delay based, time-domain model for injection locking in non-harmonic oscillators; the delay element, D, has a delay  $T_d$  whereas the inverter is assumed ideal with zero delay; (b) the free-running frequency of oscillation,  $f_0$ , is  $1/(2T_d)$ ; (c) assuming finite transition-slope signals, the addition of an injection signal,  $S_{INJ}$ , to the oscillator signal,  $S_I$ , results in an extra delay, d, in the oscillation loop so that  $f_{osc} = f_{INJ} = 1/(2(T_d + d))$  in the injection-locked state.

delay based model, we derive the locking range, and the dynamics of the locking transients of the locked oscillator and compare analytical predictions, simulations using Synopsys-HSPICE circuit simulator and measurements.

### 2.3 Quasi-Linear Model For Injection Locking in Differential Ring Oscillators

In this section we derive the injection locking bandwidth of the 4-stage differential ring oscillator shown in Fig. 2-3 using the frequency-domain model assuming quasi-linear operation of the circuit. Non-harmonic oscillators operate in a quasi-linear mode when the large signal operation of each stage is similar to its small signal AC behavior. The frequency-domain model introduced by Adler for harmonic oscillators can then be extended for non-harmonic oscillators since the phase shift through the oscillator can be derived from the small signal AC transfer function for each stage [59]. E.g., by increasing  $R_E$  in Fig. 2-3, the input pair of each delay stage becomes a linearized V-I converter and the oscillation waveforms are close to sinusoidal.

The model of Fig. 2-1 can be applied with the following signal choices in Fig. 2-3:  $S_{INJ} = I_{INJ,p} - I_{INJ,n}, S_I = I_{I1,p} - I_{I1,n}$ , and  $S_O = I_{O1,p} - I_{O1,n}$ . The loop transfer function H is then given by:

$$\frac{S_I}{S_O} = A \cdot H(jf) = -A \left(\frac{1}{1 + j\frac{f}{f_0} \cdot \tan(\frac{\pi}{4})}\right)^4 \tag{2.1}$$

where  $A = H_{DC}^4 = (G_m R_L)^4$  is the DC gain with  $G_m = g_m/(g_m R_E + 1)$  the effective transconductance of the V-I converter (Q1-Q2); at  $f_0 = 1/(2\pi R_L C_L)$  each stage contributes a phase shift of 45° and the oscillation conditions for the phase are satisfied. Assuming sufficient DC gain exists, i.e.  $H_{DC} \ge \sqrt{2}$ , the loop will self oscillate at  $f_0$ .

Given (2.1) the phase shift  $\phi$  at a frequency  $f_{INJ}$  close to  $f_0$  can now easily be computed using a first order Taylor series approximation and the locking range can be determined using the observation that  $\phi \approx \tan(\phi) = S_{INJ}/S_I = 1/\alpha$  at the edges



Figure 2-3: Four stage differential ring oscillator, with an injection stage operating on the the first stage. The oscillator's delay stages (1-4) are identical; the injection stage's bias current and degeneration resistance are scaled to scale the injection level.

of the locking range when  $\theta$  is about  $\pm 90^{\circ}$ . Generalizing this derivation for N stages<sup>1</sup>, the locking range is calculated as  $[59]^2$ :

$$\frac{\Delta f_m}{f_0} = \frac{1}{\alpha} \cdot \frac{4}{N \cdot \sin\frac{2\pi}{N}} \tag{2.2}$$

This derivation assumes that the phase shift in the V-I converter (Q1-Q2) is negligible. Consequently, the phase shift,  $\theta$ , between the differential voltages  $V_{INJ}$  and  $V_{I1}$  in Fig. 2-3 varies between  $\pm 90^{\circ}$  over the locking range with injection frequency  $f_{inj}$  as follows [50, 59]:

$$\theta = \sin^{-1}\left(\frac{f_o - f_{inj}}{f_o} \cdot \frac{N}{2}\sin\frac{2\pi}{N} \cdot \alpha\right)$$
(2.3)

#### **Experimental Verification**

A prototype board of the 4-stage ring oscillator shown in Fig. 2-3, operating from 5 V was built using discrete components with the following nominal values and  $\pm 5 \%$  tolerances:  $R_L = 47 \ \Omega$ ,  $C_L = 1 \ nF$ , and an 8mA bias current per stage. Matched

<sup>&</sup>lt;sup>1</sup>N is assumed even; for an odd number of stages a similar derivation can be performed but now the phase shift per stage becomes  $(2\pi/N)$ .

<sup>&</sup>lt;sup>2</sup>In [59] a model for the lock range of ring oscillators with an injection signal at twice the oscillation frequency applied to the tail current source of the differential stages is introduced. Even though in the oscillator in Fig. 2-3 the injection signal is applied at the frequency of the fundamental with a differential injection stage connected in parallel with the first stage, a similar expression as in [59] is obtained for the locking range  $\Delta f_o$ .



Figure 2-4: Edges of the locking range for the differential 4-stage ring oscillator operating quasi linearly w.r.t. the ratio  $\alpha$ 

2N2222 bipolar NPN transistors on MPQ2222A chips [60] were used as the active elements. For simulations, we used an openly available model for NPN 2N2222 transistors [60]. We use the same value for  $C_L$  for all stages in the netlist, but to account for board parasitics, we adjusted the value so that the measured self-oscillation frequency matched the simulated frequency. For measurements, the injection signal from a generator was converted into a differential signal with a balun; the DC common mode bias for the injection stage was applied with bias-Tees. An Agilent Infinium 1.5GHz real-time oscilloscope was used to capture the time-domain waveforms.

To obtain quasi-linear operation, the degeneration resistance  $R_E$  in the delay stages was set to 20  $\Omega$ ; the resistances in the injection stage were adjusted according to the desired injection level  $\alpha$ . The measured free running frequency  $f_0$  was 3.213 MHz. The calculated lock range using (2.2) as well as the simulated and measured values are plotted in Fig. 2-4 for varying  $\alpha$ ; the maximal error is less than 1.8%. Fig. 2-5 shows the theoretical, from (2.3), simulated and measured  $\theta$  over the locking range. Measured waveforms at the edges and in the center of the locking bandwidth are



Figure 2-5:  $\theta$  w.r.t. the injection frequency for the differential 4-stage ring oscillator operating quasi-linearly with  $\alpha = 10$ .

shown in Fig. 2-6.



Figure 2-6: Measured waveforms for the differential 4-stage ring oscillator operating quasi-linearly ( $R_E = 20 \ \Omega$ ): injection input  $V_{inj}$ , stage 1 input  $V_{i1}$  and stage 1 output  $V_{O1}$  with  $\alpha = 10$ ; (top)  $V_{inj}$  is  $-99.84^{\circ}$  out of phase with  $V_{i1}$  at 3.375 MHz, the upper edge of the locking bandwidth; (middle)  $V_{inj}$  is in phase with  $V_{i1}$  in the center of the lock range at 3.21 MHz; (bottom)  $V_{inj}$  is 79.75° out of phase with the oscillating input waveform  $V_{i1}$ , at 3.098 MHz, the lower edge of the locking bandwidth.
# 2.4 Time-Domain Model For Injection Locking in Differential Ring Oscillators

The 4-stage differential ring oscillator in Fig. 2-3 with zero degeneration resistors  $(R_E = 0)$  has output waveforms which do not have a single dominant frequency component. Hence, phasor analysis and the frequency domain injection locking model [50, 52] do not apply. We now derive the time-domain, delay based model to study the injection-locking phenomena in such oscillators. First, we derive analytical expressions for the oscillator time-domain waveforms and the effect of an injection signal in sections 2.4.1 and 2.4.2. They are used to arrive at a delay based model for the oscillator and expressions for injection-lock range in section 2.4.3. The time-domain model predictions are compared to measurements and simulations for an experimental prototype and to the predictions of the quasi-linear model from section 2.3. Using the delay based, time-domain model we further predict and experimentally verify the injection-locking dynamics in section 2.4.4.

# 2.4.1 Analytical Expressions for the Oscillator Waveforms

The operation of a delay stage can be modeled as shown in Fig. 2-7; the V-I converter (Q1-Q2) acts as comparator on the differential input  $(V_{I,p} - V_{I,n})$ ; its differential output current is a step waveform with amplitude  $I_{BIAS}$  which is driven into the differential load  $(2R_L//C_L/2)$ ; the differential output voltage  $v_d = V_{O,p} - V_{O,n}$  is the step response of the R - C circuit and thus an exponential waveform as shown in Fig. 2-8. Assuming an N stage ring oscillator, the falling section of the output waveform of a stage for  $0 \le t \le T/2$  is given by,

$$v_d(t) = -V_{a,max} + (V_a + V_{a,max}) \cdot e^{\frac{-t}{\tau}}$$
(2.4)

where  $V_a$  is the amplitude of the oscillations;  $V_{a,max} = I_{BIAS}R_L$  is the maximum possible amplitude;  $\tau = R_L C_L$  is the load time constant. The next stage has  $v_d$  as an input and switches its current when  $v_d = 0$  so that the delay  $t_d$  through each stage



Figure 2-7: For non-linear operation each stage of Fig. 2-3 is modeled as a hard amplitude limiting mechanism, whose output current drives a R-C load.



Figure 2-8: Differential output voltage  $v_d(t)$  of a stage in the 4-stage ring-oscillator of Fig. 2-3.

is determined from  $v(t_d) = 0$ ; the period T of the oscillation is  $2Nt_d$ . During T/2,  $v_d$  goes from  $V_a$  to  $-V_a$  so that  $v(T/2) = -V_a$  in (2.4). Combining these constraints, one obtains:

$$\frac{V_{a,max} + V_a}{V_{a,max} - V_a} = \left(\frac{V_{a,max} + V_a}{V_{a,max}}\right)^N.$$
(2.5)

Given N, (2.5) can be solved for  $V_a$ ; then  $t_d$  and T can be computed. E.g., for N = 4,  $V_a = 0.84V_{a,max}$  and  $t_d = 0.61\tau$ . As N becomes large  $V_a \to V_{a,max}$  and  $t_d \to \tau \ln(2)$ .



Figure 2-9: Effect of the injection signal on the output voltage  $v_d = v_{d,i} + v_{d,inj}$ 

# 2.4.2 Effect of an Injection Signal

We focus on the first stage's differential output  $v_d = V_{O1,p} - V_{O1,n}$  when an injection signal is present  $V_{INJ,p} - V_{INJ,n}$ . Since the output load is linear, the output voltage  $v_d$  can be calculated as the superposition of the output voltage  $v_{d,i}$  due to the current  $i_{d,i} = I_{I1,p} - I_{I1,n}$  and the output voltage  $v_{d,inj}$  due to the current  $i_{d,inj} = I_{INJ,p} - I_{INJ,n}$ as shown in Fig. 2-9. When the zero crossings of the input voltage of the injection stage  $V_{INJ,p} - V_{INJ,n}$  have a delay  $\Delta$  compared to the zero crossings of the input of the first stage  $V_{I1,p} - V_{I1,n}$ , then the transitions in  $i_{d,inj}$  have a delay  $\Delta$  compared to the transitions in  $i_{d,i}$ . Now, when the output voltage component  $v_{d,inj}$  adds to the component  $v_{d,i}$ , the zero-crossings of  $v_d$  is delayed by an amount d compared to the zero-crossing of  $v_{d,i}$ , which corresponds to the non-injection case. As a result, due to the presence of the injection signal, the delay through the first stage is increased by an amount d and the oscillator can now oscillate with a period T + 2d. We can develop the relationship between d and  $\Delta$  as follows. The exponentially decreasing part of  $v_{d,i}$  is given by (2.4) and for  $v_{d,inj}$  we obtain:

$$v_{d,inj}(t) = -V_{ainj,max} + (V_{ainj} + V_{ainj,max}) \cdot e^{\frac{-t+\Delta}{\tau}}.$$
(2.6)

and we define

$$\frac{1}{\alpha} = \frac{i_{d,inj}}{i_{d,i}} = \frac{V_{ainj}}{V_a} = \frac{V_{ainj,max}}{V_{a,max}}.$$
(2.7)

The extra delay d due to the injection is,

$$d = t_{zc}(\Delta) - t_{zc}(\Delta = 0).$$
(2.8)

where  $t_{zc}$  denotes the time of the zero-crossing of the falling part of the waveform  $v_d(t)$ . Using (2.8),(2.4) and (2.6), the following relationship is obtained:

$$d(\Delta) = \tau \ln\left(\frac{V_{a,max} + V_a + (V_{ainj,max} + V_{ainj}) \cdot e^{\frac{\Delta}{\tau}}}{V_{a,max} + V_a + V_{ainj,max} + V_{ainj}}\right).$$
(2.9)

# 2.4.3 Injection Locking Range

Fig. 2-10 shows the differential output for each stage in Fig. 2-3 during injection once lock has been achieved. As the delay  $\Delta$  is increased, the zero-crossing of  $v_d$  moves forward, and d keeps increasing, until  $\Delta = \Delta_{max}$  with  $v_{d,i}(\Delta_{max}) = -V_{ainj}$ ; if the injection waveform is delayed beyond  $\Delta_{max}$ , different waveform segments overlap and (2.9) is not valid anymore. Additionally, d starts decreasing as shown in Fig. 2-13 and we have reached the edge of the lock range. Then (2.4) gives,

$$e^{\frac{\Delta_{max}}{\tau}} = \left(\frac{V_{a,max} + V_a}{V_{a,max} - V_{ainj}}\right).$$
(2.10)

combining (2.10) with (2.7) and (2.9) one obtains:

$$d_{max} = d(\Delta_{max}) = \tau \ln\left(\frac{V_{a,max}}{V_{a,max} - V_{ainj}}\right).$$
(2.11)



Figure 2-10: Waveforms for the differential 4-stage ring oscillator when injection locked; the last 3 stages have a delay  $t_d$  and the first stage has a delay  $t_d + d$  due to the injection.

Similarly, for negative  $\Delta$ , the zero-crossing of  $v_d$  keeps moving backward, and d keeps decreasing, until  $v_{d,i}(T/2 + \Delta_{min}) = V_{ainj}$ . If  $\Delta$  is decreased below  $\Delta_{min}$ , the assumptions behind the derivation of (2.9) are not valid anymore and d starts increasing again as shown in Fig. 2-13. The minimum d is then

$$d_{min} = d(\Delta_{min}) = \tau \ln\left(\frac{V_{a,max}}{V_{a,max} + V_{ainj}}\right).$$
(2.12)

Note that the maximum and minimum delays can be increased and decreased respectively by decreasing  $\alpha$  and thus increasing the injection current  $i_{d,inj}$  and  $V_{ainj}$ .

We conclude that the presence of the injection signal introduces an extra delay, d, in the oscillator's loop. For a injection signal with period  $T_{inj}$  locking can occur if a dexists so that  $d = (T_{inj} - T)/2$ . Given that for a given injection level,  $d_{min} \leq d \leq d_{max}$ , the following locking bandwidth exits:

$$T + 2d_{min} < T_{inj} < T + 2d_{max}.$$
 (2.13)



Figure 2-11: Measured waveforms for the differential 4-stage ring oscillator operating non linearly ( $R_E = 0 \ \Omega$ ) with  $\alpha = 10$ : the injected signal,  $V_{inj}$ , the stage 1 input voltage,  $V_{I1}$ , and the stage 1 output voltage,  $V_{O1}$ , are shown for varying  $\Delta$ , the delay between  $V_{INJ}$  and  $V_{I1}$ ; (top)  $\Delta = \Delta_{min}$  at the upper edge of lock range at 3.61 MHz;  $t_{d1}$ , the delay through stage 1, i.e. the delay between  $V_{O1}$  and  $V_{I1}$ , is 31.4 ns;  $\Delta = 0$ and  $t_{d1} = 36.1$  ns in the middle of the lock range at 3.49 MHz; (bottom)  $\Delta = \Delta_{max}$ and  $t_{d1} = 39.1$  ns at the lower edge of the lock range at 3.37 MHz.

#### **Experimental Verification**

The same oscillator prototype used for the simulations and measurements in quasilinear operation was used to do simulations and take measurements for non-linear operation. To verify the time-domain model, we operated the oscillator non-linearly by setting  $R_E$  to 0; the resistors in the injection stage were again adjusted according to the desired ratio  $\alpha$ . The free running frequency ( $f_o$ ) was measured to be 3.501 MHz. Measured waveforms at the edges and in the center of the locking bandwidth are shown in Fig. 2-11.

Measurements, simulations and predictions [using (2.13)] of the edges of the locking range are plotted in Fig. 2-12 as a function of the ratio  $\alpha$ . In Table 2.1, the measurements for the locking range are compared to predictions using the time-domain



Figure 2-12: Edges of the locking range w.r.t.  $\alpha$  for the differential 4-stage ring oscillator operating non linearly

model and the quasi-linear model in (2.2). The predictions from the time-domain model are substantially more accurate than the quasi-linear model and their errors are close to the component tolerances. The dependence of d on  $\Delta$  obtained from measurements, simulations, and (2.9) is shown in Fig. 2-13 and good correspondence is obtained<sup>3</sup>.

<sup>&</sup>lt;sup>3</sup>The deviations between measurements and calculations close to the edges of the lock range can be traced to the fact that the real waveforms are rounded off at their extremes (see Fig. 2-11) compared to the ideal waveforms (see Fig. 2-8).

| cillator | operating non linearly | У                 |                    |  |
|----------|------------------------|-------------------|--------------------|--|
|          | Measurement            | Predictions       |                    |  |
|          |                        | Time-domain Model | Quasi-linear Model |  |

Table 2.1: Lock range measurements and predictions for the 4-stage differential ring oscillator operating non linearly

|          |       | lime-domain Model |       | Quasi-linear Model |       |  |
|----------|-------|-------------------|-------|--------------------|-------|--|
|          |       | Calc.             | Error | Calc.              | Error |  |
| $\alpha$ | [MHz] | [MHz]             | [%]   | [MHz]              | [%]   |  |
| 6.8      | 0.349 | 0.389             | 11.4  | 0.585              | 67.62 |  |
| 10       | 0.242 | 0.265             | 9.5   | 0.399              | 64.87 |  |
| 18       | 0.141 | 0.15              | 6.3   | 0.227              | 60.09 |  |



Figure 2-13: Calculated, simulated and measured d as a function of  $\Delta$  for the differential 4-stage ring oscillator operating non-linearly with (a)  $\alpha = 10$  and (b)  $\alpha = 6.8$ .

## 2.4.4 Injection Locking Dynamics

We now analyze the injection lock dynamics, i.e. the change of  $\Delta$  (and d) over time when the injection frequency changes. The update of  $\Delta$  (and thus d) during locking is a discrete-time process and happens for every zero-crossing of the injection signal which we use as our time-reference. If we know  $\Delta[n]$  at the *n*-th zero-crossing, we can find  $\Delta[n+1]$  at the (n+1)-th zero crossing using Fig. 2-14:

$$\Delta[n+1] = \Delta[n] - d(\Delta[n]) + \frac{(T_{inj} - T)}{2}.$$
(2.14)

These updates continue until  $d \to (T_{inj} - T)/2$ . Since  $d(\Delta)$  is a non-linear function, (2.14) is a non-linear difference equation.

To gain some insight, we approximate  $d(\Delta)$  around  $\Delta = 0$  in (2.9) as  $d(\Delta) = m \cdot \Delta$ ; note that  $m \leq 1$  and that m is larger at  $\Delta = 0$  for  $\alpha = 6.8$  as compared to  $\alpha = 10$ .



Figure 2-14: Injection lock transient waveforms for the differential 4-stage ring oscillator used for the derivation of  $\Delta[n+1]$  from  $\Delta[n]$ .

Substituting this linear approximation for  $d(\Delta)$  into (2.14), we obtain:

$$\Delta[n] = (1-m)^n (\Delta[0] - \frac{T_{inj} - T}{2m}) + \frac{T_{inj} - T}{2m}.$$
(2.15)

The locking dynamics are then of first order and for larger injection levels (i.e smaller  $\alpha$  and larger m) the transient time of a step response is shorter.

#### **Experimental Verification**

Fig. 2-15 compares the simulated and calculated evolution of  $\Delta$  in response to the injection frequency step change for the prototype oscillator. The simulations and calculations agree very well.

To experimentally observe the injection lock dynamics, a square wave with the appropriate amplitude was fed into the FM modulation input of the generator to obtain the desired step change in the injection frequency and was used to trigger the real-time oscilloscope to capture the waveforms on the board, as shown in Fig. 2-



Figure 2-15: Simulated and calculated injection lock dynamics of the differential 4stage ring oscillator for a step change in frequency from 3.4MHz to 3.6MHz at  $\alpha = 10$ (top)  $\alpha = 6.8$  (bottom).

16. However, the step change in the injection frequency was not instantaneous after the step trigger and the injection frequency took few cycles to settle to the new frequency as can be seen in Fig. 2-17. In order to compare the measured and calculated dynamics, we wait for 1.5 cycles, take the initial value of  $\Delta$  at the point shown as T = 0 in Fig. 2-17 and then use (2.14) to calculate the dynamics. The measured and calculated dynamics are plotted in Fig. 2-18 for  $\alpha = 6.8$  and  $\alpha = 10$  and a very good correspondence between measurements and model calculations is obtained. Close to first-order dynamics are observed and, as expected, larger injection levels (smaller  $\alpha$ ) lead to faster settling.



Figure 2-16: Experimental setup used to observe the injection lock dynamics of the 4-stage differential ring oscillator.



Figure 2-17: After an FM modulation step trigger, the injection frequency generator settled to the new frequency in about 1.5 cycles; that time point is labeled as T = 0.



Figure 2-18: Measured and calculated injection lock dynamics of the 4-stage differential ring oscillators for a step change in frequency from 3.4MHz to 3.6MHz at  $\alpha = 10$ (top)  $\alpha = 6.8$  (bottom).

# 2.5 Time-Domain Model For Injection Locking in Single-Ended Inverter Based Ring Oscillator

The delay based method can be used for other types of non-harmonic oscillators as long as a relationship between the extra stage delay (d) and the delay ( $\Delta$ ) between the injection signal and the relevant internal oscillator signal is available. This d- $\Delta$ relationship needs to be developed specifically for the oscillator topology under study using analytical equations, computer simulations or experimental measurements. In section 2.4 we derived the d vs.  $\Delta$  relationship analytically for the differential ring oscillator in Fig. 2-3.

We now demonstrate the versatility of the delay based method demonstrated by applying it to a different non-harmonic oscillator, in particular, a single ended inverter based ring oscillator shown in Fig. 2-19. The the d vs.  $\Delta$  relationship will be derived using simulations and the appropriate  $(d_{min}, d_{max})$  will be determined to predicting



Figure 2-19: Single-ended 3-stage CMOS-inverter based ring oscillator, with an injection stage operating on the first stage. Each of the three stages is made of nine (9x) identical inverters. For closed-loop operation switch S1 is closed. The injection level can be switched from  $\alpha = 9$  to  $\alpha = 4.5$  by opening or closing switches (S2,S3).

its lock range with (2.13), and its injection lock dynamics with (2.14).

#### **Experimental Prototype**

To obtain experimental data we built a prototype board for the single-ended 3-stage ring oscillator shown in Fig. 2-19. Matched CMOS inverters available on CD4007UB chips and operating from a 5 V supply were used to build a 10 MHz oscillator. For simulations, we used openly available models for the NMOS and PMOS transistors on the CD4007UB chip [61]. To account for the board parasitics loading the inverters, the parasitic resistance and capacitance at the inverter inputs was adjusted so that the measured self-oscillation frequency ( $f_o$ ) matched the simulated frequency.

#### 2.5.1 d vs. $\triangle$ Relationship

The d vs.  $\Delta$  relationship for this oscillator cannot be derived analytically due to the lack of sufficiently accurate equations describing the transient waveforms in a singleended ring oscillator. The d vs.  $\Delta$  relationship can however be obtained by simulating the delay through an inverter for different input and injection signal configurations. In Fig. 2-19, by opening S1 we obtain a circuit with two input signals IN1 and INJ. We can now measure the total inverter delay,  $t_d + d$ , for different values of the delay



Figure 2-20: Waveforms and the definition of d and  $\Delta$  for the single-ended 3-stage ring oscillator in Fig. 2-19.

 $\Delta$  between the inverter input and the injection signal, as shown in Fig. 2-20, and then obtain the d vs  $\Delta$  relationship shown in Fig. 2-21. For the presented prototype, when switches S2 and S3 are open, the injection ratio  $\alpha$  is 9, and when they are closed, the injection ratio  $\alpha$  is 4.5.

We also measured the d vs.  $\Delta$  relationship when the oscillator is operating in closed loop and injection locked. These graphs have been added to Fig. 2-21. Note the excellent correspondence between the results for both cases. This validates using the open-loop relationship to predict the injection locking characteristics of ring oscillators using the basic inverter stages.

We further verified the correspondence between the simulated d vs.  $\Delta$  characteristic and the characteristic measured on the experimental prototype; excellent correlation is obtained both for  $\alpha = 9$  in Fig. 2-22 and for  $\alpha = 4.5$  in Fig. 2-23. Also for closed-loop operation a very good correspondence between measurements and simulations is obtained for the d vs.  $\Delta$  relationship as shown in Fig. 2-24. We can now proceed with the calculations of the injection-locking range and injection-locking



Figure 2-21: d vs.  $\Delta$  relationship for the single-ended 3-stage ring oscillator obtained through open-loop simulations for different values of  $\alpha$ . Also shown are the d vs.  $\Delta$ relationships when the oscillator is operating in closed loop and injection locked.

dynamics using the inverter d vs  $\Delta$  characteristic.

# 2.5.2 Injection Locking Range

The  $(\Delta_{max}, \Delta_{min})$  of the oscillator correspond to the points on the the open-loop curves in Fig. 2-22, and Fig. 2-23 where the slope goes to 0. Indeed, for  $\Delta$ 's larger than  $\Delta_{max}$  and  $\Delta$ 's smaller than  $\Delta_{min}$  the injection signal cannot provide the delay increase or decrease required to lock the oscillator. The extremum points, when the slope of the curve  $D(d)/D(\Delta) \rightarrow 0$ , thus give us  $(\Delta_{max}, \Delta_{min})$ , which in turn give us a corresponding  $(d_{max}, d_{min})$ . The lock range of the single-ended 3 stage ring oscillator with an  $f_o = 9.76MHz$ , can now be calculated using (2.13). In Fig. 2-25 the measured, calculated and simulated edges of the locking range are plotted for different injection levels; Table 2.2 compares the calculated, simulated and calculated lock ranges for different injection levels  $\alpha$ . The errors are within the expected range due to component tolerances.



Figure 2-22: Open-loop d vs.  $\Delta$  plots obtained through measurements and simulations at  $\alpha = 9$  for the single-ended 3-stage ring oscillator.



Figure 2-23: Open-loop d vs.  $\Delta$  plots obtained through measurements and simulations at  $\alpha = 4.5$  for the single-ended 3-stage ring oscillator.



Figure 2-24: Closed-loop d vs.  $\Delta$  plots obtained through measurements and simulations for the single-ended 3-stage ring oscillator.



Figure 2-25: Edges of the locking range w.r.t.  $\alpha$  for the single-ended 3-stage ring oscillator

|                      | $f_{min}$ | $f_{max}$ | Lock Range | Rel. error<br>w.r.t. Meas |
|----------------------|-----------|-----------|------------|---------------------------|
|                      | [MHz]     | [MHz]     | [MHz]      | [%]                       |
| $\alpha = 9$ Sims    | 9.18      | 10.10     | 0.92       | -8                        |
| $\alpha = 9$ Meas    | 9.10      | 10.10     | 1.00       | _                         |
| $\alpha = 9$ Calc.   | 9.26      | 10.04     | 0.78       | -22                       |
| $\alpha = 4.5$ Sims  | 8.80      | 10.56     | 1.75       | -9                        |
| $\alpha = 4.5$ Meas  | 8.57      | 10.42     | 1.85       | _                         |
| $\alpha = 4.5$ Calc. | 8.80      | 10.34     | 1.54       | -31                       |

Table 2.2: Lock range predictions, measurements, and simulations for the single-ended 3-stage ring oscillator

# 2.5.3 Injection Locking Dynamics

Using the d vs.  $\Delta$  characteristic and (2.14), we can now also predict the injection lock dynamics of the oscillator. The simulated and calculated dynamics are shown in Fig. 2-27 and the measured and calculated dynamics are plotted in Fig. 2-26. A very good correspondence between measurements, simulations and model calculations is obtained. Close to first-order dynamics are indeed observed and, as expected, larger injection levels (smaller  $\alpha$ ) lead to faster settling. Note that difference in the final value of  $\Delta$  between the measurements and calculations is of the similar size as the difference between the d vs.  $\Delta$  relationship obtained under open-loop conditions and closed-loop conditions in Fig. 2-24.



Figure 2-26: Measured and calculated injection lock dynamics for the single-ended 3-stage ring oscillator for a step change in frequency from 9.35MHz to 9.75MHz at (top)  $\alpha = 9$  (bottom)  $\alpha = 4.5$ .



Figure 2-27: Simulated and calculated injection lock dynamics for the single-ended 3-stage ring oscillator for a step change in frequency from 9.35MHz to 9.75MHz at (top)  $\alpha = 9$  (bottom)  $\alpha = 4.5$ .

# 2.6 Summary

A time-domain delay based model is developed to predict the injection locking behavior of non-harmonic oscillators such as ring oscillators. The effect of the injection signal on the oscillator is modeled with the d vs.  $\Delta$  characteristic which captures the additional delay, d, in a stage due the effect of the injection signal with a delay  $\Delta$ . Using this characteristic the injection locking range as well as injection locking dynamics can be accurately modeled and predicted.

This modeling approach is applied to a differential 4-stage ring oscillator where analytical expressions for the waveforms could be derived along with an analytical expression for the d vs.  $\Delta$  characteristic. Good correlation is shown between the predictions, simulations and measurements of the lock range and dynamics at different injection levels for a prototype oscillator.

Versatility of the modeling approach is demonstrated by analyzing the locking behavior of a single-ended 3-stage CMOS-inverter based ring oscillator. In this case accurate analytical expressions for the oscillator waveforms cannot be obtained and the d vs.  $\Delta$  characteristic is derived from simulations and measurements on a single inverter stage in open loop. Using this characteristic good correspondence between predictions for the locking bandwidth and dynamics and measurements and simulations for a prototype oscillator is obtained at different injection levels.

In summary, the presented time-domain delay based modeling approach can be applied to any non-harmonic oscillator as long as the relationship between the extra delay, d, and the delay,  $\Delta$ , between the injection signal and the relevant internal oscillator is available. As we have shown with examples in this chapter, this relationship can be obtained either analytically or through experimental measurements and computer simulations.

# Chapter 3

# A 9GHz Sub-Integer Clock-Frequency Synthesizer at Ultra-Low Supply

# 3.1 Introduction

Exascale computing capable of atleast a million-trillion operations per second will be critical for a wide spectrum of applications in science and technology. To reach such a 100-fold increase in speed over the fastest supercomputers in broad use today would require extreme parallelism. With such massive parallelism on multiple vertical levels, the energy required to communicate over billions of parallel threads will be the critical limitation to energy efficiency [62]. High speed serial communication operating at ultra-low supplies improves the energy-efficiency and lowers the power envelop of a system doing an exaflop of loops. The focus area of this chapter is clock synthesis for such energy-efficient interconnect applications operating at high speeds and ultra-low supplies.

At high data rates, embedded sub-rate synthesizers and clocking are frequently used to reduce power [63]. For use in chip-to-chip serial links they require low phase noise, fast settling and fractional division ratios [64]. Such embedded clock synthesizers when operated from ultra-low core supply could benefit from lower dynamic power due to voltage scaling. The ability to operate from the core supply further avoids the complexities associated with separate supply domains and DC-DC converters, and



Figure 3-1: Block diagram of the ultra-low supply sub-integer clock-frequency synthesizer using ILRO based prescaler for divide-by-3 function, followed by a phaseswitching based sub-integer programmable divider and an automatic injection-lock calibration loop for ILRO and VCO.

allows for better integration [65,66]. Earlier clock synthesizers for ultra-low supply voltage were either limited in speed due to the slow prescaler performance when using traditional dividers [66], or were limited by the lock range of the injection-locked frequency dividers used as prescalers [67,68]. Also, these approaches had to resort to fractional-N synthesis to achieve fine resolution [69].

In this chapter, a 0.5-V, 9-GHz sub-integer clock-frequency synthesizer is presented demonstrating design techniques to increase speed of operation at ultra-low supply, such as multi-phase injection-locked prescaler with automatic injection-lock, and fine frequency resolution using programmable sub-integer divider. On top of these design advances, it also takes advantage of low  $V_T$  and a reduced junction capacitance in 45nm SOI-CMOS to achieve the highest reported speed in literature at this ultra-low supply.

# **3.2** Architecture and circuit description

The programmable sub-integer synthesizer uses the top-level architecture shown in Fig. 3-1. It supports feedback division ratios of 96, 96.5, 97, 97.5, 98, and 98.5 over an output frequency ( $F_{vco}$ ) range of 9GHz±1GHz with a reference frequency ( $F_{ref}$ ) of 95MHz. Using an injection-locked ring oscillator (ILRO) overcomes the speed roadblock of traditional prescalers at low supplies. Automatic calibration with a frequency counter and off-chip software control is used to ensure the ILRO is operating in injection-locking mode. The multiple phases available from the ILRO output make it possible to implement a fractional division ratio with a phase-switching programmable divider [70]. This offers the simplicity of an integer-N synthesizer while achieving fine frequency resolution without compromising loop bandwidth or settling times. Wider bandwidth and lower division ratios help in further suppression of VCO phase noise and less amplification of in-band noise. The programmable divider does not rely on time-varying modulus control to achieve sub-integer division and does not create fractional spurs, unlike in fractional-N synthesizers. The synthesizer uses a differential charge-pump (CP), with a similar design as [71] and a nominal current value of 1mA. A standard differential 2nd-order loop filter is used with a series R-C  $(R=8K\Omega, C=80pF)$  in parallel with a 4pF capacitor. The differential filter output voltage,  $V_{cp}$ , tunes the LC VCO. It uses a cross-coupled inverter for low supply voltage operation and has a rail-to-rail output signal. For testing purposes, a 2:1 MUX has been inserted at the VCO output, which can select between the VCO output or an off-chip input  $F_{inj}$ , or can be tri-stated. The MUX output connects to the ILRO-based prescaler through AC coupling.

### 3.2.1 PFD, CP, and VCO

The phase-frequency detector (PFD) design, Fig. 3-2, uses extra delay in reset path to ensure minimum pulse width to avoid deadzone induced low loop gain and increased jitter. To increase noise immunity the entire loop including the charge-pump and the loop filter uses fully differential design. Fig. 3-3 shows this differential charge-pump



Figure 3-2: PFD with extra delay in reset path.



Figure 3-3: Differential charge-pump with unity-gain buffer based architecture along with common-mode feedback circuit.

where a unity-gain buffer based architecture is used along with common-mode control to keep voltage-controlled oscillator (VCO) voltage  $(V_{cp})$  at optimal value and reject common-mode noise. At 0.5V supply, this structure has  $V_{cp}$  range of <200 mV to have  $>300 \,\mathrm{mV}$  of voltage headroom for saturation region operation of the two stack MOS transistors. The voltage-controlled oscillator (VCO) uses a cross-coupled inverter architecture for low supply operation, as shown in Fig. 3-4. The oscillator swings rail-to-rail and is designed to cover  $\pm 1$ GHz band. It uses a closely-spaced digitally tuned coarse varactor bank that centers the VCO close to required frequency, and a finely controlled varactor using the filtered differential control voltage,  $V_{cp}$ . A high  $C_{max}/C_{min}$  ratio over a low voltage tuning range implies high varactor  $k_v$ , which is unfavourable to phase noise performance. Differential tuning provides a simple but effective solution to avoid the drawbacks of high  $k_v$  effect. All low frequency noise, such as flicker noise, can be considered to be common-mode noise and differentially tuned varactors can be used to suppress common-mode noise [72]. The VCO features differentially tuned MOS varactors to provide fine tuning while diminishing the adverse effect of high varactor sensitivity through rejection of common-mode noise.

### 3.2.2 ILRO based Prescaler

Simulations over process, voltage and temperature (PVT) for a nominal 0.5V supply show that the input frequency of the ILRO-based divider is up to 2X larger than that of the traditional flip-flop based divider using current-mode-logic latches of [73], while also operating with lower power and having a smaller area footprint. The traditional divider has also more demanding signal power requirements on the input clock to fully steer the currents. The need for voltage headroom of  $2V_{DSSAT}$  plus the necessary output amplitude creates a performance ceiling at ultra-low supplies. For example,  $V_{DSSAT}$  of ~150mV and output amplitude of ~300mVpp leads to a minimum supply requirement of 0.6V. The general multi-phase, multiple-input injection scheme in Fig. 3-5 increases the locking range of the ILRO based divider and allows to implement an odd-M division modulus, where M is the odd number of ring-oscillator stages. Using the  $\Delta \rightarrow d$  relation, described in [74], such a multiple-input injection widens



Figure 3-4: VCO, using a cross-coupled inverter architecture.



Figure 3-5: General concept of odd-M stage multi-input injection to achieve modulo-M division and achieve wider injection lock range.



Figure 3-6: Ultra-low voltage pseudo-differential implementation of the ILRO prescaler in a divide-by-3 configuration.

the lock range to  $(T - 2M.d_{min} < M.T_{inj} < T + 2M.d_{max})$ , where T is the period of the free-running ring-oscillator,  $T_{inj}$  is the period of the injected signal  $S_{inj}$  and  $(d_{min}, d_{max})$  is the range of delay modulation in each of the oscillator stages due to steady-state multi-input injection action. In [75] a similar concept was used to obtain modulo-3 and 7 division ratios at a 1.8V supply. In this work, a generalized timedomain delay-based approach is used to describe a widening of the lock range with multi-input injection and is leveraged to implement a modulo-3 ILRO (Fig. 3-6) for supplies as low as 0.5V. Each stage,  $G^*$ , of the 3-stage oscillator is inverting with the transconductance of a NFET driving an active PFET load. The free-running frequency of the oscillator is set by controlling the load impedance using the bias voltage  $V_{cilo}$ . The injection signal superimposes on the  $V_{cilo}$  voltage to modulate the active load impedance to achieve injection-lock. The differential signal,  $F_{in}$ , from the 2:1 MUX is used to injection lock two coupled 3-stage ring oscillators that generate the complementary phase-shifted outputs  $C_0/C_{180}$ ,  $C_{60}/C_{240}$ , and  $C_{120}/C_{300}$ . The coupling inverters between the complementary phases correct for any phase deviations and maintain symmetry across this pseudo-differential circuit for an even output phase



Figure 3-7: Circuit block diagram for the phase-switching based programmable divider.

spacing. The minimal FET stack in the ILRO topology coupled with the lower  $V_T$  (without accompanying leakage) and lower junction capacitance benefit of the 45nm SOI CMOS technology [76], helps to push up the speed of the ILRO-prescaler, as well as the synthesizer at ultra-low supply. The higher substrate resistivity in this technology further helps with noise shielding in the pseudo-differential circuit.

## 3.2.3 Phase-Switching Programmable Divider

Following the ILRO prescaler with a conventional multi-modulus divider would result in a division step size of 3, and fractional-N synthesis would have to be used to obtain fine frequency steps. In contrast, we use the multi-phase differential outputs from the ILRO prescaler to realize sub-integer programmable division ratios, as shown in Fig. 3-7. The programmable parameter k represents the number of  $T_{in}/2$  phase shifts in a single  $T_{fbclk}$  period. The programmable pulse generator output is used to clock a finite-state machine which controls the state of the phase-switching MUX. Glitchfree phase switching [70] is used.  $T_{fbclk}$  is periodic but phase inaccuracies during phase-switching could cause modulations and lead to deviations in divider moduli and sub-integer spurs. As an example, a 2ps change in  $T_{fbclk}$  on average would lead to deviation of about 0.02% in frequency at  $F_{vco}$ .

#### /D: (96,96.5,97,97.5,98,98.5) $F_{fbclk}$ Prescaler Programmable $C_{60}$ Sub-Integer ILRO $C_{120}$ Divider /3 $\bar{V}_{cilc}$ FCFrequency Off-chip Controller → Counter Start ILRO cal. in state shown above Inc $V_{cilo}$ Y Get FCIs |FC-512| decreasing ? Ν Set $V_{cilo}$ Enter VCO band select

# 3.2.4 Automatic Injection-Lock Calibration

Figure 3-8: Automatic injection-lock calibration algorithm to coarsely set the ILRO free-running frequency.

Compared to traditional, divider-based prescalers, ILRO-prescalers can process inputs with higher frequencies while operating from lower-supplies, but they have a limited lock range. For a PLL with ILRO prescalers to work reliably over PVT, the ILRO free-running frequency needs to be set to be within lock range and the VCO band needs to be selected optimally. The ILRO free-running frequency is calibrated for in [67], but the calibration scheme presented here in Fig. 3-8 and Fig. 3-9 does



Figure 3-9: Automatic injection-lock calibration algorithm to optimally select the VCO band.

calibration for both ILRO free-running frequency and optimal VCO band. At startup, it tri-states the 2:1 MUX output that drives the ILRO, so the ILRO runs freely. Its output,  $F_{fbclk}$ , is compared against the  $F_{ref}$  for different values of  $V_{cilo}$ . The frequency counter value for  $F_{fbclk}$ , closest to the one for  $F_{ref}$  is used as the criterion to select the  $V_{cilo}$  value for the ILRO. In the second step of the calibration, the 2:1 MUX selects the VCO output. The VCO is set such that differential  $V_{cp}$  is zero and its bands ( $b_{\langle i \rangle}$ ) are stepped from bottom to top. For each band the frequency counter values are noted and a search is performed for a maximal set of contiguous bands with monotonically increasing counter values. The average band values in this set is used to set the VCO band for optimal lock margin.



Figure 3-10: Fabricated chip micrograph and layout of the PLL.

# 3.3 Experimental Results

The PLL was fabricated in a 45nm SOI CMOS technology. The die microphotograph is shown in Fig. 3-10. The area of the PLL is  $0.05 \text{mm}^2$  and its power consumption at 0.5V is 3.5mW, excluding output buffers. First, the free-running output frequency of the ring-oscillator based prescaler was measured to range from 1GHz to 3.5GHz when  $V_{cilo}$  varies from 300mV to 50mV, as shown in Fig. 3-11(a). Next, the 2:1 MUX was set to select the  $F_{inj}$  signal from an off-chip signal source to measure the ILRO lock range as a function of input power for different  $V_{cilo}$  settings. As seen in Fig. 3-11(b), the lock range is around 10% for a -3dBm input power.



Figure 3-11: Measurement of (a) $V_{cilo}$  versus  $F_{osc}$  (b) Input dBm versus Freq. lock range.

Process and back-end-of-line (BEOL) interconnect parasitic parameters are adjusted in simulation to match the measured self-oscillation frequency of the oscillator. These parameters are then used to simulate for open-loop d vs.  $\Delta$  relationship for the delay stages at different input signal levels. The extremum points in these curves, where  $slope \rightarrow 0$ , gives the corresponding  $(d_{max}, d_{min})$  values; the range of delay modulation in each of the oscillator stages due to injection action. The lock ranges can be calculated using  $(T - 6.d_{min} < lock - range < T + 6.d_{max})$ , where T is the period of the free-running ring-oscillator. In Fig. 3-12, the linear-fit of the measured and calculated lock range values are plotted at different input levels and self-oscillation frequencies, showing a good model-to-hardware correlation.



Figure 3-12: Linear-fit of measured and calculated lock ranges at different injection input levels and self-oscillation frequencies.

Fig. 3-13(a) shows the plot of min-max frequency in each VCO band, as well as the mid-band frequency with differential  $V_{cp}$  set to 0. The frequency counter values during



Figure 3-13: (a)Vco gain curves. (b) Auto-calibration between ILRO and VCO.

automatic injection-lock calibration are also shown in Fig. 3-13(b), and it converges to VCO band 18, the average of the maximal set of bands with monotonically increasing count values, for optimal injection-lock point. The ILRO lock range is large enough to maintain lock over supply and temperature drift, thus removing the need for dynamic calibration.

Fig. 3-14(a) shows the PLL output spectrum at different sub-integer division ratios using  $F_{ref}$  of 95MHz, a frequency resolution of 47.5MHz is observed in the spectra. The phase-noise at a single divider setting is shown in Fig. 3-14(b), but at all divide ratios the phase noise value is close to -100 dBc/Hz at 1MHz offset. While it is difficult to determine the source of the correlation for the spurs seen in the phase noise plot, the estimated jitter contribution due to these spurs is less than few fs. Integrated RMS jitter beyond the clock-data recovery corner frequency (baudrate/1667) is 325fs, which compares favourably for use in high-speed serial communications [77].

|                          | This Work | [66]             | [67]             | [68]             | [78]   |
|--------------------------|-----------|------------------|------------------|------------------|--------|
| CMOS Tech.               | 45nm-SOI  | $65 \mathrm{nm}$ | $65 \mathrm{nm}$ | $65 \mathrm{nm}$ | 180nm  |
| $F_{vco}$ (GHz)          | 9.12      | 2.4              | 5.49             | 5.54             | 1.9    |
| $F_{vco}/F_{ref}$        | 96        | 2400             | 160              | 160              | 126    |
| PLL-type                 | Sub-Int-N | Int-N            | Int-N            | Int-N            | Int-N  |
| VCO                      | LC        | LC               | LC               | LC               | LC     |
| Supply $(V)$             | 0.5       | 0.68             | 0.5              | 0.5              | 0.5    |
| Power $(mW)$             | 3.5       | 0.68             | 0.95             | 1.6              | 4.5    |
| Area $(mm^2)$            | 0.05      | 0.2              | 0.78             | 0.64             | 1.32   |
| ${ m PN}~({ m dBc/Hz})$  | -100      | -110             | -106             | -105             | -120.4 |
| Ref. Spur (dBc)          | -61       | -50              | -65              | -65              | -44    |
| FOM <sup>a</sup>         | -173.5    | -179             | -181             | -179             | -179.4 |
| $\operatorname{FOM}_A^b$ | -186.5    | -186             | -183             | -181             | -178.2 |

Table 3.1: Performance Summary and Comparison of Low-Supply PLLs.

$$\begin{split} FOM^a &= PN - 20.Log(F_{vco}/1MHz) + 10.Log(Power/1mW) \\ FOM^b_A &= FOM + 10 \cdot Log(Area/1mm^2) \end{split}$$

Fig. 3-15, shows the distribution of the power consumption over different macros


Figure 3-14: Measurement of (a) Output spectra of the clock-frequency synthesizer at different sub-integer division ratios. (b) Phase noise plot at division ratio of 96.



Figure 3-15: Power consumption distribution in the sub-integer clock-frequency synthesizer.

in the synthesizer. Table 3.1 summarizes the performance of the synthesizer and compares it against other ultra-low supply PLL implementations. The ultra-low voltage ILRO-prescaler topology used with automatic injection-lock calibration enabled the demonstration of a PLL with the highest speed at an ultra-low supply of 0.5V. The sub-integer programmable divider facilitates fine frequency resolution without requiring a decrease in  $F_{ref}$  or an increase in the division ratio or a lowering of the loop bandwidth. The design achieves an outstanding overall FOM<sub>A</sub> of -186.5.

## 3.4 Summary

This chapter presented a sub-integer clock-frequency synthesizer architecture that can operate at a high speed from an ultra-low supply. A record speed of 9GHz has been demonstrated at 0.5V in 45nm SOI CMOS. Key design features are described to achieve such high frequencies with fine resolution at an ultra-low supply. The proposed multi-phase multi-input ILRO-prescaler eliminates the speed bottleneck, while automatic injection-lock calibration ensures lock between the VCO and the ILROprescaler. The phase-switching based programmable divider structure provides fine frequency resolution through sub-integer division. The PLL power/area are 3.5mW and 0.05mm<sup>2</sup>, RMS jitter is 325fs, yielding a FOM<sub>A</sub> of -186.5.

## Chapter 4

# A 19Gb/s Receiver for Chip-to-Chip Links with Clock-Less DFE and High-BW CDR based on Master-Slave ILOs

## 4.1 Introduction

High performance computing (HPC) is an indispensable tool for fundamental understanding and for prediction of properties of materials and entire systems. HPC advancement is critical for needs of scientific discovery and economic competitiveness. Some of the key challenges in advancing to an exascale computing system at 1000x the performance of today's petaflop machines include: a thousand-fold increase in parallelism, memory storage and data movement requirement, reliability of the system, and energy consumption at this scale of on-die interconnect [79].

Energy-efficient circuits and architectures for high bandwidth, low latency, and error-free information transfer over very short-reach (VSR) copper interconnects are critically needed for chip-to-chip (C2C) communication in high-density, extreme-scale systems [80]. Source-synchronous links are used in HPC systems for low-power C2C interconnects [81–83]. In such links, because of the existence of the clock lane a fast CDR is not used leading to uncorrelated jitter, between clock and data as a function of skew, and resultant performance degradation [84]. Also, in extreme-scale systems with billions of threads in high-density VSR links, such an synchronous architecture stresses clock-tree planning, distribution, resilience to failure and increases potential for electromagnetic interference (EMI). An asynchronous clock architecture with reference clocks and high BW CDR [85, 86] eases clock tree distribution and enables the adoption of spread-spectrum clocking (SSC) to suppress EMI. But, in extreme-scale systems it would still be limited by its power-efficiency and need for reference clock-tree planning. Reference-less architecture removes the need for clock-tree planning but usually are limited either by the data rate [87], degree of RX equalization capability [87], or CDR bandwidth for jitter-tolerance [88,89].

In this chapter a receiver is proposed with an embedded reference-less clocking architecture that relaxes clock-tree planning in dense systems, while maintaining RX equalization capability for error-free operation over VSR channels (< 20 - inch distance). The RX has been implemented in 14nm CMOS and characterized at 19Gb/s. The receiver features an embedded injection-locked oscillator (ILO) for high BW CDR to be used with SSC to mitigate EMI and to potentially relax TX jitter specifications for improved power efficiency. It also has master-slave ILOs based phase generation/rotation using resistively-interpolated injection edges for optimal placement of sampling clocks and clock-less DFE for residual first post-cursor equalization. The next section describes the RX architecture followed by description of different circuit blocks to explain the unique features. The measurement section presents data on the quality of the recovered clock and on the RX performance.

## 4.2 System-level Considerations

#### 4.2.1 Channel Equalization

VSR-C2C links typically operate over a range of channel characteristics, ranging from C2C interconnects within multi-chip modules to relatively short (< 20-inch) channels across a PCB made up of higher quality material such as Megtron-6. Channel insertion losses of < 15dB at 10GHz are expected [90] for such VSR links. NRZ signaling is preferred over PAM4 for such links in extreme-scale systems as they have no forward-error correction (FEC) protocols to minimize system complexity and decode latency. To support high IO density and stringent power requirements in extreme-scale systems, the proposed design envisions a simplified transmitter with no feed-forward equalizer (FFE) and relaxed amplitude and output jitter specifications. The design relies solely on the RX for channel equalization. Lack of de-emphasis on TX (no FFE) increases average signal level at RX input. Continuous-time linear equalization (CTLE) on the RX side using peaking amplifier can equalize pre- and post- cursor ISI over wide time span by convolving with impulse response of the channel. But, counting on RX CTLE leads to one fundamental limitation. It provides no discrimination between desired signal and noise. Boosting high-frequency signals relative to low frequency ones not only compensates the loss of the channel but it amplifies high frequency cross-talk of other channels. A potential concern in highdensity environments of extreme-scale systems. A key advantage of a DFE is that it is able to compensate for ISI without amplifying noise. RX equalization scheme shown in Fig. 4-1 has 1-tap DFE and CTLE with 8dB of peaking at half-baud rate, both can be brought to bear on the channel for optimal performance [91].

Channel operating margins at high data rates are used to measure channel performance that includes both signal impairments and techniques used to compensate for these impairments [92,93]. Such a model is used to evaluate the planned RX CTLE and DFE equalization scheme (Fig. 4-1) to see the impact of the choice and its effectiveness for a VSR-C2C communication channel. The model includes a transmitter (with no FFE), channel induced frequency dependent attenuation, dispersion and discontinuities as well as voltage noise (from devices), static noise (quantization errors, RX meta-stability, etc), jitter in timing circuits and clock-data recover loop. The parameters used are listed in Table 4.1.

| Parameter                         | Symbol                | Value             |
|-----------------------------------|-----------------------|-------------------|
| Number of signal Levels           | L                     | 2                 |
| Signaling rate                    | $f_b$                 | $20{ m Gb/s}$     |
| Transmitter different peak output | $A_v$                 | 0.6V              |
| Single-ended termination resistor | $R_d$                 | $48\Omega$        |
| Rx 3dB bandwidth                  | $f_r$                 | $0.75 \times f_b$ |
| Tx FFE                            | $C_i$                 | i = 0             |
| CTLE DC gain                      | $g_{DC}$              | 0-1dB             |
| CTLE peaking at $f_b/2$           | $f_z, f_{p1}, f_{p2}$ | 0-8dB at $f_b/2$  |
| DFE length                        | $N_{DFE}$             | 1-UI              |
| RMS RJ                            | $\sigma_{RJ}$         | 470fs             |
| Amplitude noise RMS               | $A_m$                 | 3mV               |
| Sampler overdrive                 | $A_{ov}$              | 15mV              |
| Sinusoidal jitter                 | sj                    | 200ppm            |
| Target error rate                 | BER                   | $10^{-12}$        |

Table 4.1: Shows the many parameters used for channel margin study

The analysis, done at 20Gbps over a channel with > 16db insertion-loss at  $F_{baud}/2$ , shows adequate margins for horizontal and vertical eye opening at  $10^{-12}$  BER in Fig. 4-2. It also shows that by including 1-tap DFE capability the system has the ability to improve signal-to-noise ratio in the presence of crosstalk by dialing down the peaking in the CTLE.



 $h_1$  used to cancel 1<sup>st</sup>-post cursor ISI

Figure 4-1: RX equalization capabilities, such as CTLE peaking and 1-tap DFE are evaluated for channel performance margins.

#### 4.2.2 Receiver Architecture

The block diagram of the quarter-rate RX architecture is shown in Fig. 4-3. This architecture picked NRZ signaling over PAM4 as latency requirements for this application do not support FEC. The architecture further assumes standard encoding techniques such as 8b10b are used to maintain a minimum transition density for minimal overhead. The RX input data path has a peaking amplifier to provide linear equalization, with a nominal range of 0-8dB at half-baud frequency. The residual 1<sup>st</sup> post-cursor is then removed using a clock-less direct-feedback DFE, before feeding to two quarter-rate sampling paths (Data/Edge). The samples at the center of the eye (Data) and at the transition edge (Edge) are de-multiplexed to bang-bang phase-frequency detectors (BB-PD, BB-FD) for digital phase and frequency control, similar to [87]. The reference-less BB-FD results in wide capture range and by setting the frequency control voltage of the ILOs to be in the middle of its dead-zone width it ensures optimal lock point for the edge-detect injection.

To improve jitter tolerance (JTOL), the NRZ data sequence at the continuoustime linear-equalizer (CTLE) output is amplified and XORed with its delayed version to detect the transition edges. This edge-detect output resembles RZ data and has strong clock spectral lines at the data rate  $F_{baud}$  providing a vigorous injection signal for the master injection-locked oscillator (MILO). The delay between the 2 XOR inputs is correlated to the frequency-control operation of the BB-FD for the ILO oscillators and hence maintains a  $T_{bit}/2$  spacing for strongest injection. Incidentally, this same delayed input to XOR for edge-detection is shared to feedback  $X(z) \cdot z^{-1}$ symbol for DFE 1<sup>st</sup> post-cursor equalization. Such a clock-less DFE is possible due to the tight correlation of the delay cell to the MILO-SILO frequency control operation and the resultant  $T_{bit}/2$  spacing.

The MILO then injection locks the slave injection-locked oscillator (SILO); this improves phase-noise of the SILO and mitigates high frequency jitter transfer to the SILO's recovered output clock. The BB-PD ensures optimal timing margin for the eye center sampler by changing the phases of the SILO recovered clock using coarse selection of MILO phases and resistively-interpolating the edges finely for injection into the SILO, leading to linear 360° phase rotation of the recovered clock. Two synchronized dividers at SILO output generate quarter-rate clocks for the sampling latches. Quarter-rate clocking allows more time for critical operations such as sampling latch evaluation thereby avoiding the limitation caused by large over-head of self-capacitance. Reduced clock tree depth loading in this embedded clock architecture leads to reduced dynamic clocking power and minimal phase errors in the quarters, which in turn lessens the need for elaborate clock phase corrections in the quarters as in [94,95].



(b) Vertical Eye Opening (mV)

Figure 4-2: Channel operating margin study with signal impairments at different RX peaking and DFE settings. 1-tap DFE gives robustness to system solution in case of degradation due to crosstalk and PN-skew. To improve signal-to-noise ratio in face of crosstalk, peaking could be dialed down and  $h_1$ -tap could be used for post-cursor equalization. 65



Figure 4-3: Quarter-rate RX architecture for very short-reach chip-to-chip links with clock-less DFE and high-bandwidth CDR based on Master-Slave injection-locked oscillators.

### 4.3 Circuit Blocks and Descriptions

#### 4.3.1 CTLE

Fig. 4-4, shows the detailed schematic of the CTLE as well as a single-ended representation. It controls high-frequency gain peaking, uniquely functions as the currentsummer node for the clock-less direct-DFE, and interfaces to the Data/Edge samplers. The peaking is adjusted by switching the value of the capacitor  $C_c$  between two parallel input stages  $g_{m1A}$  and  $g_{m1B}$ . The DC gain is  $g_{m1A} \cdot R_{1A}$ , while the maximum possible high frequency gain is  $(g_{m1A} + g_{m1B}) \cdot (R_{1A}//R_{1B})$ . Without inductor  $L_{1B}$ , the achievable high frequency gain is limited by the output pole. The inductor  $L_{1B}$ extends the bandwidth thereby increasing the peaking at  $F_{baud}/2$ . The DFE feedback tap current is summed into the load resistor,  $R_{1A}$ . The clock-less DFE feedback relies on a replica buffer of the MILO-SILO delay cells to vary the delay according to baudrate. It is described in detail latter in the chapter after a discussion on MILO-SILO frequency calibration.

#### 4.3.2 Data Edge-Detection and Injection

For a random NRZ data stream, each bit in the sequence has an equal probability (50%) of being a one or a zero, regardless of the state of the preceding bit(s). It is therefore possible to have large sequences of consecutive identical digits (CIDs). Because of the very low frequency content produced by long sequences of CIDs in the data signal, designing high-speed systems that can work with random data can be difficult. Data encoding, or scrambling, is often used to format the random data into a more manageable form. This architecture utilizes a widely used encoding method in high-speed systems, 8b10b, to limit the pattern length and maintain minimum transition density for minimal overhead. The power spectrum for a NRZ data stream in Fig. 4-5 shows an infinite sequence of discrete spectral lines (delta functions) scaled by a "sinc<sup>2</sup>(f)" envelope, where sinc(f) is defined as  $sin(\pi f)/(\pi f)$ . Important observations that apply to test patterns in general include: (a) the nulls in the sinc<sup>2</sup>(f)



Figure 4-4: RX CTLE equalization using a single-stage peaking amplifier.

envelope occur at integer multiples of the data rate; (b) spectral lines are evenly spaced at an interval that is the inverse of the pattern length; and (c) the magnitude of the  $sinc^2(f)$  envelope decreases as the data rate and/or pattern length increase.



Figure 4-5: Power spectrum of NRZ signalling for a L-bit repeating pattern, showing a null at data rate.

Since the transitions of the random data sequence is still random the spectrum of the generated pulses from a NRZ data stream resembles that of a return-to-zero (RZ) data. RZ data spectrum displays as a square of sinc function with strong clock spectral lines at data rate and the harmonics. Maintaining  $T_b/2$  delay, where  $T_b$  is a bit interval, between the two inputs of the XOR gate yields a strong clock spectra line at data rate, as shown in Fig. 4-6. In fact, the normalized magnitude of  $1/T_b$ line can be expressed as  $(sinx\pi)/\pi$  where x (0 < x < 1)represents the relative pulse width [96].

Recovered clock which can track data jitter based on instantaneous locking techniques improve jitter-tolerance and are of use in applications without strict specifications on jitter transfer. This design extracts the clock at data rate for injection into the ILO to achieve high jitter-tolerance. The proposed scheme is shown in Fig. 4-7, where the data edge-detector is used to reproduce the clock for injection. The pulse generated by the XOR gate not only indicates data transitions but creates strong spectral line at the data rate ( $F_{baud}$ ), facilitating the injection locking of the subsequent MILO-SILO oscillators (to  $F_{baud}/2$ ).



Figure 4-6: (a) RZ data spectra with  $T_b/2$  delay into the XOR cell (b) Simulated RZ injection level with 19Gbps NRZ input data rate.

The limiting amplifier (LA) stage is a cascade of common-source differential amplifiers with transimpedance gate input and active PMOS loads with common-mode feedback control [97], as shown in Fig. 4-8. The variable-delay stage following the LA is a replica of the delay stage in the MILO-SILO oscillators and is nominally set to be  $T_b/2$  for maximum clock signal extraction for injection. The MILO-SILO oscillators track the data rate through the BB-FD loop, deviation in this replica delay-line



Figure 4-7: Edge-detection, clock signal extraction and injection scheme.

(Fig. 4-9) is simulated to have a 3- $\sigma$  variation of  $\pm 1.5$ ps due to mismatch. The output of LA  $(D_{in})$  and the delay-line  $(D'_{in})$  are input into the CML XOR stage, Fig. 4-10, to extract the clock signal for injection into the MILO.



Figure 4-8: Schematics of limiting-amplifier used for  $\sim 20$  dB differential gain.

#### 4.3.3 Reference-less frequency acquisition

Reference-less here implies a receiver that can function without a physical external reference clock over a wide incoming data rate. By avoiding reference or forwarded clocks to the receiver, the clock-tree and frequency planning becomes simpler and



Figure 4-9: (a) The replica delay-line uses the regulated-voltage of the MILO-SILO block as well as the  $C_L$  settings to track the data rate by maintaining  $T_b/2$  delay (b) Simulated tracking variation due to mismatch in the replica buffer.



Figure 4-10: Schematics of CML XOR stage.

more flexible in a dense C2C network. It also reduces cost of external components and need for tight tolerances in matching frequencies between received data and clock. But, to function error-free reference-less receiver needs to adjust clock phase and frequency to incoming data automatically.

Reference-less receiver in [102] automatically tunes to incoming data using Early/Late discrepancy index from bang-bang phase detector output to infer frequency offset. While the receiver in [103] does digital calibration by injecting data edges into a gated oscillator and observing the phase realignment to infer the frequency offset for adjustment. Neither of these reference-less receivers injection-lock the oscillator to the incoming data frequency. In [87], as in this design, the input data edges are injection-locked into an ILO, and consecutive bang-bang phase detect outputs are used to discriminate between either phase or frequency update. But, as there is no frequency error in the injection-lock range, the full lock range could be a "dead-zone" with equal probability for frequency convergence. The algorithm could end up converging to the edges of the lock range and needs continuous adaptation to avoid losing lock with supply and temperature drift.

The frequency acquisition algorithm used here runs at startup, but it relies on convergence to the center of the injection-lock range and large lock ranges (unlike [104]) to have tolerance against drift and maintain recovered clock performance metrics such as *Jitter<sub>rms</sub>* and JTOL bandwidth. The proposed frequency acquisition loop uses properties of a conventional bang-bang detector, which based on the sign of the phase error provides Early (E) or Late (L) information. Consecutive E/L information is used to determine phase (BB-PD) or frequency (BB-FD) updates into respective accumulators, as shown in Fig. 4-11. The sequence for frequency acquisition is shown in Fig. 4-12. Fig. 4-13 shows the circuit details of the MILO-SILO oscillators, where the natural frequency is set by controlling the reference voltage ( $v_c$ ) of their regulator and by selecting the switchable load capacitor ( $C_L$ ) at their outputs. At startup the MILO-SILO oscillators are reset to the lowest frequency using coarse ( $C_L$ ) and fine ( $v_c$ ) frequency controls. The BB-PD phase control loop is run for a timed interval, before resetting the error accumulators and running the BB-FD loop and noting the frequency error  $(F_{err})$  count. The process is repeated as the frequency control settings  $(C_L, v_c)$  are swept. The end result is a set of bands of  $(C_L, v_c)$  settings with  $F_{err}$ counts below a threshold denoting frequency lock. The center of largest such bands is chosen for its frequency control settings. By choosing the largest band the issue of harmonic locking is avoided. This setting for MILO-SILO frequency ensures data is locked in the center of the ILO injection-lock range and latter measurements show that the lock range is sufficient to ensure tight tolerance for recovered clock *Jitter*<sub>rms</sub> and JTOL bandwidth over supply and temperature drifts.

If BBPD-based frequency detect logic were to be the sole mechanism to update the oscillator frequency; then the transition density, jitter, and number of consecutive E or L signals could all influence the frequency difference between data and the recovered clock [102]. But, in our scheme BBPD-based frequency detect is run with data-edge injected into the oscillator, so as to converge on a oscillator frequency setting in the middle of the largest injection-locked band. Assuming the middle of such a band is where the free-running frequency of the injection-locked oscillator is closest to  $f_{baud}/2$ , any deviation then is dictated by the quantization of the fine-frequency control  $(v_c)$  and is estimated to be of the order of  $\pm 20MHz$ .

#### 4.3.4 Resistively-Interpolated MILO-SILO based Phase-Rotation

Fig. 4-13 shows the circuit details of the MILO-SILO used to achieve high BW for jitter-tolerance and linearity for the phase rotation of the recovered clock. The natural frequency of the MILO-SILO oscillators is set by controlling the reference voltage  $(v_c)$ of their regulator and by selecting the switchable load capacitor  $(C_L)$  at their outputs. The edge-detect output with its strong spectral content at  $F_{baud}$  is injected into the MILO to achieve injection lock at  $F_{baud}/2$  and leads to high jitter-tolerance BW. Since the pulling between MILO and SILO is quite strong, the overall lock range is primarily determined by the coupling between the edge-detect output and the MILO. Phase calibration of the recovered clock that allows for 360° phase rotation is needed for optimal link timing. In this design, phase-shifting at the MILO-SILO output is achieved by resistively-interpolating between coarse phases  $(S_o \text{ and } S_e)$  from



Figure 4-11: Shows the use of consecutive early-late transitions to discriminate between and phase and frequency error for tracking and correction.



Figure 4-12: Reference-less frequency lock algorithm which sets the MILO-SILO free running frequency to lock in the center of the injection lock range. This ensures optimal margin against drift and for jitter-tolerance.

the MILO and then injecting the finely interpolated edge  $(S_{inj})$  into the SILO for injection lock. Resistive elements are set so that nominally equispaced interpolation is achieved between  $S_o$  and  $S_e$ , while maintaining constant injection strength at  $S_{inj}$ . The coarse phase-selections to fine resistive-interpolation settings at different phaserotator positions over 2UI is shown in Fig. 4-14. The single-point injection into the SILO generates multi-phase outputs that are divided down using a set of synchronizeddividers to generate  $F_{baud}/4$  clocking to the Data and Edge samplers. Unlike prior ILO-based phase rotators [98, 99] this scheme does not suffer from glitches due to time-modulated injection or non-linearities due to relying on offsetting the natural frequency of the SILO to achieve phase-rotation or on mismatch characteristics of current DACs to achieve linearity.

#### 4.3.5 Clock-less DFE

The CTLE output is fed back through a limiting amplifier (differential gain of 20dB) and a variable delay buffer (Fig. 4-15), that are also used in the edge-detect block, for residual first post-cursor equalization. In contrast to [100], this clock-less DFE allows minimizing the load of the sampling clock and the length of the clock-distribution tree. The edges from the previous bit at the CTLE output,  $d_1$ , transition ( $\alpha$ ) through the limiting AMP and replica-delay buffer with a delay window  $T_d$  (shown in blue). The decision from the feedback applies a post-cursor correction to the current bit  $d_0$ given by,  $d_0 - i_1 \cdot R_{1A}$ . This assumes the post-cursor weight  $i_1 \cdot R_{1A}$  is settled before the data sampler samples the  $d_0$  bit. Such a settling transition ( $\beta$ ) has a settling window  $T_{settle}$  (shown in purple).  $T_d$  and  $T_{settle}$  falling within these ranges give the necessary setup/hold margins for the direct-DFE first post-cursor equalization. The same frequency-control setting used by the reference-less MILO-SILO is used in the DFE feedback variable-delay replica-cell to seamlessly meet the setup/hold margins over a wideband operating range, without complex delay calibration as in [101].



Figure 4-13: Master-Slave ILO-based  $360^o$  phase-rotation using resistive-interpolated edges for injection.



Figure 4-14: Coarse phase-selections to fine resistive-interpolation settings at different phase-rotator positions over 2UI.



Figure 4-15: Clock-less direct-feedback DFE with variable delay replica-cell tied to the delay elements in ILOs to optimally meet DFE loop timing margins.

## 4.3.6 Jitter-tolerance BW using d vs. $\Delta$ based time-delay model

The dual-loop clock-data recovery, one part based on BB-PD and other part based on edge-based injection-locking into the MILO-SILO combination, is linearized and shown in Fig. 4-16.  $K_*$ ,  $\phi_*$ ,  $\phi_{n*}$ , and  $Q_*$  represent scalar gains, phases, phase noises and quantization errors, respectively, at different stages in the system. Using the linearized dual-loop model of the proposed architecture, the transfer function from input phase ( $\phi_{data}$ ) to output phase error ( $\phi_{err}$ ) can be written as in (4.1).

$$\frac{\phi_{err}}{\phi_{data}} = \left[1 - \frac{K_{BB}K_IK_{PR}z^{-M}\beta_2}{2(1 - z^{-1/16})(1 - z^{-1})} - \frac{\beta_1}{(1 - z^{-1})} \cdot \frac{\beta_1}{(1 - z^{-1})}\right]$$
(4.1)

Jitter on injected pulses,  $\phi_i$ , at the ILO pulls its phase by  $\phi_o(\phi_i)$ , leading to the next injection pulse having a smaller  $\phi_i$ , as shown in Fig. 4-17. In steady state,  $\phi_o \to 0$ , and the settling behaviour of the phase perturbation,  $\phi_i$ , depends on  $\beta(\phi_i)$ . Assuming  $\beta_1$  and  $\beta_2$  to be the relationship between input injection phase to output phase (analogous to the slope of d vs.  $\Delta$  in the time-delay model [74]) at MILO and SILO, respectively; the injection-locked oscillator is represented as a first-order low-pass filter  $(\frac{\beta_{\leq x \geq}}{(1-z^{-1})})$ .

As the BB-PD loop has a low BW in this system, the jitter-tolerance BW of the dual-loop CDR is mostly determined by  $\beta_1$  and  $\beta_2$ . Open-loop d vs.  $\Delta$  simulations are used to generate  $\beta_1$  and  $\beta_2$  relationships and used to calculate the jitter-tolerance as described in Fig. 4-18. Comparing the JTOL plot from model based calculations to actual closed-loop spectre model based simulations shows good correlation and validates the time-delay models' usefulness in predicting dynamics of injection-locked oscillators in tracking input phase deviations.



Figure 4-16: Discrete-time model of the dual-loop clock-data recovery loop.



Figure 4-17: Relationship between input injection phase perturbation and output phase change and its impact on settling time constant.



Figure 4-18: Open-loop d vs.  $\Delta$  values of MILO-SILO used to calculate JTOL and compared against closed-loop simulations.

## 4.4 Measurements

#### 4.4.1 Experimental Setups

The RX test-chip (Fig. 4-19) was fabricated in 14nm CMOS, with a core RX area of  $225\mu m \ge 275\mu m$ . A general setup for measurements is shown in Fig. 4-20. Data is generated in J-BERT N4903B and multiplied up using N4876A 2:1 multiplexer. The data then goes through a Megtron6 PCB before entering the device under test (D.U.T) on the probe station. Serial-scan interface is controlled using National Instruments NI-2162 digital I/O accessory and NI PXI-1042, which is also used to interface with LabView GUI. Fig. 4-21, shows the setup used to measure the rotator INL/DNL. 1010 data pattern from the J-Bert is multiplied up using N4870A 2:1 Mux to injection-lock into the MILO-SILO in the D.U.T. The MILO-SILO phase-rotator output recovered clock from D.U.T is pattern-locked to a trigger in DCA-X 86100D sampling scope. MILO-SILO phase-rotator is rotated and its phase-step is calculated with reference to the previous waveform in scope memory.

#### 4.4.2 MILO-SILO, Phase Rotation and Recovered Clock

Fig. 4-22(a-b) shows the measured lock-range of the MILO-SILO recovered clock as a function of the baud rate with 1010 and PRBS7 patterns, for few discrete frequency control voltages ( $v_c$ ) and switchable load capacitance ( $C_L$ ) values. The lock-range is close to 7% of the baud rate. The apparent gaps in the plot are an artifact of discrete control voltages used in the measurements to show the wideband range of the frequency lock. With finer frequency control voltage values the plot shows a continuum of lock-range between 8-19Gb/s. Fig. 4-23 shows the DNL and INL for the MILO-SILO based phase-rotator over the extreme ends of the performance range (8-19Gb/s), using the setup shown in Fig. 4-21. With one rotator step corresponding to 1/32 of an UI, the INL of < 1-LSB shows good linearity for the resistive-interpolative injection based 360° phase rotation.

Fig. 4-24, shows a typical random jitter measurement for the recovered clock. It



Figure 4-19: Fabricated chip micrograph and layout.



Figure 4-20: Measurement setup of the DUT. Data is generated in J-BERT N4903B and multiplied up using N4876A 2:1 multiplexer. The data then goes through a Megtron6 PCB before entering the DUT on the probe station. Serial-scan interface is controlled using National Instruments NI-2162 digital I/O accessory and NI PXI-1042, which is also used to interface with LabView GUI.



Figure 4-21: Shows the setup used to measure the rotator INL/DNL. 1010 data pattern from the J-Bert is multiplied up using N4870A 2:1 Mux to injection-lock into the MILO-SILO in the device under test (D.U.T). The MILO-SILO phase-rotator output recovered clock from D.U.T is pattern-locked to a trigger in DCA-X 86100D sampling scope. MILO-SILO phase-rotator is rotated and its phase-step is calculated with reference to the previous waveform in scope memory.

includes not only the jitter transfer from the injected data but also the output clock buffers and driver. Fig. 4-25(a) shows the RJ of the recovered clock for different injection data rates within the lock range. For most of the lock range the RJ shows tight tolerance. Fig. 4-25(b) shows that post initial frequency lock calibration as supply voltage of the MILO-SILO regulator changes by  $\pm 5\%$  or as temperature deviates between 0C and 100C the recovered clock RJ shows less than 20fs of deviation with no discernible trends. It is likely that any change in injection-lock range and bandwidth due to supply/temperature drift still maintains a large enough BW to show no appreciable change in the net integrated jitter.



Figure 4-22: Lock range of MILO-SILO at different frequency control voltages  $(v_c)$  and switchable load cap  $(C_L)$  with (a)1010 pattern. (b) PRBS7 pattern.



Figure 4-23: Measured INL/DNL values of the MILO-SILO based phase-rotator over extremes of operating speed.

#### 4.4.3 Receiver Performance

An experimental setup to measure RX performance over a channel is shown in Fig. 4-20. Over a 20-inch Megtron6 PCB, which has 15dB of loss at 9.5GHz and significantly distorts a 19Gb/s data eye (Fig. 4-26), the RX recovers error-free ( $BER < 10^{-12}$ ) 19Gb/s PRBS7 data with a horizontal eye opening of 44% (Fig. 4-27). Without DFE, the eye opening is 22%, showing the benefits of the clock-less direct-DFE scheme. The JTOL plot at 19Gb/s is shown in Fig. 4-28(a) for a PRBS7 data (at BER of  $10^{-12}$ , including ISI), giving a CDR BW of 250MHz. Fig. 4-28(b), shows the JTOL BW change with supply/temperature drift. The JTOL BW remains > 200MHz with drift after initial frequency lock calibration.



Figure 4-24: Measurement of random jitter  $(RJ_{rms})$  on the recovered clock includes not only the jitter transfer from the injected data but also the output clock buffers and driver.


Figure 4-25: (a) Recovered clock random jitter as function of data baudrate post-BBFD (b) After initial frequency lock calibration, as supply voltage of the MILO-SILO regulator changes by  $\pm 5\%$  or as temperature deviates between 0C and 100C, the recovered clock RJ shows less than 10fs of deviation with no discernible trends.



Figure 4-26: Measured channel insertion loss over 20-inch Megtron6 PCB and  $Rx_{Input}$  eye diagram after channel at 19Gb/s.



Figure 4-27: RX performance at 19Gbps over 20-inch MEG6 channel.



Figure 4-28: (a)Measured JTOL BW at 19Gb/s for PRBS7 data at BER of  $10^{-12}$  over a 10dB loss channel (b) JTOL BW as a function of temperature variation after frequency lock calibration.

#### 4.4.4 Performance summary and comparison

Other RX performance data are also given in Table 4.2. Fig. 4-29, shows the distribution of the power consumption over different macros in the RX.

| Item | Description                            | Value                                            |
|------|----------------------------------------|--------------------------------------------------|
| 1    | Technology                             | 14nm CMOS                                        |
| 2    | Data Rate                              | $8-19 \mathrm{Gb/s}$                             |
| 3    | RX Architecture                        | 1-stage Peaking Amp, Quarter-rate                |
|      |                                        | Clock-less DFE, ILO-CDR                          |
| 4    | CDR JTOL bandwidth                     | $> 200 \mathrm{MHz}$                             |
| 5    | Input Swing                            | 450mVppd                                         |
| 6    | Channel loss @ Nyquist                 | $\sim \! 15 dB$                                  |
| 7    | Horizontal Eye Opening @BER $10^{-12}$ | $44\% \ (19 { m Gb/s \ PRBS7 \ data \ pattern})$ |
| 8    | Area                                   | $225 \mu m \ge 275 \mu m$                        |
| 9    | Power $@19Gb/s$                        | $56 \mathrm{mW}$                                 |
| 10   | ${ m FOM(pJ/b/dB~of~loss)}$ @19Gb/s    | $0.29 \mathrm{pJ/b/dB}$                          |
| 11   | Supply Voltages                        | $0.9\mathrm{V}$                                  |

Table 4.2: Summary of receiver performance.

Table 4.3 shows state of art comparison of energy-efficient interconnect for C2C applications. Reference [87] uses an reference-less embedded-clocking based design. It shows >200MHz JTOL BW, which is comparable power to this work. But, this work has 1.5x the maximum speed while equalizing channels with 3x the loss @ Nyquist. Reference [105] design requires reference clock and is PLL based requiring clock-tree planning, both of which constrain an extreme-scale system with high-density of C2C communication. Even though the operating speed of this design is slightly higher at 20Gb/s, its CDR BW is 20-times slower at <10MHz compared to this work. Reference [106] design uses a source-synchronous forwarded-clock approach. Even though it shows better power-efficiency at lower operating-speeds, in extreme-scale systems with extreme-density of links this approach would be challenged by clock-tree planning and EMI. Like [106], reference [107] also uses a source-synchronous forwarded-clock approach. It shows better power-efficiency at lower operating-speeds. As with [106], this approach would be challenged by clock-tree planning and EMI in



Figure 4-29: Power consumption distribution in the RX.

extreme-scale systems. Even though [107] quotes a CDR BW of 25-300MHz, the source-synchronous nature of this link would mean that SSC loses its efficiency in suppressing EMI, when say millions of threads in a dense extreme-scale system are all synchronous to each other. Comparing the design against other designs with reference-less clock-data recovery units, as seen in Fig. 4-30, shows this receiver to be having the best FoM when considering the operating speed, channel loss @ Nyquist and the JTOL bandwidth. In summary, this work stands out when one considers the problem it is solving. By avoiding need for reference-clock and complex clock-tree it simplifies clock planning in dense extreme scale systems. Also by maintaining large jitter tracking bandwidths it enables use of SSC for EMI. Other enhancements such as master-slave ILO-based phase-rotation and clock-less DFE improve Rx equalization performance in a power-efficient manner.

| Item                                                                            | Description                                                                                                                                                                                                        | This Work                                                                                                                                   | [41]                                                                                            | [105]                                                                                                     |
|---------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| 1                                                                               | Technology (CMOS)                                                                                                                                                                                                  | 14nm                                                                                                                                        | 28nm                                                                                            | 28nm                                                                                                      |
| 2                                                                               | Data Rate $(Gb/s)$                                                                                                                                                                                                 | 8-19                                                                                                                                        | 1-12                                                                                            | 20                                                                                                        |
| 3                                                                               | Clocking Arch.                                                                                                                                                                                                     | $Ref_{less}$ -embedded                                                                                                                      | $Ref_{less}$ -embedded                                                                          | w. Ref., Pll-based <sub>async</sub>                                                                       |
| 4                                                                               | Jitter tracking bandwidth                                                                                                                                                                                          | $> 200 \mathrm{MHz}$                                                                                                                        | $> 200 \mathrm{MHz}$                                                                            | $< 10 \mathrm{MHz}$                                                                                       |
| 5                                                                               | Input Swing $(mV_{ppd})$                                                                                                                                                                                           | 450                                                                                                                                         | 400                                                                                             | —                                                                                                         |
| 6                                                                               | Channel loss @ Nyquist (dB)                                                                                                                                                                                        | 15                                                                                                                                          | 5                                                                                               | 20                                                                                                        |
| 7                                                                               | Power Efficiency $(pJ/b)$                                                                                                                                                                                          | 2.9                                                                                                                                         | 2.8                                                                                             | 6.5                                                                                                       |
| 8                                                                               | Supply Voltages                                                                                                                                                                                                    | 0.9V                                                                                                                                        | 0.9V                                                                                            | 1.35V/0.9V                                                                                                |
|                                                                                 |                                                                                                                                                                                                                    |                                                                                                                                             |                                                                                                 |                                                                                                           |
| $\rightarrow$                                                                   | cont'd                                                                                                                                                                                                             | This Work                                                                                                                                   | [106]                                                                                           | [107]                                                                                                     |
| $\rightarrow$ 1                                                                 | cont'd<br>Technology (CMOS)                                                                                                                                                                                        | This Work<br>14nm                                                                                                                           | <b>[106]</b><br>32 nm                                                                           | [ <b>107</b> ]<br>65nm                                                                                    |
| $\begin{array}{c} \rightarrow \\ \hline 1 \\ 2 \end{array}$                     | cont'd<br>Technology (CMOS)<br>Data Rate (Gb/s)                                                                                                                                                                    | This Work           14nm           8-19                                                                                                     | [106]<br>32nm<br>12                                                                             | [107]<br>65nm<br>4-7.4                                                                                    |
| $\begin{array}{c} \rightarrow \\ 1 \\ 2 \\ 3 \end{array}$                       | cont'd<br>Technology (CMOS)<br>Data Rate (Gb/s)<br>Clocking Arch.                                                                                                                                                  | $\begin{tabular}{lllllllllllllllllllllllllllllllllll$                                                                                       | [106]<br>32nm<br>12<br>Source <sub>sync</sub> -forwarded <sub>clk</sub>                         | [107]<br>65nm<br>4-7.4<br>Source <sub>sync</sub> -forwarded <sub>clk</sub>                                |
| $ \begin{array}{c} \rightarrow \\ 1 \\ 2 \\ 3 \\ 4 \end{array} $                | cont'd         Technology (CMOS)         Data Rate (Gb/s)         Clocking Arch.         Jitter tracking bandwidth                                                                                                 | $\begin{tabular}{lllllllllllllllllllllllllllllllllll$                                                                                       | [106]<br>32 nm<br>12<br>Source <sub>sync</sub> -forwarded <sub>clk</sub>                        | [107]<br>65nm<br>4-7.4<br>Source <sub>sync</sub> -forwarded <sub>clk</sub><br>25-300MHz                   |
| $ \begin{array}{c} \rightarrow \\ 1 \\ 2 \\ 3 \\ 4 \\ 5 \end{array} $           | cont'd         Technology (CMOS)         Data Rate (Gb/s)         Clocking Arch.         Jitter tracking bandwidth         Input Swing (mVppd)                                                                     | $\begin{tabular}{ l l l l l l l l l l l l l l l l l l l$                                                                                    | [106]<br>32 nm<br>12<br>Source <sub>sync</sub> -forwarded <sub>clk</sub><br><br>400             | [107]<br>65nm<br>4-7.4<br>Source <sub>sync</sub> -forwarded <sub>clk</sub><br>25-300MHz                   |
| $ \begin{array}{c} \rightarrow \\ 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \end{array} $      | cont'd         Technology (CMOS)         Data Rate (Gb/s)         Clocking Arch.         Jitter tracking bandwidth         Input Swing (mVppd)         Channel loss @ Nyquist (dB)                                 | This Work           14nm           8-19           Ref <sub>less</sub> -embedded           > 200MHz           450           15               | [106]<br>32nm<br>12<br>Source <sub>sync</sub> -forwarded <sub>clk</sub><br><br>400<br>14        | [107]<br>65nm<br>4-7.4<br>Source <sub>sync</sub> -forwarded <sub>clk</sub><br>25-300MHz<br>5              |
| $ \begin{array}{c} \rightarrow \\ 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \\ 7 \end{array} $ | cont'd         Technology (CMOS)         Data Rate (Gb/s)         Clocking Arch.         Jitter tracking bandwidth         Input Swing (mVppd)         Channel loss @ Nyquist (dB)         Power Efficiency (pJ/b) | This Work           14nm           8-19           Ref <sub>less</sub> -embedded           > 200MHz           450           15           2.9 | [106]<br>32nm<br>12<br>Source <sub>sync</sub> -forwarded <sub>clk</sub><br><br>400<br>14<br>1.9 | [107]<br>65nm<br>4-7.4<br>Source <sub>sync</sub> -forwarded <sub>clk</sub><br>25-300MHz<br>—<br>5<br>0.92 |

Table 4.3: State of art comparison of energy-efficient dense VSR-C2C interconnects



Figure 4-30: Comparison of RX CDR bandwidth, speed, power efficiency, and channel loss at Nyquist against other reference-less clock-data recovery designs.

#### 4.5 Summary

This chapter presented a reference-less receiver architecture using embedded-oscillators having high jitter tolerance bandwidth for VSR-C2C channels. This receiver is shown to be 1.5x faster than previous reference-less embedded-oscillator based designs with greater than 100MHz jitter tolerance bandwidth while recovering error-free data over VSR-C2C channels. Reference-less high-bandwidth CDR simplifies clock-tree planning in dense extreme-scale computing environments and enables SSC for suppressing EMI and to mitigate TX jitter requirements. Key design features include a linear first of its kind phase generator/interpolator based on resistively-interpolated master-slave ILOs, and a clock-less DFE. Clock-less DFE reduces clock-tree load while boosting signal-to-noise ratio in presence of crosstalk and is implemented seamlessly (no DFE specific delay calibration) using variable delay information from the embedded-ILO to maintain optimal DFE loop margins while directly feeding back into the CTLE output. The RX is implemented in 14nm CMOS and characterized at 19GB/s. It achieves a power-efficiency of 2.9 pJ/b while recovering error-free data ( $BER < 10^{-12}$ ) across a 15dB loss channel. The jitter tolerance bandwidth of the receiver over supply/temperature drift is > 200 MHz and INL of the ILO-based phase-rotator  $(32_{Steps/UI})$  is < 1 - LSB.

### Chapter 5

# Conclusion

This thesis presented several new architectures and integrated circuits for the realization of low-power transceivers for extreme-scale systems. In an hierarchical heterogeneous interconnect envisioned in extreme-scale systems there is a need for IO interconnect with diverse requirements. Some IO might have use for embedded clockfrequency synthesizers functioning off core power supply, other might have low power area footprint requirements functioning in extreme dense spaces. The architectures and singular solutions presented in this thesis aim to solve challenges in this diverse requirement space. The small footprint, low-power, low-supply, high-frequency operation of ILOs make them very attractive for high data rate communication. The thesis showed how integration of such systems on silicon open up several architectural and circuit possibilities that enable good system performance in non-traditional ways. Singular performance was demonstrated taking advantage of unique properties of nonharmonic injection-locked oscillators.

To begin with the thesis develops a delay-based model to predict the injection locking behavior of non-harmonic oscillators such as ring oscillators. The effect of the injection signal on the oscillator is modeled with a d versus  $\Delta$  characteristic which captures the additional delay d in a stage due to the effect of the injection signal with a delay  $\Delta$ . Using this characteristic, the injection-locking range as well as injectionlocking dynamics can be accurately modeled and predicted. This modeling approach was applied to a differential four-stage ring oscillator where analytical expressions for the waveforms could be derived along with an analytical expression for the d versus  $\Delta$  characteristic. Versatility of the modeling approach was demonstrated by analyzing the locking behavior of a single-ended three-stage CMOS-inverter-based ring oscillator. In this case the d versus  $\Delta$  characteristic was derived from simulations and measurements. By simulating for d versus  $\Delta$  characteristic, the model is also applied to predict the lock range of a multi-phase injection-locked ring-oscillator-based prescaler, as well as the dynamics of tracking injection phase perturbations in injection-locked master-slave oscillators. The presented time-domain delay-based modeling approach can be applied to any nonharmonic oscillator as long as the relationship between the extra delay d and the delay  $\Delta$  between the injection signal and the relevant internal oscillator is available.

The thesis then presented a sub-integer clock-frequency synthesizer architecture that can operate at a high speed from an ultra-low supply. A record speed of 9GHz has been demonstrated at 0.5V in 45nm SOI CMOS. Key design features are described to achieve such high frequencies with fine resolution at an ultra-low supply. The proposed multi-phase multi-input ILRO-prescaler eliminates the speed bottleneck, while automatic injection-lock calibration ensures lock between the VCO and the ILRO-prescaler. The phase-switching based programmable divider structure provides fine frequency resolution through sub-integer division. The PLL power/area are 3.5mW and 0.05mm<sup>2</sup>, RMS jitter is 325fs, yielding a FOM<sub>A</sub> of -186.5.

Finally, the thesis describes a receiver with a reference-less clocking architecture for high-density VSR-C2C links. This architecture simplifies clock-tree planning in dense extreme-scaling computing environments and has high-bandwidth CDR to enable SSC for suppressing EMI and to mitigate TX jitter requirements. Several circuit and architecture features have been described, including a phase rotator based on resistively-interpolated injection-locked oscillator with < 1 - LSB INL, a clock-less DFE, and a high-BW JTOL with reference-less frequency lock. Measured results show 19Gb/s link operation over channels with up to 15dB of loss at Nyquist, while achieving a power efficiency of 2.9pJ/bit. The reported reach and power efficiency demonstrate the suitability of this architecture for power critical high-density I/O applications with short reach, as required for future high-performance extreme-scale systems.

#### 5.1 Future Research

The sub-integer clock-frequency synthesizer and reference-less receiver using embeddedoscillators have been demonstrated to be robust against drift with startup calibration. With the scaling in CMOS technologies and potentially larger process variation and temperature sensitivities, it would be useful research avenue to introduce dynamic calibration and adaptation into the designs. There are also potential benefits in extending this clock-frequency synthesis architecture into fractional-N space. Compared to a conventional multi-modulo divider following the ILRO-prescaler, the sub-integer division architecture shown in this thesis if used in a ultra-low supply fractional-N PLL would have the effect of reducing quantization noise by 15.5dB. Another potential avenue for research is energy-proportional operation of serial links for realizing energy-efficient data centers. Burst-mode communication, where the link is poweredoff when idle and powered-on when needed, achieves energy proportional operation. The main challenges in achieving small power-on time and off-state power include the design of fast-locking PLLs, CDRs and achieving fast settling of bias node voltages. The reference-less embedded-clock scheme shown in this thesis could be improved on to meet these power-on time challenges for potential use in burst-mode communication.

# Bibliography

- [1] P. Kogge et al., "ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems", *ExaScale Study Group*, 2008.
- [2] G. Yeric, "Moore's Law at 50: Are we planning for retirement?", *IEEE Inter*national Electron Devices Meeting (IEDM), 2015.
- [3] S. Borkar, "The Exascale challenge", Proceedings of 2010 International Symposium on VLSI Design, Automation, and Test, 2010.
- [4] P. Kogge et al., "Facing the Exascale Energy Wall", International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, 2010.
- [5] H. Wu et al., "A 19GHz 0.5mW 0.35μm CMOS Frequency Divider with Shunt-Peaking Locking-Range Enhancement", *IEEE International Solid-State Cir*cuits Conf., 2001.
- [6] K. Yamamoto et al., "70GHz CMOS Harmonic Injection-Locked Divider", IEEE International Solid-State Circuits Conf., 2006.
- [7] H. Wu et al., "A 16-to-18GHz 0.18μm Epi-CMOS Divide-by-3 Injection-Locked Frequency Divider", *IEEE International Solid-State Circuits Conf.*, 2006.
- [8] P. Mayr et al., "A 90GHz 65nm CMOS Injection-Locked Frequency Divider", IEEE International Solid-State Circuits Conf., 2007.
- [9] S. Rong et al., "0.9mW 7GHz and 1.6mW 60GHz Frequency Dividers with Locking-Range Enhancement in 0.13µm CMOS", *IEEE International Solid-State Circuits Conf.*, 2009.
- [10] B.-Y. Lin et al., "A 128.24-to-137.00GHz Injection-Locked Frequency Divider in 65nm CMOS", IEEE International Solid-State Circuits Conf., 2009.
- [11] H.-K. Chen et al., "A mm-Wave CMOS Multimode Frequency Divider", IEEE International Solid-State Circuits Conf., 2009.
- [12] Z. Huang et al., "A 70.5-to-85.5GHz 65nm Phase-Locked Loop with Passive Scaling of Loop Filter", *IEEE International Solid-State Circuits Conf.*, 2015.

- [13] S.Y. Yue et al., "A 17.1 to 17.3GHz Image-Reject Down-Converter with Phase-Tunable LO Using 3x Subharmonic Injection Locking", *IEEE International Solid-State Circuits Conf.*, 2004.
- [14] S.D. Toso et al., "UWB Fast-Hopping Frequency Generation Based on Sub-Harmonic Injection Locking", *IEEE International Solid-State Circuits Conf.*, 2008.
- [15] W.L. Chan et al., "A 56-to-65GHz Injection-Locked Frequency Tripler with Quadrature Outputs in 90nm CMOS", *IEEE International Solid-State Circuits* Conf., 2008.
- [16] A. Mazzanti et al., "A 13.1% Tuning Range 115GHz Frequency Generator Based on an Injection-Locked Frequency Doubler in 65nm CMOS", *IEEE International Solid-State Circuits Conf.*, 2010.
- [17] D. Shin et al., "A Mixed-Mode Injection Frequency-Locked Loop for Self-Calibration of Injection Locking Range and Phase Noise in 0.13μm CMOS", *IEEE International Solid-State Circuits Conf.*, 2016.
- [18] M.-J.E. Lee et al., "A Second-Order Semi-Digital Clock Recovery Circuit Based on Injection Locking", *IEEE International Solid-State Circuits Conf.*, 2003.
- [19] F. O'Mahony et al., "10GHz Clock Distribution Using Coupled Standing-Wave Oscillators", IEEE International Solid-State Circuits Conf., 2003.
- [20] F. O'Mahony et al., "A 27Gb/s Forwarded-Clock I/O Receiver Using an Injection-Locked LC-DCO in 45nm CMOS", *IEEE International Solid-State Circuits Conf.*, 2008.
- [21] M. Hossain et al., "A 6.8mW 7.4Gb/s Clock-Forwarded Receiver with up to 300MHz Jitter Tracking in 65nm CMOS", *IEEE International Solid-State Cir*cuits Conf., 2010.
- [22] J.-H. Seol et al., "An 8Gb/s 0.65mW/Gb/s Forwarded-Clock Receiver Using an ILO with Dual Feedback Loop and Quadrature Injection Scheme", *IEEE International Solid-State Circuits Conf.*, 2013.
- [23] M. Raj et al., "A 4-to-11GHz Injection-Locked Quarter-Rate Clocking for an Adaptive 153fJ/b Optical Receiver in 28nm FDSOI CMOS", *IEEE Interna*tional Solid-State Circuits Conf., 2015.
- [24] J. Lee et al., "Subharmonically Injection-Locked PLLs for Ultra- Low-Noise Clock Generation", IEEE International Solid-State Circuits Conf., 2009.
- [25] P. Park et al., "An All-Digital Clock Generator Using a Fractionally Injection-Locked Oscillator in 65nm CMOS", *IEEE International Solid-State Circuits* Conf., 2012.

- [26] Y.-C. Huang et al., "A 2.4GHz Sub-Harmonically Injection-Locked PLL With Self-Calibrated Injection Timing", *IEEE International Solid-State Circuits* Conf., 2012.
- [27] W. Deng et al., "A 0.022mm<sup>2</sup> 970μW Dual-Loop Injection-Locked PLL with -243dB FOM Using Synthesizable All-Digital PVT Calibration Circuits", *IEEE International Solid-State Circuits Conf.*, 2013.
- [28] I.-T. Lee et al., "A Divider-Less Sub-Harmonically Injection-Locked PLL with Self-Adjusted Injection Timing", *IEEE International Solid-State Circuits Conf.*, 2013.
- [29] J.-C. Chien et al., "A Pulse-Position-Modulation Phase-Noise-Reduction Technique for a 2-to-16GHz Injection-Locked Ring Oscillator in 20nm CMOS", IEEE International Solid-State Circuits Conf., 2014.
- [30] W. Deng et al., "A 0.048mm<sup>2</sup> 3mW Synthesizable Fractional-N PLL with a Soft Injection-Locking Technique", *IEEE International Solid-State Circuits Conf.*, 2015.
- [31] A. Elkholy et al., "A 6.75-to-8.25GHz 2.25mW 190 fs<sub>rms</sub> Integrated-Jitter PVT-Insensitive Injection-Locked Clock Multiplier Using All-Digital Continuous Frequency-Tracking Loop in 65nm CMOS", IEEE International Solid-State Circuits Conf., 2015.
- [32] A. Elkholy et al., "A 6.75-to-8.25GHz, 250 fs<sub>rms</sub>-Integrated-Jitter 3.25mW Rapid On/Off PVT-Insensitive Fractional-N Injection-Locked Clock Multiplier in 65nm CMOS", *IEEE International Solid-State Circuits Conf.*, 2016.
- [33] D. Coombs et al., "A 2.5-to-5.75GHz 5mW 0.3ps<sub>rms</sub>-Jitter Cascaded Ring-Based Digital Injection-Locked Clock Multiplier in 65nm CMOS", *IEEE International* Solid-State Circuits Conf., 2017.
- [34] S. Yoo et al., "A PVT-Robust -39dBc 1kHz-to-100MHz Integrated- Phase-Noise 29GHz Injection-Locked Frequency Multiplier with a 600μW Frequency-Tracking Loop Using the Averages of Phase Deviations for mm-Band 5G Transceivers", *IEEE International Solid-State Circuits Conf.*, 2017.
- [35] H.C. Ngo et al., "A 0.42ps-Jitter -241.7dB-FOM Synthesizable Injection-Locked PLL with Noise-Isolation LDO", *IEEE International Solid-State Circuits Conf.*, 2017.
- [36] A. Hussein et al., "A 50-to-66GHz 65nm CMOS All-Digital Fractional-N PLL with 220 fs<sub>rms</sub> Jitter", IEEE International Solid-State Circuits Conf., 2017.
- [37] S. Kim et al., "A 2.5GHz Injection-Locked ADPLL with 197fs<sub>rms</sub> Integrated Jitter and -65dBc Reference Spur Using Time-Division Dual Calibration", *IEEE International Solid-State Circuits Conf.*, 2017.

- [38] S. Yoo et al., "A PVT-Robust -39dBc 1kHz-to-100MHz Integrated- Phase-Noise 29GHz Injection-Locked Frequency Multiplier with a 600μW Frequency-Tracking Loop Using the Averages of Phase Deviations for mm-Band 5G Transceivers", *IEEE International Solid-State Circuits Conf.*, 2017.
- [39] J. Terada et al., "Jitter-Reduction and Pulse-Width-Distortion Compensation Circuits for a 10Gb/s Burst-Mode CDR Circuit", *IEEE International Solid-State Circuits Conf.*, 2009.
- [40] K. Maruko et al., "A 1.296-to-5.184Gb/s Transceiver with 2.4mW/(Gb/s) Burst-mode CDR using Dual-Edge Injection-Locked Oscillator", *IEEE International Solid-State Circuits Conf.*, 2010.
- [41] T. Masuda et al., "A 12Gb/s 0.9mW/Gb/s Wide-Bandwidth Injection- Type CDR in 28nm CMOS with Reference-Free Frequency Capture", *IEEE International Solid-State Circuits Conf.*, 2016.
- [42] K. Schier et al., "A 57-to-66GHz Quadrature PLL in 45nm Digital CMOS", IEEE International Solid-State Circuits Conf., 2009.
- [43] K.-T. Tsai et al., "A 43.7mW 96GHz PLL in 65nm CMOS", IEEE International Solid-State Circuits Conf., 2009.
- [44] K. Kawasaki et al., "A Millimeter-Wave Intra-Connect Solution", IEEE International Solid-State Circuits Conf., 2010.
- [45] S.-J. Cheng et al., "A 110pJ/b Multichannel FSK/GMSK/QPSK/4-DQPSK Transmitter with Phase-Interpolated Dual-Injection DLL-Based Synthesizer Employing Hybrid FIR", *IEEE International Solid-State Circuits Conf.*, 2013.
- [46] K. Kamogawa, T. Tokumitsu, and M. Aikawa, "Injection-locked oscillator chain: a possible solution to millimeter-wave MMIC synthesizers," *IEEE Transactions* on Microwave Theory and Techniques, vol. 45, pp. 1578-1584, September 1997.
- [47] R. A. York and T. Itoh, "Injection and phase locking techniques for beam control," *IEEE Transactions on Microwave Theory and Techniques*, vol. 46, pp. 1920-1929, November 1998.
- [48] S. Verma, H. Rategh, and T. Lee, "A unified model for injection-locked frequency dividers," *IEEE J. of Solid-State Circuits*, vol. 38, no. 6, pp. 1105-1027, 2003.
- [49] P. Kinget, R. Melville, D. Long, and V. Gopinathan, "An injection-locking scheme for precision quadrature generation," *IEEE J. of Solid-State Circuits*, vol. 37, pp. 845-851, July-2002.
- [50] R. Adler, "A study of locking phenomena in oscillators," Proc. IEEE, vol. 61, pp. 1380-1385, Oct. 1973.

- [51] L. J. Paciorek, "Injection locking of oscillators," Proc. IEEE, vol. 53, pp. 1723-1727, Nov. 1965.
- [52] B. Razavi, "A study of injection locking and pulling in oscillators," IEEE J. of Solid-State Circuits, vol. 39, no. 9, Sept. 2004.
- [53] M. T. Jezewski, "An approach to the analysis of injection locked oscillators," in *IEEE Transactions on Circuits and Systems*, vol. CAS-21, no. 3, May 1974, pp. 395-401.
- [54] X. Zhang, X, Zhou, A. S. Daryoush, "A theoretical and experimental study of the noise behavior of subharmonically injection locked local oscillators," in *IEEE Transactions on Microwave Theory and Techniques*, vol. 40, no. 5, May 1992, pp. 895-902.
- [55] X. Lai, J. Roychowdhury, "Capturing oscillator injection locking via nonlinear phase-domain macromodels," in *IEEE Transactions on Microwave Theory and Techniques*, vol. 52, no. 9, Sept. 2004, pp. 2251-2261.
- [56] X. Lai, J. Roychowdhury, "Analytical equation for predicting injection locking in LC and ring oscillators," in *IEEE Custom Integrated Circuits Conf.*, Sept. 2005.
- [57] G. R. Gangasani, P. Kinget, "A time-domain model for predicting the injection locking bandwidth of non-harmonic oscillators," in *IEEE Transactions on Circuits and Systems II*, vol. 53, no. 10, Oct. 2006.
- [58] G. R. Gangasani, P. Kinget, "Injection-Lock dynamics in non-harmonic oscillators," in *IEEE International Symposium on Circuits and Systems*, May 2006.
- [59] R. J. Betancourt-Zamora, S. Verma, and T. Lee, "1-GHz and 2.8-GHz injectionlocked ring oscillator prescalers," in *IEEE Symp. VLSI Circuits Dig. Tech. Papers*, June 2001, pp. 47-50.
- [60] MPQ2222A, NPN silicon quad chip: Central Semiconductor Corp.
- [61] CD4007UB, CMOS Dual complementary pair inverter: Texas Instruments.
- [62] P. Kogge et al., "ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems", ExaScale Study Group, 2008.
- [63] T. Toifl, et al., "A 0.94-ps-RMS-Jitter 0.016mm<sup>2</sup> 2.5-GHz Multiphase Generator PLL with 360° Digitally Programmable Phase Shift for 10-Gb/s Serial Links", *IEEE J. of Solid-State Circuits*, vol. 40, no. 12, Dec. 2005.
- [64] B. A. Floyd, "Sub-Integer Frequency Synthesis Using Phase-Rotating Frequency Dividers", *IEEE Transactions on Circuits and Systems I*, vol. 55, no. 7, Aug. 2008.

- [65] P. R. Kinget, "Scaling Analog Circuits into Deep Nanoscale CMOS: Obstacles and Ways to Overcome Them", *IEEE Custom Integrated Circuits Conf.*, Sept. 2015.
- [66] A. Paidimarri, N. Ickes and A. P. Chandrakasan, "A 0.68V 0.68mW 2.4GHz PLL for Ultra-Low Power RF Systems", *IEEE Radio Frequency Integrated Circuits* Symposium, May 2015.
- [67] S. Ikeda, et al., "A Sub-1mW 5.5-GHz PLL with Digitally-Calibrated ILFD and Linearized Varactor for Low Supply Voltage Operation", *IEEE Radio Frequency Integrated Circuits Symposium*, June 2013.
- [68] S. Ikeda, et al., "A 0.5-V 5.5-GHz Class-C-VCO-Based PLL with Ultra-Low-Power ILFD in 65nm CMOS", *IEEE Asian Solid-State Circuits Conf.*, Nov. 2012.
- [69] S.-A. Yu and P. R. Kinget, "A 0.65V 2.5GHz Fractional-N Frequency Synthesizer in 90nm CMOS", *IEEE International Solid-State Circuits Conf.*, Feb. 2007.
- [70] N. Krishnapura and P. R. Kinget, "A 5.3-GHz Programmable Divider for HiPer-LAN in 0.25-μm CMOS", *IEEE J. of Solid-State Circuits*, vol. 35, no. 7, July 2000.
- [71] A. Momtaz, et al., "Fully-Integrated SONET OC48 Transceiver in Standard CMOS", *IEEE International Solid-State Circuits Conf.*, Sept. 2001.
- [72] N. H. W. Fong, et al., "A 1-V 3.8-5.7-GHz Wide-Band VCO With Differentially Tuned Accumulation MOS Varactors for Common-Mode Noise Rejection in CMOS SOI Technology", *IEEE Transactions on Microwave Theory and Techniques*, vol. 51, no. 8, Aug. 2003.
- [73] J.-K. Kim, et al., "A 26.5-37.5 GHz Frequency Divider and a 73-GHz-BW CML Buffer in 0.13μm CMOS", *IEEE Asian Solid-State Circuits Conf.*, Nov. 2007.
- [74] G. R. Gangasani and P. R. Kinget, "Time-Domain Model for Injection Locking in Nonharmonic Oscillators", *IEEE Transactions on Circuits and Systems I*, vol. 55, no. 6, July 2008.
- [75] Y.-C. Lo, H.-P. Chen, J. Silva-Martinez and S. Hoyos, "A 1.8V, Sub-mW, Over 100% Locking Range, Divide-by-3 and 7 Complementary-Injection-Locked 4 GHz Frequency Divider", *IEEE Custom Integrated Circuits Conf.*, Sept. 2009.
- [76] G. G. Shahidi, et al., "Device and Circuit Design Issues in SOI Technology", IEEE Custom Integrated Circuits Conf., Sept. 1998.
- [77] J.F. Bulzacchelli, et al., "A 28-Gb/s 4-Tap FFE/15-Tap DFE Serial Link Transceiver in 32-nm SOI CMOS Technology", *IEEE J. of Solid-State Circuits*, vol. 47, no. 12, Dec. 2012

- [78] H.-H. Hsieh, C.-T. Lu and L.-H. Lu, "A 0.5-V 1.9-GHz Low-Power Phase-Locked Loop in 0.18-μm CMOS", *IEEE Symposium on VLSI Circuits*, June 2007.
- [79] P. Kogge et al., "ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems", ExaScale Study Group, 2008.
- [80] P. Kogge et al., "Facing the Exascale Energy Wall", International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, 2010.
- [81] T. Dickson et al., "A 1.4 pJ/bit, Power-Scalable 16x12 Gb/s Source-Synchronous I/O With DFE Receiver in 32 nm SOI CMOS Technology", *IEEE J. of Solid-State Circuits*, vol. 50, no. 8, Aug. 2015.
- [82] T. Dickson et al., "A 1.8 pJ/bit 16x16 Gb/s Source-Synchronous Parallel Interface in 32 nm SOI CMOS with Receiver Redundancy for Link Recalibration", *IEEE J. of Solid-State Circuits*, vol. 51, no. 8, July 2016.
- [83] T. Toifl et al., "A 2.6 mW/Gbps 12.5 Gbps RX With 8-Tap Switched-Capacitor DFE in 32 nm CMOS", *IEEE J. of Solid-State Circuits*, vol. 47, no. 4, April 2012.
- [84] M. Hossain et al., "A 6.8mW 7.4Gb/s Clock-Forwarded Receiver with up to 300MHz Jitter Tracking in 65nm CMOS", *IEEE International Solid-State Cir*cuits Conf., 2010.
- [85] Pier Andrea Francese et al., "A 16 Gb/s 3.7 mW/Gb/s 8-Tap DFE Receiver and Baud-Rate CDR With 31 kppm Tracking Bandwidth", *IEEE J. of Solid-State Circuits*, vol. 49, no. 11, Nov. 2014.
- [86] G.R. Gangasani et al., "A 32 Gb/s Backplane Transceiver With On-Chip AC-Coupling and Low Latency CDR in 32 nm SOI CMOS Technology", *IEEE J.* of Solid-State Circuits, vol. 49, no. 11, Nov. 2014.
- [87] T. Masuda et al., "A 12Gb/s 0.9mW/Gb/s Wide-Bandwidth Injection-Type CDR in 28nm CMOS with Reference-Free Frequency Capture ", *IEEE International Solid-State Circuits Conf.*, 2016.
- [88] G. Shu et al., "A Reference-Less Clock and Data Recovery Circuit Using Phase-Rotating Phase-Locked Loop", *IEEE J. of Solid-State Circuits*, vol. 49, no. 4, April 2014.
- [89] W. Rahman et al., "A 22.5-to-32Gb/s 3.2pJ/b Referenceless Baud-Rate Digital CDR with DFE and CTLE in 28nm CMOS ", *IEEE International Solid-State Circuits Conf.*, 2017.
- [90] J. D'Ambrosia, "IEEE 802.3WG Closing Plenary Report, IEEE P802.3bj 100 Gb/s Backplane and Copper Cable Task Force", http://www.ieee802.org, 2012.

- [91] T. Beukema et al., "A 6.4-Gb/s CMOS SerDes Core With Feed-Forward and Decision-Feedback Equalization", *IEEE J. of Solid-State Circuits*, vol. 40, no. 12, Dec. 2005.
- [92] V. Stojanovic et al., "Modeling and Analysis of High-speed links ", IEEE Custom Integrated Circuits Conf., 2003.
- [93] V. Dmitriev-Zdorov et al., "BER- and COM-Way of Channel-Compliance Evaluation: What are the Sources of Differences", *DesignCon*, 2016.
- [94] G.R. Gangasani et al., "A 28.05Gb/s Transceiver using Quarter-Rate Triple-Speculation Hybrid-DFE Receiver with Calibrated Sampling Phases in 32nm CMOS", *IEEE Symposium on VLSI Circuits*, 2017.
- [95] B. Casper et al., "Clocking Analysis, Implementation and Measurement Techniques for High-Speed Data Links-A Tutorial", *IEEE Transactions on Circuits* and Systems I, vol. 56, no. 1, Jan. 2009.
- [96] J. Lee et al., "A 20-Gb/s Burst-Mode Clock and Data Recovery Circuit Using Injection-Locking Technique", *IEEE J. of Solid-State Circuits*, vol. 43, no. 3, Mar. 2008.
- [97] M.L. Schmatz et al., "A 40-Gb/s, Digitally Programmable Peaking Limiting Amplifier with 20-dB Differential Gain in 90-nm CMOS", *IEEE Radio Fre*quency Integrated Circuits Symposium, 2006.
- [98] M. Aleksic, "A 3.2-GHz 1.3-mW ILO Phase Rotator for Burst-Mode Mobile Memory I/O in 28-nm Low-Leakage CMOS", *IEEE European Solid-State Cir*cuits Conf., 2014.
- [99] F. O'Mahony et al., "A programmable Phase Rotator based on Time-Modulated Injection-Locking", *IEEE Symposium on VLSI Circuits*, 2010.
- [100] G.R. Gangasani et al., "A 16-Gb/s Backplane Transceiver With 12-Tap Current Integrating DFE and Dynamic Adaptation of Voltage Offset and Timing Drifts in 45-nm SOI CMOS Technology", *IEEE J. of Solid-State Circuits*, vol. 47, no. 8, Aug. 2012.
- [101] M. Pozzoni et al., "A 12Gb/s 39dB Loss-Recovery Unclocked-DFE Receiver with Bi-dimensional Equalization", *IEEE International Solid-State Circuits* Conf., 2010.
- [102] G. Shu et al., "A 4-to-10.5 Gb/s Continuous-Rate Digital Clock and Data Recovery With Automatic Frequency Acquisition", *IEEE J. of Solid-State Circuits*, vol. 51, no. 2, Feb. 2016.
- [103] C.-F. Liang et al., "A Reference-Free, Digital Background Calibration Technique for Gated-Oscillator-Based CDR/PLL", *IEEE Symposium on VLSI Cir*cuits, 2009.

- [104] A. Elkholy et al., "A 6.75-to-8.25GHz 2.25mW 190 fs<sub>rms</sub> Integrated-Jitter PVT-Insensitive Injection-Locked Clock Multiplier Using All-Digital Continuous Frequency-Tracking Loop in 65nm CMOS", IEEE International Solid-State Circuits Conf., 2015.
- [105] V. Balan et al., "A 130mW 20Gb/s Half-Duplex Serial Link in 28nm CMOS", IEEE International Solid-State Circuits Conf., 2014.
- [106] T. Dickson et al., "A 1.4 pJ/bit, Power-Scalable 16x12 Gb/s Source-Synchronous I/O With DFE Receiver in 32 nm SOI CMOS Technology", *IEEE J. of Solid-State Circuits*, vol. 50, no. 8, Aug. 2015.
- [107] M. Hossain et al., "A 6.8mW 7.4Gb/s Clock-Forwarded Receiver with up to 300MHz Jitter Tracking in 65nm CMOS", *IEEE International Solid-State Cir*cuits Conf., 2010.
- [108] W. Rahman et al., "A 22.5-to-32-Gb/s 3.2-pJ/b Referenceless Baud-Rate Digital CDR With DFE and CTLE in 28-nm CMOS", *IEEE J. of Solid-State Circuits*, vol. 52, no. 12, Dec. 2017.
- [109] M.S. Jalali et al., "A Reference-Less Single-Loop Half-Rate Binary CDR", IEEE J. of Solid-State Circuits, vol. 50, no. 9, Sept. 2015.
- [110] N. Kocaman et al., "An 8.5-11.5-Gbps SONET Transceiver With Referenceless Frequency Acquisition", *IEEE J. of Solid-State Circuits*, vol. 48, no. 8, Aug. 2013.
- [111] J. Lee et al., "A 20-Gb/s Full-Rate Linear Clock and Data Recovery Circuit With Automatic Frequency Acquisition", *IEEE J. of Solid-State Circuits*, vol. 44, no. 12, Dec. 2009.