Abstract-A near-threshold forwarded-clock I/O receiver architecture is presented. In the proposed receiver, the majority of the circuitry is designed to operate in the near-threshold region at 0.6 V supply to save power, with the exception of only the global clock buffer, test buffers and synthesized digital circuits at the nominal 1 V supply. To ensure the quantizers are working properly with this low supply, a 1:10 direct demultiplexing rate is chosen as a demonstration of achieving low supply operation by high-parallelism. A novel low-power super-harmonic injection-locked ring oscillator is proposed to generate deskewable symmetric multi-phase local clock phases. The relative performance impact of including a perdata lane sample-and-hold (S/H) to improve quantizer aperture time at low voltage is demonstrated with two receiver prototypes fabricated in a 65 nm CMOS technology. Including the amortized power of global clock distribution, the receiver without S/H consumes 1.3 mW and the one with S/H consumes 2 mW at an 8 Gb/s input data rate, which converts to 0.163 pJ/bit and 0.25 pJ/bit, respectively. Measurement results show both receivers get BER 10 across a 20-cm FR4 PCB channel.
chip-to-chip I/O bandwidth between cores and memories needs to scale at the same rate in order to feed and keep the computation units well loaded to gain the best performance. However, due to practical limitations like channel loss and crosstalk, the data rate per pin is only projected to rise by about 10 in 2024 relative to 2009. Given that the maximum pin number increases about 2 to 4 during the same period [1] , there is a huge gap of aggregate bandwidth for link designers to meet the performance trend.
As observed in Fig. 1(b) , energy per bit of recently reported I/O transceivers has been improving at a much slower rate compared with the projected requirements for off-chip bandwidth [2] , [3] . As a result, without significant improvements in energy 0018-9200/$31.00 © 2012 IEEE efficiency, I/O power dissipation is likely to limit the overall performance and thermal requirements of future processor systems.
A large percentage of serial link power is often consumed in the receiver, which usually needs to successfully quantize and demultiplex the incoming signal at a bit-error rate (BER) less than 10 . This level of performance often demands that the receiver include some equalization to compensate for channel frequency-dependent loss, as well as the ability to properly deskew the receiver clocks in order to provide sufficient timing margin for the sampling of incoming data.
As clock generation and distribution consumes a significant portion of total receiver power, recent low-power I/O transceivers have leveraged techniques such as shared phase deskew circuitry among several bundled link channels [2] , or resonant clock distribution [3] for decreasing global clock dynamic power. In these designs, sample clock phase deskew is achieved using phase interpolators with delay-locked loops (DLLs) [2] or phase-rotating phase-locked loops (PLLs) [3] . Relative to conventional phase interpolators, injection-locked oscillators (ILOs) [4] [5] [6] have been introduced as an energy-efficient technique for deskewing the phase positions of sampling clocks.
Another key technique to improve serial link energy efficiency borrows from low-power processor design [7] , and involves scaling the supply voltage to the minimum level required for the desired BER [8] , [9] . Implementing circuit parallelism at the serial link receiver front-end by performing a high degree of input demultiplexing allows for multiple receivers to operate at lower voltages, reducing the dynamic energy consumption quadratically [8] . However, several new challenges arise as the supply voltage is reduced to near the transistor threshold voltage, due to increased sensitivity to device variations and mismatches. As shown in Fig. 2 , both oscillator phase mismatch and comparator offset degrade with supply voltage reduction. This paper presents two low-power 8 Gb/s forwarded-clock receivers that improve upon a previous low-power receiver architecture [5] by leveraging new mixed-signal circuit techniques, including a super-harmonic injection-locked ring oscillator which allows for a high input demultiplexing ratio of 1:10 to achieve operation near the threshold voltage and allow for improved phase noise, the inclusion of boot-strapped sample/holds for improved quantizer aperture time at low , and digital calibration to compensate for timing and voltage offsets. The impact and limitations associated with an increased input demultiplexing factor and near-threshold operation of key receiver circuits, such as the multi-phase oscillator, quantizer, and continuous-time linear equalizer (CTLE), are discussed in Section II. Section III details the receiver architectures and key circuit blocks. Experimental results from a 65 nm CMOS prototype are presented in Section IV. Finally, Section V concludes the paper.
II. RECEIVER ARCHITECTURAL CONSIDERATIONS

A. Voltage Scaling
Higher circuit parallelism enables more aggressive supply voltage scaling, which reduces energy consumption in a quadratic fashion [8] . However, this methodology cannot be pursued indefinitely as circuit performance degrades non-linearly as transistors approach near-threshold operation. In this section, we explore several limiting factors that guide the optimal choice of scaled operation in a 65 nm technology with threshold voltages near 350 mV for LVT devices.
1) Voltage-Controlled Oscillator (VCO):
For a multi-phase ring VCO, the product of the number of phases times its oscillation frequency must remain constant for a given data rate. As is reduced, the oscillation frequency is impacted, requiring a larger number of interpolated phases. Various methods have been proposed to increase the number of generated phases, such as tapped delay lines [2] , coupled ring-based oscillators [10] , poly-phase filters [11] and various forms of phase interpolation [12] . While the energy per stage improves quadratically with more delay stages running at a lower , two major sources of uncertainty prevent continued scaling: transistor mismatch and phase noise.
As the gate overdrive is reduced at lower , susceptibility to process uncertainties, such as threshold voltage mismatch, increases substantially. Fig. 2(a) shows the simulated DNL (differential non-linearity) of the phase spacing between a 10-stage ring oscillator running at V and V. It can be observed that phase mismatch degrades by more than 2 at these two operating conditions. 1 Fortunately, sub-picosecond resolution capacitive phase vernier interpolators [13] can be applied to tune out the phase error. Alternatively, static phase mismatches surfacing from multi-phase generators can be measured and calibrated to less than 10 ps [12] and 2 ps [14] using phase binning and averaging.
While DC phase mismatches can be calibrated offline, the problem of intrinsic VCO phase noise is more difficult to address. Based on the phase noise model by Hajimiri [15] , two degradations to phase noise arise as is lowered. First, intrinsic thermal noise (kT/C) becomes larger proportionally to the linear reduction in capacitor charge as result of scaled . Furthermore, the lower results in slower inverter rise/fall edge times, degrading the impulse sensitivity function which results in higher phase noise.
These effects are illustrated in the simulation results of Fig. 3 , which compares the phase noise of a 4 GHz 6-delay-stage injection-locked ring oscillator operating at 1.0 V and an 800 MHz 10-delay stage version operating at 0.6 V. When they are both free-running, 1 V VCO exhibits higher phase noise than 0.6 V one. However, after injection locked, their low-frequency noise is high-pass filtered in both cases, and becomes comparable. This is because the 1 V VCO has a larger bandwidth of 50 MHz than about 20 MHz of 0.6 V VCO, resulting in a larger bandwidth of phase noise filtering.
VCO jitter can be expressed as a function of its phase noise power spectral density , as derived from [15] and [16] ,
where is the RMS jitter, is the free-running oscillation frequency, and is the phase noise. When injection locked, the integrated jitter from simulation for 1 V, 4 GHz oscillator is 0.34 ps , while for 0.6 V, 800 MHz oscillator is 1.69 ps , excluding input reference clock jitter. If we assume that the bounding probability is 10 , the peak jitter amplitude for a Gaussian source is 8 . For 8 Gb/s data rate (1 UI 125 ps) with jitter on both edges and no jitter tracking included for hand-calculation, there will be 98 ps (0.78 UI) opening left. Hence, for V, the increase in phase noise is tolerable for a 8 Gb/s data rate, while still providing some margin for other factors such as power supply induced jitter (PSIJ).
2) Sampler: At lower supply voltages, samplers 2 require more time to resolve each low-swing input signal to full-swing digital levels. The result is that a higher level of circuit parallelism is required to satisfy the I/O bandwidth. Fig. 4(a) shows the simulation result of the comparator delay for varying supply voltages, using the comparator structure similar to [5] . Fig. 4(b) shows the improvement in energy per quantization as is lowered. In this 65 nm CMOS process, as is lowered from 0.6 V-0.5 V, energy per bit is reduced by only 11%, but the delay increases by , requiring twice as much circuit multiplexing. This increase in circuit parallelism also increases the loading on the CTLE, as well as the area and wiring overhead such that the benefit of reduction diminishes quickly beyond 0.6 V for this process technology.
While a higher level of circuit parallelization at low enables each comparator to spend more time in deciding on a quantization, the requirements for sample/hold aperture time are still a fundamental limitation that cannot be relaxed. Due to the limited sampling bandwidth and degraded clock slew rate, the actual comparator input is a weighted average over a finite time period, and is characterized by its impulse sensitivity function (ISF) [17] , [18] : (3) where the integration of h(t) from to is normalized to 1. Fig. 5 shows the normalized comparator ISF at different supply voltages. At lower , the ISF becomes spread-apart (similar to an integrating receiver) and the aperture time, defined as the time period that accounts for 90% of the area under the ISF, increases. Whereas the effective aperture time increases to 7.5 ps (36%) as is lowered from 1.0 V to 0.6 V, from 0.6 V to 0.5 V, the ISF increases to 11 ps (47%). Again, the rate of performance degradation shoots up abruptly near a supply voltage of 0.6 V, which appears to be the lower limit for supply scaling in this process technology.
3) Continuous-Time Linear Equalizer (CTLE): While the power-delay trade-off for a quantizer is relatively straightforward, this is not the case for the CTLE. As shown in Fig. 6 , its transfer function can be written as: (4) where (5) 2 Also known as quantizer and sense amplifiers, such as the StrongArm latch. For this short-distance application, the desired peaking factor needs to be approximately 10 dB. Assuming that the combined voltage drop across input device and the current source is and can be related as follows:
Using a square-law approximation, the peak gain can be written as: (9) This peak gain increases with for a given headroom. The maximum value of is limited by the bandwidth at the output node: (10) where indicates the distance from the second pole to the Nyquist frequency. Equations (9) and (10) indicate that for a given transistor size, reducing also reduces the CTLE peak gain. This is because at lower , a higher amount of sampler time-interleaving is required, adding to the CTLE load capacitance . In order to meet the bandwidth requirement, must be reduced accordingly. Since both and decreases with , so does the peak gain . Although the peaking factor is not directly affected by scaling, the reduced peak gain limits the CTLE output swing. Note that the peak gain can potentially be boosted by using larger device sizes. However, pushing this too far negatively affects as approaches , and becomes dominated by CTLE self-loading.
To better understand the effect of supply voltage scaling on CTLE, simulation results at different supply voltages are shown in Fig. 7 . Throughout the simulation, k and W/L are kept constant. It can be observed that, for the same demultiplexing ratio N, power consumption almost scales linearly, while peak gain decreases at a lower rate. However, once N increases to compensate for the rise of sampler delay, peak gain drops significantly, and the increase in shadows the scaling of , resulting in higher power consumption. At 0.5 V, the CTLE provides the lowest peak gain, while consuming the highest power. This fast performance roll-off below 0.6 V coincides with other building blocks.
From the analysis above, we conclude that although 0.6 V may not be the optimal operating point for every block, it provides an attractive trade-off among power, performance, and design complexity.
B. Trade-Offs in Forwarded-Clock Architecture Using ILO
The choice of ILO bandwidth and forwarded clock frequency in source synchronous parallel links are two important design considerations. One of the main advantages of forwarded clock architectures is that the clock and data channels are clocked by the same transmit oscillator, and therefore, some of the jitter is correlated and tracks each other. However, on one hand, due to the delay mismatch between the clock and data channels, high frequency jitter will become harmful, because clock and data will be eventually out-of-phase [6] , [19] , which means ILO bandwidth should be low enough so as not to track high frequency jitter. On the other hand, since ILO is like a first-order PLL, it will low-pass filter the noise from the injection clock, and high-pass filter the noise from itself [5] . Therefore, ILO bandwidth should also be high enough to suppress the phase noise from itself. In practical designs, this direct trade-off leads to ILO bandwidth in the range of several ten to several hundred MHz [5] , [6] , depending on different environment or applications.
In order to ease the design complexity while maintaining low-power in the forwarded clock channel, a typical choice is to select one of the sub-harmonic frequencies of the Nyquist frequency of the I/O baud rate to deliver, such as 1 GHz, 2 GHz or 4 GHz etc. forwarded clock for a 8 Gb/s data rate. After clock and data travel through lossy channel, their jitter will get enhanced due to jitter amplification effect [20] , and exhibit different amplification for different Nyquist frequencies. For example, simulation results of jitter amplification based on measured channel characteristics of a 20-cm PCB trace, shown in Fig. 8 , are plotted in Fig. 9 . It shows that the jitter amplification will vary according to the forwarded clock frequency. Therefore, in order to maintain well-matched jitter between the clock and data lanes, half-rate forwarded clock which equates to the Nyquist frequency of data channel is desirable for better jitter tracking between clock and data, at the cost of more power burned for clock distribution than other lower frequency subharmonic rates.
III. ARCHITECTURE AND CIRCUIT IMPLEMENTATION
A. Receiver Architecture
The architecture of the proposed forwarded-clock receiver is depicted in Fig. 10 . A half-rate clock source (4 GHz) is forwarded, buffered and distributed to three data receivers and a standalone oscillator for test purposes in this prototype. Operating with 1 V supply, the global CML clock buffer drives the 600 m long clock distribution to the respective super-harmonic ILO in each receiver for multi-phase generation and local deskewing.
For the data path, two prototypes (RX1 and RX2) are realized to compare the performance of the data lane without and with input S/H. As shown in Fig. 11(a) , for RX1, the received 8 Gb/s data is first fed to the CTLE, and then directly sampled and demultiplexed by ten deskewed phases from the super-harmonic ILO to ten-way 800 Mb/s recovered data outputs. Finally they are muxed out for test purpose to reduce the number of pads. As mentioned previously, in order to maximize the timing margin for the quantizers, S/H circuits are employed in front of each quantizer in RX2, as shown in Fig. 11(b) . As the on-resistance of conventional switches get worse at lower supply operation, bootstrapped switches proposed in [21] are used in the S/H to reduce on-resistance and minimize signal-dependent distortion. Following the main switch, a widely-used dummy switch driven by a complementary clock phase is employed to minimize clock-feedthrough and charge injection.
Except for the global clock buffer and test buffers, the other circuits like the super-harmonic ILO, CTLE, and quantizer circuits are designed to operate at 0.6 V supply. In order to address slower transistor speed at low supply voltage, a highly parallel architecture using 1:10 demultiplexing is chosen, such that the sampling clock and quantizers of each lane can operate at a much lower frequency. The quantizer is a two-stage sense amplifier with only three stacked transistors [22] for low supply operation. To prevent degradation from potential process variation, extensive digital trimming bits are utilized throughout the entire receiver for quantizer offset calibration, oscillator frequency and phase deskew tuning. These calibrations are done at startup. Fig. 12 shows the schematic of the proposed near-threshold super-harmonic ILO. It contains a ring oscillator and an injection pair. The five-stage differential ring oscillator generates ten evenly-spaced phases (P[0] to P [9] ) with free-running frequency designed to be 800 MHz so that the phase spacing between two adjacent phases equals to 1 UI (125 ps for 8 Gb/s). Negatively-skewed phase interpolation [23] is employed to enhance the ring oscillator frequency at 0.6 V supply. The oscillator incorporates three sources of frequency control: supply voltage (fixed at 0.6 V in this design), 40-bit thermometer-encoded current-starving for fine tuning, and a DC-biased PMOS load (Vc) in each delay cell for coarse tuning.
B. Super-Harmonic ILO
As conceptually demonstrated in Fig. 13 , in first-harmonic injection-locking ring oscillators [5] , the injection signal will load one particular stage more than the others. However, in superharmonic ILO, the differential clock signal is now injected into the common source nodes (CSP and CSN) instead of directly loading any output phases. This relieves the problem of asymmetric injection and adjacent static phase error caused by different capacitance loading in first-harmonic injection-locking ring oscillators.
Following the principle of first-harmonic injection-locking [4] , [5] , in the case of super-harmonic ILO, the frequency difference between its free-running frequency and the M-th sub-harmonic of the injection clock will result in a phase shift in the final output when locked ( for this design), with the amount of phase shift depending on this frequency difference and locking range, according to Alder's equation [24] . Therefore, the 40-bit thermometer-encoded fine frequency tuning digital bits are designed for deskew purposes by detuning its freerunning frequency. This gives about 0.4 UI deskew range with quite small steps. To further extend the deskew range to a full UI, inversion-mode PMOS varactors are used as coarse deskew tuning by adjusting the capacitance loading of the branches external to the oscillator controlled by Vd (Fig. 12) . Once Vd is set to roughly cover the phase difference between clock and data, digital controlled fine tuning will adjust to further deskew the phases with fine steps.
Each delay stage of this super-harmonic ILO can be modeled as depicted in Fig. 14 . Taking the second stage as an example, clock phases P [4] and P [5] are first combined due to the negatively-skewed phase interpolation technique used here. The nonlinear function f(e) will generate multiple harmonic products from injection signal and the interpolated phase. They are then filtered by the delay stage transfer function H [25] . The single-sided locking range can be expressed as (11) where is the injection efficiency, is the M-th harmonic coefficient, M is also the number of stages, is the free-running frequency of the oscillator, and is the amplitude of the injection signal [26] .
To compensate for any potential phase imbalance due to layout mismatch, a 4-bit switched capacitor bank on each phase is incorporated for individual phase trimming, with a measured resolution of 3-5 ps. A scan-chain feedback loop runs at startup to adjust the phase spacing, using a histogram calibration algorithm [14] . The calibrated ten phases are then used to demultiplex the incoming data.
IV. MEASUREMENT RESULTS
A 1 mm 1 mm test chip has been fabricated in a 65 nm 1P9M CMOS technology. To evaluate the effectiveness of the bootstrapped S/H frontend, two versions of the receivers are built. RX2 in Fig. 11 uses the S/H while RX1 does not. The die photo and layout screen captures of two receiver prototypes are shown in Fig. 15 . The on-die clock channel, implemented as a global CML clock buffer that drives the differential load capacitance across a 600 m distribution by top metal M9, to three data receivers (two RX1 and one RX2) and a stand-alone super-harmonic ILO for test purposes.
A HP 8648D signal generator, with 1.2 ps intrinsic jitter is used to generate the 4 GHz injection clock. A Tektronix AWG 7122B generates the PRBS-7 8 Gb/s data that passes through the FR4 channel consisting of 20 cm long PCB traces and 62 cm/15 cm SMA cables on each end. Fig. 8 shows that the measured frequency response of this channel is approximately 9.7 dB at Nyquist (4 GHz). A Tektronix DSA 8200 digital serial analyzer captures the demultiplexed receiver outputs, and performs bit-error rate analysis.
The phase deskew range of the super-harmonic ILO is shown in Fig. 16 . A total deskew range of 130 ps is achieved by combining both coarse and fine tuning controls, covering the full UI of 125 ps. The coarse deskew tuning is done by changing the varactor control voltage Vd. This provides about 82 ps phase shift range, with enough overlap margin between adjacent coarse settings. After one of the coarse tuning is selected and set, the fine tuning bits of the super-harmonic ILO are varied to provide another 48 ps deskew range with 1-3 ps step resolution. Therefore the proposed receiver can cover the full UI without dead zone, as shown in Fig. 16. Fig. 17 shows the deskewed clock edges overlaid on the oscilloscope by just changing only the fine tuning bits. Only every other one or two clock edges are overlaid for clarity.
The best-case jitter is measured below 4 ps , and increases towards the far end of ILO locking range, where the jitter up to 4.6 ps has been measured. Fig. 18 illustrates this slight degradation in jitter performance as the super-harmonic ILO is biased at extremities away from the center of the locking-range. The locking range measured by changing the free-running frequency is from 40 MHz to 78 MHz depending on the injection strength controlled by 3-bit amplitude setting of the global clock buffer, which follows the fashion in (11). Jitter-tracking bandwidth is measured by modulating the 4 GHz injection clock with a low-frequency sinusoidal signal. When a 20 MHz sine-wave modulation is applied, the corresponding bimodal jitter distribution can be observed in Fig. 19 . However, any modulation frequencies higher than 40 MHz start to be filtered out by the narrow bandwidth of signal generator source itself. It is still able to observe modulation signal up to 40 MHz. As there is no attenuation of output jitter up to this point, the jitter tracking bandwidth of this super-harmonic ILO is greater than 40 MHz. , it can also be observed that the timing margin is improved for the S/H receiver with the equalization off compared with the direct comparator-input receiver.
Power consumption for each block is listed in Table I . RX1 and RX2 consume 1.3 mW and 2 mW respectively at 8 Gs/s, translating into figure of merit (FOM) of 0.163 pJ/bit and 0.25 pJ/bit. Table II compares this design with previously reported prototypes.
V. CONCLUSION
A low-power forwarded-clock receiver prototype operating at near-threshold supply voltage is proposed. Both architecture considerations and circuit design techniques are discussed. By employing a super-harmonic ILO, 1:10 direct demultiplexing ratio and near-threshold operation, the receiver achieves as low as 0.163 pJ/bit at 8 Gb/s data rate. This work provides promising potential to relieve a key system performance scaling bottleneck-the power constraint of future high-speed I/Os.
