#### AN ABSTRACT OF THE THESIS OF <u>Kangmin Hu</u> for the degree of <u>Doctor of Philosophy</u> in <u>Electrical and Computer Engineering</u> presented on <u>June 8, 2011</u>. Title: Analysis and Design on Low-Power Multi-Gb/s Serial Links. Abstract approved: #### Patrick Y. Chiang High speed serial links are critical components for addressing the growing demand for I/O bandwidth in next-generation computing applications, such as many-core systems, backplane and optical data communications. Due to continued process scaling and circuit innovations, today's CMOS serial link transceivers can achieve tens of Gb/s per pin. However, most of their reported power efficiency improves much slower than the rise of data rate. Therefore, aggregate I/O power is increasing and will exceed the power budget if the trend for more off-chip bandwidth is sustained. In this work, a system level statistical analysis of serial links is first described, and compares the link performance of Non-Return-to-Zero (2-PAM) with higher-order modulation (duobinary) signaling schemes. This method enables fast and accurate BER distribution simulation of serial link transceivers that include channel and circuit imperfections, such as finite pulse rise/fall time, duty cycle variation, and both receiver and transmitter forwarded-clock jitter. Second, in order to address link power efficiency, two test chips have been implemented. The first one describes a quad-lane, 6.4-7.2 Gb/s serial link receiver prototype using a forwarded clock architecture. A novel phase deskew scheme using injection-locked ring oscillators (ILRO) is proposed that achieves greater than one UI of phase shift for multiple clock phases, eliminating phase rotation and interpolation required in conventional architectures. Each receiver, optimized for power efficiency, consists of a low-power linear equalizer, four offset-cancelled quantizers for 1:4 demultiplexing, and an injection-locked ring oscillator coupled to a low-voltage swing, global clock distribution. Measurement results show a 6.4-7.2Gb/s data rate with BER < 10<sup>-12</sup> across 14 cm of PCB, and an 8Gb/s data rate through 4cm of PCB. Designed in a 1.2V, 90nm CMOS process, the ILRO achieves a wide tuning range from 1.6-2.6GHz. The total area of each receiver is 0.0174mm², resulting in a measured power efficiency of 0.6mW/Gb/s. Improving upon the first test chip, a second test chip for 8Gb/s forwarded clock serial link receivers exploits a low-power super-harmonic injection-locked ring oscillator for symmetric multi-phase local clock generation and deskewing. Further power reduction is achieved by designing most of the receiver circuits in the near-threshold region (0.6V supply), with the exception of only the global clock buffer, test buffers and synthesized digital test circuits at nominal 1V supply. At the architectural level, a 1:10 direct demultiplexing rate is chosen to achieve low supply operation by exploiting high-parallelism. Fabricated in 65nm CMOS technology, two receiver prototypes are integrated in this test chip, one without and the other with front-end boot-strapped S/Hs. Including the amortized power of global clock distribution, the proposed serial link receivers consume 1.3mW and 2mW respectively at 8Gb/s input data rate, achieving a power efficiency of 0.163mW/Gb/s and 0.25mW/Gb/s. Measurement results show both receivers achieve BER $< 10^{-12}$ across a 20-cm FR4 PCB channel. ©Copyright by Kangmin Hu June 8, 2011 All Rights Reserved ### Analysis and Design on Low-Power Multi-Gb/s Serial Links by Kangmin Hu #### A THESIS submitted to Oregon State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Presented June 8, 2011 Commencement June 2012 | <u>Doctor of Philosophy</u> thesis of <u>Kangmin Hu</u> presented on <u>June 8, 2011</u> | | |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--| | | | | APPROVED: | | | | | | Major Professor, representing Electrical and Computer Engineering | | | | | | Director of the School of Electrical Engineering and Computer Science | | | | | | Dean of the Graduate School | | | Dean of the Graduate School | | | | | | | | | | | | I understand that my thesis will become part of the permanent collection of Oregon State University libraries. My signature below authorizes release of my thesis to any reader upon request. | | | | | | Kangmin Hu, Author | | #### ACKNOWLEDGEMENTS This Ph.D. work would never ever be accomplished without kind guidance, generous help and significant collaboration by my advisors, colleagues, friends, family and sponsors. First and foremost, I would like to express my deepest gratitude to Professor Patrick Yin Chiang, my doctoral advisor. It was his foreseeing vision, broad knowledge and great encouragement that led me to the world of serial link design and eventually the ideas in this thesis. I would not forget the days we discussed about papers, worked closely, and drew layout overnight together. I was touched by an energetic young professor with such strong desire for academic achievements. I am deeply grateful to him for numerous advice and guidance throughout my research work. Secondly, I would like to thank Professor Pavan Kumar Hanumolu for his advice on a collaborated Intel project as well as his teaching of analog circuit and PLL design. I also thank Professor Huaping Liu and Professor Thinh Nguyen for serving on my doctoral committee and being readers for this dissertation. I am indebted to Professor John Nairn for taking the time to serve as graduate council representative in my committee. I would like to thank Professor Sam Palermo of Texas A&M University for his valuable advice on joint SRC project. His insights into serial link design helped me to focus on this research. I am deeply indebted to Professor Zhiliang Hong of Fudan University, my master's advisor, for introducing me to the analog circuit design and for sponsoring of the forthcoming tapeout for our group. I also thank Professor Brian Otis and Dr. Julie Hu of University of Washington for their advice on tapeout and joint testing. I am extremely grateful to Dr. Frank O'Mahony, Dr. Ganesh Balamurugan and Bryan Casper of Intel's Circuit Research Laboratory for their advice on serial link design used in this thesis and for the permission to access their measurement equipment. I would like to thank Brent Close, Charlie Zhong and Freeman Zhong of LSI Corporation for generous support of lab and equipment time. I also thank Larry Wu, Yu Zhong, Dr. Leon Pu and Dr. Howard Yang of Montage Technology for the guidance and help on developing the system simulator of serial link when I was interned there. I have been much honored to work in a research group with so many extraordinary students. I am deeply grateful for the help and collaboration of Jingguang Wang, Tao Jiang, Changhui Hu, Jacob Postman, Joe Crop, Rui Bai, Robert Pawlwski, Jiao Cheng, Lingli Xia, Nariman Moezzi, Chao Ma, Karthik Jayaraman, Divya Kesharwani, Sirikarn Woracheewan, Ben Goska, Eric Donkoh and Ryan Albright. Also, I wish to thank many exceptional students from other groups including, but not limited to Weilun Shen, Wenhuan Yu, Yan Wang, Xiaorao Gao, Chia-Hung Chen, Hurst Kuo, Jiaming Lin, Jinzhou Cao, Tao Wang, Tao Tong, Xin Meng, Wei Li and Sanghyeon Lee in Professor Gabor Temes' group, Wenjing Yin, Jeff Pai, Rajesh Inti, Amr Elshazly, Bangda Yang and Sachin Rao in Professor Pavan Kumar Hanumolu's group, Yue Hu, Jon Guerber, Dave Gubbins, Nima Maghari and Tawfik Musah in Professor Un-Ku Moon's group, Jinjin He, Ruiqing Ye and Stephen Redfield in Professor Huaping Liu's group, Huarong Ni, Chao Shi and Yuhan Xie in Professor Terri Fiez's group, Na An in Professor Albrecht Jander's group, Younghoon Song and Ahmed Ragab in Professor Sam Palermo's group in Texas A&M University and Yue Lu in Professor Elad Alon's group in U. C. Berkeley, for valuable technical discussions and making my Ph.D. life so vivid and fantastic. I would like to CDADIC, AFRL, SRC and Intel Corporation for funding my research and MOSIS, IBM and TSMC for chip fabrication. Finally, I wish to offer my sincere gratitude to my entire family: my parents, girlfriend, cousins, grandparents, uncles and aunts who support my graduate work with their hearts. ## TABLE OF CONTENTS | | <u>Page</u> | |---------------------------------------------------|-------------| | CHAPTER 1. INTRODUCTION | 1 | | 1.1 Motivation | 1 | | 1.2 Thesis Organization | 4 | | CHAPTER 2. ANALYSIS AND MODELLING ON SERIAL LINKS | 6 | | 2.1 Overview of the Architectures | 6 | | 2.2 Modeling of Serial Link | 8 | | 2.2.1 Overview of NRZ and Duobinary Signaling | 12 | | 2.2.2. Background of Statistical Analysis | 13 | | 2.2.3 Statistical Analysis for Duobinary | 15 | | 2.2.4 Clock Non-idealities | 16 | | 2.2.5 Sub-block Modeling of Serial Link | 20 | | 2.3 Behavioral Simulations on Link Performance | 21 | | 2.4 Clock Distribution Methods | 28 | | 2.4.1 Inverter Chain | 29 | | 2.4.2 CML Chain | 31 | | 2.4.3 Transmission Line | 32 | | 2.4.4 Inductive Load | 34 | | 2.4.5 Capacitively Driven Wires (CDW) | 35 | | 2.5 Summary | 36 | ## TABLE OF CONTENTS (Continued) | | <u>Page</u> | |----------------------------------------------------------------------------|-------------| | CHAPTER 3. A SERIAL LINK RECEIVER USING LOCAL INJECTION-LO RING OSCILLATOR | | | RING OSCILLATOR | 38 | | 3.1 Forwarded Clock Receiver Architectures | 38 | | 3.2 Analysis on Injection-Locked Ring Oscillators | 42 | | 3.2.1 Previous Approaches | 42 | | 3.2.2 Proposed Approach for ILRO Analysis | 44 | | 3.3 Circuit Implementation | 48 | | 3.3.1 Proposed Injection-Locked Ring Oscillator | 48 | | 3.3.2 Other Building Blocks | 51 | | 3.4 Experimental Results | 51 | | 3.4.1 ILRO | 52 | | 3.4.2 Entire Receiver | 56 | | 3.5 Summary | 60 | | CHAPTER 4. A NEAR-THRESHOLD SERIAL LINK RECEIVER | 62 | | 4.1 Receiver Implementation | 62 | | 4.2 Design of Super-harmonic ILO | 66 | | 4.3 Experimental Results | 68 | | 4.4 Summary | 74 | | Chapter 5. CONCLUSION | 76 | | 5.1 Summary | 76 | ## TABLE OF CONTENTS (Continued) | | <u>Page</u> | |-------------------------------------|-------------| | 5.2 Recommendations for Future Work | 77 | | | | | Bibliography | 79 | ## LIST OF FIGURES | <u>Page</u> | |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Figure 1.1. SoC networking projections (performance scaled from 8-core implementation in 2009) [1]. | | Figure 1.2. I/O bandwidth projections [1] | | Figure 1.3. Power efficiency of recent published serial links. | | Figure 2.1. Block diagram of serial link transceiver: (a) forwarded-clock architecture and (b) embedded-clock architecture. | | Figure 2.2. Measured channel loss of 10cm to 80cm PCB traces (from top to bottom), showing -6,6dB, -18.9dB,-35dB loss at 10GHz for 10cm, 40cm and 80cm PCB traces respectively. | | Figure 2.3. Pulse response to a 50ps (20-Gb/s) pulse before equalization, after NRZ equalization, and duobinary equalization of 40cm PCB trace. | | Figure 2.4. Frequency response before equalization, after NRZ equalization and duobinary equalization of a 40cm PCB trace | | Figure 2.5. (a) Jitter impulse response and (b) jitter transfer function of a 40cm PCB trace at 20-Gb/s. | | Figure 2.6. Schematic of receiver linear equalizer | | Figure 2.7. Eye diagram of 40cm trace after NRZ equalization, (a) transient simulation of 10k random bits and (b) statistical analysis. | | Figure 2.8. Eye diagram of 40cm trace after duobinary equalization, (a) transient simulation of 10k random bits and (b) statistical analysis. | | Figure 2.9. (a) Simulation result for NRZ and (b) Measured eye diagram from Fig. 29 in [36] | | Figure 2.10. (a) Simulation result for Duobinary and (b) Measured eye diagram from Fig. 29 in [36] | | Figure 2.11 Eye opening area for BER<10 <sup>-12</sup> with different length of traces | # LIST OF FIGURES (Continued) | <u>Figure</u> | <u>Page</u> | |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------| | Figure 2.12. Eye opening area for BER<10 <sup>-12</sup> with different rising and falling tit 40cm trace. | | | Figure 2.13. Eye opening area for BER<10 <sup>-12</sup> with different duty cycle deviation 40cm trace. | | | Figure 2.14. Eye opening area for BER<10 <sup>-12</sup> with different receiver and transmer RMS jitter for 40cm trace. | | | Figure 2.15. Eye opening area for BER<10 <sup>-12</sup> with different jitter tracking band for 40cm trace with both 1ps RMS TX and RX jitter | | | Figure 2.16. Eye opening area for BER<10 <sup>-12</sup> with different FFE and DFE taps 40cm trace (for FFE, with 1 precursor tap and varying no. of postcursor taps) | | | Figure 2.17. Inverter chain. | 29 | | Figure 2.18. Modeling of chain. | 29 | | Figure 2.19. CML chain. | 32 | | Figure 2.20. Microstrip transmission line. | 33 | | Figure 2.21. Inductive load. | 34 | | Figure 2.22. CDW. | 35 | | Figure 2.23. Modeling of CDW. | 35 | | Figure 3.1. (a).Conventional forwarded clock receiver architecture and (b) proparchitecture using ILRO for multiple serial links. | | | Figure 3.2. (a) Phase vector diagram (injection signal $e_{inj}(x)$ , free-running tank s $e(x)$ and resultant output $e(x)$ ). (b) Deskew with different injection strength k, on Adler's equation. | based | | Figure 3.3. Superposition of waveforms. | 43 | | Figure 3.4. Block diagram of proposed receiver | 47 | # LIST OF FIGURES (Continued) | <u>Figure</u> <u>P</u> | age | |------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----| | Figure 3.5. Schematic of ILRO. | 48 | | Figure 3.6. Simulated AC response of RX EQ under different settings. | 50 | | Figure 3.7. Quantizer with offset control. | 50 | | Figure 3.8. Die photo and layout screen capture. | 52 | | Figure 3.9. (a) Deskew of ILRO (b) Overlaid waveforms by sweeping phase setting (vertical scale = 25mV/div, horizontal scale = 10ps/div). | | | Figure 3.10. Jitter performance of ILRO. | 55 | | Figure 3.11. Measured phase noise performance of ILRO. | 55 | | Figure 3.12. Measured jitter transfer of ILRO. | 56 | | Figure 3.13. Phase spacing when injecting 2.5GHz clock: (a). $f_0$ =2.49GHz, (b) $f_0$ =2.58GHz (vertical scale = 25mV/div, horizontal scale = 50ps/div) | 57 | | Figure 3.14. Eye diagrams of recovered data under (a) 4cm trace and (b) 14cm trace (input data rate=7.2Gb/s, vertical scale = 155mV/div, horizontal scale = 111ps/div) | | | Figure 3.15. BER measurements (a) by sweeping the delay in BERT and (b) by sweeping the phase setting of ILRO. | 59 | | Figure 3.16. Receiver power breakdown. | 60 | | Figure 4.1. Proposed receiver architecture. | 63 | | Figure 4.2. Schematic of global clock buffer with parasitics. | 64 | | Figure 4.3. Block diagram of the receiver data lane (a) RX1 and (b) RX2 | 65 | | Figure 4.4. Schematic of super-harmonic ILO. | 66 | | Figure 4.5. Model of single stage of super-harmonic ILO. | 67 | | Figure 4.6 (a) Die photo. And layout screen capture of (b) RX1 and (c) RX2 | 69 | ## LIST OF FIGURES (Continued) | <u>Page</u> | |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Figure 4.7. Measured deskew range and free-running frequency of super-harmonic ILO across fine freq tuning | | Figure 4.8. Overlaid waveform of clock rising edge by changing fine tuning alone (a) with oscilloscope average mode on for clarity and (b) with grade color mode on 7 | | Figure 4.9. (a) RMS jitter of super-harmonic ILO output across fine tuning settings, and (b) one of zoomed jitter measurement | | Figure 4.10. Output of SH-ILO after phase modulating 4G clock source by (a) 20MHz deviation and (b) 30MHz deviation | | Figure 4.11. RX1: (a) 800Mb/s 1:10 recovered data output (x=250ps/div, y=100mV/div), (b) BER bathtub curve at 8Gb/s over 20cm FR4 | | Figure 4.12. RX2: (a) 800Mb/s 1:10 recovered data output (x=200ps/div, y=100mV/div), (b) BER bathtub curve at 8Gb/s over 20cm FR4 | | Figure 4.13 Channel response of a 20cm FR4 PCB trace | ## LIST OF TABLES | <u>Table</u> | | <u>Page</u> | |--------------|----------------------------------------|-------------| | 2.1 | Comparison of Five Clock Distributions | 36 | | 3.1 | Performance Summary | 61 | | 4.1 | Power Breakdown | 74 | | 4.2 | Comparison with Previous Works | 75 | ## Analysis and Design on Low-Power Multi-Gb/s Serial Links #### **CHAPTER 1. INTRODUCTION** ### 1.1 Motivation The fast-growing demand of higher data rates has pushed the bottleneck to the off-chip I/O bandwidth for applications like future many-core microprocessor systems. According to the ITRS roadmap [1], as shown in Fig. 1.1, the number of cores is projected to increase 1.4x per year, and each core frequency by 1.05x per year. This will result in about 20x increase of the system processing performance in 2016 with Figure 1.1. SoC networking projections (performance scaled from 8-core implementation in 2009) [1]. more than 80 cores in 22nm process relative to an 8-core system in 45nm in 2009. And it will even be roughly 1000x increase in 2024. These results indict that the aggregate chip-to-chip I/O bandwidth between cores and memories needs to scale with the same fashion in order to feed and keep the computation units well loaded to gain the best performance. However, due to practical limitations like channel loss and crosstalk, in Fig. 1.2, the data rate per pin is only projected to raise about 10x in 2024 relative to 2009, although CMOS technology is going to scale to 8nm. Given that the maximum pin number increases about 2x to 4x during the same period [1], there is a huge gap of aggregate bandwidth for link designers to meet the performance trend. On the other hand, I/O power efficiency does not scale as aggressively as Figure 1.2. I/O bandwidth projections [1]. Figure 1.3. Power efficiency of recent published serial links. performance either. Based on the data in [2] and recent published serial link transceivers in ISSCC and VLSI symposium [3-34], the power efficiency on an average has been improved by only 20% per year, as shown in Fig 1.3. If this tendency sustains in the near future, as the trend of data rate increases much faster than the improvement of power efficiency, the total I/O power is going to increase. This not only will add the cost of thermal dissipation, but also will gain an unacceptable level of power of the whole system. The purpose of this thesis is to explore these limitations and to develop new design techniques suitable for low-power multi-Gb/s CMOS serial links. A targeted data rate is around 8Gb/s and the designs are focused on the power-efficient receivers for moderate loss channels. ### 1.2 Thesis Organization This dissertation begins with an overview of serial link transceiver architectures, followed by a statistical modeling method of link performance at system level, a comparison of NRZ and duobinary modulation scheme and a discussion on clock distribution methods in Chapter 2. Chapter 3 describes a quad-lane, 6.4-7.2 Gb/s serial link receiver test chip in 90nm CMOS process with a novel phase deskew scheme of injection-locked ring oscillator as a local deskew unit for each receiver. Significant power is saved by replacing the traditional power-hungry phase interpolators by this ring oscillator. This receiver achieves 0.6mW/Gb/s power efficiency in a nominal 1.2V supply. Some design issues and theoretical analysis on injection locked ring oscillators will also be discussed. After that, Chapter 4 presents 8Gb/s serial link receivers in 65nm CMOS to further reduce the power by using proposed super-harmonic injection-locked ring oscillators for symmetric multi-phase local clock generation and deskewing. Combining this technique with higher parallelism in the receiver structure, low-supply near-threshold operation and CMOS scaling, this receiver prototype improves power efficiency up to 0.163 mW/Gb/s for a -9.7dB loss channel with BER $< 10^{-12}$ . Finally, Chapter 5 summarizes the presented work and proposes some suggestions for further studies. #### CHAPTER 2. ANALYSIS AND MODELLING ON SERIAL LINKS #### 2.1 Overview of the Architectures Depending on the way clock is generated in the receiver side, serial link transceivers can be generally divided into two classes: forwarded-clock architecture and embedded-clock architecture. As shown in Fig. 2.1 (a), in the forwarded-clock architecture, the clock in the receiver (RX) side is obtained directly from transmitter (TX), which ensures that clock frequencies of both the TX and RX are exactly the same. That is, there is no frequency offset otherwise will rise from the small variation of separate crystal oscillators between TX and RX. Therefore, the clock recovery block in the forwarded-clock architecture just needs to rotate the phase to make the clock samples the data at the center. On the other hand, in the embedded-clock architecture, shown in Fig. 2.1 (b), the sampling clock of RX is recovered from the received data instead. As a typical clock data recovery (CDR) circuit in this architecture has a local PLL loop in it to help recover the clock, it needs to tolerate small frequency difference between TX and RX due to different crystal oscillators. The jitter of the recovered clock will track up to the CDR loop bandwidth. However, it is usually smaller than the jitter tracking bandwidth of forwarded-clock architecture [35]. Compared with the embedded clock architecture, forwarded clock architectures reduce the power of clock recovery at the expense of an additional forwarded link to deliver the transmitted mesochronous clock. However, if the chip I/O interface requires many parallel serial lanes, the power and pin overhead of the additional forwarded clock can be amortized among all the data links. For the data path, on the TX side, several data sequences are multiplexed, with the transmitted symbol pulse width and position determined by the clock shape and data multiplexing ratio. The multiplexed data sequence $\{d_k\}$ is equalized by feed- Figure 2.1. Block diagram of serial link transceiver: (a) forwarded-clock architecture and (b) embedded-clock architecture. forward equalization (FFE) and sent to a channel by the output driver. The channel can be a combination of bonding wire, package trace, PCB trace on the board, connectors and co-axial cable. After passing through a receiver linear equalizer (LE) and/or nonlinear decision feedback equalizer (DFE), the data is then recovered and demultiplexed by the quantizer(s) and flip-flops. For NRZ signaling, the receiver uses the quantizer to slice the 2-level analog input into a single digital value. For higher order modulation, such as duobinary for example, an LSB distiller or 3-level ADC [36] (not shown in Fig. 2.1 for simplicity) is necessary to convert the recovered sequence to NRZ, resulting in two digital outputs for each 3-level duobinary analog input. ### 2.2 Modeling of Serial Link Chip-to-chip communications can show widely varying channel losses (e.g. - 3dB to -30dB at Nyquist) due to variations in trace length, PCB material, connector type, via stubs, and proximity to aggressor signal coupling. Lossy channel bandwidth critically limits the maximum symbol rate due to inter-symbol interference (ISI), which makes designers to think using a higher order modulation (e.g. PAM-4, partial response such as duobinary) to get higher bit rate. For next generation multi-Gb/s serial links, such as in short-range chip-to-chip applications [37], the channel typically exhibits moderate losses of -20dB or less. Fig. 2.2 depicts the measured channel losses of typical FR4 PCB traces from 10cm to 80cm long, showing that for a 40cm trace length, the measured channel loss at 10GHz is -18.9dB. While such channel losses may contribute to a reduced signal-to-noise ratio (SNR) in the eye opening, other non-ideal effects beyond channel losses may also contribute to performance degradation, such as PLL jitter, crosstalk, duty cycle distortion (DCD), jitter amplification, and finite rise/fall time of the data symbol. Besides the issue of channel and circuit impairments, another critical problem is the difficulty in achieving simulation accuracy at the circuit transistor level for multi-Gb/s data rates. As date rate goes up, the time step for simulation becomes smaller. Therefore, excessive transient simulation time is required for the same accuracy; otherwise, simulation inaccuracy will appear due to the incomplete characterization of the link performance. For example, the simulation length of a Figure 2.2. Measured channel loss of 10cm to 80cm PCB traces (from top to bottom), showing -6,6dB, -18.9dB,-35dB loss at 10GHz for 10cm, 40cm and 80cm PCB traces respectively. random input sequence exhibiting error-free operation should be at least three times the inverse of the expected bit-error rate (BER), in order to obtain reasonable accuracy with a 95% confidence level [38]. For a typical serial link application with an expected BER of $10^{-12}$ , the data sequence needs to be at least $3x10^{12}$ symbols long, which requires a significant amount of simulation time even for a 64-bit workstation. Moreover, to accurately model the jitter, duty cycle variation and finite symbol rise/fall time, the time step of the simulator must be further reduced, again resulting in increased simulation time. Worst-case analysis has been proposed in [39], [40] for obtaining quick link estimation, but is unable to provide more complete link characteristics, such as BER versus eye sampling location. Statistical analysis techniques [39], [41]-[44] enable accurate and more efficient methods to estimate the performance of serial links beyond conventional transient simulations. These simulators calculate the BER distribution plot by convolving the probability density function (PDF) of all individual cursors of the pulse response. While the PDF of interference sources such as crosstalk can be easily added by summing the corresponding aggressor responses, timing uncertainty such as accurate analysis of transmitter jitter is more difficult to perform [42]. For example, the original work in [39] simply treats transmitter jitter similarly to receiver jitter, though it has travelled through the lossy channel which causes jitter amplification. In [41], jitter from both the transmitter and receiver are converted to an equivalent voltage noise, based on a jittered pulse decomposition model that gives accurate results in the voltage domain. Extending on the work in [39], a more accurate analysis of transmitter jitter was proposed in [42], [43], which requires extensive calculations to take almost every possible position of the transmitted pulse shapes into account according to the PDF of the transmitter jitter. However, this will degrade the efficiency of the statistical analysis. Furthermore, it treats individual transmit jitter shaped by the PDF separately as a time offset from an ideal pulse, regardless of its frequency content. This can be problematic, as the transmitted sequence will be fed to an ideal high-pass filter in order to capture the jitter amplification at high frequency [45], [35] -- resulting in the same inaccuracy problem as the conventional transient simulation mentioned above. For multi-Gb/s data rates or above, these timing uncertainties become even more critical for accurate analysis and prediction of link performance across various modulation schemes. A statistical analysis technique for multi-Gb/s serial links that not only includes the effect of channel loss such as ISI and equalization, but also predicts the effects of transmitter jitter amplification, random receiver jitter, finite rise/fall time, and clock duty cycle variation, will be discussed. Based on this analysis, a comparison of at 20-Gb/s data rate between conventional NRZ signaling and duobinary modulation is made. #### 2.2.1 Overview of NRZ and Duobinary Signaling NRZ signaling is commonly used in high-speed chip-to-chip communications due to its simplicity and therefore straightforward design in both the transmitter and receiver circuit architectures. In the frequency domain, its main spectral lobe occupies bandwidth up to its data rate of $1/T_b$ , where $T_b$ is the period of a symbol. To relieve ISI due to channel loss, equalization is predominantly used to flatten the channel response. The tap coefficients of the equalization filter can be calculated by zero-forcing the nearby cursors except for the main cursor. For example, the coefficients $(c_{-1} c_0 c_1 c_2)^T$ of a 4-tap FFE for a channel pulse response shown in Fig. 2.3 can be solved by $$\begin{pmatrix} g_{-1} \\ g_{0} \\ g_{1} \\ g_{2} \end{pmatrix} = \begin{pmatrix} g_{F,0} & g_{F,-1} & g_{F,-2} & g_{F,-3} \\ g_{F,1} & g_{F,0} & g_{F,-1} & g_{F,-2} \\ g_{F,2} & g_{F,1} & g_{F,0} & g_{F,-1} \\ g_{F,3} & g_{F,2} & g_{F,1} & g_{F,0} \end{pmatrix} \begin{pmatrix} c_{-1} \\ c_{0} \\ c_{1} \\ c_{2} \end{pmatrix}$$ (2.1) where targeted cursors $(g_{-1} g_0 g_1 g_2)^T$ are $(0 \ 1 \ 0 \ 0)^T$ for NRZ; $g_F(t)$ is the pulse response of the channel and $g_F(t-kT_b)$ is noted as $g_{F,k}$ . Duobinary is a partial response signaling scheme that introduces controlled ISI to reduce the transmitted bandwidth. Its main spectral lobe occupies bandwidth up to only half the data rate, or 1/2T<sub>b</sub>. In theory, duobinary is performed as the exclusive-or sum of the current bit and the preceding one within a NRZ sequence, resulting in a 3-level signaling constellation [46]. In practice, it can be achieved by combining both the channel low-pass characteristics and the transceiver equalization together. For example, the coefficients of a 4-tap FFE used for duobinary can be calculated from Figure 2.3. Pulse response to a 50ps (20-Gb/s) pulse before equalization, after NRZ equalization, and duobinary equalization of 40cm PCB trace. (2.1) with targeted cursors (g<sub>-1</sub> g<sub>0</sub> g<sub>1</sub> g<sub>2</sub>)<sup>T</sup> equivalent to (0 0.5 0.5 0)<sup>T</sup>. To prevent error propagation, a precoder and decoder must be implemented at baseband. Fig. 2.3 and Fig. 2.4 show the pulse and frequency responses before and after equalization for both NRZ and duobinary signaling, using a 4-tap FFE through a 40cm FR4 PCB trace at a 20-Gb/s data rate. The smaller bandwidth of duobinary modulation confirms its higher spectral efficiency, showing less loss than NRZ for the same date rate. ## 2.2.2. Background of Statistical Analysis As mentioned previously, since BER for a typical serial link can be less than $10^{-12}$ and random noises are boundless, transient simulations of eye diagrams and SNR are both excessively time-consuming and difficult to process (due to the large amount Figure 2.4. Frequency response before equalization, after NRZ equalization and duobinary equalization of a 40cm PCB trace. of sampled data). Statistical analysis, on the other hand, can give a detailed eye plot of the BER distribution across both different timing offsets and decision thresholds. Based on the transmitted pulse response through the channel, statistical analysis convolves all the PDFs of the residual ISI to produce the BER eye. For NRZ signaling, the PDF of the ISI from the k<sup>th</sup> preceding bit can be expressed as $$ISI_{k} = P_{0}\delta(x) + P_{1}\delta(x - g_{F,k}), \quad k \neq 0$$ (2.2) where $P_0$ and $P_1$ are the probability of transmitting ZERO and ONE symbols, with typical values of 0.5 for equal possibility of ZERO and ONE. $\delta(x)$ is the unit impulse function. When k > 0, ISI results from the postcursor tails of previous bits, while when k < 0, the ISI arises from the precursor of proceeding bits. The total ISI is then calculated by convolving all the ISIs as: $$ISI = \dots \otimes ISI_{-2} \otimes ISI_{-1} \otimes ISI_{1} \otimes ISI_{2} \otimes \dots$$ (2.3) The PDFs of the main cursor with symbols ZERO and ONE are: $$main_0 = \delta(x), \quad main_1 = \delta(x - g_{F,0})$$ (2.4) Then the PDFs of ZERO and ONE interfered by the ISI are: $$pdf_0 = main_0 \otimes ISI$$ , $pdf_1 = main_1 \otimes ISI$ (2.5) Hence, the BER of NRZ signaling for a given decision threshold $y_T$ can be written as: $$BER_{NRZ}(y_T) = P_0 \cdot P(D_1 | H_0) + P_1 \cdot P(D_0 | H_1)$$ $$= P_0 \int_{y_T}^{\infty} p df_0 dx + P_1 \int_{-\infty}^{y_T} p df_1 dx$$ (2.6) where $P(D_1|H_0)$ is the probability of transmitting a ZERO but mistaking it as a ONE at the receiver, while $P(D_0|H_1)$ is the opposite scenario. The BER distribution at any single time instance is obtained by sweeping $y_T$ across the input dynamic range. After repeating the above steps across one complete symbol period, the entire BER eye plot can be derived. ## 2.2.3 Statistical Analysis for Duobinary Duobinary modulation introduces controlled ISI, implicit within the coding. Therefore, there exist two large distributions (one caused by the main cursor and the other by the first postcursor) in pdf<sub>0</sub> and pdf<sub>1</sub>, instead of only one as in case of NRZ. Also because of the three-level signaling, two decision boundaries $v_{TH1}$ and $v_{TH2}$ need to be set in order to obtain the BER for duobinary: $$BER_{duo}(y_{T}) = \begin{cases} \int_{y_{T}}^{v_{TH1}} (P_{0} \cdot pdf_{0} + P_{1} \cdot pdf_{1}) dx, & y_{T} \leq v_{TH1} \\ \int_{v_{TH1}}^{y_{T}} (P_{0} \cdot pdf_{0} + P_{1} \cdot pdf_{1}) dx, & v_{TH1} < y_{T} < v_{mid} \\ \int_{y_{T}}^{v_{TH2}} (P_{0} \cdot pdf_{0} + P_{1} \cdot pdf_{1}) dx, & v_{mid} \leq y_{T} < v_{VT2} \\ \int_{v_{TH2}}^{y_{T}} (P_{0} \cdot pdf_{0} + P_{1} \cdot pdf_{1}) dx, & v_{TH2} \leq y_{T} \end{cases}$$ $$(2.7)$$ where $v_{mid}$ is the position of the peak impulse from the sum of pdf<sub>0</sub> and pdf<sub>1</sub>. The decision boundaries $v_{TH1}$ and $v_{TH2}$ can be obtained by searching for the minimum BER located around the position of $v_{mid}\pm0.5max(gF(t))$ , such that the BER can be low enough to open the eye near the boundaries. #### 2.2.4 Clock Non-idealities In addition to ISI, clock non-idealities such as transmitter jitter, receiver jitter, rise/fall time and duty cycle variation will also degrade the performance of a serial link receiver. When the jittery data sequence is transmitted through the channel and arrives at the input of the receiver, the jitter value will be increased, especially for its high frequency portion. This is typically referred to as jitter enhancement or jitter amplification in [45], [35], and it worsens as data rate increases. One way to quantify the amount of jitter amplification is to use the jitter impulse response (JIR) and jitter Figure 2.5. (a) Jitter impulse response and (b) jitter transfer function of a 40cm PCB trace at 20-Gb/s. transfer function (JTF). The JIR at a given data rate can be extracted by comparing the ideal zero-crossings with the zero-crossings of the response where the data sequence gives a single-shot of a small time offset. Then JTF can be obtained by calculating the Fourier transformation of the JIR. Fig. 2.5 shows the JIR and JTF of the 40cm FR4 PCB trace at a 20-Gb/s data rate. Assuming the transmitter jitter sequence $J_{TX}$ is wide-sense stationary (WSS), the mean of the jitter response at the input of the receiver $J_{TX}$ can be expressed as: $$E\left[J_{TX}^{\prime}\right] = E\left[JIR \otimes J_{TX}\right] = E\left[J_{TX}\right] \int_{0}^{\infty} JIR \ dt = E\left[J_{TX}\right] JTF\left(0\right)$$ (2.8) where E(x) is the expected value or mean of x [47]. $S_{TX}$ and $S'_{TX}$ , which are the power spectral density (PSD) of the $J_{TX}$ and $J'_{TX}$ , can be related as the well-known equation: $$S_{TX}^{\prime} = \left| JTF(f) \right|^2 S_{TX} \tag{2.9}$$ Then the auto-covariance C'<sub>TX</sub> of J'<sub>TX</sub> is $$C'_{TX}(\tau) = R'_{TX}(\tau) - E^2 \left[ J'_{TX} \right] = \mathcal{F}^{-1}(S'_{TX}) - E^2 \left[ J'_{TX} \right]$$ (2.10) where $R'_{TX}$ is the auto-correlation of $J'_{TX}$ , while the second equation comes from Wiener-Khinchin theorem. From (2.8)-(2.10), if the distribution of $J_{TX}$ is known, we can obtain both the mean and auto-covariance of its response $J'_{TX}$ through the channel. Moreover, if the input process $J_{TX}$ is a Gaussian WSS random process, the output $J'_{TX}$ will also be a Gaussian WSS random process [47]. Thus, the mean and auto-covariance will be sufficient to determine the distribution of $J'_{TX}$ . It should be noted that while the jitter is amplified as it passes through the channel, the sampling clock can track some amount of this jitter, such that the total degradation on BER can be mitigated. This jitter tracking is constrained by the bandwidth limitation of a CDR circuit [35] (which generates the clock for the receiver in Fig. 2.1) in an embedded clock architecture, or from the mismatch observed between the data and clock paths in the forwarded clock architecture. To model this effect, we assume a first order low-pass system to track the jitter up to its tracking bandwidth BW<sub>track</sub>, with only the portion outside this tracking bandwidth is integrated. The transfer functions of "jitter tracking" and "not tracking" can be expressed as below: $$H_{track} = \frac{1}{1 + \frac{j\omega}{BW_{track}}}, \quad H_{not\_track} = 1 - H_{track} = \frac{\frac{j\omega}{BW_{track}}}{1 + \frac{j\omega}{BW_{track}}}$$ (2.11) By doing so, the transmitter jitter is converted to its equivalent jitter distribution at the receiver side. The random timing jitter uncertainty at the receiver side can be modeled as a Gaussian distribution. Though a Gaussian distribution is boundless, the probability that the random variable exceeds $7.0345\sigma$ is only $10^{-12}$ , where $\sigma$ is its standard deviation [47]. We include the range between $\pm N_s \sigma$ in the calculation, where $N_s$ is chosen as 8 in order to leave sufficient margin for a BER of $10^{-12}$ . The time positions of the cursors in the pulse response $g_F(t)$ , shown in Fig. 2.3, are disturbed by the presence of the jitter. Therefore, the PDFs of the ISI and the main cursor ONE can be modified from (2.2) and (2.4) to: $$ISI_{k,j} = P_0 \delta(x) + P_1 \sum_{\tau = -N_s \sigma}^{N_s \sigma} \left[ \delta(x - g_F(t - kT - \tau)) \cdot gs(\tau) \right]$$ (2.12) $$main_{1,j} = \sum_{\tau = -N_s \sigma}^{N_s \sigma} \left[ \delta \left( x - g_F \left( t - \tau \right) \right) \cdot gs(\tau) \right]$$ (2.13) where $gs(\tau)$ is the PDF of the jitter. Different PDFs of uncorrelated jitter sources can be convolved together to obtain the total equivalent PDF at the receiver side. The effects of finite rise/fall time and duty cycle variation are added to this analysis by directly shaping the input symbol pulse according to its rise/fall time and pulse width, and then regenerating the pulse response through the channel. # 2.2.5 Sub-block Modeling of Serial Link As shown in Fig. 2.1, the sub-blocks of a serial link transceiver includes the channel, FFE, LE and DFE. The channel pulse response can be extracted from the inverse FFT of the S-parameters of the channel [39]. Because the FFE and DFE are discrete-time in nature, they are easily included in the analysis, as the tap coefficients calculated from (2.1) can be used directly as the coefficients for the FIR filter of the FFE or DFE. The receiver front-end LE, on the other hand, is the analog component that works at the highest frequency of all the receiver blocks. It is usually implemented as a source-degenerated, linear equalizer [2], as shown in Fig. 2.6. Its voltage gain can be written as: $$A_{v} = G_{m}R_{out} \approx \frac{g_{m}}{1 + g_{m}\left(R_{s} / / \frac{1}{j\omega C}\right)} \left(R_{D} / / \frac{1}{j\omega C_{L}}\right) = \frac{g_{m}R_{D}}{1 + g_{m}R_{s}} \frac{1 + j\omega / \omega_{z}}{\left(1 + j\omega / \omega_{p}\right)\left(1 + j\omega / \omega_{p,out}\right)}$$ $$(2.14)$$ where $g_m$ is transconductance of the input transistor pair, $\omega_z = 1/R_s C$ , $\omega_p = (1+g_m R_s)/R_s C = (1+g_m R_s)\omega_z$ , and output pole $\omega_{p,out} = 1/R_D C_L$ . Therefore, $R_s$ and C introduce a zero $\omega_z$ before the pole $\omega_p$ . If the output pole $\omega_{p,out}$ is designed to be larger than the zero, the gain will be boosted between $\omega_z$ and the smaller one of $\omega_p$ Figure 2.6. Schematic of receiver linear equalizer. and $\omega_{p,out}$ . By increasing the value of the degenerated resistor R<sub>s</sub>, the DC gain will decrease and $\omega_z$ will be smaller. However, the location of two poles will not change significantly, resulting in an effective high-pass filtering effect with a constant frequency peak that compensates for some of the channel loss. Finally, the BER distribution plot with equalization can be obtained from the resulting pulse response convolving with the impulse responses of the equalizers. # 2.3 Behavioral Simulations on Link Performance The above analysis is verified using behavioral simulations in MATLAB. Several FR4 PCB traces with two SMA connectors for different lengths from 10cm to 80cm were measured (Fig. 2.2). The impulse responses of the channels were derived from the measured S parameters. Unless otherwise stated, the default settings for the Figure 2.7. Eye diagram of 40cm trace after NRZ equalization, (a) transient simulation of 10k random bits and (b) statistical analysis. Figure 2.8. Eye diagram of 40cm trace after duobinary equalization, (a) transient simulation of 10k random bits and (b) statistical analysis. simulations in this section are 20-Gb/s data rate with 0.5V transmitter amplitude and 20mV tap coefficient resolution through the 40cm PCB trace. Due to the large channel loss of the 40cm trace, the eye without equalization will be closed. Therefore, the effectiveness of the statistical analysis is verified by traditional transient simulation with 4-tap FFE equalization for both NRZ and duobinary. The two methods exhibit similar horizontal and vertical openings, as shown in Fig. 2.7 and Fig. 2.8. Note that for the transient results, 10k bits are simulated in order to trade-off between accuracy and simulation time. The proposed statistical analysis not only provides similar eye diagram with less simulation time but also includes sufficient BER information. This BER eye plot can easily be converted to the conventional bathtub curve for a given decision threshold. By rebuilding the Figure 2.9. (a) Simulation result for NRZ and (b) Measured eye diagram from Fig. 29 in [36]. Figure 2.10. (a) Simulation result for Duobinary and (b) Measured eye diagram from Fig. 29 in [36]. pulse response, a comparison is also made between the statistical method and the measured results in [36], as shown in Fig. 2.9 and Fig. 2.10. To fairly compare the performance of NRZ without equalization, NRZ with equalization and duobinary equalization, the tap coefficients of the 4-tap FFE are normalized. Each modulation scheme is analyzed by comparing the area of the region where BER<10<sup>-12</sup> in the BER eye plot, in unit ps\*V. As there are two eye openings for duobinary signaling, only the minimum of the two is counted as the worst case when there are uneven eyes for duobinary. As shown in Fig. 2.11, the eye is almost closed after 20cm if no equalization is performed. Because duobinary equalization consumes more voltage headroom than NRZ and it relies on a faster decreased channel loss in the frequency domain, it is not as effective as NRZ equalization for small channel losses. However, for severe loss channels like those longer than 40cm, its eye opens Figure 2.11. Eye opening area for BER<10<sup>-12</sup> with different length of traces. more compared with NRZ equalization. The effects of finite rise/fall time and duty cycle deviation are shown in Fig. Figure 2.12. Eye opening area for BER<10<sup>-12</sup> with different rising and falling times for 40cm trace. Figure 2.13. Eye opening area for BER<10<sup>-12</sup> with different duty cycle deviations for 40cm trace. 2.12 and Fig. 2.13. Here it is observed that NRZ equalization dose not degrade as much as duobinary due to these variations. Interestingly, the eyes improve slightly Figure 2.14. Eye opening area for BER<10<sup>-12</sup> with different receiver and transmitter RMS jitter for 40cm trace. Figure 2.15. Eye opening area for BER<10<sup>-12</sup> with different jitter tracking bandwidth for 40cm trace with both 1ps RMS TX and RX jitter. with small rise/fall time, because the finite transition times smooth the pulse shape and excite less interference. Fig. 2.14 shows the BER eye openings of NRZ and duobinary signaling with different receiver and transmitter jitter values, where the eye opening of duobinary degrades faster than that of NRZ in the existence of jitter. Thus, while the eye opening of jitter-free duobinary is larger than that of NRZ, as the jitter value increases, duobinary performs worse than NRZ. Fig. 2.15 shows that a larger jitter tracking bandwidth will help to improve the BER performance. The eye openings for different FFE and DFE taps are plotted in Fig. 2.16. As the number of taps increases, residual ISI becomes less severe, opening the eyes of both NRZ and duobinary. However, as duobinary requires 3-level signaling, its eye is more likely to be limited by its voltage Figure 2.16. Eye opening area for BER<10<sup>-12</sup> with different FFE and DFE taps for 40cm trace (for FFE, with 1 precursor tap and varying no. of postcursor taps). headroom than by residual ISI. Therefore, when a large number of equalizing taps are used, the duobinary eye with limited voltage headroom may perform unfavorably when compared with NRZ. #### 2.4 Clock Distribution Methods In the previous sections, analysis and simulation of serial link performance are focused on the data path with abstract of clock information to jitter, rise/fall time, DCD etc. In this section, more discussions on detailed clock distribution methods will be presented. Clock distribution methods can be classified into full swing propagation like inverter chain and low swing propagation such as current mode logic (CML), transmission line, inductive load and capacitively driven wires (CDW). Because dynamic power dissipation at frequency f can be expressed as $$P_{dyn} = C_L V^2 f (2.15)$$ where CL is load capacitance and V is the propagation swing [70]. Low swing methods benefit from lower dynamic power and less aggressive to substrate and other circuits. But they may suffer from large static power consumption like CML. Also, auxiliary circuits such as level shifter are needed at local receiver side to convert to full swing. #### 2.4.1 Inverter Chain Inverter chain is the most traditional way for clock distribution. As shown in Fig. 2.17, it can be divided into several segments to minimize the propagation delay. For hand calculation, inverter chain can be modeled as Fig. 2.18. When an ideal step input excites a single-pole system, the output waveform can be written as $$V_{out}(t) = V(\infty) + (V(0^+) - V(\infty))e^{-t/\tau}$$ (2.16) where $\tau$ is the time constant of the system, which is the product of the effective resistance and capacitance. $V(\infty)$ and $V(0^+)$ are final and initial voltages respectively. The delay time to reach the 50% point is well known as propagation delay: $$t_p = \ln(2)\tau = 0.69\tau \tag{2.17}$$ According to Elmore model [70], propagation delay of inverter chain can be Figure 2.17. Inverter chain. Figure 2.18. Modeling of chain. calculated as $$t_{p} = N \left[ 0.69 R_{eq} c_{o} + \left( 0.69 R_{eq} + 0.38 \frac{R_{w}L}{N} \right) \frac{C_{w}L}{N} + 0.69 \left( R_{eq} + \frac{R_{w}L}{N} \right) c_{i} \right]$$ $$= N \left[ 0.69 \frac{R_{equ}}{m} m c_{ou} + \left( 0.69 \frac{R_{equ}}{m} + 0.38 \frac{R_{w}L}{N} \right) \frac{C_{w}L}{N} + 0.69 \left( \frac{R_{equ}}{m} + \frac{R_{w}L}{N} \right) m c_{iu} \right]$$ $$= 0.69 N R_{equ} \left( c_{ou} + c_{iu} \right) + 0.38 \frac{R_{w}C_{w}L^{2}}{N} + 0.69 \frac{R_{equ}C_{w}L}{m} + 0.69 m R_{w}L c_{iu}$$ $$(2.18)$$ where N is the number of segments. $C_i$ , $C_o$ , $R_{eq}$ are the input, output capacitance and equivalent resistance of the inverter. $R_w$ , $C_w$ are the unit resistance and capacitance of the metal wire. L is the total length of the metal wire. $C_{iu}$ , $C_{ou}$ , $R_{equ}$ are the input, output capacitance and equivalent resistance of a unit strength inverter. Usually a unit strength inverter is defined, so that other inverters are m multiples of the unit inverter. The inverter chain can get minimum propagation delay when $$N = 0.742L\sqrt{\frac{R_{w}C_{w}}{R_{equ}(c_{ou} + c_{iu})}} \text{ and } m = \sqrt{\frac{R_{equ}C_{w}}{R_{w}c_{iu}}}$$ (2.19) In a specific CMOS 90nm process, for example, the sheet resistance and unit capacitance of the top thick metal 8 are $R\Box=47m\Omega$ , $C\Box=0.25fF/um^2$ , and its minimum pitch is 560nm. It's not necessary to use minimum width for clock distribution. If 2x wider metal is chosen, then $Rw=0.083\Omega/um$ , Cw=0.134fF/um. And it is assumed the metal wire is 5mm long. A unit strength inverter is designed as W/L of NMOS: 0.4u/0.1u, W/L of PMOS: 1u/0.1u. Simulated $C_{iu}$ and $C_{ou}$ are 1.3fF and 1fF respectively. For hand calculation, $R_{equ}$ is estimated by average the on resistance of unit inverter $R_{on}$ at supply and half supply, as $1/2(R_{on}(VDD)+R_{on}(VDD/2))\approx 6k\Omega$ . So the segment number N and multiple m for optimal delay can be calculated from (2.19) as N=3.33, round to 3, and m≈86. However, these numbers are not optimal for power and jitter performance because inverter is sensitive to supply noise. Less number of inverters is preferred in regard to jitter performance. Simulations under the condition of 5mm long wire, 2.5 GHz clock and $\pm 5\%$ supply variation shows minimum jitter ~36ps is achieved when segment N=3, multiple m=256, while minimum delay ~321ps is when N=3, m=128, and minimum power is when N=4, m=64. From the viewpoint of jitter-power product and delay-power product, N=3, m=128 gets the best performance. #### 2.4.2 CML Chain CML chain, as shown in Fig.2.19, can conduct much faster signal due to its current mode nature. It can be analyzed using the same model as Fig. 2.18. In the same process as above, a unit CML is designed with top load resistor $200\Omega$ and 1mA current source. So the output swing is 0.2V and $R_{equ}$ is $200\Omega$ by this choice. W/L of input device pair is 20u/0.1u. Simulated $C_{iu}$ and $C_{ou}$ for CML are 30fF and 6fF respectively. Doing the same analysis as the inverter chain and also assuming 5mm long metal wire, we can get N $\approx$ 4 and m=3.3 for optimal delay from equation (2.19). Though power supply rejection ratio (PSRR) of CML is much better than inverter, it is Figure 2.19. CML chain. much power hungry due to its static power dissipation. Therefore, minimum number of segments is favorable for low power design. Since the delay of CML is smaller than inverter, its timing margin is relived, and it's not necessary to have to get optimal propagation time to meet the timing requirement. Simulations show minimum jitter of CML chain in this process is $\sim$ 0.5ps when N=2, m=8, while minimum delay $\sim$ 182ps when N=4, m=8, and minimum power is N=2, m=1. From the viewpoint of jitter-power product and delay-power product, N=2, m=1 achieves a good tradeoff. # 2.4.3 Transmission Line Transmission line effect takes into effect when total resistance of metal wire is small enough regarding to the characteristic impedance $Z_0$ . From this point, the distributed inductance of the wire starts to affect the delay behavior. The delay of transmission line is smallest among all the methods due to its speed-of-light propagation velocity. As a passive element, it does not introduce jitter itself, but the driven circuits do. The on-die transmission line can be realized by a coupled differential microstrip in top thick metal 8 with underneath ground shield in metal 6 as shown in Fig. 2.20. It's not necessary to make exactly standard $100\Omega$ differential characteristic impedance as long as it's matched on-die by its source and load. Actually, $Z_0$ is preferred to be designed large enough compared with the DC resistance of metal as mentioned before. And large $Z_0$ also saves power of the driven CML circuit for the given swing. However, $Z_0$ is limited by the parasitic capacitance on the die. After several trial and errors, we choose the transmission line with W=6um, space d=2.5um, ground wire width Ws=4um, and space between signal wire and ground wire s =4um. With this choice, the simulated differential characteristic impedance is $120\Omega$ and DC resistance is $42\Omega$ . It behaves as a lossy transmission line, and the signal amplitude would tamper a little along the line. Simulation shows the delay is as small as 43ps and with only 0.18ps jitter. However, the driven CML circuit consumes 4mA. Figure 2.20. Microstrip transmission line. #### 2.4.4 Inductive Load Other than taking advantage of distributed inductance as transmission line, a lumped on-die inductor can be employed to boost both the propagation time and voltage swing, as shown in Fig. 2.21 [2]. An estimation of inductance can be made by resonant frequency and total capacitance of the metal wire. Assume the same 5mm wire as inverter chain, $C_{total} = C_w \cdot L = 670 \text{fF}$ , so the inductance to cancel this capacitance at 2.5G is about $1/C_{total}\omega^2 = 6nH$ . Larger inductance can push the zero introduced by inductor to lower frequency band, thus causing more boost at desired frequency. However, this may cost more area on the chip. Since the area of inductor is directly related to its inductance and quality factor, careful choice is necessary to achieve these within reasonable area. Die area for inductor can be saved by low inductance and low quality factor Q. Actually quality factor Q of the on-die inductor does not affect much on the performance, because the long metal wire resistance has dominated loaded Q anyway. Simulation show jitter is smaller than 1ps. Like transmission line, its swing also gets smaller along the wire. Figure 2.21. Inductive load. Figure 2.22. CDW. Figure 2.23. Modeling of CDW. # 2.4.5 Capacitively Driven Wires (CDW) Fig. 2.22 shows the capacitively driven method [71]. The capacitance seen by the inverter is significantly reduced due to the capacitor in series, so the power consumption is reduced too. The propagation time can be estimated as the model in Fig. 2.23. $$t_p = 0.69R_{eq} \left( c_o + c_{p1} + c_c \right) + 0.19R_w C_w L^2$$ (2.20) where $C_i$ , $C_o$ , $R_{eq}$ are the input, output capacitance and equivalent resistance of the driven inverter, $R_w$ , $C_w$ are the unit resistance and capacitance of the metal wire. L is the total length. Although minimum series capacitor $C_c$ will reduce the delay and power, it also reduces the swing because of the capacitance divider. So its minimum value should be limited by the sensitivity of comparator at receiver side with some margin. Although Cc blocks the DC, many methods can be implemented to control the DC bias [71]. CDW can achieve the minimum power dissipation in all five methods. Simulated in a 90nm CMOS process, their performances are summarized in Table 2.1. It shows transmission line gets the best jitter and delay performance. And CDW gives the minimum jitter-power product and delay-power product. TABLE 2.1 COMPARISON OF FIVE CLOCK DISTRIBUTIONS | Technology | 1.2V 90nm CMOS | | | |-----------------------------|----------------|------------|-----------| | Methods (with optimal | Performance | | | | tradeoff) | Jitter (ps) | Delay (ps) | Power(mW) | | Inverter chain (N=3, m=128) | 36 | 321 | 11.5 | | CML chain (N=2, m=1) | 1 | 221 | 2 | | Transmission line | 0.18 | 43 | 4 | | Inductive load (L=6nH, Q=2) | 0.42 | 55 | 4 | | CDW (Cc=50f) | 1.98 | 116 | 0.62 | # 2.5 Summary A statistical method to analyze serial link systems for NRZ and duobinary signaling is presented, incorporating non-ideal effects such as transmitter jitter and receiver jitter, jitter tracking bandwidth, finite rise/fall time and duty cycle deviation. Using this analysis tool, a comparison of the performance between NRZ and duobinary at 20-Gb/s is then performed. While duobinary achieves less channel loss due to the reduced Nyquist bandwidth, in general, it suffers more than NRZ from non-idealities arising from the imperfect clock source. Only for long channels with significant attenuation does multi-level, duobinary signaling have a BER advantage over NRZ, given the expected amount of clock uncertainty. The proposed statistical analysis can therefore give early insight for quick and accurate system design tradeoffs for multi-Gb/s interconnections. On the clock path, a comparison and design analysis regarding five clock distribution methods for serial links are presented. Simulations in a 90nm CMOS process have been performed to verify the design tradeoffs and show that transmission line can achieve least jitter and delay, while CDW consumes lowest power with reasonable jitter and delay performance. # CHAPTER 3. A SERIAL LINK RECEIVER USING LOCAL INJECTION-LOCKED RING OSCILLATOR In the previous chapter, design considerations of serial links have been addressed on a system level. In this chapter, some circuit techniques will be discussed to achieve high power efficiency. Recent serial link receivers have shown improvements in power efficiency by focusing on reducing dynamic clock power using resonantly-tuned LC oscillators, both in global clock distribution [2], [48] and local clock demultiplexing [49], [50]. In this chapter, a multi-channel serial link receiver architecture will be presented, which exhibits further improvements in dynamic clock power consumption by implementing a low-voltage swing, global clock distribution to multiple link locations, where locally-tapped, injection-locked ring oscillators (ILRO) are used to generate tunable quadrature sampling clocks for receiver demultiplexing [51]. #### 3.1 Forwarded Clock Receiver Architectures A conventional, forwarded clock receiver architecture [35] is shown in Fig. 3.1(a), which consists of the global clock distribution, as well as the local delay/phase locked loop (DLL/PLL) to generate multiple time-interleaved phases. The proceeding Figure 3.1. (a).Conventional forwarded clock receiver architecture and (b) proposed architecture using ILRO for multiple serial links. phase rotator use these phases to interpolate the appropriate phase position for the receiver to sample the incoming data. In this architecture, significant power is spent in the receiver clocking and phase generation as each link needs a local, phase rotator-based PLL to deskew the clock phase for recovery of the data [52]. For example, phase rotation alone occupies almost half the total receiver power in [2]. As an alternative to the phase rotator, injection-locked LC oscillators (IL-LCO) can enable clock deskew ability with less power and a lower voltage swing for the global clock. Hence, injection-locking has recently been proposed for both clock distribution [48], [53] and serial link receivers [49], [54]. As shown from the phase vector diagram in Fig. 3.2(a), when the frequency of the injection signal $e_{inj}(x)$ is different from the free-running frequency of the signal in the LC tank, e(x), a phase deskew $\alpha$ will be generated between the resulting outputs $e_g(x)$ and $e_{inj}(x)$ . The value $\alpha$ depends on the frequency difference and locking range, given by Alder's equation [55]: $$\frac{d\alpha}{dt} = -\omega_{SL} \sin \alpha + \Delta \omega_0 \quad , \quad \omega_{SL} = k / \frac{d\varphi}{d\omega} = k \frac{\omega_0}{2Q}$$ (3.1) where $\alpha$ is the phase difference between the resultant output clock and the injection input clock, $\phi$ is the phase difference between the free-running frequency $\omega_0$ and the resultant output, $\omega_{SL}$ is the single-sided locking range, k is the injection strength defined as the ratio of the injection current and the oscillator current, and $\Delta\omega_0$ is the Figure 3.2. (a) Phase vector diagram (injection signal $e_{inj}(x)$ , free-running tank signal e(x) and resultant output e(x)). (b) Deskew with different injection strength k, based on Adler's equation. frequency difference between $\omega_0$ and the injection clock. Fig 3.2(b) plots an example of the deskew phase shift along the normalized frequency difference under three injection strength values. Monolithic LC oscillators typically have better phase noise and jitter performance than their ring oscillator counterparts due to the band-pass nature of LC tank resonators, rejecting out-of-band frequencies and filtering power supply induced noise [56]. However, for highly-parallel serial link applications, per-channel, injection-locked LC oscillators are not desirable, as each receiver would require an individual on-chip inductor, resulting in significant area penalty. In addition, LC-based oscillators exhibit very limited tuning range, may exhibit oscillator pulling due to magnetic coupling from adjacent LC oscillators [57], and do not scale well with continued technology scaling. Although the jitter performance of free-running ring oscillators is typically worse than LC oscillators, the large jitter transfer bandwidth of injection locking can suppress and high-pass filter a large amount of oscillator phase noise, as will be described in the next section. Therefore, a new forwarded clock receiver architecture using injection-locked ring oscillators (ILRO) is proposed, as shown in Fig. 3.1(b), to deskew the clock used to sample the incoming data. Compared with the conventional receiver architecture, the ILRO can achieve large phase deskew ability without the power overhead required for the combined DLL, PLL and phase interpolation. Second, it can lock to relatively small voltage swings of the injected global clock, saving power in the clock distribution. Third, the ILRO can achieve faster phase locking than a conventional PLL because while the loop bandwidth of a PLL is limited to approximately 1/10 of the reference clock [57], injection-locking exhibits non-linear loop bandwidth characteristics. As shown in Fig. 3.2(b), nonlinearity can be observed in the deskew steps at the edge of locking range, when $\alpha$ reaches around $\pm 100^{\circ}$ . However, the linear deskew region can be increased as the injection strength k increases. To further avoid the use of the nonlinear deskew region, each receiver uses 1:4 demultiplexing, implemented with four quantizers clocked by quadrature sampling. Therefore, only $\pm 45^{\circ}$ phase deskew range of the ring oscillator is required to enable each quadrature phase to achieve full UI range, limiting the deskew to only the linear region. Compared with the IL-LCO, ILRO consumes less silicon area, larger tuning range, inherent multi-phase generation, and scalability to future CMOS processes. However, because previous analysis on the injection-locking phenomenon is applicable only to tank-based oscillators, new analysis is needed to further understand the behavior of the proposed ILRO. # 3.2 Analysis on Injection-Locked Ring Oscillators # 3.2.1 Previous Approaches Several methods have been proposed in previous works to analyze injection locking in oscillators including: the phasor-based Adler's equation, the perturbation-based projection vector (PPV) method, and the waveform-based timedomain derivation. The classic Adler's equation [55] expresses the oscillator behavior under injection locking by using a phasor vector diagram, as shown in (3.1) and Fig. 3.2. Various time-domain solutions to Adler's equation are discussed in [58]-[60]. However, two main factors prevent this approach from being applicable to ring oscillators. First, the output waveform of ring oscillators usually does not exhibit sine wave behavior; however, the adoption of a vector-based analysis relies on the assumption that there exists only a single dominant frequency component [61]. Second, it is required to know the quality factor Q in order to solve $d\phi/d\omega$ in equation (3.1), which is not well defined for nonharmonic ring oscillators. The PPV [62] and the transient waveform-based methods [61] are capable of Figure 3.3. Superposition of waveforms. analyzing both LC and ring oscillators. However, the PPV method requires a full circuit description at both the transistor and numerical levels, and only the expression for locking range is derived [62]. The analysis in [61] provides good insight into analyzing injection locking in the time domain. However, neither of these two methods gives an analytical expression for evaluating the jitter performance of injection-locked oscillators. #### 3.2.2 Proposed Approach for ILRO Analysis Since Adler's equation is still quite simple and is proven useful for capturing the LC oscillator behavior in both the frequency and time domains, this work presents an expansion to Adler's equation that overcomes the two limitations mentioned above, making it suitable for injection-locked ring oscillators. By revisiting the process of Adler's derivation [55], it can be observed: $$\frac{d\alpha}{dt} = \varphi / \frac{d\varphi}{d\omega} + \Delta\omega_0 \tag{3.2}$$ Note that (3.2) is held for both LC and ring oscillators, as neither the assumption of Q nor a vector diagram approach has been applied yet. Next, alternative methods for finding $d\phi/d\omega$ as well as the relationship between $\phi$ and $\alpha$ are presented. First, $d\phi/d\omega$ can be solved directly from the small signal model of each delay cell. Assuming each delay cell contributes one dominant 3dB pole, the loop transfer function H of an N-stage ring oscillator is: $$H(j\omega) = -\left(\frac{A_0}{1 + j\omega/\omega_{3dB}}\right)^N \tag{3.3}$$ such that its phase and derivative are: $$\varphi(j\omega) = N \tan^{-1}(\omega/\omega_{3dB})$$ (3.4) $$\left. \frac{d\varphi}{d\omega} \right|_{\omega = \omega_0} = \frac{N}{2\omega_0} \sin \frac{2\pi}{N} \tag{3.5}$$ Equation (3.5) is obtained by noting that each delay stage exhibits a phase shift equal to $\tan^{-1}(\omega_0/\omega_{3dB})=\pi/N$ . Similar analysis can lead to an equivalent definition of Q for ring oscillators as shown in [63]. Second, the phase relationship can be obtained by superposition of waveforms in the time-domain rather than using a vector diagram; this enables a general analysis for any arbitrary waveform. As shown in Fig. 3.3, the proposed derivation assumes that the small-swing injection clock $e_{inj}(x)$ remains like a sine-wave, but the waveform shape of the free-running ring oscillator e(x) resembles a trapezoid. Hence, this trapezoidal model reflects the actual waveform of a nonharmonic ring oscillator with equal rise and fall times, where the slope is $k_f$ . Signal $e_g(x)$ is the resulting superposition waveform of both the injection and the ring oscillator signals, and $x_0$ is the phase difference between $e_{inj}(x)$ and e(x), which is equal to $\alpha+\varphi$ . Other symbols remain unchanged. The amplitude is normalized to the amplitude of the free-running oscillator, and the time axis x is normalized to $2\pi$ . During the rising edge of the oscillator waveform (in the dashed box of Fig.3.3), it is observed that due to superposition: $$e_{g}(x) = e_{ini}(x) + e(x) = k \sin x + k_{f}(x - x_{0})$$ (3.6) Let (3.6) equal to 0, and noting that sin(x) can be approximated as $sinx_0+(x-x_0)$ $cos(x_0)$ using Taylor series expansion near $x_0$ , it is obtained: $$\varphi = -\frac{k \sin x_0}{k_f - k \cos x_0} \approx -\frac{k \sin \alpha}{k_f - k \cos \alpha}$$ (3.7) The relationship between k<sub>f</sub> and N for ring oscillators is: $$k_f = N\eta / \pi \tag{3.8}$$ where $\eta$ is a proportionality constant, close to the value of one [64]. Substituting (3.5), (3.7) and (3.8) into equation (3.2), we observe: $$\frac{d\alpha}{dt} = -\frac{k}{N\eta/\pi - k\cos\alpha} \frac{2\omega_0}{N\sin(2\pi/N)} \sin\alpha + \Delta\omega_0$$ (3.9) Hence, the new expression for single-sided locking range $\omega_{SL}$ becomes: $$\omega_{SL} = \frac{k}{N\eta / \pi - k \cos \alpha} \frac{2\omega_0}{N \sin(2\pi / N)}$$ (3.10) Thus, we have derived new equations for analyzing the behavior of injection-locked ring oscillators (with no requirement for Q) by applying small signal and time-domain waveform analysis to Adler's derivation. Further, by analogy between injection-locked oscillators and a 1st order PLL [65], jitter transfer and jitter generation functions can be derived as follows: $$\left|\frac{\phi_{out}}{\phi_{inj}}\right| = \frac{1}{\sqrt{1 + \left(\omega/\omega_{SL}\right)^2}} \quad , \quad \left|\frac{\phi_{out}}{\phi_{vco}}\right| = \frac{1}{\sqrt{1 + \left(\omega_{SL}/\omega\right)^2}} \tag{3.11}$$ Therefore, ILRO will low-pass filter the noise from injection clock, while high-pass filter the noise from itself. Since the jitter of the injection clock $(\sigma_{inj})$ and that of oscillator $(\sigma_{vco})$ are usually uncorrelated, the total jitter can be expressed as: $$\sigma_{out}^2 = \left| \frac{\phi_{out}}{\phi_{inj}} \right|^2 \sigma_{inj}^2 + \left| \frac{\phi_{out}}{\phi_{vco}} \right|^2 \sigma_{vco}^2$$ (3.12) Figure 3.4. Block diagram of proposed receiver. # 3.3 Circuit Implementation Fig. 3.4 shows the block diagram of the major sections of the test chip. Four links are integrated for the experimental demonstration of a multiple serial link architecture using ILROs. The forwarded quarter-rate clock is first buffered by a global CML clock buffer driving a 600um-long, ground-shielded, differential RC line to all four receivers with a 250mV clock signal. The clock input is then coupled to each receiver through a local CML buffer that injects into the injection-locked ring oscillator. The ILRO generates tunable quadrature phases for the quantizers to recover and demultiplex the data. # 3.3.1 Proposed Injection-Locked Ring Oscillator Each injection-locked oscillator consists of a voltage-to-current (V/I) converter and a four-stage, cross-coupled, pseudo-differential current-starved ring oscillator. Simple NMOS-only differential V/I converters without resistive loading are used here Figure 3.5. Schematic of ILRO. to mitigate the interaction with the DC bias at the injection point. The sizes of the NMOS differential pair are carefully chosen to reduce the parasitic loading while fully steering the current source. As shown in Fig. 3.5, all the delay cells in the ring oscillator share a single current source, implemented as a 32b thermometer-encoded DAC. The minimum DAC current step is 30uA, enabling fine tuning of the free-running frequency of the oscillator. For coarse tuning of the free-running VCO frequency, either the supply can be reduced or a 3b switched capacitor array can be utilized. Injection-locking will cause adjacent time-interleaved phases of the oscillator to be unevenly spaced. For example, the injection nodes CK135 and CK315 exhibit significant phase asymmetry from the other six phases, as these nodes are the summing nodes from the injection-locking interpolation. However, once the differential clocks are propagated to CK0 and CK180, they are decoupled from the injection-point phase asymmetry, and now exhibit conventional inverter loading and delay. Hence, the four alternating phases (every two inverter stages) CK180, CK270, CK0, CK90 maintain adequate quadrature accuracy (less than 4.5°), both in the simulated as well as in the experimental results. Each of the eight multi-phases is loaded with the same inverter buffering, to maintain the same capacitive loading. Further phase symmetry is obtained by using small cross-coupled inverters between complementary phases, as well as using a 3b binary capacitor bank on each output clock phase to individually trim the phase imbalances that arise due to process variations or layout mismatch. In addition, for a typical application (not implemented here), an offline, static phase calibration of multiple, time-interleaved phases at reset time would be included to resolve maximum phase mismatch to several picoseconds [66], [67]. Figure 3.6. Simulated AC response of RX EQ under different settings. Figure 3.7. Quantizer with offset control. # 3.3.2 Other Building Blocks The front-end receiver equalizer is the analog component that works at the highest frequency of all the receiver blocks. A source-degenerated, linear equalizer as discussed in previous chapter is implemented, as shown in Fig.2.6. By switching the value of the degenerated resistor $R_s$ , the DC gain will change as shown in Fig. 3.6, resulting in an effective high-pass filtering effect. Each quantizer of the 1:4 demuxplexing is implemented using a two-stage sense amplifier [68] and SR latch, as shown in Fig. 3.7. A 6b binary current source can be injected to nodes a and b in order to cancel the quantizer offset by current-imbalancing. # 3.4 Experimental Results The 1mm<sup>2</sup> test chip has been fabricated in a 90nm, 1.2V CMOS process and tested in a chip-on-board assembly. As shown in Fig. 3.8, it integrates four receivers, the global clock distribution network, a digital scan chain, test output buffers and a stand-alone ILRO for test purposes. Each receiver occupies 0.0174mm<sup>2</sup>. Due to the limitation in pad area, only near-end (RX1) and far-end (RX4) I/Os are measured. In each receiver, the four-way, demultiplexed output data can be individually selected to drive the output pads. Figure 3.8. Die photo and layout screen capture. # 3.4.1 ILRO In this subsection, the analytical equations derived in Section 3.2.2 will be examined and compared with measurement data. Note that the stage number N=4 and assumes $\eta$ =1. The ILRO can tune from 1.6GHz to 2.6GHz by coarse tuning its supply, while turning on all switches in the digital switch capacitor bank can provide an additional 250-300MHz frequency range. Hence, this large and fine tuning range can be used to compensate possible variations. Note that a shared, frequency-locked loop (not implemented in this work) initiated at reset time, can be used at startup to calibrate and compensate for initial oscillation frequency variations [69]. Figure 3.9. (a) Deskew of ILRO (b) Overlaid waveforms by sweeping phase settings (vertical scale = 25mV/div, horizontal scale = 10ps/div). Fig. 3.9(a) shows both the measured deskew under different injection strength $k_{\text{eff}}$ and the analytical model predicted from (3.9). In this measurement, the injection clock was kept constant at 2.5GHz and the current-DAC, fine tuning control was swept until the oscillator operated beyond its locking range. The measured phase shift confirms the same deskew characteristics as the simulated ILRO shown previously in Fig. 3.2(b). The measured results also show that the ILRO can achieve greater than $\pm 45^{\circ}$ linear deskew range. Fig. 3.9(b) plots the corresponding output waveforms of the ILRO, overlaid on oscilloscope, giving an intuitive viewpoint of the fine interpolation steps of ILRO. The measured injection-locking ranges for injection strength $k_{eff}$ =0.03, 0.06, 0.09 and 0.12 are 65, 115, 167 and 203MHz, respectively. The jitter performance was measured by keeping the free-running frequency fixed at round 2.5GHz and sweeping the frequency of the injection clock. Fig. 3.10 shows that the jitter of ILRO will first get slightly worse as the frequency of injection clock moves away from the free-running frequency, and then get dramatically worse at the edge of locking range. By substituting measured RMS jitters of injection clock and free-running ring oscillator into (3.12), we can successfully predict the jitter degradation when the injection clock is near the free-running frequency. This is because $\omega_{SL}$ will reduce slightly due to the change of $\alpha$ from equation (3.10), such that more noise of the oscillator will pass through to the output. At the edge of the locking range, the jitter worsens because the oscillator is on the cusp of losing phase lock. However, across the linear deskew range, the jitter stays sufficiently low (below 1.5ps RMS jitter when $k_{eff}$ =0.09) such that the jitter degradation will not affect the normal operation of the ILRO. Corresponding measurements of the phase noise shows a similar tendency in the performance degradation, as shown in Fig. 3.11. The -3dB bandwidth at $k_{eff}$ =0.03, 0.06, 0.09, 0.12 are 31, 55, 80 and 100MHz Figure 3.10. Jitter performance of ILRO. Figure 3.11. Measured phase noise performance of ILRO. Figure 3.12. Measured jitter transfer of ILRO. respectively, as shown in Fig. 3.12. This measurement is done directly by injecting a stressed input clock with 5% UI of sine jitter generated by a BertScope 12500B. To verify the phase symmetry, quadrature output waveforms of the ILRO are overlaid in Fig. 3.13. Measurement across the linear deskew range shows a maximum I/Q phase and amplitude imbalance of 4.5° and 7mV respectively. ### 3.4.2 Entire Receiver Experimental measurements are obtained from testing the far-end receiver (RX4). RX4 exhibits the longest clock distribution distance among the four receivers, and therefore shows the worst case performance of the four lanes. The results are measured from 6.4Gb/s to 8Gb/s with a 400mV swing PRBS 7 data sequence generated by the BertScope under two channel conditions: 1) chip-on-board bond-wire with a 4cm PCB trace on the test board plus two SMA connectors and cables, and 2) uses both 1) and an additional 10cm of FR4 PCB trace. They are denoted as 4cm trace and 14cm trace in this subsection, and exhibit approximately 1dB and 5dB loss at 4 GHz respectively. Figure 3.13. Phase spacing when injecting 2.5GHz clock: (a). $f_0$ =2.49GHz, (b) $f_0$ =2.58GHz (vertical scale = 25mV/div, horizontal scale = 50ps/div). Figure 3.14. Eye diagrams of recovered data under (a) 4cm trace and (b) 14cm trace (input data rate=7.2Gb/s, vertical scale = 155mV/div, horizontal scale = 111ps/div). Eye diagrams of one stream of recovered 1.8Gb/s data from the 7.2Gb/s input are shown in Fig. 3.14. There exists a slightly bimodal eye diagram when the 14cm PCB trace is used. Due to a sub-optimal design of the equalizer where the peak is placed 2x lower than desired for this PCB Nyquist bandwidth, the channel plus component assembly results in more loss and reflection than expected, such that the equalization is not enough to compensate for the channel losses at higher data rate. Fig. 3.15 plots the measured BER bathtub curves for the two channel conditions for 6.4Gb/s to 8Gb/s data rates. Figure 3.15. BER measurements (a) by sweeping the delay in BERT and (b) by sweeping the phase setting of ILRO. Figure 3.16. Receiver power breakdown. The receiver consumes 3.84mW, 4.3mW and 4.8mW at input data rates of 6.4Gb/s, 7.2Gb/s and 8Gb/s respectively. Fig. 3.16 shows the breakdown of the power for the receiver. The ILRO occupies 23% of the total power consumption, which is about half the ratio of a phase rotator within a conventional link receiver, as in [2]. Therefore the clock related power is reduced to about 53% of the total receiver power. # 3.5 Summary A four-lane, 6.4-7.2 Gb/s per link, parallel serial link receiver design has been presented. The proposed forwarded clock architecture using ILROs allows the test chip to obtain full UI deskew while achieving only 0.6mW/Gb/s under moderate channel losses. The use of ILROs also exhibits other benefits including inherent multiphase generation and large jitter transfer bandwidth. Methods to avoid the nonlinearity of ILROs are also discussed. Simple analytical equations are derived to understand both the injection locking and jitter performance of ILROs and are verified with experimental measurements. The measured performance summarized in Table 3.1. TABLE 3.1 PERFORMANCE SUMMARY | I ERFORMANCE BOMMAKT | | | | | | | | |---------------------------------------------|----------------|----------|--------|--|--|--|--| | Technology | 1.2V 90nm CMOS | | | | | | | | ILRO tuning range | 1.6-2.6GHz | | | | | | | | ILRO locking range (k <sub>eff</sub> =0.12) | 203MHz | | | | | | | | Phase deskew | 1.8°-3.6° | | | | | | | | resolution(k <sub>eff</sub> =0.12) | | | | | | | | | ILRO power | 0.88mW | 1.08mW | 1.3mW | | | | | | | @1.6GHz | @1.8GHz | @2GHz | | | | | | Total RX Power (including | 3.84mW | 4.3mW | 4.8mW | | | | | | amortized power of clock distr.) | @6.4Gb/s | @7.2Gb/s | @8Gb/s | | | | | #### CHAPTER 4. A NEAR-THRESHOLD SERIAL LINK RECEIVER The state-of-the-art low-power multi-Gb/s chip-to-chip transceivers [19], [22] have demonstrated the power efficiencies as low as nearly 1mW/Gb/s or about 0.5mW/Gb/s for receiver alone. These designs focus on reducing the clocking power by either sharing it within a bundle of links or using resonant-clock distribution. Relative to traditional phase interpolators, injection-locked oscillator (ILO) [49], [51] has been introduced as a power-efficient technique for deskewing the clock phase position. Upon the previous low-power architecture in the last chapter, more power efficient receiver will be discussed by leveraging additional mixed-signal circuit techniques including a super-harmonic injection-locked ring oscillator (super-harmonic ILO), lower operating supply voltage, and higher demultiplexing ratio, to achieve better power efficiency. ## 4.1 Receiver Implementation Fig. 4.1 shows the architecture of the entire forwarded-clock receiver testchip. A half-rate clock source (4GHz) is forwarded and independently deskewed at each of the three data receivers, with the number of parallel receivers limited by the pad number, die area and driving strength of the clock buffer. Jitter of forwarded clock and that of the received data are correlated since they are from the same clock source in Figure 4.1. Proposed receiver architecture. the transmitter side. A half-rate forwarded clock is desired in order to match the Nyquist frequency of the data channels, such that the data channels experience a loss/phase delay similar to the clock channel for better jitter tracking between clock and data. Since the received data is recovered from correlated forwarded clock, jitter tolerance in the receiver side can be improved. The received forwarded clock drives an on-chip CML clock buffer with cross-coupled resistor feedback, in order to extend its bandwidth (Fig.4.2). Operating with 1V supply, this CML buffer delivers a large-swing 4GHz clock waveform across a 600-um long clock distribution to the respective super-harmonic ILO in each receiver for multi- phase generation and deskewing. Figure 4.2. Schematic of global clock buffer with parasitics. For the data path, two prototypes (RX1 and RX2) are realized. As shown in Fig. 4.3(a), for RX1, the received the 8Gb/s data is first fed to a conventional source-degenerated continuous-time linear equalizer (CTLE) [2], and then directly sampled and demuxed by ten deskewed phases from super-harmonic ILO (to be discussed in detail in the following section) to ten-way 800Mb/s recovered data outputs. Finally they are muxed out for test purpose to save pads. In order to maximize the timing margin for the quantizers, sample-and-hold (S/H) circuits are employed in front of each quantizer in RX2, as shown in Fig. 4.3(b). Bootstrapped switches [72] are used in the S/H to get lower on-resistance and minimal signal-dependent distortion. Following the main switch, a conventional dummy switch is driven by complementary clock to minimize clock-feedthrough and charge injection. In order to minimize power consumption, the super-harmonic ILO, CTLE, and quantizer circuits are designed to operate at 0.6V supply. A similar two-stage sense amplifier with only three stacks [68] described in Chapter 3 is also used here as quantizer for low supply operation. While sub/near-threshold operation can enable Figure 4.3. Block diagram of the receiver data lane (a) RX1 and (b) RX2. significant improvement in power consumption and thus has become popularized for digital circuits, two disadvantages prevent its wide use for serial link applications: low operating frequency and worsened process variation. In order to address the low transistor speed, at the system level, a highly parallel architecture using 1:10 demultiplexing is chosen, such that sampling clock of each sub-lane quantizer can operate at a much lower frequency. To prevent potential process variation, extensive digital trimming bits are utilized throughout the entire receiver (i.e. quantizer offset calibration, oscillator frequency and phase deskew tuning). These calibrations are done at startup. # 4.2 Design of Super-harmonic ILO The schematic of the proposed near-threshold super-harmonic ILO is shown in Fig. 4.4. It generates ten evenly-spaced phases (P[0] - P[9]) using five stage differential delay cells, with a free-running frequency of around 800MHz. Negatively-skewed phase interpolation [73] is exploited to increase the ring oscillator frequency with a 0.6V supply. The ring incorporates three sources of frequency control: supply voltage, 40-b thermometer-encoded current-starving for fine tuning, and a DC-biased PMOS load (Vc) for coarse tuning. Since the differential half-rate (4GHz) clock source is now injected into the Figure 4.4. Schematic of super-harmonic ILO. common source nodes (CSP and CSN) of the oscillator instead of directly loading any output phases, the proposed super-harmonic ILO relieves the problem of asymmetric injection and adjacent static phase error caused by different capacitance loading in 1st-harmonic injection-locking ring oscillators in Chapter 3. Following the principle of 1st-harmonic injection-locking [49], [51], the frequency difference between the Nth sub-harmonic of the injection clock and the free-running frequency super-harmonic ILO will result in phase deskew when locked (N=5 for this design). Therefore, the 40-b fine frequency digital tuning bits can be used for deskew purposes. To further extend the deskew range to full UI, inversion-mode PMOS varactors are used as coarse deskew tuning (V<sub>d</sub>) by adjusting the capacitance loading of the branches external to the oscillator (Fig. 4.4). Once V<sub>d</sub> is set to roughly cover the phase difference between clock and data, digital controlled fine tuning will adjust to further deskew the phase. Fig. 4.5 shows the simplified model of each stage of this super-harmonic ILO. Take 2nd stage as an example, clock phase P[4] and P[5] are first interpolated due to the negatively-skewed phase technique used here. Nonlinear function f(e) will Figure 4.5. Model of single stage of super-harmonic ILO. generate multiple harmonic products from injection signal and the interpolated phase. They are then filtered by the transfer function $H(\omega)$ of the delay stage [74]. The single-sided locking range $\omega_{SL}$ can be estimated as $$\omega_{SL} = \eta \cdot \alpha_N \cdot \frac{2\omega_0}{N\sin(2\pi/N)} V_{inj}$$ (4.1) where $\eta$ is the injection efficiency, $\alpha_N$ is the N-th harmonic coefficient. N is also the number of stages. $\omega_0$ is the free-running frequency of the oscillator. And $V_{inj}$ is the amplitude of the injection signal [75]. In order to compensate for any potential phase imbanlance due to layout mismatch, a 4-b switched capacitor bank on each phase is incorporated for individual phase trimming, with a measured resolution of 3-5ps. A scan-chain feedback loop runs at startup to adjust the phase spacing, using a histogram calibration algorithm [66]. The calibrated ten phases are then used to demux the input data. # 4.3 Experimental Results The 1mm x 1mm testchip (Fig. 4.6) has been fabricated in a 65nm, 1V CMOS process. It contains three receivers (two RX1 and one RX2), the global clock distribution network, and a stand-alone super-harmonic ILO for test purposes. First, the performance of super-harmonic ILO is examined. As shown in Fig. 4.7, by changing the fine tuning settings of the super-harmonic ILO alone, a 48ps Figure 4.6. (a) Die photo. And layout screen capture of (b) RX1 and (c) RX2. deskew range with 1-3ps step resolution is achieved. Coarse deskew tuning, by changing the varactor control voltage $V_d$ , provides another 82ps deskew, with enough margin between adjacent traces. Therefore the proposed receiver can cover the full UI (125ps for 8Gb/s) without dead zone. Fig. 4.8 shows the deskewed clock edges overlaid on the oscilloscope by just changing the fine tuning bits for clarity. When injecting a 4GHz clock with 800fs RMS jitter from signal generator, the super-harmonic ILO clock output exhibits jitter performance from 3.8ps RMS to 4.6ps RMS at 800MHz across the deskew settings, as shown in Fig. 4.9(a). Fig. 9(b) shows Figure 4.7. Measured deskew range and free-running frequency of super-harmonic ILO across fine freq tuning. Figure 4.8. Overlaid waveform of clock rising edge by changing fine tuning alone (a) with oscilloscope average mode on for clarity and (b) with grade color mode on. one of the jitter histogram measurements. The measured locking range (by changing the free-running frequency) is from 40 to 78MHz depending on the 3-b amplitude setting of the global clock buffer, which follows the fashion in Eq. (4.1). Figure 4.9. (a) RMS jitter of super-harmonic ILO output across fine tuning settings, and (b) one of zoomed jitter measurement. Due to the bandwidth limitations of the external signal generator, jitter modulation of the 4GHz clock source can only be introduced up to 40MHz. Fig. 10 show the measured output jitter induced by input 20MHz and 30MHz deviation Figure 4.10. Output of SH-ILO after phase modulating 4G clock source by (a) 20MHz deviation and (b) 30MHz deviation. respectively. As no attenuation is observed for sine-wave modulation up to this point, the bandwidth of the super-harmonic ILO jitter transfer is larger than 40MHz. Fig. 4.11 and Fig. 4.12 show the measured recovered eye diagram of 1:10 demultiplexed 800Mb/s data, and the measured bathtub curve at 8Gb/s with a 2<sup>7</sup>-1 PRBS data input across a 20cm FR4 PCB trace (~-9.7dB channel loss @ 4GHz, as shown in Fig. 4.13) of the two receivers, with equalization of CTLE turning on or off. As expected, RX2 (the one with S/Hs) exhibits a little more timing margin. Table 4.1 presents the power breakdown of the two receiver prototypes. RX2 consumes more power than RX1, due to its ten S/Hs and the additional loading of the local clock buffers. In total, RX1 and RX2 consume 1.3mW and 2mW respectively, which equals 0.163mW/Gb/s and 0.25mW/Gb/s at 8Gb/s. The measured results are summarized and compared to previous designs in Table 4.2. Figure 4.11. RX1: (a) 800Mb/s 1:10 recovered data output (x=250ps/div, y=100mV/div), (b) BER bathtub curve at 8Gb/s over 20cm FR4. Figure 4.12. RX2: (a) 800Mb/s 1:10 recovered data output (x=200ps/div, y=100mV/div), (b) BER bathtub curve at 8Gb/s over 20cm FR4. Figure 4.13 Channel response of a 20cm FR4 PCB trace. TABLE 4.1 Power Breakdown | | = 11 | | | |-------------|------------------------------|------|------| | | Unit: mW | RX1 | RX2 | | | CTLE | 0.25 | 0.25 | | | 10 S/Hs N/A | | 0.55 | | 0.6V supply | Quantizers, latches, | | | | | local clock and data | 0.4 | 0.52 | | | buffers, and others | | | | | Super-harmonic ILO | 0.24 | 0.25 | | 1V supply | apply Amortized global clock | | 0.41 | | | buffer and bias | | | | Total Power | | 1.3 | 1.98 | # 4.4 Summary A receiver architecture with super-harmonic injection locked oscillator is proposed. The super-harmonic ILO introduces less imbalance to output phases than 1st-harmonic ILOs, and is also used to provide a full UI deskew. The receiver uses 1:10 demultiplexing ratio to ensure its operation under a low supply voltage of 0.6V. This choice of supply voltage, as well as the usage of super-harmonic ILO, greatly improves the power efficiency. Measurement result shows that one receiver prototype consumes 0.163mW/Gb/s at 8Gb/s. And the other receiver prototype with similar architecture and sample-and-hold before quantizer consumes 0.25mW/Gb/s at the same data rate. Both designs show significantly better energy efficiency compared to previous similar designs. TABLE 4.2 Comparison with Previous Works | COMPARISON WITH REVIOUS WORKS | | | | | | | | |-------------------------------|----------------------|----------------------|---------------------|----------------------|---------------------------------------|--|--| | | [2] | Design in | [76] | This work | | | | | | | Chapter 3 | | (RX1 and RX2) | | | | | <b>.</b> | 6.25.01./ | 7.001 / | <b>5</b> 4 C1 / | 0.0 | 1 / | | | | Data rate | 6.25Gb/s | 7.2Gb/s | 7.4Gb/s | 8Gb/s | | | | | | ~ ^ | | | _ | | | | | Architecture | Software | Forwarded | Forwarded | Forwarded CK | | | | | | an n | CY. | GY. | | | | | | | CDR | CK | CK | | | | | | | | | | | | | | | Phase deskew | PLL with PI | Ring-ILO | ILO | Super-harmonic ILO | | | | | | | | | | | | | | method | | | | | | | | | | 00 00 00 | 22.52.2 | | | | | | | Technology | 90nm CMOS | 90nm CMOS | 65nm CMOS | 65nm CMOS | | | | | | | | | | · · · · · · · · · · · · · · · · · · · | | | | RX power | 8.22mW | 4.3mW | 6.8mW | 1.3mW | 1.98mW | | | | | | | | | | | | | Power | 1.31 | 0.6 | 0.92 | 0.163 | 0.25 | | | | | **** | **** | **** | | | | | | efficiency | mW/Gb/s | mW/Gb/s | mW/Gb/s | mW/Gb/s | mW/Gb/s | | | | | 2 1 7 2 | 2 2 2 | 2 2 | 2 | 2 | | | | RX area | 0.153mm <sup>2</sup> | 0.017mm <sup>2</sup> | 0.03mm <sup>2</sup> | 0.014mm <sup>2</sup> | 0.018mm <sup>2</sup> | | | | | | | | | | | | ## **Chapter 5. CONCLUSION** ### 5.1 Summary The general trend of serial links will continue to provide higher and higher I/O bandwidth for applications like microprocessor systems. On the other hand, it is also necessary to improve the power efficiency on the similar scale as bandwidth to make the total power meet the tight power budget and thermal requirement. This dissertation first analyzes the link performance on a system level by a statistical approach. This approach gives quick insights of important factors in link designs such as equalization, receiver and transmit jitter, finite rise/fall time, duty cycle variation. Based on this approach, a comparison of NRZ and duobinary is made. Although duobinary outstands NRZ in high-loss channel, in moderate channel, simply NRZ is more immune to clock non-idealities. On the clock path, five clock distribution methods is investigated and compared in jitter performance, propagation delay and power consumption viewpoints. Two test chips were implemented in 90nm and 65nm CMOS process to illustrate low-power serial link receivers. The first one demonstrates that the power can be saved by using local ILRO to get full UI deskew by replacing the power-hungry phase interpolator. A modified analysis technique is proposed to accurately predict the performance of ILRO. A second test chip incorporated a super-harmonic ILO, near-threshold and high parallelism to further improve the power efficiency from 0.6mW/Gb/s down to 0.163mW/Gb/s. Design issues involved in these receivers have like discussed. They demonstrate that power efficiency can also catch up the scaling of total bandwidth with the help of CMOS process and supply scaling, link parallelism, potential innovations on system and circuit structures. #### 5.2 Recommendations for Future Work There are several practical considerations of this research that require further investigation. Probably most important is the issue of PVT variation of near-threshold design. PVT variation is possible to be compensated at a start-up or periodic offline calibration procedure. However, in practice, it will be more convenient to make it working background with no interruption of normal link operation. The strategy can be like this, at startup, process variation is first compensated, and then a background logic loop can be used to monitor the possible voltage and temperature variation. This logic, actually, can be included in clock recovery logic, because when voltage and temperature induce VCO frequency change and sampling phase position change, clock recovery loop will in nature try to move the phase back to the center of the data. Second, a corresponding transmitter is necessary to pair with this receiver to complete the transceiver design. To save the power in the transmitter side, voltage-mode driver is preferred. However, how to realize FFE efficiently in voltage domain will need further investigation. To save the complication of equalization, it is possible for low loss channel to eliminate FFE, and just use CTLE or DFE in the receiver side to remove ISI. This will relax the transmitter design and also save its power. A dynamic power scheme can be used to optimize the power division between transmitter and receiver. For example, when it sends data in lower swing to save power in transmitter side, it may have to burn more power in receiver to increase its sensitivity to get certain BER, and vice versa. This trade-off is also non-linear, because imagine that the transmitter data swing is near zero, no matter how much power is spent in receiver side, there is no way to recover it. Therefore, depending on what kind of channel and how much crosstalk in the environment, there will be an optimum power breakdown between transmitter and receiver to get best combined power efficiency. ### **Bibliography** - [1] International Roadmap Committee (IRC), International Technology Roadmap for Semiconductors, 2009 edition [Online]. Available: http://www.itrs.net/Links/2009ITRS/Home2009.htm. - [2] J. Poulton, R. Palmer, A. M. Fuller *et al.*, "A 14-mW 6.25-Gb/s transceiver in 90-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 42, no. 12, pp. 2745-2757, Dec. 2007. - [3] G. Balamurugan, J. Kennedy, G. Banerjee *et al.*, "A scalable 5-15Gbps, 14-75mW low power I/O transceiver in 65nm CMOS," in *Symp. VLSI Circuits Dig.*, Jun. 2007, pp. 270-271. - [4] K. Fukuda, H. Yamashita, F. Yuki *et al.*, "An 8Gb/s transceiver with 3x-oversampling 2-threshold eye-tracking CDR circuit for -36.8dB-loss backplane," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2008, pp. 98-99. - [5] J. Lee, M.-S. Chen and H.-D. Wang, "A 20Gb/s duobinary transceiver in 90nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2008, pp. 102-103. - [6] S. Gupta, J. Tellado, S. Begur *et al.*, "A 10Gb/s IEEE 802.3an-compliant Ethernet transceiver for 100m UTP cable in 0.13um CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2008, pp. 106-107. - [7] S. Goswami, T. Copani, A. Jain et al., "A 96Gb/s-throughput transceiver for short-distance parallel optical links," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2008, pp. 230-231. - [8] K. Chang, H. Lee, J.-H. Chun *et al.*, "A 16Gb/s/link, 64GB/s bidirectional asymmetric memory interface cell," ," in *Symp. VLSI Circuits Dig.*, Jun. 2008, pp. 126-127. - [9] N. Nguyen, Y. Frans, B. Leibowitz *et al.*, "A 16-Gb/s differential I/O cell with 380fs RJ in an emulated 40nm DRAM process" in *Symp. VLSI Circuits Dig.*, Jun. 2008, pp. 128-129. - [10] J.-K. Kim, J. Kim, G. Kim *et al.*, "A 40-Gb/s transceiver in 0.13-μm CMOS technology," in *Symp. VLSI Circuits Dig.*, Jun. 2008, pp. 196-197. - [11] J. Nasrullah, A. Amin, W. Ahmadin *et al.*, "A TeraBit/s-throughput, SerDesbased interface for a third-generation 16 Core 32 thread chip-multithreading SPARC processor," in *Symp. VLSI Circuits Dig.*, Jun. 2008, pp. 200-201. - [12] A. Hayashi, M. Kuwata, K. Suzuki *et al.*, "A 21-Channel 8Gb/s transceiver macro with 3.6ns latency in 90nm CMOS for 80cm backplane communication," in *Symp. VLSI Circuits Dig.*, Jun. 2008, pp. 202-203. - [13] Y. Hidaka, W. Gai, T. Horie *et al.*, "A 4-Channel 10.3Gb/s backplane transceiver macro with 35dB equalizer and sign-based zero-forcing adaptive control," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2009, pp. 188-189. - [14] Y. Amamiya, S. Kaeriyama, H. Noguchi *et al.*, "A 40Gb/s multi-data-rate CMOS transceiver chipset with SFI-5 interface for optical transmission Systems," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2009, pp. 358-359. - [15] H. Wang, C.-C. Lee, A.-M. Lee and J. Lee, "A 21-Gb/s 87-mW transceiver with FFE/DFE/linear equalizer in 65-nm CMOS technology," in *Symp. VLSI Circuits Dig.*, Jun. 2009, pp. 50-51. - [16] S. Joshi, J. T.-S. Liao, Y. Fan *et al.*, "A 12-Gb/s transceiver in 32-nm bulk CMOS," in *Symp. VLSI Circuits Dig.*, Jun. 2009, pp. 52-53. - [17] Y.-C. Jang, J.-Y. Park, S. Shin *et al.*, "Self-calibrating transceiver for source synchronous clocking system with on-chip TDR and swing level control scheme," in *Symp. VLSI Circuits Dig.*, Jun. 2009, pp. 54-55. - [18] R. Palmer, J. Poulton, B. Leibowitz *et al.*, "A 4.3GB/s mobile memory interface with power-efficient bandwidth scaling," in *Symp. VLSI Circuits Dig.*, Jun. 2009, pp. 136-137. - [19] F. O'Mahony, J. Kennedy, J. E. Jaussi *et al.*, "A 47×10Gb/s 1.4mW/(Gb/s) parallel interface in 45nm CMOS," ," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2010, pp. 156-157. - [20] K. Maruko, T. Sugioka, H. Hayashi *et al.*, "A 1.296-to-5.184Gb/s transceiver with 2.4mW/(Gb/s) burst-mode CDR using dual-edge njection-locked oscillator," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2010, pp. 364-365. - [21] F. Spagna, L. Chen, M. Deshpande *et al.*, "A 78mW 11.8Gb/s serial link transceiver with adaptive RX equalization and baud-rate CDR in 32nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2010, pp. 366-367. - [22] K. Fukuda, H. Yamashita, G. Ono *et al.*, "A 12.3mW 12.5Gb/s complete transceiver in 65nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2010, pp. 368-369. - [23] G. Balamurugan, F. O'Mahony, M. Mansuri *et al.*, "A 5-to-25Gb/s 1.6-to-3.8mW/(Gb/s) reconfigurable transceiver in 45nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2010, pp. 372-373. - [24] S.-J. Bae, Y.-S. Sohn, T.-Y. Oh *et al.*, "A 40nm 7Gb/s/pin single-ended transceiver with jitter and ISI reduction techniques for high-speed DRAM interface," in *Symp. VLSI Circuits Dig.*, Jun. 2010, pp. 193-194. - [25] N. Kocaman, A. Garg, B. Raghavan *et al.*, "11.3Gb/s CMOS SONET-compliant transceiver for both RZ and NRZ applications," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2011, pp. 142-143. - [26] M.-S. Chen, Y.-N. Shih, C.-L. Lin, H.-W. Hung and J. Lee, "A 40Gb/s TX and RX chip set in 65nm CMOS," ," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2011, pp. 146-147. - [27] S. Fukuda, Y. Hino, S. Ohashi *et al.*, "A 12.5+12.5Gb/s full-duplex plastic waveguide interconnect," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2011, pp. 150-151. - [28] R. Inti, A. Elshazly, B. Young *et al.*, "A highly digital 0.5-to-4Gb/s 1.9mW/Gb/s serial-link transceiver using current-recycling in 90nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2011, pp. 152-153. - [29] Y. Hidaka, T. Horie, Y. Koyanagi *et al.*, "A 4-channel 10.3Gb/s transceiver with adaptive phase equalizer for 4-to-41dB loss PCB channel," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2011, pp. 346-347. - [30] S. Quan, F. Zhong, W. Liu *et al.*, "A 1.0625-to-14.025Gb/s multimedia transceiver with full-rate source-series-terminated transmit driver and floating-tap decision-feedback equalizer in 40nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2011, pp. 348-349. - [31] A. K. Joy, H. Mair, H.-C. Lee *et al.*, "Analog-DFE-based 16Gb/s SerDes in 40nm CMOS that operates across 34dB loss channels at Nyquist with a baud rate CDR and 1.2Vpp voltage-mode driver," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2011, pp. 350-351. - [32] M. Ramezani, M. Abdalla, A. Shoval *et al.*, "An 8.4mW/Gb/s 4-lane 48Gb/s multi-standard-compliant transceiver in 40nm digital CMOS technology," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2011, pp. 352-353. - [33] G.-S. Byun, Y. Kim, J. Kim *et al.*, "An 8.4Gb/s 2.5pJ/b mobile memory I/O interface using simultaneous bidirectional dual (Base+RF) band signaling," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2011, pp. 488-490. - [34] W.-Y. Shin, G.-M. Hong, H. Lee *et al.*, "A 4.8Gb/s impedance-matched bidirectional multi-drop transceiver for high-capacity memory interface," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2011, pp. 494-495. - [35] B. Casper and F. O'Mahony, "Clocking analysis, implementation and measurement techniques for high-speed data links a tutorial," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 56, no. 1, pp. 17-39, Jan. 2009. - [36] J. Lee, M.-S. Chen and H.-D. Wang, "Design and comparison of three 20-Gb/s backplane transceivers for duobinary, PAM4 and NRZ data," *IEEE J. Solid-State Circuits*, vol. 43 no. 9, pp. 2120-2133, Sep. 2008. - [37] B. Casper, J. Jaussi, F. O'Mahony *et al.*, "A 20Gb/s forwarded clock transceiver in 90nm CMOS," *in IEEE ISSCC Dig. Tech. Papers*, Feb. 2006, pp. 90-91. - [38] JitterTime Consulting LLC, Santa Clara, CA, 2006, "BER Calculator," [Online]. Available: http://www.jittertime.com/resources/bercalc.shtml. - [39] B. K. Casper, M. Haycock and R. Mooney, "An accurate and efficient analysis method for multi-Gb/s chip-to-chip signaling schemes," in *Symp. VLSI Circuits Dig.*, Jun. 2002, pp. 54-57. - [40] P. K. Hanumolu, B. Casper, R. Mooney, G.-Y. Wei and U.-K. Moon, "Analysis of PLL clock jitter in high-speed serial links", *IEEE Trans. Circuits Syst. II, Analog and digital signal processing*, vol. 50, no. 11, pp. 879-886, Jan. 2003. - [41] V. Stojanovic and M. Horowitz, "Modeling and analysis of high-speed links," in *IEEE Proc. CICC*, Sep. 2003, pp. 589-594. - [42] B. Casper, G. Balamurugan, J. E. Jaussi et al., "Future microprocessor interfaces: analysis, design and optimization," in *IEEE Proc. CICC*, Sep. 2007, pp. 479-486. - [43] G. Balamurugan, B. Casper, J. E. Jaussi et al., "Modeling and analysis of high-speed I/O links," *IEEE Trans. Adv. Packag.*, vol. 32, no. 2, pp. 237-246, May 2009. - [44] EDOTRONIK, München, Germany, 2007, "Stateye," [Online]. Available: http://www.stateye.org. - [45] W. Beyene, "Modeling and analysis techniques of jitter enhancement across high-speed interconnect systems," in *Proc. IEEE Elect. Perform. Electron. Packag.*, Oct. 2007, pp. 29–32. - [46] J. G. Proakis, *Digital Communications*, 4th ed. NY: McGraw-Hill, 2001. - [47] A. Leon-Garcia, *Probability, statistics, and random processes for electrical engineering*, 3rd ed., NJ: Pearson Prentice Hall, 2008. - [48] L. Zhang, B. Ciftcioglu, M. Huang and W. Hui, "Injection-locked clocking: a new GHz clock distribution scheme," in *IEEE Proc. CICC*, Sep. 2006, pp. 785-788. - [49] F. O'Mahony, S. Shekhar, M. Mansuri *et al.*, "A 27Gb/s forwarded-clock I/O receiver using an injection-locked LC-DCO in 45nm CMOS," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2008, pp. 452-453. - [50] P. Chiang, W. J. Dally, M.-J. E. Lee *et al.*, "A 20Gb/s 0.13um CMOS serial link transmitter using an LC-PLL to directly drive the output multiplexer." *IEEE J. Solid-State Circuits*, vol. 40, no. 4, pp. 1004-1011, Apr. 2005. - [51] K. Hu, T. Jiang, J. Wang, F. O'Mahony and P. Y. Chiang, "A 0.6mW/Gbps, 6.4-8.0Gbps serial link receiver using local injection-locked ring oscillators in 90nm CMOS," in *Symp. VLSI Circuits Dig.*, Jun. 2009, pp. 46-47. - [52] A. Agrawal, P. K. Hanumolu and G-Y. Wei, "A 8x5Gb/s source-synchronous receiver with clock generator phase error correction," in *IEEE Proc. CICC*, Sep.2008, pp. 459-462. - [53] Z. Xu and K. L. Shepard, "Low-jitter active deskewing through injection-locked resonant clocking," in *IEEE Proc. CICC*, Sep.2007, pp. 9-12. - [54] M. Hossain, A. C. Carusone, "CMOS oscillators for clock distribution and injection-locked deskew," *IEEE J. Solid-State Circuits*, vol. 44, no. 8, pp. 2138-2153, Aug. 2009. - [55] R. Alder, "A study of locking phenomena in oscillators," Proc. IRE, vol.34, pp. 351-356, June 1946, reprinted in *Proc. IEEE*, vol. 61, pp. 1380-1385, Oct. 1973. - [56] T. H. Lee and A. Hajimiri, "Oscillator phase noise: a tutorial," *IEEE J. Solid-State Circuits*, vol. 35, no. 3, pp. 326-336, Mar. 2004. - [57] B. Razavi, *RF Microelectronics*. NJ: Prentice Hall, 1998, ch.7-ch.8. - [58] L. J. Paciorek, "Injection locking of oscillators," *Proc. IEEE*, vol. 53no.11, pp. 1723-1728, Nov. 1965. - [59] B. Razavi, "A study of injection locking and pulling in oscillators," *IEEE J. Solid-State Circuits*, vol. 39, no. 9, pp. 1415-1424, Sep. 2004. - [60] N. Lanka, S. Patnaik, R. Harjani, "Understanding the behavior of injection locked LC oscillators," in *IEEE Proc. CICC*, Sep.2007, pp. 667-670. - [61] G. R. Gangasani, P. R. Kinget, "Time-domain model for injection locking in nonharmonic oscillators," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 55, no. 6, pp. 1648-1658, July 2008. - [62] X. Lai, J. Roychowdhury, "Analytical equations for predicting injection locking in LC and ring oscillators," in *IEEE Proc. CICC*, Sep. 2005, pp. 461-464. - [63] B. Razavi, "A study of phase noise in CMOS oscillators," *IEEE J. Solid-State Circuits*, vol. 31, no. 3, pp.331-343, March. 1996. - [64] A. Hajimiri, S. Limotyrakis and T. H. Lee, "Jitter and phase noise in ring oscillators," *IEEE J. Solid-State Circuits*, vol. 34, no. 6, pp. 790-804, June 1999. - [65] V. F. Kroupa, *Phase lock loops and frequency synthesis*. John Wiley & Sons, Ltd, 2003, ch.1. - [66] L. Lee, D. Weinlader, and C.-K. K. Yang, "A Sub-10-ps multiphase sampling system using redundancy," *IEEE J. Solid-State Circuits*, vol. 41, no. 1, pp. 265-273, Jan. 2006. - [67] J. Wang, "Techniques for improving timing accuracy of multi-gigahertz track/hold circuits," M.S. dissertation, Dept. Elect. Eng., Oregon State Univ., Corvallis, OR, 2008. - [68] D. Schinkel, E. Mensink, E. Klumperink, E. van Tuijl and B. Nauta, "A double-tail latch-type voltage sense amplifier with 18ps setup+hold time," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2007, pp. 314-315. - [69] J. Kim and M. A. Horowitz; "Adaptive supply serial links with sub-1-V operation and per-pin clock recovery," *IEEE J. Solid-State Circuits*, vol. 37, no. 11, pp. 1403-1413, Nov. 2002. - [70] J. M. Rabaey, A. Chandrakasan and B. Nikolic, *Digital Integrated Circuits: A Design Perspective*, 2nd ed., Prentice Hall, 2003. - [71] R. Ho et al., "High speed and low energy capacitivly driven on-chip wires," *IEEE J. Solid-state Circuits*, vol.43, no. 1, pp. 52-60, Jan. 2008. - [72] M. Dessouky and A. Kaiser, "Very low-voltage digital-audio $\Delta\Sigma$ modulator with 88-dB dynamic range using local switch bootstrapping," *IEEE J. Solid-State Circuits*, vol. 36, no. 3, pp. 349-355, Mar. 2001. - [73] S.-J. Lee *et al.*, "A novel high-speed ring oscillator for multiphase clock generation using negative skewed delay scheme," *IEEE J. Solid-State Circuits*, vol. 32, no. 2, pp. 289-291, Feb. 1997. - [74] J. Hu and B. Otis, "A 3uW, 400 MHz Divide-by-5 Injection-Locked Frequency Divider with 56% Lock Range in 90nm CMOS," *IEEE Radio Frequency Integrated Circuit (RFIC) Symposium*, June 2008, pp. 665-668. - [75] W.-Z. Chen and C.-L. Kuo, "18GHz and 7 GHz superharmonic injection-loced dividers in 0.25 um CMOS technology," *Proc. of ESSCIRC*, Sep. 2002, pp. 89-92. - [76] M. Hossain and A. C. Carusone, "A 6.8mW 7.4Gb/s clock-forwarded receiver with up to 300MHz jitter tracking in 65nm CMOS," *IEEE ISSCC Dig. Tech. Papers*, Feb. 2010, pp. 158-159.