Energy-efficient wireline transceivers by Shu, Guanghua
c© 2016 Guanghua Shu
ENERGY-EFFICIENT WIRELINE TRANSCEIVERS
BY
GUANGHUA SHU
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2016
Urbana, Illinois
Doctoral Committee:
Associate Professor Pavan Kumar Hanumolu, Chair
Associate Professor Rakesh Kumar
Professor Jose´ Schutt-Aine´
Professor Naresh Shanbhag
ABSTRACT
Power-efficient wireline transceivers are highly demanded by many applica-
tions in high performance computation and communication systems. Apart
from transferring a wide range of data rates to satisfy the interconnect band-
width requirement, the transceivers have very tight power budget and are
expected to be fully integrated. This thesis explores enabling techniques to
implement such transceivers in both circuit and system levels. Specifically,
three prototypes will be presented: (1) a 5Gb/s reference-less clock and data
recovery circuit (CDR) using phase-rotating phase-locked loop (PRPLL) to
conduct phase control so as to break several fundamental trade-offs in conven-
tional receivers; (2) a 4-10.5Gb/s continuous-rate CDR with novel frequency
acquisition scheme based on bang-bang phase detector (BBPD) and a ring
oscillator-based fractional-N PLL as the low noise wide range DCO in the
CDR loop; (3) a source-synchronous energy-proportional link with dynamic
voltage and frequency scaling (DVFS) and rapid on/off (ROO) techniques
to cut the link power wastage at system level. The receiver/transceiver ar-
chitectures are highly digital and address the requirements of new receiver
architecture development, wide operating range, and low power/area con-
sumption while being fully integrated. Experimental results obtained from
the prototypes attest the effectiveness of the proposed techniques.
ii
ACKNOWLEDGMENTS
It is my great fortune to meet so many people who have made my Ph.D.
journey challenging, rewarding, and memorable. I give them my heartfelt
thanks for their contributions to my research and personal growth.
First and foremost, my gratitude to my advisor Prof. Pavan Kumar Hanu-
molu comes from the bottom of the heart, for his excellent guidance, encour-
agement, and patience. His vision in the field of integrated circuits keeps
my eyes open, and his passion in research will always inspire me. Through
the years, I have also been trying to pick up a little of his great ability to
explain new concepts with clarity. I consider myself truly fortunate to work
with him, the best research advisor one can wish for. I hope I can always
count on his advice and friendship in the future.
I would also like to extend my sincerest thanks to Prof. Rakesh Kumar,
Prof. Jose´ Schutt-Aine´, and Prof. Naresh Shanbhag for being on my doctoral
committee. Their feedback has helped to extend my research horizon and
improved the quality of this thesis. I am also indebted to Prof. Un-Ku
Moon at Oregon State University, not only for his insightful lectures on
circuit designs, but also for his kind support at various stages of my Ph.D.
study. Thanks are also due to Prof. Franke and Laurie Fisher for providing
a smooth transfer from Oregon State University to the University of Illinois.
I am also grateful for Rachel Glasa, and subsequently Jennifer Summers, for
their assistance in all kinds of administrative work, and James Hutchinson,
for his excellent editorial help to improve the thesis quality.
I am also greatly indebted for all the help and mentorship I have re-
ceived from: John Bulzacchelli (IBM), Mounir Meghelli (IBM), Dan Fried-
man (IBM), Jack Kenney (ADI), Ken Chang (Xilinx), Yong Liu (Broadcom),
Jon Proesel (IBM), Tod Dickson (IBM), Alexander Rylyakov (Coriant), Mo-
hamed Elzeftawi (Samsung), Jafar Savoj(Apple), and Ganesh Balamurugan
(Intel). Their help and advice kept my eyes open about the research and
iii
development in industry, which is of great benefit to my graduate research.
Another important part of this journey is the great friends I have made
over these years. They not only introduced great fun into my school life, but
also played an essential role in my graduate research. Yue Hu, Xin Meng,
Rui Bai, Yichen Zhao, Hui Guo, Luyang Yang, and Xun Sun helped me
settle in the US smoothly, and made excellent company for my two years’
stay in Corvallis. Sincere thanks to senior group members Amr Elshazly,
Rajesh Inti, Qadeer Khan, Sachin Rao, Karthik Reddy, Wenjing Yin, and
Brian Young for their guidance at the early stage of my Ph.D. study and for
being forthcoming with their advice even after their graduations.
Ahmed Elkholy, Mrunmay Talegaonkar, Romesh Nandwana, Saurabh Sax-
ena, Seong-Joong Kim, Tejasvi Anand, and Woo-Seok Choi carried on the
journey to University of Illinois; they are really brilliant and enthusiastic
colleagues and friends (including Praveen Prabha who missed the party in
Illinois). This five years’ time will always be deep in my memory, be it the
late nights when we were fighting to catch up with the tape-out shuttle, or
the short sneak-out during the busy conference schedule. Especially, I want
to thank Woo-Seok Choi, Saurabh Saxena, and Mrunmay Talegaonkar. They
have been perfect intellectual walls off which to bounce all kinds of different
ideas. With a strong background in mathematics, Woo-Seok always finds the
insightful abstract for the problems and often brings the ideas to a new level
or disproves them before it gets too late. Saurabh is a perfect detail-oriented
circuit designer. He always puts the discussion into circuit perspectives and
brings the ideas down to the earth. Mrunmay is very knowledgeable and,
more importantly, he always gives very objective suggestions with his in-
sight, which has saved me many times from being biased by own limited
view. Their help, along with that from other group members, is critical in
completing this thesis research. The excellence of Prof. Hanumolu’s group
has been preserved with many recent new members. Junheng Zhu, Dong-
wook Kim, Mostafa Ahmed, Ahmed Elmallah, Da Wei, Timir Nandi, Danniel
Coombs, Braedon Salz, Hyun-Jae Ko, and Tianyu Wang are a pleasure to
work with. I hope they will have as much fun as I did, or even more, in such
a great environment.
Last and most important, I wish to thank my family for their unconditional
love and support. My parents believe in the value of education and did their
best to provide me the best possible education, even if it means I need to
iv
study abroad far away from them. I thank my brother (Guangliang) for
constantly caring about my progress and taking the whole responsibility as
a son in the family when I am away. I am truly fortunate to have my
wife (Yanliang) with me in the US. She is so patient and considerate of my
extremely dynamic working hours, and even attempts to understand how
circuits work although she is from a totally different professional area. They
have my deepest gratitude for all their endless love and support. To them, I
dedicate this thesis.
v
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 2
CHAPTER 2 WIRELINE TRANSCEIVER OVERVIEW . . . . . . . 5
2.1 Transceiver Operation . . . . . . . . . . . . . . . . . . . . . . 5
2.2 CDR Performance Metrics . . . . . . . . . . . . . . . . . . . . 11
2.3 Conventional CDR Limitations . . . . . . . . . . . . . . . . . 16
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
CHAPTER 3 A REFERENCE-LESS CDR USING PHASE-ROTATING
PLL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Proposed CDR Architecture . . . . . . . . . . . . . . . . . . . 23
3.3 Circuit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
CHAPTER 4 A CONTINUOUS-RATE DIGITAL CLOCK AND
DATA RECOVERY WITH AUTOMATIC FREQUENCY AC-
QUISITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 Automatic Frequency Acquisition . . . . . . . . . . . . . . . . 51
4.2 Overall CDR Architecture . . . . . . . . . . . . . . . . . . . . 55
4.3 Circuit Implementation . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
CHAPTER 5 AN ENERGY-PROPORTIONAL SOURCE-SYNCHRONOUS
LINK WITH DVFS AND ROO TECHNIQUES . . . . . . . . . . . 76
5.1 Energy-Proportional Link with DVFS and ROO . . . . . . . . 77
5.2 Circuit Implementation . . . . . . . . . . . . . . . . . . . . . . 80
vi
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 90
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 102
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
APPENDIX A RELIABILITY ANALYSIS OF PROPOSED FRE-
QUENCY ACQUISITION SCHEME . . . . . . . . . . . . . . . . . 104
A.1 FLL Locking Reliability with Conventional DCO . . . . . . . 106
A.2 FLL Locking Reliability with Fractional-N PLL-based DCO . 108
APPENDIX B ANALYSIS OF LINKS WITH DVFS AND ROO
TECHNIQUES USING QUEUE MODEL . . . . . . . . . . . . . . 112
B.1 Comparison between DVFS and ROO . . . . . . . . . . . . . . 115
B.2 Combine DVFS and ROO . . . . . . . . . . . . . . . . . . . . 117
B.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 118
APPENDIX C DISCUSSION ON α-POWER LAWMODEL FOR
MOSFET AND ITS EFFECT ON DVFS . . . . . . . . . . . . . . . 124
C.1 Scaling of Supply Voltage and Data Rate . . . . . . . . . . . . 124
C.2 Scaling of Active Power of Difference Circuit . . . . . . . . . . 125
C.3 Power Scaling of Link Transceivers with α-Power Law Model . 126
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
vii
LIST OF TABLES
3.1 PRPLL performance summary and comparison . . . . . . . . 40
3.2 RCK and SCK jitter versus different data sequences . . . . . . 42
3.3 Receiver performance summary and comparison . . . . . . . . 45
4.1 CDR performance summary and comparison with the state-
of-the-art designs . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1 Transceiver performance summary and comparison with
the state-of-the-art designs . . . . . . . . . . . . . . . . . . . . 99
C.1 Power scaling of link transceiver building blocks . . . . . . . . 126
C.2 Power distribution of a source-synchronous link transceiver
@ 10Gb/s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
C.3 Power distribution of an embedded clock link transceiver
per channel @ 6.25Gb/s . . . . . . . . . . . . . . . . . . . . . 128
viii
LIST OF FIGURES
1.1 Application scenario of wireline transceivers. . . . . . . . . . . 2
1.2 Wireline transceiver trends in last 15 years: (a) date rate,
(b) energy efficiency. . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Block diagram of a wireline transceiver. . . . . . . . . . . . . . 5
2.2 NRZ/RZ data waveforms and power spectral density for
random NRZ/RZ patterns. . . . . . . . . . . . . . . . . . . . . 7
2.3 Clocking schemes based on the relative switching rates be-
tween data (DIN) and clock (CK). . . . . . . . . . . . . . . . 8
2.4 Link classification based on clocking schemes. . . . . . . . . . 10
2.5 Recover clock and data with VCO-based CDR. . . . . . . . . . 10
2.6 Equalizer compensates for channel loss. . . . . . . . . . . . . . 11
2.7 Receiver performance consideration. . . . . . . . . . . . . . . . 12
2.8 Typical jitter transfer response. . . . . . . . . . . . . . . . . . 13
2.9 Typical jitter tolerance response. . . . . . . . . . . . . . . . . 14
2.10 Jitter contribution in data eye diagram. . . . . . . . . . . . . . 15
2.11 Transition from analog CDR to digital CDR. . . . . . . . . . . 16
2.12 Loop dynamics of VCO-based digital CDR. . . . . . . . . . . . 17
2.13 Relationship between JTRAN bandwidth and JTOL cor-
ner frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Phase interpolator-based sub-rate CDR. . . . . . . . . . . . . 21
3.2 Block diagram of a PRPLL. . . . . . . . . . . . . . . . . . . . 23
3.3 Evolution of the proposed CDR. . . . . . . . . . . . . . . . . . 25
3.4 Linearized phase-domain model of the proposed CDR. . . . . . 28
3.5 Detailed schematic of the proposed reference-less PRPLL-
based CDR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Schematic of phase-rotating PLL with quadrant segmentation. 30
3.7 Phase-rotating process in PRPLL. . . . . . . . . . . . . . . . . 31
3.8 Schematic of combined XORPD and charge pump (XORPD-
CP). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.9 Schematic of the limiting amplifier. . . . . . . . . . . . . . . . 32
3.10 Half-rate bang-bang phase detector. . . . . . . . . . . . . . . . 33
3.11 Schematic of supply-regulated digitally controlled oscilla-
tor (DCO). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
ix
3.12 Simulated CDR power supply noise rejection transfer func-
tions with and without the regulator. . . . . . . . . . . . . . . 36
3.13 Die micrograph. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.14 Measured PRPLL phase noise plot. . . . . . . . . . . . . . . . 37
3.15 PRPLL output jitter histogram. . . . . . . . . . . . . . . . . . 38
3.16 Measured digital to phase transfer characteristics of the PRPLL. 39
3.17 Measured phase interpolation linearity (DNL and INL) of
the PRPLL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.18 Measured jitter transfer function with different gain settings. . 41
3.19 Measured JTRAN with different input jitter amplitudes. . . . 41
3.20 Measured jitter tolerance with a BER threshold of 10−12
and PRBS7 input data. . . . . . . . . . . . . . . . . . . . . . . 42
3.21 Measured RCK and SCK jitter with PRBS31 input data:
(a) RCK jitter, (b) SCK jitter. . . . . . . . . . . . . . . . . . . 43
3.22 Measured BER as a function of supply noise amplitude at
different noise frequencies with PRBS31 input data. . . . . . . 44
3.23 Measured BER as a function of input amplitude for differ-
ent PRBS sequences. . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Block diagram of a continuous-rate CDR with automatic
frequency acquisition. . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 (a) Analog D/PLL architecture with large loop filter capac-
itor, and (b) jitter transfer (JTRAN) and jitter tolerance
(JTOL) in D/PLL. . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Operations of a bang-bang phase detector. . . . . . . . . . . . 52
4.4 Principle of proposed frequency acquisition scheme: (a)
diagram of a BBPD-based frequency locking loop, and (b)
operation of a BBPD-based frequency locking loop. . . . . . . 54
4.5 Residual frequency error dependence on transition density:
(a) w/o jitter, and (b) w/ jitter. . . . . . . . . . . . . . . . . . 56
4.6 Residue frequency error comparison between proposed scheme
and SRCG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7 Digital implementation of D/PLL CDR architecture. . . . . . 57
4.8 Complete schematic of the proposed continuous-rate CDR. . . 58
4.9 Schematic of the digitally controlled delay line. . . . . . . . . . 60
4.10 Schematic of ring oscillator-based fractional-N PLL as DCO. . 61
4.11 FCW synchronization from CDR to DCO. . . . . . . . . . . . 62
4.12 Schematic of the digital multiplying DLL (MDLL). . . . . . . 63
4.13 Charge pump with adaptation loop: (a) circuit schematic,
and (b) effectiveness on suppressing in-band fractional spurs. . 64
4.14 Die micrograph. . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.15 Power and area breakdowns of the CDR prototype. . . . . . . 66
4.16 Measured power spectrum of MDLL. . . . . . . . . . . . . . . 67
4.17 Measured phase noise performance of FNPLL (DCO). . . . . . 67
x
4.18 Measured frequency acquisition process from initial fre-
quency to 6Gb/s data rate. . . . . . . . . . . . . . . . . . . . 68
4.19 Measured frequency acquisition process with data rate switch-
ing from 6Gb/s to 9.5Gb/s. . . . . . . . . . . . . . . . . . . . 69
4.20 Measured residual frequency error versus locking threshold
NTH at different transition densities. . . . . . . . . . . . . . . 70
4.21 Measured residual frequency error versus locking threshold
NTH at different input jitter amplitudes with PRBS7 input
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.22 Measured JTRAN with different input jitter amplitudes. . . . 72
4.23 Measured jitter tolerance with PRBS7 input data at 10Gb/s
and 4Gb/s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.24 Measured recovered clock jitter with PRBS31 input data:
(a) at 5Gb/s, and (b) at 10Gb/s. . . . . . . . . . . . . . . . . 73
5.1 Cut link power/bandwidth wastage with DVFS and ROO
techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Link energy efficiency with DVFS and ROO techniques. . . . . 79
5.3 Block diagram of source-synchronous link with DVFS and
rapid on/off capabilities. . . . . . . . . . . . . . . . . . . . . . 81
5.4 Wake-up process of the energy-proportional transceiver. . . . . 82
5.5 (a) Current-mode hysteric converter. (b) Simulated line
transient response. . . . . . . . . . . . . . . . . . . . . . . . . 83
5.6 Block diagram of rapid on/off multiplying delay-locked loop
(MDLL) and timing diagram during wake-up process. . . . . . 84
5.7 (a) Source follower (SF) -based low dropout (LDO) voltage
regulator. (b) Simulated PSRR of LDO. . . . . . . . . . . . . 85
5.8 Schematic of 7-bit phase interpolator. . . . . . . . . . . . . . . 86
5.9 Block diagram of energy-proportional transmitter with 3-
tap FIR filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.10 Schematic of segmented CML output driver. . . . . . . . . . . 88
5.11 Schematic and settling process of rapid replica biasing (RRB)
circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.12 Schematic of receiver data path. . . . . . . . . . . . . . . . . . 89
5.13 Schematic of Rx limiting amplifier with load resistor cali-
bration and offset cancellation. . . . . . . . . . . . . . . . . . . 90
5.14 Micrograph of the energy-proportional transceiver prototype. . 91
5.15 Measured transient response of DC-DC converter: w/ and
w/o shunt regulator. . . . . . . . . . . . . . . . . . . . . . . . 92
5.16 Measured power efficiency of DC-DC converter. . . . . . . . . 92
5.17 Measured settling behavior of low dropout (LDO) voltage
regulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.18 Measured MDLL performance across different supply voltages. 93
5.19 Measured MDLL settling behavior with programmable divider. 94
xi
5.20 Measured MDLL jitter settling during wake-up process. . . . . 95
5.21 Measured transceiver energy efficiency in DVFS mode. . . . . 95
5.22 Measured source-synchronous link bathtub cures. . . . . . . . 96
5.23 Measured power on and off process of transmitter driver. . . . 97
5.24 Measured power on and off behavior of complete link with
less than 14 ns wake up time. . . . . . . . . . . . . . . . . . . 97
5.25 Measured link power scaling capability with 500 x range of
data rate (8Gb/s to 16Mb/s). . . . . . . . . . . . . . . . . . . 98
5.26 Measured link energy-proportional operation capability with
500x range of data rate (8Gb/s to 16Mb/s). . . . . . . . . . . 99
5.27 Comparison of measured transceiver on-state power and
off-state power at 8Gb/s and 3Gb/s. . . . . . . . . . . . . . . 100
A.1 Sampling instance between RCK and DIN in the presence
of random jitter. . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.2 FLL locking reliability versus locking threshold NTH: (a)
one step reliability, (b) overall reliability. . . . . . . . . . . . . 107
A.3 FLL locking reliability versus period jitter: (a) one step
reliability, (b) overall reliability. . . . . . . . . . . . . . . . . . 109
A.4 FLL locking reliability versus locking threshold, NTH, with
fractional-N PLL-based DCO: (a) one step reliability, (b)
overall reliability. . . . . . . . . . . . . . . . . . . . . . . . . . 110
A.5 FLL locking reliability versus period jitter with fractional-
N PLL-based DCO: (a) one step reliability, (b) overall reliability.111
B.1 Queue model for serial links. . . . . . . . . . . . . . . . . . . . 112
B.2 Queue model for serial links. . . . . . . . . . . . . . . . . . . . 113
B.3 Queue length versus the probability of overflow (expected
queue length is E[N]=10). . . . . . . . . . . . . . . . . . . . . 115
B.4 Expected waiting time comparison between DVFS and ROO
with different expected queue length (E[NQ]). . . . . . . . . . 116
B.5 Measured transceiver energy efficiency at different peak
data rates in DVFS mode (replot Fig. 5.21 for convenience). . 117
B.6 Energy delay product comparison between DVFS and ROO
with different expected queue length (E[NQ]). . . . . . . . . . 118
B.7 DVFS and ROOwith different expected queue length (E[NQ]):
(a) ratio of expected waiting time, (b) ratio of energy delay
product. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
B.8 Expected waiting time with combined DVFS and ROO
with expected queue length E[NQ] = 1. . . . . . . . . . . . . . 120
B.9 Energy delay product with combined DVFS and ROO with
expected queue length E[NQ] = 1. . . . . . . . . . . . . . . . . 120
B.10 Simulated link energy delay product with M/M/1 queue model.122
xii
B.11 Simulated link performance with combined DVFS and ROO
using M/M/1 queue model: (a) normalized energy per bit,
(b) normalized energy delay product. . . . . . . . . . . . . . . 123
C.1 DVFS according to α-power law model for source-synchronous
transceiver in Chapter 5. . . . . . . . . . . . . . . . . . . . . . 125
C.2 Data rate and energy/bit scaling according to α-power law
model for source-synchronous transceiver. . . . . . . . . . . . . 127
C.3 Data rate and energy/bit scaling according to α-power law
model for embedded clock transceiver. . . . . . . . . . . . . . 128
xiii
CHAPTER 1
INTRODUCTION
1.1 Motivation
Thanks to the advancement of hardware and software technologies, gather-
ing information from all walks of life has become pervasive. The amount of
data generated has exploded exponentially, leading to the era of Big Data.
The ability to store, access, and process data determines the usefulness of
the acquired data. Memory subsystems, interconnection links, and proces-
sors perform data storage, communication, and computation, respectively.
Traditionally, energy consumed for computation has been the predominant
concern; however, with the explosion in data traffic, energy consumption
issues have been extended to the entire system. In particular, the energy
needed for data communication is becoming the bottleneck [1].
Wireline transceivers (also known as serial link transceivers) are the main
building blocks to accomplish the data communication in digital format as il-
lustrated in Fig. 1.1. They are commonly adopted to meet the data communi-
cation bandwidth requirement in various applications including CPU to CPU
(or its peripheral devices) connection, network interfaces, backplane, and op-
tical communication [2–5]. The achievable transceiver data rate (Gb/s), de-
ciding the interconnect bandwidth, is limited by either transistor speed in a
given technology and/or the channel bandwidth. Though techniques to deal
with band-limited channels have been well established by using equalization,
achieving high data rate and low bit error rate (BER) within a tight energy
efficiency requirement (≤ 5mW/Gb/s or 5pJ/bit) continues to be a signifi-
cant challenge. And this has been becoming the bottleneck in many complex
and fast computation and communication systems.
The trends of wireline transceiver data rate and energy efficiency in Fig. 1.2
simply reveals this challenge. Over the last 15 years, the requirement for data
1
Figure 1.1: Application scenario of wireline transceivers.
rate (wireline transceiver bandwidth) is constantly increasing to keep up with
the demand in data communication bandwidth (Fig. 1.2(a)). The link en-
ergy efficiency in Fig. 1.2(b), however, is becoming more and more difficult
to improve, especially in recent years, because the benefit from process scal-
ing is diminishing due to the slowing pace of technology scaling (denoted as
“efficiency wall” in analogy to the “power wall” in processor design). There-
fore, both circuit and system level innovations are becoming more and more
paramount to satisfy the demanding data communication bandwidth with
good energy efficiency, in both high performance systems (such as data cen-
ters and supercomputer facilities) and low power systems (such as portable
devices and sensor nodes in the Internet of Things (IoT)).
1.2 Thesis Organization
This thesis aims to develop design techniques, at both circuit and system
level, to improve the link energy efficiency. At circuit level, novel receiver
architectures are explored to break several inherent trade-offs in conventional
receivers, and extend receiver operation to a wide range of data rates with
2
Figure 1.2: Wireline transceiver trends in last 15 years: (a) date rate, (b)
energy efficiency.
a stringent power budget. At system level, the thesis closely studies the
feasibility of energy-proportional link, and aims to build wireline transceiver
that can respond to the sparse data communication in many applications,
thus achieving energy-proportionality over a wide range of utilization levels.
In both directions, a highly digital design philosophy is applied to leverage
the benefits from technology scaling. The thesis is organized as follows:
Chapter 2 reviews basic wireline transceiver operations, introduces various
jitter metrics of the receiver, and highlights the limitations and trade-offs in
conventional receivers.
Chapter 3 presents a highly digital receiver with phase-rotating phase-
locked loop (PRPLL) to decouple the dependence between jitter transfer
bandwidth and jitter tolerance corner frequency and eliminate the inherent
peaking in jitter transfer function of the conventional receiver architecture.
Similar to the delay-locked/phase-locked loop (D/PLL) receiver architecture,
the bandwidth for oscillator phase noise suppression is reduced, causing in-
adequate jitter performance at recovery clock especially with ring oscillators.
One solution to address this issue is detailed in the next chapter.
Chapter 4 proposes a reference-less frequency acquisition scheme using
bang-bang phase detector (BBPD), and demonstrates a digital implemen-
tation of D/PLL receiver to eliminate the bulky loop filter capacitor and
preserves the feature of decoupled jitter transfer and jitter tolerance in its
analog counterpart. Furthermore, a fractional-N phase-locked loop (PLL) is
introduced as a digitally controlled oscillator (DCO) to improve the recovery
clock jitter performance, which resolves the remaining issue on clock jitter
3
from Chapter 3.
Chapter 5 explores the energy-proportional operation concept in serial
links, and demonstrates the first energy-proportional source-synchronous link
transceiver that combines dynamic voltage-and-frequency scaling (DVFS)
and rapid on/off (ROO) techniques with less than 14 ns exit latency.
Finally, the thesis is concluded in Chapter 6 with a summary of the con-
tributions and directions for further research.
4
CHAPTER 2
WIRELINE TRANSCEIVER OVERVIEW
2.1 Transceiver Operation
A basic wireline transceiver including a transmitter and a receiver is depicted
in Fig. 2.1. The transmitter (Tx) consists four main blocks: transmitter
phase-locked loop (TxPLL), serializer, equalizer, and output driver. The
TxPLL generates a high-frequency on-chip clock using a low-frequency ex-
ternal crystal reference. The serializer multiplexes the data word input into a
serial stream using TxPLL clock output and its divided versions. The equal-
izer adds pre-emphasis to the data stream to compensate for the channel
dispersion and attenuation. The transmitter driver is responsible for driving
the high speed serializer output onto the channel.
Figure 2.1: Block diagram of a wireline transceiver.
The receiver (Rx) consists of three important blocks: the clock recovery
unit, the data samplers, and the equalizer. Usually, the clock recovery (CR)
unit and the data samplers together are referred to as the clock and data
recovery (CDR) circuit, which is the most critical component in a receiver
(receiver and CDR are used interchangeably hereafter in this thesis and the
exact meaning should be clear in the context). Based on system requirements,
deserialization might be adopted at the receiver side to provide the output
5
data stream at the required rate. Due to the serializer in Tx side and the
deserializer in Rx side, serial link transceivers are also called SerDes systems.
Similar to the Tx equalizer, the Rx equalizer also helps to mitigate the effect
of channel imperfections. The basic operation of the transceivers can be
understood in four main parts: signaling, clocking, recovering and equalizing
methods. A brief description of each part is discussed here [6].
2.1.1 Signaling
The most widely used signaling method is the non-return to zero (NRZ)
format for the input data DIN. Fig. 2.2 illustrates transmitted waveforms
for a known NRZ data pattern 1001. Also shown is the waveform for the less
commonly used return-to-zero (RZ) format. Transmitting every bit requires
Tb seconds or one unit interval (1UI). NRZ data keeps constant during the
interval, while RZ data has a 1 to 0 transition (usually at 0.5 Tb) if the
transmitted bit is 1. The reason why the NRZ pattern is preferred can be
better understood in frequency domain as shown in Fig. 2.2. Analyzing the
power spectral density (PSD) for a long binary random sequence with equal
transition density shows that the spectrum of NRZ data has the first spectral
null at 1/Tb whereas the first null of RZ data is at 2/Tb [7, 8], and spectra
of the NRZ and RZ data are:
SNRZ = Tb[
sin(pifTb)
pifTb
]2, SRZ =
Tb
2
[
sin(0.5pifTb)
0.5pifTb
]2 (2.1)
A larger spread in the PSD for RZ data requires larger channel bandwidth,
thereby making NRZ the preferred format for binary data transmission. At
higher data rates (≥25Gb/s), a multi-level signaling scheme, such as PAM4,
is sometimes adopted to further confine the signal spectrum in order to reduce
the burden of heavy equalization due to the channel impairment at high-
frequency [9, 10].
2.1.2 Clocking
Link clocking scheme describes the relationship between input data (DIN)
and sampling clock (CK). As shown in Fig. 2.3, based on the relative switch-
ing rates between data (DIN) and clock (CK), majority of the links operate
6
Figure 2.2: NRZ/RZ data waveforms and power spectral density for
random NRZ/RZ patterns.
7
in either full rate (FCK = FDIN), half rate (FCK = FDIN/2), or quarter rate
(FCK = FDIN/4) clocking scheme. Choosing a sub-rate clocking scheme (half
rate, quarter rate, or lower) reduces the maximum clock frequency for on-chip
distribution, for which the power is usually above 20% of the overall power,
and the percentage increases as the data rate goes higher [4]. The trade-off is
that multiple phases are needed to operate in sub-rate, and achieving good
phase spacing among phases is challenging. This is one main reason that
further than quarter rate clocking scheme is not commonly used. Of course,
there are receivers that have clock rate higher than the data rate, which is
usually referred to as oversampling clocking schemes [11], but these are rarely
adopted at high data rates (≥25Gb/s) due to the difficulty in high-frequency
clock generation and excessive power for clock distribution.
Figure 2.3: Clocking schemes based on the relative switching rates between
data (DIN) and clock (CK).
Serial link transceivers can also be classified based on the generation of
clock in the receiver (Rx) side, as shown in Fig. 2.4. If the link has a dedicated
channel to forward the clock from Tx to Rx side, it is referred to as a source-
synchronous (forwarded clock) link. If Tx only transmits data to Rx and
there is no crystal reference for Rx, it is known as reference-less clocking. In
such links, the receiver derives sampling clock from random input with special
frequency detectors [12, 13]. Reference-less transceivers are employed when
a crystal reference cannot be afforded on the Rx side, or it is not practical
8
to use a dedicated clock channel. Repeater is one such application that will
be covered more in Chapter 3 and 4.
If Rx does have a crystal reference, the link can be further classified into
two types. On one hand, if the Rx uses a different crystal from Tx, it
is called plesiochronous (embedded clock) link. On the other hand, if Rx
shares the same crystal with Tx, the link is classified as mesochronous link.
The main difference between plesiochronous and mesochronous links is that
plesiochronous receivers must cover the frequency offset between two crystals
(typically measured in parts-per-million (ppm)) [14].
2.1.3 Clock and Data Recovery
Clock and data recovery (CDR) is the most essential component of any re-
ceiver. The diagram of a CDR based on voltage-controlled oscillator (VCO)
is shown in Fig. 2.5 [15]. This CDR loop is very similar to a type-II phase-
locked loop except that the phase detector (PD) is operating on random data
DIN. Intuitively, the main task of the CDR loop is to drive the rising edges
of VCO output, recovered clock (RCK), to the center of data eye, which
is the optimum sampling point for the samplers inside PD to retime DIN
and generate recovered data RDATA. Taking a full rate system for example,
in order to achieve optimum sampling, the negative feedback loop locks the
falling edge of RCK to the transition of input data DIN. Since the rising edge
is ideally 180 degrees away from the falling edge, it automatically samples
DIN at the optimum position to get RDATA. Therefore, both clock (RCK)
and data (RDATA) are recovered.
2.1.4 Equalization
A bandwidth-limited channel causes inter-symbol interference (ISI) [6], which
not only attenuates data amplitude but also introduces dispersion in phase
and amplifies jitter, especially at higher data rates. As shown in Fig. 2.6,
equalization is widely used to compensate for channel loss and minimize the
amplification of jitter due to ISI. Equalization can be done either in the
continuous-time domain or discrete-time domain (processing sampled data).
The goal in both domains is to approximate the reciprocal of channel fre-
9
Figure 2.4: Link classification based on clocking schemes.
Figure 2.5: Recover clock and data with VCO-based CDR.
10
quency response, 1/Hch(s), using Heq(s) , which is the frequency equivalent
response of equalizer. Ideally equalization can be performed either at the
transmitter and/or receiver side, but the amount of equalization at the trans-
mitter side is usually limited by the achievable peak swing at the driver. A
discussion of various equalizer architectures and their trade-offs is presented
in [6].
Figure 2.6: Equalizer compensates for channel loss.
2.2 CDR Performance Metrics
Generally, many factors need to be taken into account in CDR designs, such
as power consumption, bit error rate (BER), jitter, operation range, and tech-
nology as shown in Fig. 2.7. BER is one high-level metric commonly used
for characterizing the CDR performance, in which a lower BER is better. In
addition, CDRs are also characterized in terms of what level of impairment
(mostly at input data and recovery clock) can be tolerated while still achiev-
ing the required BER level. This includes three metrics related to jitter in
the system: jitter generation (JGEN), jitter transfer (JTRAN), and jitter
tolerance (JTOL) [16]. The definition details about the three performance
metrics are addressed below.
11
Figure 2.7: Receiver performance consideration.
2.2.1 Jitter Generation
The jitter generation (JGEN) evaluates the intrinsic jitter of the CDRs. It
is measured as the output jitter in the CDR recovered clock, RCK, with no
jitter presented at input data. Yet, ISI and other common impairments ex-
cept jitter should be included at the input data. Taking the CDR in Fig. 2.5
for example, the main contributors of JGEN include: (1) VCO phase noise;
(2) ripple on the control voltage (related to loop dynamics); (3) quantization
error in digital implementations (Fig. 2.11); (4) ISI or similar common im-
pairment from input data and inside CDRs; (5) supply and substrate noise.
JGEN is usually presented as a root-mean-square (RMS) jitter value. Some
filter may be applied at the input while measuring the JGEN performance
depending on standard specification [17].
2.2.2 Jitter Transfer
The jitter transfer (JTRAN) identifies the jitter magnitude at the output
of a CDR with a given amount of input jitter at different frequencies. It is
essentially the transfer function from CDR input to the output. To measure
JTRAN performance, an input data sequence (usually a pseduorandom se-
quence), with its phase modulated by sinusoidal signal at a given frequency, is
12
applied to the CDR. The jitter at the recovered clock output is measured, and
the ratio between the output jitter and input jitter over different frequency
gives jitter transfer. Generally, the JTRAN exhibits a low-pass characteristic
with 0 dB gain at low frequency, and a typical JTRAN is shown in Fig. 2.8.
Note that a jitter peaking exists due to a zero in second or higher order sys-
tems. The transfer function starts to roll off after the JTRAN bandwidth at
a rate depending on the order of the CDRs (20 dB/decade for second order
systems since the zero cancels out the roll off of one pole).
Figure 2.8: Typical jitter transfer response.
It is important to mention that the jitter transfer requirement differs from
application to application. For example, high speed links for chip-chip com-
munication do not require specific jitter transfer performance, and are instead
focused on achieving sufficiently low BER, whereas in synchronous optical
network (SONET) systems, jitter transfer, especially the peaking value (≤
0.1 dB), is critical because the system has to ensure that jitter does not build
up while traveling through multiple repeaters [18].
13
2.2.3 Jitter Tolerance
The jitter tolerance (JTOL) quantifies how much input jitter can be tolerated
by a CDR loop with certain threshold of BER level. The requirement on
JTOL is usually specified as minimum jitter amplitude, as a function of
frequency, that must be tolerated while not exceeding a specific BER, shown
as JTOL mask in Fig. 2.9. In the JTOL measurement, a pseudo-random
Figure 2.9: Typical jitter tolerance response.
input data sequence is applied to the CDR, and the phase of the sequence is
modulated by a sinusoidal signal at a given frequency. The amplitude of the
modulation keeps increasing until the measured BER exceeds the required
BER level. Usually, the measured jitter tolerance performance at different
frequencies is compared with the JTOL mask (Fig. 2.9) to see whether it
satisfies the requirement.
The process for generating JTOL mask is as follows. As shown in Fig. 2.10,
the typical jitter contribution is shown within a data eye diagram, where
1UI is the overall timing margin for samplers, TJ stands for CDR intrinsic
jitter under certain BER level, and ΦE is the phase error caused by applied
sinusoidal jitter, Φin, for JTOL measurement. Assuming the open loop gain
14
Figure 2.10: Jitter contribution in data eye diagram.
of CDR is LG(s), the phase error is given as:
ΦE =
Φin
1 + LG(s)
(2.2)
In order for the CDR to meet the BER requirement, the phase error, αUI,
introduced by input sinusoid jitter should not exceed the available sampling
margin, which means:
ΦE = α < 1− TJ (2.3)
With substitution of Eq. (2.2) into Eq. (2.3), the JTOL mask (bottom
curve in Fig. 2.9) is given by:
Φin = α(1 + LG(s)) (2.4)
and the measured JTOL performance (top curve in Fig. 2.10) is given by:
Φin < (1− TJ)(1 + LG(s)) (2.5)
15
2.3 Conventional CDR Limitations
Although conventional analog CDRs (top side of Fig. 2.11) can meet the
performance requirements in most applications, the continued scaling tech-
nology in deep-submicron CMOS process imposes severe constraints such as
current leakage, poor analog transistor gain, low supply voltage, and pro-
cess variability. Overcoming such technology limitations in CDR designs
often incurs penalties in terms of performance, area, power, time-to-market,
and design flexibility. For instance, the area of the analog CDR is histori-
cally large due to the big capacitor in the loop filter. Transistor leakage in
deep-submicron technology mandates the use of metal capacitors in place of
high-density MOS capacitors, causing more than 3 times increase in the loop
filter area. Moreover, in applications which require small peaking in jitter
transfer, the loop filter capacitor is too large to be implemented on chip, and
full integration of the CDRs becomes impossible [13].
Figure 2.11: Transition from analog CDR to digital CDR.
To overcome these drawbacks, digital CDRs (bottom side of Fig. 2.11) are
emerging as attractive alternatives in high-speed wireline transceivers due
to their robustness in process-voltage-temperature (PVT) variations, design
flexibility, and good area and power efficiency. The key distinction is in the
implementation of the loop filter: analog loop filter LF(s) versus digital loop
filter LF(z). Both of them perform proportional control and integral con-
trol to stabilize the second-order loop. Yet, the digital loop filter realizes
the function of a capacitor, which is essentially integration, with a digital
16
accumulator to reduce the area and improve PVT robustness. Due to the
reconfigurable nature of digital circuits, the digital loop filter also has more
flexibility to control the CDR loop dynamics. The digital implementation is
also power efficient for two reasons. First, a digital circuit can potentially
operate at low supply voltage without degrading the performance, especially
in deep-submicron technology. Second, in digital domain, signals can be
decimated to lower speed for further processing to reduce the power con-
sumption [13]. Of course, as with most digital circuits, quantization error
will be introduced and the techniques to mitigate this error will be addressed
in detail later.
Apart from the limitations in analog-type CDRs in deep submicron pro-
cess, both analog and digital CDRs have two main inherent trade-offs with
the conventional architecture. One is tightly coupled jitter transfer (JTRAN)
bandwidth and jitter tolerance (JTOL) corner frequency, and the other is
conflict between CDR jitter generation (JGEN) and JTRAN bandwidth.
This section explains both trade-offs in detail with a linear analysis for digital
CDR based on small signal model in Fig. 2.12, and serves as one motivation
for the new CDR architectures.
Figure 2.12: Loop dynamics of VCO-based digital CDR.
In the first trade-off, the jitter transfer (JTRAN) bandwidth and jitter
tolerance (JTOL) corner frequency are decided by the cut-off frequency of
transfer functions HJTRAN(s) and HJTRACK(s), respectively.
HJTRAN(s) =
ΦRCK
ΦDIN
(s) =
LG(s)
1 + LG(s)
=
sρKPDKPKDCO + ρKPDKIfACCKDCO
s2 + sρKPDKPKDCO + ρKPDKIfACCKDCO
(2.6)
17
HJTRACK(s) =
ΦE
ΦDIN
(s) =
1
1 + LG(s)
=
s2
s2 + sρKPDKPKDCO + ρKPDKIfACCKDCO
(2.7)
where ρ is the transition density of input data. For a heavily damped sys-
tems, both HJTRAN(s) and HJTRACK(s) have the same two real poles located
at ωpL ≈ −KIfACC/KP and ωpH ≈ −KPDKPKDCO, respectively, as shown in
Fig. 2.13. In addition, HJTRAN(s) has a zero at ωz = −KIfACC/KP, which is
the reason for the inevitable peaking of jitter transfer function in conventional
architecture.
Figure 2.13: Relationship between JTRAN bandwidth and JTOL corner
frequency.
From Fig. 2.13, it is important to note that both JTRAN bandwidth and
JTOL corner frequency are decided by the higher pole ωpH (JTOL is shown
as its scaled inversion, HJTRACK(s), for clear comparison). In other words,
whenever one lowers JTRAN to reduce the input jitter transfer to output,
the JTOL corner frequency is also compromised. Chapter 3 proposes a novel
CDR architecture to decouple this trade-off with a low JTRAN bandwidth
and a high JTOL corner frequency and eliminate the peaking at jitter transfer
at the same time.
For the trade-off between jitter generation (JGEN) and jitter transfer
(JTRAN) bandwidth, the essential conflict is the bandwidth for filtering the
input noise and VCO noise (the major contributor to JGEN). Specifically,
the transfer function from input noise to output has low-pass characteristics
and that of VCO noise is high pass. Both transfer functions have the same
shape as HJTRAN(s) and HJTRACK(s) shown in Fig. 2.13. Similarly, the band-
18
widths of both transfer functions are decided by ωpH, and there exists the
trade-off.
The solution presented in Chapter 3 achieves low JTRAN bandwidth and
high JTOL corner frequency to break the first trade-off, and eliminates the
peaking at jitter transfer function. Yet this reduces the bandwidth for VCO
noise suppression, decided by JTRAN bandwidth, which degrades the JGEN
of the CDR. Novel architectures to decouple both trade-offs at the same time
are under investigation in the future work.
2.4 Summary
Basic operations of wireline transceivers are described, including signaling,
clocking, clock and data recovery, and equalization. The CDR performance
metrics are then discussed in detail with their definitions, characterization
setup, and variation for different applications. A brief comparison between
analog and digital CDRs is given to explain the limitation in analog CDRs
and how the digital counterparts overcome the limits. Last, inherent trade-
offs in conventional CDRs are discussed. The inherent trade-off in conven-
tional CDRs and trends of transceiver energy efficiency motivate: (i) novel
CDR architectures to break the trade-offs in conventional structure in Chap-
ter 3 and 4; (ii) the concept of energy-proportional link transceiver to cut
link power wastage at system level, thus improving the energy efficiency in
Chapter 5.
19
CHAPTER 3
A REFERENCE-LESS CDR USING
PHASE-ROTATING PLL
The receiver is a key building block in wireline communication where it per-
forms the crucial function of recovering clock and re-timing the received data.
It must recover data without errors and tolerate input jitter as quantified by
the jitter tolerance (JTOL) metric in a power- and cost-efficient manner. To
avoid the cost of a crystal oscillator needed for the CDR, frequency acqui-
sition without using a reference clock is desirable. Additionally, CDRs used
in repeater applications should have minimum peaking (≤ 0.1 dB) in the jit-
ter transfer (JTRAN) function, and must satisfy stringent jitter generation
(JGEN) requirement [15].
In this chapter, we demonstrate a CDR that employs a phase-rotating
PLL (PRPLL) as a phase interpolator and achieves reference-less frequency
acquisition [19]. Main features of the proposed CDR are discussed through
comprehensive linear and stability analysis, along with detailed discussion on
circuit implementation of the PRPLL and CDR building blocks. Fabricated
in a 90 nm CMOS process, the prototype CDR consumes 13.1mW power
at 5Gb/s and achieves a BER better than 10−12, 2MHz JTRAN bandwidth
with no peaking, 16MHz JTOL corner frequency, and a recovered clock long-
term jitter of 5.0 psrms/44.0 pspp with PRBS31 input data. The CDR can
operate with negligible degradation in BER with 110mVpp amplitude supply
noise at the worst case frequency (7MHz).
The rest of this chapter is organized as follows. Prior art on dual-loop
CDRs is briefly discussed in Section 3.2, serving as another motivation for the
proposed CDR presented in Section 3.3. The circuit implementation details
of the proposed CDR are described in Section 3.4. The measured results are
presented in Section 3.5, followed by a summary of the key contributions in
Section 3.6.
20
3.1 Background
The phase interpolator-based (PI-based) CDR shown in Fig. 3.1 is one of
the most commonly used CDR architectures [20–23]. Note that the phase
accumulator (ACCP) and PI together is similar to the VCO or DCO func-
tion in Fig. 2.11 to provide infinite phase shift with a modulo of 2pi. The
Figure 3.1: Phase interpolator-based sub-rate CDR.
whole CDR is composed of a cascade of multiphase generator (MPG), typ-
ically implemented using a PLL or a delay-locked loop (DLL), and a main
CDR, which is also known as clock and data recovery (CDR) loop, and CDR
and CDR are used interchangeably in this proposal. Using a local reference
clock, the MPG generates multiple equally-spaced phases at approximately
the data rate and feeds them to the CDR loop. A bang-bang phase detector
(BBPD) in the main CDR loop detects the sign of the phase error, and a
digital proportional-integral loop filter processes BBPD output and generates
the frequency control word, DF. A digital accumulator, ACCP, integrates DF
and generates the phase control word, DP, which controls the phase inter-
polator (PI). The PI interpolates between MPG phases as governed by DP
and generates recovered clock, RCK. By varying DP, the CDR loop drives
the recovered clock phase to the center of the input data eye. By designing
the phase accumulator (ACCP) to roll-over (as opposed to saturating it), in-
finite phase shifting can be achieved to track the small frequency difference
between the MPG output frequency and the incoming data rate [20].
There are many tradeoffs that one must consider while designing a PI-based
CDR. First, because JTOL corner frequency is dictated by phase-tracking
slew rate, it can be increased only by increasing PI step size (assuming the
21
loop is operated at the maximum possible update rate). But this also in-
creases phase quantization error and degrades JGEN [14, 24]. Second, since
JTRAN and JTOL are both governed by the same loop parameters, it is
impossible to lower JTRAN to filter input jitter without reducing JTOL cor-
ner frequency [25]. Third, the non-linear transfer characteristic of BBPD
causes loop gain to depend on input jitter, which makes it difficult to control
JTRAN in a robust manner [15, 26]. Finally, the design of phase interpo-
lators is challenging due to the conflicting tradeoffs between linearity, noise
sensitivity, operating range, area, and power [20, 27]. Their power and area
penalty is further exacerbated in sub-rate CDRs which require many PIs to
generate multiple phases; for example, a half-rate CDR needs 4 phases and
a quarter-rate CDR needs 8 phases [21, 22].
Several techniques have been proposed to improve PI resolution and reduce
its impact on JGEN [14,24,28]. PI quantization error was suppressed in [24]
by filtering it using a PLL and the suppression was further improved in [14]
by using a delta-sigma modulator to shape the quantization error out of band.
Both these architectures are particularly amenable for sub-rate CDRs as they
can generate multiple phases using a single PI. However, their effectiveness
is limited by PI non-linearity and by the coupled PLL bandwidth tradeoff to
simultaneously suppress VCO phase noise and PI quantization error (Qn). A
low bandwidth is desirable to filter Qn while a large bandwidth is needed to
mitigate VCO phase noise [14, 27].
The phase-rotating PLL (PRPLL) proposed in [28] (shown in Fig. 3.2)
presents an interesting phase shifting technique without using an explicit
phase interpolator, and it overcomes the inherent non-linearity that comes
with implementing interpolation in phase domain [27]. Different from a con-
ventional charge-pump PLL, it consists of multiple XOR phase detectors
whose output currents are weighted, summed, and filtered to generate the
control voltage. By weighting the individual XOR outputs differently using
control word DP, the amount of output phase shift can be varied. Compared
to a conventional PI, thanks to the current-domain operation, the PRPLL
approach exhibits superior digital-to-phase conversion linearity. Further, the
output frequency being same as the input reference frequency mandates a
high-frequency reference clock which allows the PRPLL to have a very high
bandwidth. This helps to suppress VCO phase noise and reduce loop filter
area. These advantages were exploited in [29,30], where the PRPLL was used
22
to interpolate between phases and implement sub-rate CDRs. However, the
need for a high-frequency reference clock (at approximately the data rate)
has restricted the widespread usage of PRPLLs in PI-based CDRs. In view of
this, we seek to obviate the need for a reference clock and present a reference-
less PRPLL-based CDR.
Figure 3.2: Block diagram of a PRPLL.
3.2 Proposed CDR Architecture
To arrive at the proposed architecture, we start with the PRPLL-based CDR
(Fig. 3.3(a)). Note that, in steady state, ACCI output represents the fre-
quency error between the incoming data and the PRPLL output. There-
fore, we postulate that frequency locking can also be achieved by tuning the
PRPLL output frequency indirectly by tuning its reference clock frequency
using ACCI output. To this end, a digitally-controlled oscillator (DCO) is
23
used, as shown in Fig. 3.3(b), to generate reference clock for the PRPLL.
In steady state, the DCO would be tuned to the data rate. We further ob-
serve that the newly added DCO path also implements the frequency control
portion of the CDR and appears in parallel to the original frequency con-
trol path through phase accumulator ACCP. Thus, it is unnecessary to feed
ACCI output into the phase-tuning port of the PRPLL. Applying this mod-
ification leads to the CDR depicted in Fig. 3.3(c), which can be redrawn as
shown in Fig. 3.3(d). Looking at Fig. 3.3(d) reveals that the proposed CDR
can be simply viewed as a Type-II CDR in which the proportional path is
implemented in phase domain as opposed to digital (or analog) domain in
conventional CDRs. As will be illustrated shortly, this way of implement-
ing proportional control gives rise to many attractive features such as well
controlled JTRAN, decoupled JTRAN/JTOL and JTRAN/JGEN behavior.
3.2.1 Linear Analysis
A linear model of the proposed CDR is depicted in Fig. 3.4. BBPD is repre-
sented by its linearized gain, KBBPD, given by [23]:
KBBPD =
1
σj
√
2pi
(3.1)
where input jitter is assumed to have normal distribution with zero mean
and a variance of σ2j . The loop gain of the CDR, LGCDR(s), is given by:
LGCDR(s) = ρKBBPD
(
fACC
s
KPKPR +
fACC
s
KIKDCO
s
)
LGPRPLL(s)
1 + LGPRPLL(s)
(3.2)
where ρ is input data transition density, KDCO is digitally-controlled oscil-
lator (DCO) gain, fACC is the frequency at which accumulators ACCP and
ACCI are clocked, and KPR is the phase interpolation gain of the PRPLL.
LGPRPLL(s) is the loop gain of the PRPLL and is equal to:
LGPRPLL(s) = KPDLF(s)
KVCO2
s
(3.3)
24
Figure 3.3: Evolution of the proposed CDR.
25
where KPD and KVCO2 are gains of the PD and the oscillator, respectively.
Since PRPLL bandwidth is designed to be much larger than that of the
CDR, LGPRPLL(s)/1 + LGPRPLL(s) ≈ 1 in the vicinity of CDR transfer band-
width. Using this simplification, input to RCK transfer function, HIN2RCK(s),
can be calculated to be:
HIN2RCK(s) =
ΦRCK(s)
ΦDIN(s)
=
ρKBBPDfACCKIKDCO
1 + LGCDR(s)
=
ρKBBPDfACCKIKDCO
s2 + sρKBBPDfACCKPKPR + ρKBBPDfACCKIKDCO
(3.4)
The above equation reveals that there are two poles and importantly no
zeros in the transfer function. Due to the absence of zeroes, jitter peaking
can be completely eliminated simply by making the two poles to be real.
Under this condition, the location of the two poles can be determined to be:
ωp1 =
−ρKBBPDfACCKPKPR +
√
(ρKBBPDfACCKPKPR)2 − 4ρKBBPDfACCKIKDCO
2
=
KPP
2
(−1 +
√
1− 4KINT
K2PP
)
≈−KINT
KPP
= −ρKBBPDfACCKIKDCO
ρKBBPDfACCKPKPR
= −KIKDCO
KPKPR
(3.5)
ωp2 =
−ρKBBPDfACCKPKPR −
√
(ρKBBPDfACCKPKPR)2 − 4ρKBBPDfACCKIKDCO
2
=
KPP
2
(−1−
√
1− 4KINT
K2PP
)
≈−KPP= −ρKBBPDfACCKPKPR (3.6)
where ωp1 is the lower of the two pole frequencies. Because ωp1 ≪ ωp2, jitter
transfer bandwidth (JTRAN) approximately equals to |ωp1|.
The transfer function from input to SCK can be similarly calculated and
is given by:
HIN2SCK(s) =
ΦSCK(s)
ΦDIN(s)
=
LGCDR(s)
1 + LGCDR(s)
=
sρKBBPDfACCKPKPR + ρKBBPDfACCKIKDCO
s2 + sρKBBPDfACCKPKPR + ρKBBPDfACCKIKDCO
(3.7)
26
Note that the above transfer function has the same two poles as those
of HIN2RCK(s). However, much like in a conventional type-II PLL (and un-
like HIN2RCK(s)), the transfer function contains a pole-zero pair (ωp1 and
ωz1 = −KIKDCO/KPKPR). If ωp1 and ωz1 perfectly cancel, as desired in most
applications, the jitter tracking bandwidth (or equivalently JTOL corner fre-
quency) equals ωp2. In case of imperfect cancellation, JTOL corner frequency
varies in proportion to the cancellation inaccuracy. It is important to note
that, in the proposed architecture, the mismatch in the pole-zero cancellation
does not change jitter transfer (JTRAN) bandwidth, which is determined by
the dominant pole (ωp1) as illustrated earlier. Approximately, JTOL corner
frequency is given by:
JTOL corner frequency = |ωp2| ≈ ρKBBPDfACCKPKPR (3.8)
Based on the analysis presented thus far, two important observations can
be made: (1) unlike conventional bang-bang CDRs, the JTRAN bandwidth
of the proposed CDR is independent of the BBPD gain (see Eq. (3.5)). (2)
JTRAN and JTOL bandwidths are completely decoupled, unlike in a con-
ventional type-II CDR where they are both set by ωp2 [15]. As a result of
using voltage controlled delay line (VCDL) as the phase shifter in the data
path, CDRs reported in [12,25,31] also possess this property. However, using
a PRPLL in the clock path as proposed offers two main advantages. First,
PRPLL consumes significantly less power compared to a VCDL designed to
minimize inter-symbol interference in the input data. In other words, achiev-
ing long delay without attenuating input signal requires a large number of
power hungry delay buffers [12]. Second, infinite phase shifting capability of
the PRPLL eliminates the mid-frequency JTOL limitation that comes with
limited range of VCDL [31].
3.2.2 Stability Analysis
Compared to a conventional type-II CDR, the stability analysis of the pro-
posed CDR is complicated since the PRPLL is embedded in the CDR loop.
A common strategy to stabilize systems with embedded feedback loops is
based on choosing widely separated individual loop bandwidths. However,
this approach is complicated by the unpredictability of the CDR loop band-
27
ACCf
s
BBPDK
PK
PRK
DCOK
s
PDK2VCO
K
s
( )LF s
IK
Figure 3.4: Linearized phase-domain model of the proposed CDR.
width caused by non-linearity of the BBPD. In this work, the impact of
PRPLL on the CDR loop stability is minimized by making the slew rate of
phase tracking in the CDR to be much smaller than that of PRPLL loop [28].
Mathematically, this condition is expressed as follows:
KP∆Φpp +KI∆Φint
2pi · TACC ≪ fPRPLL (3.9)
where ∆Φpp = KPR and ∆Φint = KDCO are the magnitudes of maximum
phase deviations caused by proportional and integral control, respectively,
fPRPLL is the bandwidth of PRPLL, and TACC (= 1/fACC) is the update pe-
riod of accumulators. Under this condition, the proposed CDR behaves much
like a conventional type-II CDR and its stability can be ensured by choosing
the proportional path gain to be much larger than the integral path gain.
To this end, stability factor, ξ as defined below, must be chosen to be much
greater than one [14].
Stability factor ξ =
KP∆Φpp
KI∆Φint
≫ 1 (3.10)
In the proposed architecture, at an operating frequency of fVCO = 2.5GHz,
∆Φpp =
2pi
64
rad/s, ∆Φint =
2pi
6.25× 103 rad/s, KP = 1/2, KI = 1/4, and
fACC = fVCO/4, the lower bound on fPRPLL is about 5MHz while the upper
bound is approximately equal to 1/10 of the VCO frequency which can be as
high as 250MHz [32]. Stability factor is approximately 195. Having discussed
28
the key features of the proposed CDR, the circuit implementation details are
presented next.
3.3 Circuit Design
Figure 3.5: Detailed schematic of the proposed reference-less PRPLL-based
CDR.
The detailed schematic of the CDR is shown in Fig. 3.5 [19]. Input data,
DIN, is buffered by a two-stage limiting amplifier and fed to a half-rate BBPD
(HR-BBPD) whose output is decimated by a factor of 4 to ease the speed
requirements of digital circuits such as accumulators [15]. The decimated
BBPD output is fed to integral and proportional paths, which control the
DCO and PRPLL separately.
The PRPLL provides four equally-spaced sampling clock phases (SCK)
for HR-BBPD, and the retimer compensates timing difference between SCK
and RCK to guarantee correct retimed data (RDATA). For frequency acqui-
sition, a divide-by-1024 stage divides the input data to generate a stochastic
reference clock for the frequency-locked loop (FLL) [13]. The FLL path con-
sists of a divider-based frequency detector (FD), a 10-bit digital accumulator
(ACCF), and a delta-sigma DAC whose gain is denoted as KF. The rest of
the section focuses on the circuit implementation details of the PRPLL and
29
key CDR building blocks.
3.3.1 Phase-Rotating PLL Design
Figure 3.6: Schematic of phase-rotating PLL with quadrant segmentation.
The block diagram of the PRPLL implemented in the prototype is shown
in Fig. 3.6. Compared to the PRPLL in [28], two new techniques to improve
phase interpolation linearity and power efficiency are proposed. The power
dissipation in a conventional PRPLL is dominated by the XOR phase detec-
tors and the voltage-to-current (V-I) converter needed to drive the passive
loop filter. Current-mode logic (CML) XOR gates and the high-frequency
V-to-I converter consume significant portion of the PRPLL power in [29].
In view of this, we propose a segmented phase interpolation to reduce the
number of phase detectors and embed charge-pumps into CMOS XOR gates
to eliminate high-bandwidth V-to-I converter (see Fig. 3.8). As shown in
Fig. 3.6, segmentation is implemented by first selecting two adjacent clock
phases, denoted as I/Q, corresponding to the quadrant in which phase in-
terpolation occurs with the two most significant bits (MSBs) and using rest
of the four least significant bits (LSBs) to vary currents II and IQ in each
of the two XOR phase detectors. To better illustrate the phase interpola-
tion behavior, the relationship between PD output current and input control
word, DP, is depicted in Fig. 3.7. Note that the exact locking position is 90
◦
apart from the quadrant decided by I/Q phases as depicted in Fig. 3.6 due to
the behavior of XOR phase detectors. Further, this segmented approach of
using a quadrant multiplexer and only two phase-detectors is easily scalable
when dealing with a larger number of VCO phases to achieve better phase
resolution in the PRPLL.
30
Figure 3.7: Phase-rotating process in PRPLL.
The proposed circuit that combines XOR phase detectors with charge
pump (XORPD-CP) is shown in Fig. 3.8. It consists of four XOR phase
detectors, a current steering DAC, two fixed current sources and a balancing
amplifier. The current steering DAC controls the tail current of the XOR
phase detectors using the digital codes from the CDR logic. The DAC has
15 unit current source elements, each steering current ILSB. Each of the two
fixed current sources sinks 0.5 ILSB current and helps to improve the speed of
the DAC [33]. The DAC can sink a maximum of 15.5 ILSB current while the
fixed current sources each pump 8 ILSB current. The outputs of the two main
XOR phase detectors (XOR1, XOR4) are combined to generate the charge
pump current ICP. The complementary phase detectors (XOR2, XOR3) con-
duct when the main XOR phase detectors are off and steer current into
the virtual ground node N. This maintains constant current sink through
the DAC, thereby eliminating large voltage fluctuations on the DAC output
nodes II and IQ. A balancing amplifier further suppresses any residual volt-
age fluctuations and helps to improve PI linearity. It should be noted that
the balancing amplifier does not require a large bandwidth as it is used only
to maintain the steady state operating point of virtual ground node.
3.3.2 Limiting Amplifier
The schematic of the limiting amplifier used to buffer the input data is shown
in Fig. 3.9. It is implemented using a cascade of two CML stages and a
CML-to-CMOS converter. The combined gain and bandwidth of the two
CML stages are about 22 dB and 2.6GHz, respectively. A CML-to-CMOS
converter serves the dual purpose of generating rail-to-rail CMOS outputs
and isolating the CML stages from the BBPD kick-back noise.
31
Figure 3.8: Schematic of combined XORPD and charge pump
(XORPD-CP).
Figure 3.9: Schematic of the limiting amplifier.
32
3.3.3 Half-Rate Bang-Bang Phase Detector
A half-rate bang-bang phase detector is implemented using the schematic
shown in Fig. 3.10. Rising edges of Φ0 and Φ180 sample the incoming data
to generate data samples DS0 and DS1, while rising edges of Φ90 and Φ270
sample data transitions to generate edge samples ES0 and ES1. Early/Late
(E/L) decisions are made by combining data and edge samples as illustrated
in Fig. 3.10. Note that the rising edge of the synchronization phase ΦSYN (a
delayed version of Φ270) has to fall between Φ0 and Φ90 to ensure that the
proper data and edge samples are used to generate the correct E/L informa-
tion. The DFFs are implemented by cascading two sense amplifiers and a
symmetric latch to achieve small aperture window and optimize the timing
margin of the overall phase detector [14].
Figure 3.10: Half-rate bang-bang phase detector.
3.3.4 Digitally Controlled Oscillator (DCO)
The schematic of the ring DCO is shown in Fig. 3.11. The oscillator is im-
plemented using four pseudo-differential current starved delay cells whose
output is level shifted to a rail-to-rail signal using an AC coupled output
buffer. The oscillator frequency is controlled by DACs in the integral path
33
Figure 3.11: Schematic of supply-regulated digitally controlled oscillator
(DCO).
34
(DACI) and in the FLL (DACF) [34]. Simulations indicate a DACI LSB cur-
rent of 1µA corresponds to a KDCO of 400 kHz/LSB and leads to a JTRAN of
2MHz. Note that high bandwidth is beneficial for suppressing DCO phase
noise, while a low bandwidth is desirable to filter input jitter. Assuming
1%UIrms input jitter (2 psrms at 5Gb/s data rate) and an input transition
density of 0.5, JTRAN bandwidth of 2MHz mandates DCO phase noise to be
-100 dBc/Hz at 1MHz frequency offset for input jitter and DCO phase noise
to contribute equally to the recovered clock jitter. The DCO power consump-
tion to achieve such phase noise performance is 6.9mW, which constitutes
to more than 50% of overall CDR power.
Ring oscillators using inverter-based delay cells, such as the one used in
this work, are highly susceptible to supply noise. Thus, their supply needs to
be regulated to prevent jitter degradation of the CDR. In this prototype, a
simple self-biased source follower regulator, shown in Fig. 3.11, is used with
a supply voltage (VDD) of 1.3V, and a regulator output (VDDREG) of 1.0V.
The gate voltage of transistor M1 (a native NMOS transistor) is generated
by filtering the noisy supply voltage using resistor R1 and capacitor C1. The
cut-off frequency of the low-pass filter formed by R1 and C1 is about 6.4 kHz,
which is sufficiently lower than CDR’s JTRAN bandwidth. This ensures
that low-frequency supply noise leaking through the gate of M1 is adequately
suppressed by the CDR loop. The capacitor C2 = 200 pF is used to tightly
couple the gate-source voltages of controlled current sources (such as M3) to
further improve the DCO supply noise immunity.
The simulated power supply noise rejection (PSNR) curves are depicted
in Fig. 3.12. The PSNR is defined as [35]: PSNR = 20log
Tj/T
∆VDD/VDD
, where
T is the period of DCO output and Tj is the amplitude of jitter caused by
peak-to-peak supply-noise amplitude of ∆VDD. Without regulation, worst
case PSNR is about 60 dB and occurs at about 7MHz. Regulator provides
nearly 40 dB rejection and improves the PSNR of the regulated DCO by the
same amount. Note that poor regulation of the source-follower regulator at
low frequencies does not impact the PSNR as long as the pole frequency,
ωp = 1/(R1C1), is much smaller than the CDR jitter transfer bandwidth. In
our implementation, the ratio of JTRAN to ωp is more than 300.
35
102 103 104 105 106 107 108 109 1010
-60
-40
-20
0
20
40
60
M
ag
n
itu
de
 
[dB
]
Frequency [Hz]
 
 
PSNR w/o REG
REG PSRR
PSNR w/ REG
Figure 3.12: Simulated CDR power supply noise rejection transfer functions
with and without the regulator.
3.4 Experimental Results
The prototype CDR is implemented in a 90 nm CMOS process and occupies
an active area of 0.62mm2. The die micrograph is shown in Fig. 3.13. The
standalone PRPLL performance was characterized first, and the complete
CDR results are presented subsequently.
3.4.1 PRPLL Measurement Results
The external reference clock to the PRPLL was provided by an arbitrary
waveform generator (AWG7122B), and a power spectrum analyzer (Agilent
PSAE4440A) and a communication signal analyzer (CSA8200) were used to
measure phase noise and long-term absolute jitter, respectively. All mea-
surements were performed at an output frequency of 2.5GHz and 1V supply
voltage. The measured phase noise plot of the PRPLL is shown in Fig. 3.14.
The spot phase noise at 1MHz frequency offset is -134 dBc/Hz and the inte-
grated jitter from 4 kHz to 200MHz is 615 fsrms. Such excellent phase noise
performance is attributed to aggressive suppression of the VCO phase noise
by a large PLL bandwidth (≈ 200MHz). The measured reference phase
noise is -135 dBc/Hz at 1MHz frequency offset [36], which dominates the
36
Figure 3.13: Die micrograph.
Figure 3.14: Measured PRPLL phase noise plot.
37
phase noise of the PRPLL. The jitter histogram displayed in Fig. 3.15 indi-
cates that the PRPLL achieves 1.1 psrms and 8.9 pspp (>100khits) including
the scope jitter [28].
Figure 3.15: PRPLL output jitter histogram.
The phase rotation behavior of the PRPLL is evaluated by sweeping the
digital control word (DP in Fig. 3.6) and measuring the output phase. The
measured digital-to-phase transfer function depicted in Fig. 3.16 is monotonic
with a maximum deviation of about ±1.2 ps from the nominal phase step of
400 ps/64=6.25 ps.
The linearity of the phase rotation process is illustrated by the DNL/INL
plots shown in Fig. 3.17. No large jumps were observed during quadrant
switching. Since the amplifier employed in the XORPD-CP alleviates the
non-linearity caused by output resistance modulation, it was possible to
achieve an excellent linearity of DNL < ±0.2 LSB, and INL < ±0.4 LSB.
At 2.5GHz, the PRPLL consumes 2.9mW of which only 450µW is dissi-
pated by XORPD-CP. The performance summary of PRPLL and comparison
with other PRPLL designs in the literature is shown in Table. 3.1.
3.4.2 CDR Measurement Results
The BER performance of the CDR was characterized with different PRBS
sequences using Agilent BERT N4902B. Input phase modulation needed to
38
0 5 10 15 20 25 30 35 40 45 50 55 604
5
6
7
8
Ph
as
e 
st
ep
s 
[ps
]
0 5 10 15 20 25 30 35 40 45 50 55 600
100
200
300
400
Ph
as
e 
sh
ift
 
[ps
]
Digital control word DP
Figure 3.16: Measured digital to phase transfer characteristics of the
PRPLL.
0 5 10 15 20 25 30 35 40 45 50 55 60-0.6
-0.3
0
0.3
0.6
D
N
L 
[LS
B
]
0 5 10 15 20 25 30 35 40 45 50 55 60-0.6
-0.3
0
0.3
0.6
IN
L 
[LS
B
]
Digital control word DP
Figure 3.17: Measured phase interpolation linearity (DNL and INL) of the
PRPLL.
39
Table 3.1: PRPLL performance summary and comparison
[28] [30] This work
Technology 90 nm CMOS 90 nm CMOS 90 nm CMOS
XORPD power efficiency
[mW/GHz]
3.2 0.24 0.18
PRPLL power efficiency
[mW/GHz]
7.92 1.34 1.16
Long-term jitter
[mUIrms/mUIpp]
3.5/39.5 N/A 2.75/22.3
Phase linearity [DNL/INL] ±0.8/N/ALSB ±0.5/±0.8LSB ±0.2/±0.4LSB
Phase noise at 1MHz offset -122 dBc/Hz N/A -134 dBc/Hz
measure JTRAN and JTOL was provided by Agilent E4433B RF signal gen-
erator and the recovered clock jitter was measured using CSA8200. All the
measured results presented in this section were obtained at a data rate of
5Gb/s, and the channel used for characterizing the CDR contains 1-m coax-
ial SMA cable, 2-inch on-board FR4 PCB trace, and parasitics associated
with QFN48 package. The overall loss is about 3-to-4 dB at 5Gb/s.
The jitter transfer function,
ΦRCK(s)
ΦIN(s)
, measured by feeding ‘1100’ data
pattern with about 1%UIrms jitter for different integral path gain settings is
plotted in Fig. 3.18. With nominal gain setting, JTRAN was measured to be
approximately 2.3MHz, and it varies from 1.1MHz to 3.8MHz as the gain
is scaled by a factor of 4, thus illustrating that the JTRAN of the proposed
CDR loop is well controlled, and no jitter peaking was observed under all
conditions.
The sensitivity of JTRAN to input jitter is evaluated by measuring JTRAN
for different input jitter amplitudes and the results are shown in Fig. 3.19.
Minimal variation is observed in JTRAN bandwidth as the input jitter ampli-
tude is varied from 0.01UI to more than 0.3UI (more than 30x), illustrating
that JTRAN is independent of input jitter. In other words, the proposed
architecture achieves linear loop dynamics even while using a BBPD and
digital control.
JTOL plot measured with PRBS7 input data and a BER threshold of
10−12 is shown in Fig. 3.20. JTOL corner frequency is about 16MHz, and
low frequency JTOL is limited by the phase modulation range of the BERT.
A half-rate recovered data eye diagram, obtained at the HR-BBPD output,
is enclosed in Fig. 3.20, and data jitter is 6.2 psrms and 45.6 pspp.
40
0.01 0.05 0.1 0.5 1 5 10 20-8
-6
-4
-2
0
2
Frequency [MHz]
M
ea
su
re
d 
TF
 
[dB
]
 
 
RCK__1.1MHz
RCK__2.3MHz
RCK__3.8MHz
Figure 3.18: Measured jitter transfer function with different gain settings.
0.01 0.05 0.1 0.5 1 5-6
-5
-4
-3
-2
-1
0
1
Frequency [MHz]
|JT
R
A
N
| [d
B
]
 
 
input jitter amplitude=0.01UI
input jitter amplitude=0.05UI
input jitter amplitude=0.30UI
Figure 3.19: Measured JTRAN with different input jitter amplitudes.
41
0.1 1 2 5 10 20 40 1000
0.3
0.5
0.7
1
1.5
Frequency [MHz]
JT
O
L 
[U
Ip
p]
Figure 3.20: Measured jitter tolerance with a BER threshold of 10−12 and
PRBS7 input data.
Table 3.2: RCK and SCK jitter versus different data sequences
‘1100’ pattern PRBS7 PRBS15 PRBS31
RCK [psrms/pspp] 4.1 /30.8 4.7 /35.2 4.9 /40.8 5.0 /44.0
SCK [psrms/pspp] 6.8 /64.4 10.2 /107.6 14.7 /114.0 14.9 /125.2
Long-term absolute jitter histograms of both RCK and SCK, when the
CDR is operating with PRBS31 input data, are shown in Fig. 3.21. Because
SCK contains the phase quantization error of the PRPLL, as expected, it
exhibits inferior jitter performance compared to that of RCK. Jitter depen-
dence on length of the PRBS sequence is reported in Table. 3.2.
Supply noise sensitivity of RCK is measured using a setup similar to that
described in [35] without the on-chip supply-noise monitor. Since only a small
decoupling capacitor is used for VDDVCO node, similar to [35], the injected
on-chip supply noise has almost the same amplitude as that applied off-chip.
When a 7MHz (50mVpp) sinusoid was applied to the DCO supply voltage,
RCK jitter degraded to 9.65 psrms/61.2 pspp. Based on simulations, the CDR
is most sensitive to supply noise frequencies around 7MHz (see Fig. 3.12),
hence the reported jitter degradation represents the worst case. A more
meaningful measure of the CDR sensitivity to supply noise can be captured
by evaluating the BER performance in the presence of supply noise. To this
end, BER is measured at different supply noise frequencies and amplitudes
and the results are presented in Fig. 3.22. At 7MHz noise frequency, CDR
42
Figure 3.21: Measured RCK and SCK jitter with PRBS31 input data: (a)
RCK jitter, (b) SCK jitter.
43
operates with a BER better than 10−12 for supply noise amplitudes smaller
than 110mVpp, while at 50MHz noise frequency, the CDR can tolerate supply
noise of 155mVpp.
100 110 120 130 140 150 160 170 180 190 200 21010
-15
10-12
10-9
10-6
10-3
100
Sinusoid Supply Noise Amplitude [mVpp]
B
it 
Er
ro
r 
R
at
e
 
 
7MHz
25MHz
50MHz
Figure 3.22: Measured BER as a function of supply noise amplitude at
different noise frequencies with PRBS31 input data.
Input sensitivity of the CDR is evaluated by measuring the BER as a
function of input amplitude for different PRBS sequences (see Fig. 3.23).
With PRBS7 input data, the sensitivity is about 10mV to achieve better
than 10−12 BER and it degrades to 13mV with PRBS31 input data.
At 5Gb/s, the CDR consumes 13.1mW of which 6.9mW is dissipated by
DCO. The limiting amplifier consumes an additional 5.5mW. The perfor-
mance summary and comparison of the proposed CDR with state-of-the-art
designs are shown in Table. 3.3. The proposed CDR compares favorably both
in terms of power efficiency and jitter with CDRs implemented using ring os-
cillators in [13, 37, 38]. Compared to LC oscillator-based CDRs in [12, 31],
the power efficiency is superior but jitter is higher. Including the on-chip
limiting amplifier power consumption in this work, the proposed design still
achieves much better power efficiency of 3.72mW/Gb/s compared to the
designs [12, 31] where limiting amplifiers are also implemented on-chip.
44
2 3 4 5 6 7 8 9 10 11 12 13 1410
-15
10-12
10-9
10-6
10-3
100
Input Amplitude [mV]
B
it 
Er
ro
r 
R
at
e
 
 
PRBS7
PRBS15
PRBS31
Figure 3.23: Measured BER as a function of input amplitude for different
PRBS sequences.
Table 3.3: Receiver performance summary and comparison
[12] [31] [37] [38] [13] This work
Technology 0.35µm 0.13µm 0.13µm 65nm 0.13µm 90nm
Supply [V] 3.3 3.3/1.8 1.2 1.2 0.8/1.2 1.3/1.0
JTRAN [MHz] 0.5 1.2 1.4 N/A N/A 2.3
Oscillator LC LC Ring Ring Ring Ring
Jitter [psrms/pspp] 0.5/8.0 0.6/5.1 7.2/47.2 9.7/53.3 5.4/44.0 5.0/44.0
Input sensitivity [mV] 6 10 N/A∗ N/A∗ N/A∗ 10
Power [mW] 775.5∗∗ 800∗∗ 13.2 20.6 6.1 13.1(18.6∗∗)
Data rate [Gb/s] 2.5 11.4 2.5 0.65 2.0 5.0
FoM [mW/Gb/s] 310.2∗∗ 70.2∗∗ 5.28 31.7 3.05 2.62(3.72∗∗)
Architecture Full-
rate
Half-
rate∗∗∗
Full-
rate∗∗∗
Full-
rate
Half-
rate
Half-rate
∗ No limiting amplifier is available on-chip
∗∗ Includes limiting amplifier power
∗ ∗ ∗ Requires a reference clock for acquisition
3.5 Summary
A PRPLL presents an attractive way to implement linear phase interpola-
tion which makes it well suited for implementing PI-based sub-rate CDRs.
45
Because the output frequency of a PRPLL is the same as its reference input,
PRPLL-based CDRs need a high-frequency external clock. This require-
ment makes them less appealing and has hindered their widespread usage.
In view of this, we presented design techniques to implement a reference-
less PRPLL-based CDR. Reference clock to the PRPLL is generated using a
digitally-controlled oscillator whose frequency is tuned to the data rate by the
CDR loop. Proportional control needed to stabilize Type-II CDR is imple-
mented in phase domain within the PRPLL. By doing so, we have illustrated
that the proposed CDR decouples jitter transfer (JTRAN) bandwidth from
jitter tolerance (JTOL) corner frequency, eliminates jitter peaking, and re-
moves JTRAN dependence on bang-bang phase detector gain. These features
are particularly attractive for repeater applications in which the recovered
clock is used to re-transmit the recovered data. The proposed techniques
are validated by measurement results obtained from the prototype CDR fab-
ricated in a 90 nm CMOS process. Error-free operation (BER< 10−12) is
achieved with 5Gb/s PRBS data sequences ranging from PRBS7 to PRBS31.
The measured JTRAN bandwidth is 2MHz and JTOL corner frequency is
16MHz. The CDR is tolerant to 110mVpp of sinusoidal noise on the DCO
supply voltage at the worst case noise frequency of 7MHz. At 5Gb/s, the
CDR consumes 18.6mW power and achieves a recovered clock long-term jit-
ter of 5.0 psrms/44.0 pspp when operating with PRBS31 input data. Since
the DCO was implemented using a ring oscillator, it consumed more than
50% of CDR power (about 37% of overall CDR power) and contributed to a
large portion of recovered clock jitter. Using a LC-based oscillator can both
reduce power and improve jitter performance at the expense of area. Within
the framework of using ring oscillators, it is still possible to improve recovery
clock jitter performance with architecture-level innovations. One potential
solution is detailed in Chapter 4.
Circuit techniques to improve power efficiency and phase interpolation lin-
earity of the PRPLL are also presented. Power efficiency is improved by using
segmented phase interpolation that reduces the number of phase detectors
and embedding charge-pumps in CMOS XOR phase detectors to eliminate
the need for a high-frequency V-to-I converter. PI non-linearity is reduced
by minimizing current mismatch introduced by channel length modulation.
At 2.5GHz, the PRPLL consumes 2.9mW and achieves -134 dBc/Hz phase
noise at 1MHz frequency offset. The differential and integral non-linearity of
46
its digital-to-phase transfer characteristic are within ±0.2 LSB and ±0.4 LSB,
respectively.
47
CHAPTER 4
A CONTINUOUS-RATE DIGITAL CLOCK
AND DATA RECOVERY WITH
AUTOMATIC FREQUENCY ACQUISITION
Continuous-rate clock-and-data recovery (CDR) circuits capable of operat-
ing across a wide range of data rates offer flexibility in both optical and
electrical communication networks. They can help satisfy specifications of
multiple standards using a single chip solution and can reduce cost when
implemented using a minimal number of external components such as capac-
itors and voltage controlled crystal oscillators. However, it is very difficult to
meet these requirements using a classical analog CDR architecture depicted
in Fig. 4.1 [15,39]. First, extracting the bit rate (frequency information) from
RCK
Wide-range 
oscillator
RDATA
DIN
CP
CP
PLL
FLL
PD
FD
Figure 4.1: Block diagram of a continuous-rate CDR with automatic
frequency acquisition.
the incoming random data stream is difficult because of the limited range
of conventional frequency detectors. Second, the design of a wide-tuning-
range low-noise oscillator in a power- and area-efficient manner is challenging.
Third, jitter transfer (JTRAN) and jitter tolerance (JTOL) characteristics
are set by the same loop parameters (as explained below), which complicates
the CDR design, especially in the context of repeater applications. Strin-
gent jitter peaking requirements in such applications also mandate a large
loop filter capacitor that is difficult to integrate on chip [12]. Finally, the
low JTRAN required in many standards such as SONET increases jitter
48
generation (JGEN) due to inadequate suppression of oscillator phase noise.
Alternatively, this translates to increased oscillator power dissipation. These
issues are further elaborated starting with frequency acquisition.
Automatic frequency acquisition loops are typically implemented using
either a rotational frequency detector (RFD) or Quadri-correlator frequency
detector (QFD) [12,40–43]. The main limitation of these frequency detectors
is their limited frequency acquisition range, which is usually less than 50%
of the target frequency. Therefore, dedicated coarse frequency detectors are
necessary to extend the range for continuous-rate applications [12]. Recently,
a divider-based stochastic reference clock generator (SRCG) approach that
provides unlimited frequency acquisition range (can lock to any frequency
within the tuning range of oscillators) was reported in [13,44]. However, the
accuracy with which the oscillator is tuned to the data rate strongly depends
on input data transition density, ρ, where 0 ≤ ρ ≤ 1. Any deviation of ρ
from 0.5 (a transition density of 50% ) causes 2× (ρ−0.5)×106 ppm residual
frequency error. For instance, a 7-bit of pseudo random binary sequence
(PRBS7) data pattern (with ρ ≈ 0.504) causes about 8000 ppm frequency
error, which is larger than the pull-in range of most conventional CDRs. In
this chapter, we present an automatic frequency acquisition scheme that:
(i) is insensitive to transition density, (ii) can achieve unlimited frequency
acquisition range, and (iii) amenable for sub-rate CDR architectures.
Achieving wide tuning range and low noise simultaneously is a challenging
design task. Ring oscillators can provide wide frequency range, but their
phase noise is not adequate for high performance CDR applications [13]. On
the other hand, LC oscillators offer excellent phase noise performance, but
their tuning range is limited. Carefully designed multiple LC tanks can cover
a wide frequency range [12, 31] at the expense of excessive power and area
consumption. In this chapter, we embed a wide tuning range ring oscillator
in fractional-N PLL (FNPLL) and use the FNPLL as a digitally controlled
oscillator (DCO) to achieve both wide range and low noise. The FNPLL-
based DCO also helps decouple the trade-off between jitter transfer (JTRAN)
bandwidth and JGEN due to ring oscillator noise in conventional CDRs.
In addition to limited frequency acquisition range and finite tuning range
of the oscillator, classical CDRs also suffer from two other design trade-offs.
On one hand, the jitter transfer (JTRAN) bandwidth and jitter tolerance
(JTOL) corner frequency of a classical 2nd order CDR cannot be chosen in-
49
PD
RCK
VCO
RDATA
DIN
VCDL
VC DLL
PLL
(a) 
(b) 
E 
CP
JTRAN JTOL
ωωpHωpL
ΦE 
ΦDIN 
ΦRCK 
ΦDIN 
nF
Figure 4.2: (a) Analog D/PLL architecture with large loop filter capacitor,
and (b) jitter transfer (JTRAN) and jitter tolerance (JTOL) in D/PLL.
50
dependently as they are both dictated by the higher of the two closed loop
poles [12]. This is undesirable because JTRAN cannot be lowered with-
out degrading JTOL. Also intrinsic peaking resulting from placing the loop
stabilizing zero in the feed-forward path is also problematic, especially in
repeater applications. Delay/phase-locked loop (D/PLL) architecture, re-
ported in [5, 12, 25, 31, 45, 46] and shown in Fig. 4.2(b), removes the closed
loop zero and avoids jitter peaking. Furthermore, JTRAN bandwidth and
JTOL corner frequency are decoupled with the JTRAN bandwidth governed
by the low pole (mainly from PLL), and the JTOL corner frequency decided
by the higher pole (mainly from DLL) [12]. On the other hand, classical
CDRs suffer from conflicting bandwidth requirements to meet jitter gener-
ation (JGEN) and JTRAN specifications. Minimizing the amount of input
jitter transferred to CDR output (recovered clock) requires low JTRAN while
a high JTRAN is needed to suppress oscillator noise, which is a major con-
tributor of CDR jitter generation. Hence improving JGEN with low JTRAN
requires a low noise oscillator that consumes significant power and occupies
large area [12,31]. In this chapter, a digital D/PLL architecture is proposed
to overcome JTOL/JTRAN/JGEN trade-offs.
The rest of this chapter is organized as follows. The automatic frequency
acquisition is detailed in Section 4.2. The overall digital CDR architecture
with proposed wide-range low-noise DCO is discussed in Section 4.3 followed
by circuit implementation details of the proposed CDR in Section 4.4. The
measured results are presented in Section 4.5, and a summary of the key
contributions is given in Section 4.6.
4.1 Automatic Frequency Acquisition
4.1.1 Review of BBPD Operation
The proposed frequency detection scheme uses the properties of a conven-
tional bang-bang phase detector (BBPD). So it is instructive to first review
the basic operation of a BBPD. A BBPD detects the sign of the phase er-
ror, ∆Φ, between incoming random data DIN and the local/recovered clock,
RCK. Based on the sign of the phase error, BBPD provides Early or Late
(E/L) information for the CDR loop to achieve phase locking. The input-
51
output transfer function of a BBPD, depicted in Fig. 4.3, illustrates that the
output changes sign whenever the input phase error crosses npi radians. Due
to this behavior, BBPD output is usually considered to be valid only when
∆Φ lies between −pi and pi. This condition is violated in the presence of fre-
quency error since the phase error accumulates indefinitely, causing BBPD
to produce Early and Late (E/L) signals alternatively.
However, taking a closer look at the BBPD behavior reveals some interest-
ing properties (Fig. 4.3). We note that within each pi interval of ∆Φ, BBPD
outputs either consecutive E or L signals and the number of consecutive E
(or L) signals, NP, is inversely proportional to the frequency difference (∆F)
between DIN and RCK. In other words, if the number of consecutive E/L
signals NP = NP1 when ∆F = ∆F1, NP = NP2 > NP1 when the frequency er-
ror ∆F2 is slightly smaller than ∆F1. This is simply because it takes longer
for the phase error to accumulate pi radians with smaller frequency error.
Similarly, an even smaller frequency difference ∆F3 results in even larger
number NP3 that is greater than both NP1 and NP2. The key observation
is that the frequency difference ∆Fn is inversely proportional to the number
of consecutive E/L signals NPn. This relationship is used in the proposed
frequency acquisition scheme as discussed next.
L
E
L
E
L
E
BBPD 
output
@∆F1 t
t
NP1=2
NP2=3
VBBPD
0 2pipi
∆Φ
BBPD
TF 
L
E
2pi− pi−
t
NP3=8
BBPDDINRCK
E/L ACCE/L NP
@∆F2 
@∆F3 
Figure 4.3: Operations of a bang-bang phase detector.
52
4.1.2 Principle of Proposed Frequency Acquisition
The block diagram of the proposed BBPD-based frequency locking loop
(FLL) is shown in Fig. 4.4(a). Using E/L outputs of the BBPD, frequency
detection logic (FDL) generates frequency error information, which is inte-
grated by the accumulator ACCF and used to update DCO frequency (FDCO).
The process of frequency acquisition is illustrated in Fig. 4.4(b). At the be-
ginning of frequency acquisition, DCO is reset to its lowest frequency. Using
an accumulator, ACCE/L, FDL accumulates E/L signals from BBPD until
the sign of BBPD output changes polarity. When the sign changes, ACCE/L
resets and starts accumulating a new set of consecutive E/L information
again. FDL increments accumulator ACCF and updates the DCO frequency
FDCO when BBPD output changes sign and NP < NTH (the locking thresh-
old). Lock detector declares frequency lock when NP becomes greater than
or equal to NTH. After that, the phase tracking loop takes over and achieves
phase locking.
In practice, jitter (Φj) may cause false updates of the DCO frequency since
the sign of BBPD output is alternating when the phase relationship between
DIN and RCK is within the jittery region (Fig. 4.5(b)). However, the jit-
tery region provides no valid information about the frequency error, thus the
false update can be prevented by not increasing ACCF when the peak value
of ACCE/L is smaller than its previous peak. Another common issue in auto-
matic frequency acquisition is harmonic locking where the steady state DCO
frequency equals K times the data rate. In this design, starting the DCO
from its lowest frequency ensures that the DCO locks to the target frequency
before it reaches any harmonic frequencies, thus avoiding the harmonic-lock
problem.
4.1.3 Analysis of Proposed Frequency Acquisition
The number of consecutive E/L signals (NP) not only depends on the fre-
quency error ∆F, but also on transition density ρ, and jitter Φj. First,
consider the case without jitter as shown in Fig. 4.5(a), where FDIN is input
data rate. One data bit of DIN spans 2pi radians, and the BBPD output
changes sign when RCK and DIN phase difference exceeds pi radians. In
53
NTH
0 pi 2pi 3pi pinpi(n-1) pi(n+1)
ACCE/L
FLL 
locks
L
FDCO
∆Φ
NP=2
BBPD 
output
E
≈
 
t
≈
 
≈
 
NP=3 NP=4 NP≥NTH

BBPD
TF 
L
E
NP=n1 NP=n2
VBBPD
BBPD E/L ACCF
DIN RCK
DCO
DF
   H 
FDL
FDL: Frequency Detection Logic
(a) 
(b) 
Figure 4.4: Principle of proposed frequency acquisition scheme: (a) diagram
of a BBPD-based frequency locking loop, and (b) operation of a
BBPD-based frequency locking loop.
54
each pi radians, the number of consecutive E/L signal is:
NP = ρ
FDIN
∆F
pi
2pi
(4.1)
Therefore, the relative frequency error
(
∆F
FDIN
)
, NP, and ρ are related by:
∆F
FDIN
=
ρ
2NP
(4.2)
Tabulating the above equation for different values of NP and ρ reveals that
the relative frequency error is bounded within 1000 ppm for any transition
density ρ between 0 and 1 when the locking threshold NTH = NP is set
to 500. In other words, residual frequency error in the proposed frequency
acquisition scheme can be made to be well within the pull-in range of a CDR,
independent of the input transition density.
As shown in Fig. 4.5(b), the effect of input data jitter Φj can be incorpo-
rated into the relative frequency error expression as given below:
∆F
FDIN
=
ρ
NP
pi − Φj
2pi
(4.3)
Interestingly, as long as the jitter is not so large as to close the eye, jitter
reduces residual frequency error compared to case when there is no jitter.
In other words, increasing jitter has the same effect as making the locking
threshold larger.
Compared to the frequency acquisition based on SRCG in [13], the pro-
posed scheme is much less sensitive to input transition density as shown in
Fig. 4.6. With PRBS7 input data (ρ ≈ 0.504) the residual frequency error is
as high as 8000 ppm in [13], while the error is stable around 500 ppm (with
NTH = 500) for any PRBS sequence in the proposed scheme. Please refer
to Appendix A for more a detailed analysis of the locking reliability for the
proposed frequency acquisition scheme.
4.2 Overall CDR Architecture
A simplified block diagram of the proposed digital D/PLL CDR architecture
is shown in Fig. 4.7 [47]. It consists of three loops: (i) a frequency looked loop
55
piρ
pi
piρ
pi
− Φ
= =
∆
− Φ∆
⇒ =
2
2
jDIN
TH P
j
DIN P
FN N
F
F
F N
piρ
pi
ρ pi ρ
pi
= =
∆
∆
⇒ = =
2
2 2
DIN
TH P
DIN P P
FN N
F
F
F N N
jΦ
ρ 0.1 0.5 1.0
Φj=0
F/FDIN
[ppm]
100 500 1000
Φj=pi/4
F/FDIN
[ppm]
75 375 750
F/FDIN
[ppm]
0.1 0.5 1.0
NTH=250 200 1000 2000
NTH=500 100 500 1000
NTH=750 66 330 660

Figure 4.5: Residual frequency error dependence on transition density: (a)
w/o jitter, and (b) w/ jitter.
7 8 9 10 11 12 13 14 1510
1
102
103
104
N [in PRBS 2N-1]
R
es
id
u
al
 
fr
eq
u
en
cy
 
er
ro
r 
[pp
m
]
 
 
Inti 2011
NTH=250
NTH=500
NTH=750
12
=
2 1
N
N
−
−
Figure 4.6: Residue frequency error comparison between proposed scheme
and SRCG.
56
(FLL), (ii) a delay-locked loop (DLL), and (iii) a phase-locked loop (PLL).
Using the half-rate bang-bang phase detector (BBPD) outputs, as described
earlier, FLL brings the DCO frequency to be within 500 ppm of the target
frequency (half of the data rate). The DLL adjusts the phase of the input
data using a digitally controlled delay line (DCDL) and locks it to that of
the recovered clock (RCK). In other words, the DLL in itself can be viewed
as a Type-I CDR. The PLL integrates the BBPD output using accumulator
ACCI and drives the DCO toward frequency lock. This behavior is analogous
to that of integral control path in a classical Type-II CDR. In other words,
the DLL and PLL implement the proportional and integral control portions
of the CDR, respectively.
DCDL
ACCp ACCI
BBPD E/L ACCF
RCK
RDATA
DCO
DP
DI
DF
DLL PLL
   H 
DIN
FLL
FDL
Figure 4.7: Digital implementation of D/PLL CDR architecture.
Similar to its analog D/PLL counterpart shown in Fig. 4.2, the proposed
digital CDR also decouples the trade-off between JTRAN bandwidth and
JTOL corner frequency. However, implementing the loop filter in digital do-
main eliminates the large loop filter capacitor needed in the analog D/PLL. It
is also interesting to note that JTRAN bandwidth of the D/PLL is governed
only by the ratio of DCO and DCDL gains [31, 45]. As a result, JTRAN
is independent of BBPD gain and hence it does not depend on input jitter.
This is a considerable advantage compared to conventional bang-bang CDRs.
The detailed schematic of the proposed CDR is shown in Fig. 4.8 [47].
Input data DIN is buffered using a two-stage limiting amplifier before feeding
it to the DCDL. BBPD output is demultiplexed by a factor of 4 in the DLL
after carefully evaluating the trade-off between increased loop delay caused
by larger demultiplexing factor and increased power dissipation of ACCP at
smaller demultiplexing ratio. It is important to reduce loop latency because
large loop delay severely limits JTOL performance [48]. By contrast, the
loop latency is not as critical in the PLL. Therefore the BBPD output is
57
multiplexed by a factor of 32 in the integral path and the FDL to reduce
digital logic power. The outputs of ACCI and ACCF are summed to generate
frequency control word (FCW) for the DCO. The fractional-N PLL-based
DCO provides four equally-spaced sampling clock phases (RCK) for half-
rate BBPD.1
Because the CDR is designed to operate across a very wide range of data
rates, it is susceptible to false locking. We propose a false locking prevention
scheme that is based on the observation that the sum of Early and Late
outputs of the BBPD must equal the number of input data transitions in the
frequency-locked state. The number of data transitions (NDT) counted using
divider H and accumulator ACCH is compared to the number of Early/Late
outputs (NE/L) provided by ACCE/L. If NDT 6= NE/L, FDL logic continues
to increase the frequency and drives the DCO away from false locking. Both
loss-of-lock detection (LOLD) and lock detection (LD) are implemented to
ensure seamless switching between data rates.
Figure 4.8: Complete schematic of the proposed continuous-rate CDR.
Furthermore, in order to maximize JTOL performance, the DCDL is bi-
ased at its mid-delay point in steady state by the path containing gain block
KO (with a value of 1/16) and accumulator ACCO. Since in steady state,
the average input to ACCO is zero, the DCDL operates around its mid-delay
point and provides a maximum possible delay range of about ±100 ps. This
technique is fairly straightforward to realize in digital implementation com-
pared to an analog D/PLL [12], where an extra gm control path is required
1Interestingly, using fractional-N PLL as a DCO also improves the locking reliability
of the frequency locking loop. Please refer to Appendix A for a detailed analysis.
58
to properly bias the delay line and has to be always on to compensate the
capacitor leakage.
4.3 Circuit Implementation
Thanks to the mostly digital nature of the proposed CDR, a large number
of circuit blocks are fully synthesized using standard cells. The half-rate
bang-bang phase detector is implemented using a conventional Alexander
phase detector with improved sense-amplifier flip-flops as data and edge sam-
plers [45, 49]. The front-end limiting amplifier incorporates two CML stages
and a CML-to-CMOS conversion stage [45]. Offset correction is performed
by independently controlling positive/negative side termination voltages. A
minimum input swing of 15mV is required to achieve BER < 10−12. The
design details of other critical analog building blocks including the digitally
controlled delay line (DCDL) and the ring-oscillator-based fractional-N PLL
used as the digitally controlled oscillator (DCO) are presented next.
4.3.1 Digitally Controlled Delay Line (DCDL)
The schematic of digitally controlled delay line is shown in Fig. 4.9. A two
stage limiting amplifier converts low swing input data to full swing CMOS
levels and feeds it to delay line controlled by code DP. The delay line is
implemented using a cascade of 16 pseudo-differential CMOS delay stages
that provide a total delay of about 200 ps, which is 2UIpp at 10Gb/s input
data rate. Delay tuning is performed by varying the output capacitance
of delay stages. The DCDL control encoder is designed to distribute the
desired delay equally among all delay stages to improve the digital control
to delay output linearity [50]. Compared to CML-based delay buffers used
in [31], the CMOS delay stages consume lower power and occupy smaller
area. For instance, 17-stage CML-based delay line in [31] consumes about
60mW while achieving a delay of about 150 ps, while the proposed CMOS
delay line dissipates only about 5mW while providing 200 ps delay. However,
finite bandwidth of CMOS delay stages adds inter-symbol interference (ISI)
to the input data and their poor power supply noise sensitivity increases
jitter. Extensive transistor-level simulations indicated that, with 16-stages
59
DCDL, the ISI degradation can be limited to be within 5%UI with 10Gb/s
PRBS31 input data at worst case process, supply voltage, and temperature
(PVT) condition (about 1%UI additional ISI in nominal condition). Supply
noise sensitivity is reduced by powering the delay line using a linear low
dropout regulator operating from a 1.2V supply voltage. Simulated power
supply rejection ratio of the regulator is about -20 dB at 10MHz.
Figure 4.9: Schematic of the digitally controlled delay line.
4.3.2 Digitally Controlled Oscillator (DCO)
Ring oscillators have wide tuning range and can provide multiple phases
but their relatively poor phase noise limits their usage in many applications.
This is especially the case in a D/PLL based CDR because DCO phase noise
suppression bandwidth (which is equal to the JTRAN bandwidth) is much
lower than that of a conventional CDR. In view of this, we seek to use a ring
oscillator based fractional-N PLL as a DCO wherein the output frequency is
varied by controlling the feedback division ratio using the frequency control
word (FCW) as illustrated in Fig. 4.10. Since ring oscillator is embedded
inside the PLL, its phase noise is suppressed by the feedback loop with much
higher bandwidth. The FCW is equal to the sum of control words gener-
ated by frequency acquisition control path, DF, and the integral path, DI.
Because clock domain (CLKCDR) in which FCW is generated has no fixed
phase relationship with the clock domain (CLKFB) in which ∆Σ modula-
60
Figure 4.10: Schematic of ring oscillator-based fractional-N PLL as DCO.
tor operates, FCW is synchronized to CLKFB by the synchronization block
shown in Fig. 4.11. Meta-stability is mitigated as long as CLKFB is higher
than twice the frequency of CLKCDR. The fractional-N PLL is implemented
using the charge-pump based delta-sigma (∆Σ) architecture [51]. In addition
to a phase frequency detector (PFD), loop filter, charge-pump, and a voltage
controlled oscillator (VCO), it consists of a 4-to-15 multi-modulus divider
that is dithered by a ∆Σ modulator. The ∆Σ modulator truncates 17-bit
FCWSYN (which is equal to the sum of FLL and integral control words, DF
and DI, respectively) and generates a sequence of integers ranging from 4
to 15, with a running average equal to the desired fractional division ratio.
The quantization error introduced by the ∆Σ modulator is suppressed by low
pass filtering action of the PLL feedback loop. While it is possible to reduce
the impact of quantization error on output phase noise to negligible levels
by reducing the PLL bandwidth, the contribution of VCO phase increases
resulting in a conflicting noise bandwidth trade-off. Consequently, choosing
the PLL bandwidth that suppresses both the ∆Σ quantization error and
VCO phase noise adequately becomes very challenging.
In this work, a 2-stage architecture is employed to alleviate this trade-
off [52]. The first stage implemented using a digital multiplying DLL (MDLL)
61
Figure 4.11: FCW synchronization from CDR to DCO.
62
[53] multiplies 50MHz crystal oscillator output and generates a 500MHz out-
put clock that acts as the reference clock to the second stage ∆Σ fractional-N
PLL. Because oversampling ratio of the ∆Σ modulator is increased by a fac-
tor of 10, the PLL bandwidth can be increased to adequately suppress ring
oscillator phase noise without increasing the contribution of ∆Σ truncation
error to output jitter [52, 54]. An additional pole located at the drain of
current-source transistor is introduced to further suppress the ∆Σ truncation
error. It is important to note the crystal oscillator does not aid frequency ac-
quisition, as its frequency has no relation to the input data rate. The digital
Figure 4.12: Schematic of the digital multiplying DLL (MDLL).
MDLL is adopted for reference multiplication due to its superior phase noise
performance compared to a conventional PLL [53,55]. As shown in Fig. 4.12,
every rising edge of the input reference clock (FREF) replaces 10
th rising edge
of the VCO output to reset phase noise accumulation and thus achieves good
phase noise performance. The frequency of the VCO is tuned by a integral
path consisting a BBPD that detects the phase difference between oscillator
output and input reference clock, an accumulator, ACC, and a ∆Σ digital
to analog converter (DAC) clocked at 125MHz that drives the oscillator. A
4th order low pass filter is used to suppress truncation error of digital ∆Σ
modulator.
In the fractional-N PLL, a single four-stage pseudo-differential ring oscil-
lator is chosen to support a data rate range from 4Gb/s to 10.5Gb/s. Since
more than 2x range is achieved, lower data rates can be supported by us-
ing dividers [12]. The control voltage, VC, needs to swing by more than
300mV to support such a wide frequency tuning range. In order to improve
the linearity of charge pump across a large control voltage range, a feedback
63
loop is used to adjust the bias for the up current source adaptively. This
adaptive biasing control reduces reference spur by about 3 dB, and is also
effective in suppressing in-band fractional spurs. With a PLL bandwidth
of about 5MHz, a minimum of 7 dB in-band fractional spur suppression is
observed as shown in Fig. 4.13. The intuition behind this improvement is
that the adaptation loop is fast enough to track the control voltage variation
caused by in-band fractional spur, so as to suppress the spur level. Whereas
for high-frequency perturbations, the adaptation loop cannot respond fast
enough, so the spur levels remain the same. Further, transistors M1 and M2
are included to minimize the current mismatch due to charge sharing [24].
To account for the drop across M3, M4 and M5 are introduced, which also
improve the current-mirroring accuracy [35]. The loop filter shares the same
supply with oscillator to improve the supply noise sensitivity. The overall
power consumption of the DCO is about 7.5mW, of which MDLL and PLL
consume 2.5mW and 5mW, respectively.
0.05 0.1 0.2 0.5 1 2 4 6 1015-50
-45
-40
-35
-30
Frequency [MHz]
Pe
ak
 
Fr
ac
tio
n
al
 
Sp
u
r 
[dB
]
 
 
w/o adaptation
with adaptation
Figure 4.13: Charge pump with adaptation loop: (a) circuit schematic, and
(b) effectiveness on suppressing in-band fractional spurs.
4.4 Experimental Results
The prototype CDR was fabricated in a 65 nm CMOS process and it occupies
an active area of 1.63mm2. The chip micrograph is shown in Fig. 4.14. The
64
die was packaged in a 88-pin QFN (QFN88) package. The area and power
breakdown of the prototype CDR are shown in Fig. 4.15. The DCO, including
MDLL and fractional-N PLL, takes about one half the area and one third
the power at 10Gb/s input data rate. Compared to using multiple LC tanks,
the proposed DCO is more efficient in both area and power [12,31]. Because
the area of the DCO is dominated by the loop filter capacitors in MDLL and
fractional-N PLL, recently reported digital implementations could further
reduce DCO area. In the rest of this section, we report the performance of a
standalone DCO followed by complete CDR results.
Figure 4.14: Die micrograph.
65
Amp+
DCDL
16% BBPD+
DMUX
8%
FLL
5%
CDR 
Logic
8%
MDLL
21%
FNPLL
28%
Test 
Circuitry
14%
Area 
Amp+
DCDL
31%
BBPD+
DMUX
23%
CDR 
Logic
9%
FLL
4%
MDLL
11%
FNPLL
22%
22.5mW @ 10Gb/s
Figure 4.15: Power and area breakdowns of the CDR prototype.
4.4.1 DCO Results
The fixed 50MHz reference clock to the DCO was provided by an off-chip
crystal with RMS jitter of 813 fs integrated from 1 kHz to 20MHz. A power
spectrum analyzer (PSA E4440A) and a signal source analyzer (SSA E5052B)
were used to measure spectrum and phase noise performance, respectively.
The measured operating range of the DCO is 2GHz to 7GHz. We present
measurement results obtained at an output frequency of 5GHz, which corre-
sponds to 10Gb/s CDR operation. Fig. 4.16 illustrates the power spectrum
of the MDLL at an output frequency of 500MHz. The reference spur is about
-57 dB, which translates to a deterministic jitter of 0.28 ps [34]. The mea-
sured MDLL and DCO output phase noise plots are shown in Fig. 4.17. The
phase noise of the MDLL at 1MHz frequency offset from 500MHz carrier
frequency is -126 dBc/Hz and the integrated jitter from 1 kHz to 40MHz is
1.06 psrms.
The phase noise of the overall DCO (measured at the output of FNPLL)
at 1MHz frequency offset is -104 dBc/Hz and the integrated jitter from 1 kHz
to 40MHz is 1.41 psrms. With a fractional division ratio of 99.998 (output
frequency at 4.9999GHz), the worst case integrated jitter of the DCO is
2.30 psrms. The 20 dB increase in phase noise from the MDLL output to
DCO output is due to frequency multiplication by about 10 in the FNPLL.
66
-57dB -68dB
Figure 4.16: Measured power spectrum of MDLL.
Figure 4.17: Measured phase noise performance of FNPLL (DCO).
67
4.4.2 FLL Results
The transient behavior of the frequency acquisition process is captured with
the SSA E5052B and the result is shown in Fig. 4.18. Note that DCO re-
sets to its lowest frequency at the beginning of the acquisition and the FLL
monotonically increases the DCO frequency until it acquires locking to the
desired data rate of 6Gb/s. The update step size of the DCO frequency in
this design is fixed to about 50 ppm, which resulted in the frequency acquisi-
tion time of about 230µs. Faster acquisition can be achieved by controlling
the update step size adaptively according to residual frequency error, which
is readily available in the form of digital code.
Figure 4.18: Measured frequency acquisition process from initial frequency
to 6Gb/s data rate.
The lock detector declares frequency locking when the number of con-
secutive Early/Late signal reaches the locking threshold NTH. Thereafter
D/PLL takes over the control and achieves phase locking. The seamless data
rate switching capability of the CDR is verified by changing the input data
rate from 6Gb/s to 9.5Gb/s and measuring the acquisition behavior (see
Fig. 4.19). When the data rate is switched, loss of lock detector (LOLD)
68
detects the frequency difference, and triggers a new frequency acquisition
process by resetting the DCO frequency to its lowest frequency and activat-
ing the FLL. As illustrated in Fig. 4.19, the FLL relocks to the new data
rate (9.5Gb/s), thus validating the proposed continuous-rate CDR’s ability
to detect data rate switching automatically. Note that the transient time
while locking to a new data rate is dominated by the loss of lock detection
time. This long time is due to the LOLD choice in this particular design,
which adopts a 27-bit counter for better detection accuracy of frequency er-
ror before initiating a reacquisition. Figure 4.6 suggests a possible method to
reduce LOL detection time. Note that a frequency error of about 1000 ppm
leads to a peak ACCE/L value of about 250. Therefore, reacquisition can be
initiated when this condition is detected, thereby drastically reducing LOLD
time to the order of few micro-seconds. Under this condition, transient time
for locking to a new data rate will be dominated by reacquisition time, which
is about 600µs in this design.
Figure 4.19: Measured frequency acquisition process with data rate
switching from 6Gb/s to 9.5Gb/s.
The sensitivity of the proposed frequency acquisition scheme to variations
in input transition density is quantified by plotting the residual frequency
69
error, ∆F, versus locking threshold, NTH, for different transition densities
ranging from ρ = 1 to ρ = 0.32 (see Fig. 4.20). ∆F is equal to the frequency
difference between the DCO frequency after the FLL has locked and the
desired DCO frequency (equal to half the data rate). As expected, based on
the analysis in Section II, ∆F is maximum when ρ = 1 and monotonically
decreases for smaller values of ρ. Furthermore, for NTH greater than 500, ∆F
is less than 1000 ppm, independent of the transition density. Because the pull-
in range of D/PLL is more than 1000 ppm, the proposed CDR’s frequency
acquisition behavior is not affected by the transition density as compared
to [13]. While it may appear that ∆F can be reduced to arbitrarily small
values simply by setting NTH to be very large, in practice, FLL may not
achieve locking for too large a NTH since there may not be NTH number of
consecutive E/L signals within the frequency update period.
100 200 300 400 500 600 70010
20
50
100
200
500
1000
2000
3000
5000
FLL locking threshold (NTH)
R
es
id
u
al
 
fre
qu
en
cy
 
er
ro
r 
[pp
m
]
 
 
"...1010..."(ρ=1.0)
PRBS7(ρ=0.503937)
PRBS15(ρ=0.50002)
PRBS31(ρ=0.50000)
"...110000110000..."(ρ=0.32)
Figure 4.20: Measured residual frequency error versus locking threshold
NTH at different transition densities.
To avoid this, NTH must be set large enough such that the resulting ∆F
is well within the pull-in range of the CDR. Figure 4.21 shows the residual
frequency error, ∆F, versus locking threshold, NTH, at different input jitter
amplitudes with PRBS7 input data. With NTH = 500, residual frequency
error is less than 500 ppm for input jitter less than 0.3UI. Note that, with
0.3UI of input jitter, the frequency acquisition process is not so robust when
NTH is 700, because the region for consecutive E/L signal is greatly reduced.
70
100 200 300 400 500 600 70010
20
50
100
200
500
1000
2000
3000
FLL locking threshold (NTH)
R
es
id
u
al
 
fr
eq
u
en
cy
 
er
ro
r 
[pp
m
]
 
 
input jitter amplitude=0
input jitter amplitude=0.05UI
input jitter amplitude=0.10UI
input jitter amplitude=0.20UI
input jitter amplitude=0.30UI
Figure 4.21: Measured residual frequency error versus locking threshold
NTH at different input jitter amplitudes with PRBS7 input data.
4.4.3 CDR Results
The bit error rate (BER) performance of the CDR was characterized with dif-
ferent PRBS sequences using Agilent BERT N4901B. Input phase modulation
needed to measure JTRAN and JTOL was provided by Agilent E4433B RF
signal generator and the recovered clock jitter was measured using sampling
oscilloscope DSA8200. The CDR achieves error free operation (BER< 10−12)
across data rates ranging from 4Gb/s to 10.5Gb/s. The channel used for
characterizing the CDR contains 1-m coaxial SMA cable, 2-inch on-board
FR4 PCB trace, and parasitics associated with QFN88 package. The over-
all loss is about 5-to-6 dB at 5GHz. The measured jitter transfer (JTRAN)
function
(
ΦDIN(s)
ΦREF(s)
)
magnitude response is shown in Fig. 4.22. Because
JGEN due to oscillator phase noise is greatly suppressed by wide bandwidth
fractional-N PLL, a very low JTRAN bandwidth was chosen to suppress in-
put jitter. The measured JTRAN bandwidth is about 0.2MHz. JTRAN was
71
also measured with different input jitter amplitudes ranging from 0.01UI to
more than 0.2UI (more than 20x variation) and the results are shown in
Fig. 4.22. As expected, JTRAN bandwidth is almost independent of input
jitter even while using a BBPD [19,23,31]. No JTRAN peaking was observed
at any input jitter amplitude.
0.01 0.02 0.05 0.1 0.2 0.5 1 2 5-35
-30
-25
-20
-15
-10
-5
0
Frequency [MHz]
|JT
R
A
N
| [d
B
]
 
 
input jitter amplitude=0.01UI
input jitter amplitude=0.05UI
input jitter amplitude=0.20UI
Figure 4.22: Measured JTRAN with different input jitter amplitudes.
Measured jitter tolerance (JTOL) plot at 10Gb/s and 4Gb/s with PRBS7
input data is shown in Fig. 4.23. JTOL corner frequency is about 9MHz
at 10Gb/s (4MHz at 4Gb/s), which is much larger than JTRAN band-
width of 0.2MHz. Thus, the proposed digital D/PLL preserves the ben-
efit of decoupled JTRAN bandwidth and JTOL corner frequency present
in its analog counterpart [31]. JTOL is limited by DCDL range in 1.1-to-
2.5MHz frequency band at 10Gb/s (0.8MHz to 2.0MHz at 4Gb/s) [31],
while the low-frequency JTOL is restricted to 2UIpp at 10Gb/s (1.2UIpp
at 4Gb/s) due to instrument limitation. Measured long-term absolute jit-
ter of the recovered clock when the CDR is operating with PRBS31 input
data is 2.9 psrms/25.1 pspp at 4Gb/s and 2.2 psrms/24.0 pspp at 10Gb/s (see
Fig. 4.24).
The performance summary of the proposed CDR and its comparison to
state-of-the-art designs are shown in Table 4.1. Only the proposed scheme
and [56] can perform frequency acquisition without using an explicit fre-
quency detector. However, [56] is not suited for digital implementation
and it is not amenable for sub-rate CDR architectures. Further, linear PD
72
0.1 0.2 0.5 1 2 5 10 20 500.3
0.6
0.9
1.2
1.5
1.8
2.1
2.5
Frequency [MHz]
JT
O
L 
[U
Ip
p]
Figure 4.23: Measured jitter tolerance with PRBS7 input data at 10Gb/s
and 4Gb/s.
used in [56] is not the preferred choice at high data rates. The proposed
CDR achieves best power efficiency and lowest jitter among CDRs imple-
mented with ring oscillators [13,38]. Compared to LC oscillator-based CDRs
in [12, 31, 56], the power efficiency is superior but jitter is higher.
Figure 4.24: Measured recovered clock jitter with PRBS31 input data: (a)
at 5Gb/s, and (b) at 10Gb/s.
73
Table 4.1: CDR performance summary and comparison with the
state-of-the-art designs
[12] [56] [38] [13] This
work
Technology 0.35µm 0.18µm 65 nm 0.13µm 65 nm
Supply [V] 3.3 1.8 1.2/0.8 1.2 1.2/1.0
FD type RFD Linear
PD
SRCG DLL BBPD
Data rate [Gb/s] 0.0125-2.7 8.2-10.3 0.5-2.5 0.65-8 4-10.5
Acq. time [µs] < 800 < 200 N/A N/A < 600
Architecture Full-rate Full-rate Half-
rate
Full-rate Half-
rate
JTRAN [MHz] 0.5 4 N/A N/A 0.2
Oscillator LC LC Ring Ring Ring
Jitter [psrms/pspp] 0.4/8.0 0.4/12.3 5.4/44.0 9.7/53.3 2.2/24.0
Power [mW@Gb/s] 775@2.5 174@10.3 6.1@2 88.6@8 22.5@10
FoM [mW/Gb/s] 310 16.8 3.05 11.1 2.25
Area [mm2] 9.0 0.54 0.39 0.11 1.63
4.5 Summary
A continuous-rate clock and data recovery (CDR) with automatic frequency
acquisition and ring-oscillator-based wide-range low-noise DCO is presented.
Frequency detection is performed by using only the early/late outputs pro-
vided by a conventional BBPD. It is based on the simple observation that fre-
quency error is inversely proportional to the number of consecutive early/late
signals. Hence, frequency acquisition is achieved by adjusting DCO frequency
until the number of consecutive early/late signals reaches the desired thresh-
old. In contrast to divider-based SRCG scheme [13], the proposed method
can lock the CDR to within 1000 ppm of the data rate independent of input
data transition density.
A digital D/PLL CDR architecture is proposed to reduce the area penalty
of large loop filter capacitors present in the analog counterpart. The digital
implementation preserves the benefits of the analog D/PLL CDR such as de-
coupled jitter transfer (JTRAN) bandwidth and jitter tolerance (JTOL) cor-
ner frequency. Furthermore, JTRAN peaking and JTRAN bandwidth depen-
dence on BBPD gain are also eliminated. A ring-oscillator-based fractional-N
74
phase-locked loop (PLL) is used as a DCO to achieve both wide range and
low noise. This DCO also helps to alleviate the conflict between jitter gen-
eration (JGEN) and JTRAN bandwidth in conventional CDRs. Fabricated
in 65-nm CMOS technology, the prototype CDR operates without any errors
from 4Gb/s to 10.5Gb/s. At 10Gb/s, the CDR consumes 22.5mW power
and achieves a JTRAN bandwidth of 0.2MHz and JTOL corner frequency
of 9MHz, respectively. The proposed DCO has an operation range of 2GHz
to 7GHz and provides a 2.2 psrms recovered clock with a 10Gb/s PRBS31
input data sequence.
75
CHAPTER 5
AN ENERGY-PROPORTIONAL
SOURCE-SYNCHRONOUS LINK WITH
DVFS AND ROO TECHNIQUES
Aggregate data communication bandwidth is continuously expanding for
servers in data centers and mobile devices driven by the explosive growth
of data traffic and demand for increasing computation capabilities [57–59].
However, the thermal dissipation constraints (related to cooling cost in data
centers) and battery energy density (translated to mobile devices’ battery
life) increase at a much slower rate, raising a challenge of improving the en-
ergy efficiency of data communication links to sustain the growing trend in
bandwidth. Over the past decade, significant improvement has been made to
improve the energy efficiency (energy-per-bit) as shown in Fig. 1.2(b) with
published data from major conferences and journals. Along with the benefits
from supply voltage scaling and technology development, efforts in improv-
ing link energy efficiency mainly focus on circuit-level optimization to reduce
the power consumption of link building blocks. Techniques like voltage mode
(VM) line drivers [30], low-swing differential signaling [30], ground-referenced
signaling (GRS) [60], charge-based sampler [61, 62], and resonant clock dis-
tribution [63] have demonstrated attractive efficiency. However, Fig. 1.2(b)
also clearly suggests a saturation of energy efficiency in recent years, partially
due to the slowing down of technology scaling, and also due to the limitation
of only relying on circuit-level techniques to reduce the power dissipation of
already optimized designs.
At system level, dynamic voltage and frequency scaling (DVFS) [21, 57,
64, 65] and burst-mode operation [65–68] are two promising techniques to
greatly improve links energy efficiency. By varying supply voltage in accor-
dance with the desired data rate/workload, DVFS scales link power almost
cubically with data rate. Because the time constant associated with chang-
ing the output of a DC-DC converter that provides the optimal link supply
voltage is of the order of several microseconds if not longer [21], DVFS is
effective only when the rate of workload variations is slow. On the other
76
hand, burst-mode communication, implemented using rapid on/off (ROO)
links, linearly scales power consumption with effective data rate and is well
suited for interfaces where link inactive periods are short, of the order of few
hundred nano-seconds or less. However, energy efficiency of ROO links de-
grades considerably at lower utilization levels due to leakage and static power
consumed in the off state. Hence, DVFS and ROO techniques are best suited
for workload variations with large and small time constants, respectively. In
practice, their effectiveness also greatly depends on the integrity of supply
voltage as it is stressed considerably more compared to always-on links op-
erating at a fixed supply voltage. In this chapter, we seek to combine DVFS
and ROO approaches along with robust supply voltage generation and reg-
ulation techniques to achieve excellent energy efficiency across a wide range
of data rates. Specifically, the challenges of agile link power management,
rapid on/off clock generator and proper timing for transceiver operation are
elaborated, along with potential solutions to address them.
The rest of the chapter is organized as follows. Section 5.2 illustrates energy
efficiency benefit of links with both DVFS and ROO, and how DVFS helps
extend the energy-proportional operation range. The circuit implementation
details including link power management, rapid on/off clock generator, and
energy-proportional transceiver are described in Section 5.3. The experimen-
tal results of a prototype transceiver are presented in Section 5.4 followed by
a summary of key contributions in Section 5.5.
5.1 Energy-Proportional Link with DVFS and ROO
The opportunity to cut power wastage at system level is embedded in the
data traffic (workload) profile (see Fig. 5.1(a)). It has been observed by
many researchers that data traffic in many real world applications is bursty
in nature with active time often followed by idle periods. As a result, links
are actively used for only 15-30% percent of the time [69, 70]. Currently,
because links are always kept on, up to 70 to 85% link bandwidth is wasted.
This mismatch between the link bandwidth and desired effective bandwidth
translates to power wastage. When there is no data traffic, one way to
reduce the power wastage is to simply turn off the link and rapidly turn on
the link when requests for data transfer are made (see Fig. 5.1(b)). In such
77
Figure 5.1: Cut link power/bandwidth wastage with DVFS and ROO
techniques.
78
a scenario, the links will operate in either the on state or the off state and
ideally consume power only when they are in the on state. In other words,
link power consumption scales linearly with link utilization level. We will
refer to such links as rapid on/off links, or ROO links for short. In terms of
energy efficiency, under ideal conditions such as zero power in the off state
and zero wake up time, ROO links have constant energy efficiency across all
utilization levels. This behavior is known as energy-proportional operation
as illustrated by each horizontal line in Fig. 5.2 [67, 68, 71] .
In addition to the bursty nature, workloads also exhibit dynamic behavior
in terms of their intensity (see Fig. 5.1(b)). Based on the tasks performed
by the system, workloads can be broadly classified as being either heavy,
medium, or light. Link data rate must be high to serve heavy workload
while it can be lower to serve light loads. ROO links cut power wastage
during the idle periods. But because ROO links operate with a fixed peak
data rate at a fixed supply, they cannot exploit the dynamic behavior in
workload to further lower the power consumption. By combining ROO and
DVFS, both bursty and dynamic behavior of workloads can be fully exploited
to improve energy efficiency. As illustrated in Fig. 5.1(b), DVFS scales the
peak data rate at which ROO is performed. Therefore, wastage of link power
is further reduced, as compared to only rapid on/off operation. In terms of
Figure 5.2: Link energy efficiency with DVFS and ROO techniques.
79
energy efficiency, interestingly, combining DVFS and rapid on/off techniques
allows links to achieve better than constant energy efficiency as described
in Fig. 5.2. With a fine-grained DVFS operation, the link energy efficiency
follows the dashed red line from 10Gb/s to 3Gb/s. The horizontal blue line
stands for the energy efficiency of rapid on/off operation with different link
utilization levels at each peak data rate, and effective data rate is defined as:
Effective data rate = D× Peak data rate (5.1)
where D corresponds to utilization level. In order to demonstrate how DVFS
improves energy-proportional operation, one may consider how an effective
data rate of 0.5Gb/s can be achieved. It is clear that 0.5Gb/s data rate
can be achieved with any given peak data rate and appropriately chosen
utilization level. For instance, it can be achieved by operating the link at
VDD1 with 5% on time, or at VDD5 with 50% on time. The latter case
improves energy efficiency by more than 2 times, thanks to the power savings
provided by DVFS. Without DVFS, the best energy efficiency to achieve
0.5Gb/s effective data rate is fixed at 7 pJ/bit.1
5.2 Circuit Implementation
A simplified complete transceiver diagram is shown in Fig. 5.3, employing a
half-rate source-synchronous architecture [72]. The link power management
circuit, including a DC-DC converter and several LDOs, conducts DVFS
operation. The rapid on/off operation on the transmitter is controlled by an
external wake-up signal, WKPTx, while the receiver wakes up automatically
by detecting common mode changes in the differential clock signal.
Fig. 5.4 depicts the detailed wake-up sequence with more transmitter and
receiver details. The wake-up processes of data and clock paths are slightly
different. In the data path on the transmitter side, the wake-up signal,
WKPTx, is retimed by the reference clock, FREF, to delay the turn-on instant
of the data driver by one extra reference cycle (2 ns). This staggered turn-on
process helps reduce simultaneous switching noise (SSN). In the clock path,
1Interestingly, combining DVFS and ROO techniques also provides benefits to improve
links’ energy delay product for doing data communication. Appendix B has a detailed
analysis with the help of a queue model.
80
Figure 5.3: Block diagram of source-synchronous link with DVFS and rapid
on/off capabilities.
the clock generator turns on when the wake-up signal occurs. As the clock
generator requires more than 2 ns to provide clean clock, delaying the turn-on
instant of data driver does not add any penalty in terms of overall transceiver
wake-up time. When the clock generator turns on, common mode voltage of
the differential clock signal reduces from supply voltage in off state to driver
output common mode voltage in the on state. The wake-up detection circuit
in the receiver side detects this common-mode change and generates a power-
on signal, WKPRx, to wake-up the whole receiver. Inside the receiver, the
digital circuitry including the PRBS checker operates at 1/16th of the data
rate. The receiver has four cycle latency to synchronize the PRBS checker
before starting to evaluate a new set of data bits. During this time, the Error
signal stays high. The wake-up time of the transceiver is pessimistically
defined as the time it takes for the Error signal to stay low.
5.2.1 DC-DC Converter
The main purpose of the DC-DC converter is to provide appropriate supply
voltage for link peak data rate requirement, set by the input reference voltage
VREF from system level depending on applications (for instance, processors
81
Figure 5.4: Wake-up process of the energy-proportional transceiver.
give out VREF in memory controller application). In addition to the high
efficiency requirement, energy-proportional links also demand fast response
from such DC-DC converters. Generally, the converters are implemented
either using linear pulse-width modulation (PWM) based controller [73] or
non-linear hysteretic control mechanisms. While PWM-based converters op-
erate with fixed frequency, their reference tracking ability is governed by the
control loop bandwidth, which is limited to a maximum of about 1/10th the
switching frequency, FSW [32]. As a consequence, their bandwidth can only
be increased with higher FSW which comes at the expense of degraded power
efficiency [74]. On the other hand, hysteretic control is easy to implement,
needs no external components for compensation, and has fast transient re-
sponse.
The DC-DC converter in this work employs a simple current-mode non-
linear hysteretic controller combined with a window-based shunt regulator
shown in Fig. 5.5(a). The added shunt regulator provides a direct path from
input voltage (VREF) to output voltage (VO) or from output voltage to ground
when the output falls outside of regulation window: VO < VSHL or VO >
VSHH, respectively. To ensure high power conversion efficiency, a relatively
low switching frequency (about 2MHz) is chosen for low switching loss, at
82
some cost of its reference voltage (VREF) tracking speed. Shunt regulation
improves tracking by providing large current during line transient. With
FSW=2MHz, L=4.7µH and C=10µF, simulation results shown in Fig. 5.5(b)
illustrate more than 10x improvement in tracking speed while maintaining
peak efficiency above 90%.
V
o
u
t 
[V
]
V
o
u
t 
[V
]
Figure 5.5: (a) Current-mode hysteric converter. (b) Simulated line
transient response.
5.2.2 Rapid On/Off Clock Generator
Clock generator is usually the bottleneck of wake-up process for rapid on/off
(ROO) operation [75], especially when limited bandwidth PLL is chosen to
provide the sampling clock [76]. In this source-synchronous transceiver, a
multiplying delay locked loop (MDLL) serves as clock generator located in
the transmitter side (see Fig. 5.4). MDLL replaces every Nth oscillator edge
with a reference edge, where N is the clock multiplication factor [67, 77, 78].
This feed forward edge replacement, by definition, results in instantaneous
phase locking, independent of bandwidth. Therefore, this feature makes
MDLL particularly suitable for ROO applications.
Fig. 5.6 shows the details of ROO clock generator implemented using a
supply-regulated multiplying delay-locked loop (MDLL). An NMOS source
follower-based LDO suppresses power converter output voltage ripple and
provides clean supply voltage to the MDLL. The ripple frequency is the same
as the converter switching frequency of about 2MHz. A native device is used
83
Figure 5.6: Block diagram of rapid on/off multiplying delay-locked loop
(MDLL) and timing diagram during wake-up process.
for the NMOS pass transistor to minimize dropout voltage and achieve fast
settling. The gate voltage for the pass transistor is generated by low-pass
filtering the converter output voltage, VDC−DC (or input reference voltage
VREF), as shown in Fig. 5.7. Compared to a classical error amplifier feedback
based topology, this open loop architecture provides faster load transient
response and better high-frequency power supply noise rejection (24 dB at
0.5MHz and higher) at the expense of regulation accuracy. As illustrated in
Fig. 5.7, if a filtered version of VREF is available for the gate voltage of the
pass transistor, the LDO achieves larger than 16 dB suppression across full
spectrum range. The same architecture is used for all the other LDOs.
Power-on lock time of an MDLL, ideally, can be very small (≈ TREF) if
the oscillator starts in a frequency locked condition with its free-running fre-
quency, FOSC, equal to the target frequency, FOUT. In practice, however, any
small initial frequency error introduces supply voltage ripple and increases
84
Figure 5.7: (a) Source follower (SF) -based low dropout (LDO) voltage
regulator. (b) Simulated PSRR of LDO.
lock time to more than 3 or 4 TREF [67, 75]. For instance, if FOSC > FOUT,
the oscillator completes the desired N cycles before the reference edge ar-
rives and stops oscillating. During this stop time, node voltage, VCTRL, gets
charged to a higher potential by the positive side resistive digital-to-analog
converter (P-RDAC). Upon the arrival of reference clock edge, the oscillator
starts oscillating again and VCTRL starts discharging. The large settling time
penalty incurred by these disturbances on VCTRL node can be mitigated if
the oscillator is not allowed to stop. To this end, as shown in the timing
diagram in Fig. 5.6, a programmable multi-modulus divider is used and its
modulus value is set to be greater than N (or less than N) if FOSC > FOUT
(or FOSC < FOUT) such that the oscillator never stops regardless of its initial
FOSC. As a result, the ripple on node VCTRL is eliminated and the MDLL
settles within one to two TREF. Similar to [79], an automatic de-skew cal-
ibration loop is used to correct input static phase offset between the edge
replacement path and integral path and reduce deterministic jitter (DJ) (see
Fig. 5.6). A 15-bit pseudorandom binary sequence (PRBS15) is adopted as
the gating signal for demultiplexing BBPD output to the integral path and
auto-deskew loop in order to suppress the potential spurs resulted from the
periodic gating behavior [79].
85
5.2.3 Transmitter
As illustrated in Fig. 5.4, the transmitter employs the matched source-synchronous
(MSSC) architecture [80], in order to minimize the impact of cycle-to-cycle
jitter from the rapid on/off clock generator during wake-up process. The
delay mismatch is nulled by two phase interpolators (PI) in clock and data
path, respectively. Fig. 5.8 illustrates the schematic for the PI, with 2-bit
MSB for quadrant control and 5-bit LSB for interpolation weight control.
The PI is controlled by wake-up signal, WKPTx, to save power during idle
period consuming only 10µA bias current. An additive half unit current
(0.5I0) is assigned to all four quadrant to improve PI wake-up speed and
linearity [33].
Figure 5.8: Schematic of 7-bit phase interpolator.
Fig. 5.9 shows the complete data path of the half-rate transmitter, includ-
ing parallel pattern generator [81], 16-to-1 serializer, and 3-tap finite impulse
response (FIR) filter, and rapid replica biasing (RRB) circuit for output data
drivers. In this design, current mode logic (CML) driver circuit is preferred
over the voltage mode (VM) counterpart for three main reasons: (i) CML
drivers achieve termination with passive resistors and do not require addi-
86
tional supply regulators for impedance control; (ii) Convenient implementa-
tion of pre-emphasis in CML circuit, while the current efficiency benefits in
VM drivers diminish with equalization options [64]; (iii) CML circuits are
much less sensitive to supply variations caused by rapid on/off operation,
and the output swing is more controllable than that of VM drivers. The
data path CML driver is segmented into the basic unit as shown in Fig. 5.10,
including CMOS to CML level converter [82], pre-driver and output driver.
The same driver structure is applied in the clock path to match the delay
between clock and data path, and achieve similar delay variations for PVT
changes.
Figure 5.9: Block diagram of energy-proportional transmitter with 3-tap
FIR filter.
A fast biasing circuit is also developed to improve the rapid on/off (ROO)
process. During ROO operation, settling of bias circuit not only influences
the turn-on time of the transmitter driver, but also decides its output swing.
Instead of relying on the settling of diode-connected current mirror with
a fixed bias current [68], a rapid on-off bias circuit with digital control is
introduced to improve the bias settling time and thus the transmitter turn-
on time [67]. But both schemes still cannot control output swing considering
the PVT variations. In this work, a rapid replica bias (RRB) circuit is
introduced not only to provide the bias voltage, VB, abruptly, but also to
control the output swing through the replica circuit. As shown in Fig. 5.11(a),
the RRB circuit consists of an always-on bias section with IBIAS of 10µA, a
87
replica bias section, and bias applying section (for instance the Tx output
driver). The operation of the RRB circuit is illustrated in Fig. 5.11(b) with
the staggered turn-on sequence to minimize SSN. In addition to the always-on
bias to maintain the biasing voltage VBP1 and VBP2 in the replica amplifier,
the current injection circuit also provides sufficient current, IINJ, to make sure
that the bias voltage settles before retimed wake-up signal, WKPRT, turns
on the Tx driver. Therefore, the RRB circuit achieves both fast biasing and
output swing control through replica operation.
Figure 5.10: Schematic of segmented CML output driver.
Figure 5.11: Schematic and settling process of rapid replica biasing (RRB)
circuit.
88
5.2.4 Receiver
Fig. 5.12 illustrates the data path of the receiver consisting of a wake-up
detector, amplifiers, clock division/distribution, data samplers, and deserial-
izers. The wake-up detector senses the common-mode drop of the differen-
tial input clock sent from the transmitter side through a forwarding chan-
nel [68], and generates a wake-up signal, WKPRx, to pull the whole receiver
out of power-down state. In normal operation, charge-based sense amplifier
(CSA) [62,83], low-swing (LS) latch and charge-based flip flop (CFF) [62] are
used for data samplers and deserializers to save about 40% power compared
to the full-swing CMOS logic.
Figure 5.12: Schematic of receiver data path.
The input limiting amplifier depicted in Fig. 5.13 adopts offset cancella-
tion to cover about ±20mV range for input referred offset. Load resistor
calibration is also employed to stabilize the gain over PVT variation (see
Fig. 5.13). Three main considerations are involved in component sizing to
achieve almost constant load across different conditions: (i) R0 is chosen to
89
be target resistor value under minimum resistor process corner; (ii) keep the
length of resistor unit L0 the same to ensure consistent contact resistor; (iii)
scale W, W0, and transistor size to vary resistance. The calibrated resistor
covers a variation range of ±20%, and the simulated resistor accuracy after
calibration is within 3%. This improves the gain variation of cascaded two
CML stages from about 4 dB to less than 1 dB. In practice, the close loop
resistor calibration can be implemented using a reference resistor which is al-
ready available in links, which involves a transmitter impedance tuning loop.
For rapid on/off operation, the RRB circuit is also applied to speed up the
bias settling process in both clock and data paths.
Figure 5.13: Schematic of Rx limiting amplifier with load resistor
calibration and offset cancellation.
5.3 Experimental Results
The prototype energy-proportional transceiver was fabricated in a 65 nm
CMOS process and it occupies an active area of 2.4mm2. The chip micro-
graph is shown in Fig. 5.14. The die was packaged in a 88-pin QFN (QFN88)
package for measurement. In the rest of this section, we report the perfor-
mance of link power management circuit, rapid on/off MDLL, always-on link
with DVFS, and rapid on/off transceiver, respectively.
90
Figure 5.14: Micrograph of the energy-proportional transceiver prototype.
5.3.1 Link Power Management
For the DC-DC converter, both transient response behavior and power ef-
ficiency are evaluated. The window-based shunt regulator provides large
current needed to charge the output capacitor during transitions. As shown
in Fig. 5.15, the converter responds to a negative step on the reference volt-
age much faster when the shunt regulator is enabled, and similar behavior
was also observed for positive step response. In terms of power efficiency,
operating at about 2MHz switching frequency, the converter achieves above
90 percent peak efficiency at all output voltages from 1.3V to 0.7V (see
Fig. 5.16). The settling behavior of the source-follower based LDO was also
characterized, and the LDO output settled within 3 ns as shown in Fig. 5.17.
5.3.2 Rapid On/Off MDLL
In always-on mode, the MDLL could operate across a wide range of sup-
ply voltages (0.6-1.3V) and provides output frequencies ranging from 1.5 to
5.0GHz. The absolute jitter across the entire range is less than 1.6psrms
(see Fig. 5.18). At 5GHz, it consumes 6.2mW of power and achieves a jit-
91
Figure 5.15: Measured transient response of DC-DC converter: w/ and w/o
shunt regulator.
Figure 5.16: Measured power efficiency of DC-DC converter.
ter performance of 1.1 psrms/10.2 pspp. During rapid on/off operation, the
effectiveness of the proposed programmable divider was characterized in
Fig. 5.19. With a fixed division factor of 10 as shown on top, the oscillator
was stopped and the MDLL took a couple of reference cycles to reach steady
92
Figure 5.17: Measured settling behavior of low dropout (LDO) voltage
regulator.
state. When the division factor is set to 11 during the first reference period
as shown in the bottom, MDLL output clock settles almost instantaneously.
The zero-crossing points of MDLL output are also captured using real time
sampling scope. Post-analysis of the zero-crossing information demonstrates
that MDLL jitter settles within 6pspp after the first reference cycle (2 ns) (see
Fig. 5.20).
Figure 5.18: Measured MDLL performance across different supply voltages.
93
Figure 5.19: Measured MDLL settling behavior with programmable divider.
5.3.3 Always-on Link with DVFS
The dynamic voltage and frequency scaling (DVFS) performance of the link
is summarized in Fig. 5.21. Scaling the supply voltage from 1.3V to 0.9V
achieves max data rate from 10Gb/s to 3Gb/s and improves energy efficiency
by more than 2 times (from 7.2 pJ/bit at 10Gb/s to 3.5 pJ/bit at 3Gb/s).
Note that amortizing the power of the clock path (about 35mW) across 8
lanes substantially improves the transceiver energy efficiency to 4.0 pJ/bit
at 10Gb/s. The transceiver bathtub plots (see Fig. 5.22) indicate an eye
opening of 0.4UI and 0.1UI at 6Gb/s and 10Gb/s, respectively and it is
94
Figure 5.20: Measured MDLL jitter settling during wake-up process.
almost independent of whether the transceiver supply voltage is provided
externally or by the DC-DC converter.
Figure 5.21: Measured transceiver energy efficiency in DVFS mode.
95
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5-12
-10
-8
-6
-4
-2
0
Phase [UI]
lo
g 1
0(B
ER
)
 
 
w/ Converter (6Gb/s) w/o Converter (6Gb/s) w/o Converter (10Gb/s)
Figure 5.22: Measured source-synchronous link bathtub cures.
5.3.4 Rapid On/Off Transceiver
The rest of this section presents rapid on/off transceiver performance, start-
ing with measured transmitter on/off behavior. Figure 5.23(a) shows that
the transmitter driver takes about 600 ps to settle. Since the eye is always
open, even the first data bit could be potentially detected if the receiver is
capable of operating with this amplitude and varying common mode voltage.
The transmitter power-off transient was captured in Fig. 5.23(b). The 2 ns
inactive period on driver output was due to the staggered turning on/off se-
quence, in which the serializer turns off earlier than the driver and no other
data bit is available to transmit thereafter.
The on/off behavior of the complete transceiver was also evaluated using
two separate test chips: one configured as a transmitter and the other as
receiver. Receiver side waveforms captured with a real time sampling scope
are shown in Fig. 5.24(a). About 40 billion on/off transactions are captured
to confirm the robustness of the link on/off behavior. Fig. 5.24(b) enlarges
the result in Fig. 5.24(a) and reveals that error signal goes low 14 ns after the
wake-up signal, indicating that transceiver takes about 14 ns to turn on. This
14 ns includes the PRBS checker latency which is about 10.7 ns at 6Gb/s,
equivalent to 4 cycles of PRBS checker clock at one sixteenth of data rate.
Further analysis of the data pattern indicates an error only appears in the
first 3 data bits.
96
Figure 5.23: Measured power on and off process of transmitter driver.
Figure 5.24: Measured power on and off behavior of complete link with less
than 14 ns wake up time.
97
In addition to the transceiver transient behaviors during on/off process, the
power scaling feature and energy-proportionality of the transceiver are also
characterized. Fig. 5.25 shows the power scaling feature of the transceiver.
With 128-byte data burst, the transceiver power scales almost linearly with
effective data rate (utilization level times peak data rate at certain supply
voltage). Specifically, for 500x change in data rate the transceiver power is
scaled by 220 times. Energy efficiency is also measured at different peak
data rates as illustrated in Fig. 5.26. For the same 500x change in data
rate, energy efficiency only varies by 2.2 times, from 6.2 pJ/bit at 8Gb/s to
14.1 pJ/bit at 16Mb/s. DVFS helps achieve this wide data rate scaling range
by improving energy efficiency at peak data rate, and by also reducing the
leakage power in the off state, especially at low supply voltage. A detailed
comparison of transceiver power consumption at 8Gb/s and 3Gb/s is given
in Fig. 5.27, with on-state power on the left, and off-state leakage and bias
power on the right. The off-state power is the main reason for increasing
energy-per-bit for on/off operation. DVFS helps reduce the link off-state
power consumption by about 4.5 times from 8Gb/s to 3Gb/s, and extends
the energy-proportional operation range to 500x,from about 100 x when only
rapid on/off is available [67, 68, 75].
Figure 5.25: Measured link power scaling capability with 500 x range of
data rate (8Gb/s to 16Mb/s).
98
Figure 5.26: Measured link energy-proportional operation capability with
500x range of data rate (8Gb/s to 16Mb/s).
The performance of the transceiver and its comparison to state-of-the-art
designs are summarized in Table 5.1.
Table 5.1: Transceiver performance summary and comparison with the
state-of-the-art designs
[76] [66] [68] [21] This
Work
Technology 40 nm 40 nm 65 nm 0.25µm 65 nm
Supply [V] 1.1 1.1 1/1.1 0.9-2.5 0.7-1.4
Peak data rate [Gb/s] 5.6 2.7-4.3 7 0.65-5.0 3.0-10.0
Power-on time [ns] 241.8 8 20 N/A 14
Efficiency [pJ/bit] 3.3 2.4 9.1 14.9-76 3.6-7.2
On-state power [mW] 14.2 13.44 63.7 9.7-380 10.8-72.4
Off-state power [mW] N/A ≈ 0 740 N/A 78-466
Energy prop. range N/A N/A 100x N/A 500x
Regulator efficiency N/A N/A N/A 83-94 82-93
Area [mm2] 0.92 N/A 1.7 0.63 2.37
This work has demonstrated first energy-proportional wireline transceiver
that combines DVFS and rapid on/off (ROO) techniques. Using high effi-
ciency integrated link power management and rapid on/off clock generator,
the prototype transceiver wakes up in less than 14 ns (MDLL settles in 2 ns,
99
which equals a single reference cycle) and achieves energy-proportional op-
eration over 500x data rate range (from 8Gb/s to 16Mb/s).
PI: 41W
Serializer+Driver:
115WMDLL: 
120W
Rx: 20W 
Bias: 48W
PI: 
2.34mW
Serializer+Driver: 
5.15mW
MDLL: 
0.62mW
Rx: 
2.63mW
Bias: 
34W PI: 3.8W
Serializer+Driver:
11.8W
MDLL:
20.1W
Rx: 8.6W
Bias: 
34µW
PI: 
6.4mW
Serializer+Driver: 
24.8mW
MDLL:
4.2mW
Rx: 
11.4mW
Bias: 
48W
Figure 5.27: Comparison of measured transceiver on-state power and
off-state power at 8Gb/s and 3Gb/s.
5.4 Summary
A source-synchronous link transceiver is presented to demonstrate the energy-
proportional data communication link concept. The transceiver combines the
DVFS and rapid on/off (ROO) techniques, and fits well in the growing I/O
applications that require superior energy efficiency over a very wide range
of data rates and agile responses. In such links, DVFS is responsible for
providing better-than-linear scaling of power consumption down to a certain
data rate (3Gb/s in this prototype), while ROO operation at a fixed peak
date rate responds to the bursty nature in data communication with almost
linear power scaling down to very low effective data rate or utilization level.
This work focuses on providing potential solutions for challenges in link
power management, rapid on/off clock generator, and synchronization control
for on/off operation. For power management, the DC-DC converter adopts
current-mode hysteretic controller with an auxiliary window-based shunt reg-
ulator to achieve both high efficiency and fast response. A rapid on/off MDLL
100
with a single reference cycle (2 ns) settling time is proposed as the clock gener-
ator, by mitigating the effects of supply ripple during on/off operation with
a programmable divider. For complete link operation, a matched source-
synchronous (MSSC) architecture is used to reduce the dependence of link
performance on clock jitter during settling and also simplify the receiver de-
sign. Furthermore, staggered turn-on sequence is explored to alleviate power
supply variation induced by simultaneous switching behaviors. Fabricated in
65 nm CMOS process, the prototype transceiver features a DC-DC converter
with above 90% efficiency over supply range from 1.3V to 0.7V and a clock
generator with a power-on time of single reference cycle (2 ns). The com-
plete link transceiver achieves less than 14 ns wake-up time, 500x (8Gb/s to
16Mb/s) energy-proportional range with only about 2x variation of energy
efficiency (5.9 pJ/bit to 14.1 pJ/bit), and 220x (46.8mW to 0.21mW) power
scaling capability.
101
CHAPTER 6
CONCLUSION
6.1 Conclusions
This thesis explored design techniques, at both circuit and system level, to
improve the link energy efficiency and introduce novel architecture to break
the trade-offs in classic designs. The main contributions are summarized as
follows:
In Chapter 3, a phase domain proportional path is introduced into a highly
digital receiver to decouple the dependence between jitter transfer band-
width and jitter tolerance corner frequency and eliminate the inherent peak-
ing in jitter transfer function. A phase-rotating phase-locked loop is also
proposed to implement highly linear phase interpolation in current domain,
and achieves a linearity of DNL < ±0.2 LSB, and INL < ±0.4 LSB.
In Chapter 4, a reference-less frequency acquisition scheme using bang-
bang phase detector (BBPD) is demonstrated with minimal hardware penalty
in digital CDRs. A digital D/PLL CDR is implemented to eliminate the
bulky loop filter capacitor and preserve the feature of decoupled jitter transfer
and jitter tolerance in its analog counterpart. Furthermore, to improve the
jitter for recovery clock in the case of ring oscillator in Chapter 3, a fractional-
N phase-locked loop (PLL) acts as a digitally controlled oscillator (DCO) and
leverages the PLL bandwidth (not the jitter transfer bandwidth in CDR loop
as in conventional case) to suppress the oscillator phase noise.
In Chapter 5, the energy-proportional operation concept is explored in data
communication. The first energy-proportional source-synchronous link that
combines dynamic voltage and frequency scaling (DVFS) and rapid on/off
(ROO) techniques is prototyped with integrated link power management
and rapid on/off clock generator. The transceiver achieves 500x (8Gb/s to
16Mb/s) energy-proportional range and 220x (46.8mW to 0.21mW) power
102
scaling capability. The complete link transceiver achieves less than 14 ns
wake-up time.
6.2 Future Work
In receiver (or CDR) prototypes, the proposed CDR architecture can be
easily applied in a parallel receiver to share the integral path which contains
the DCO. This not only amortizes the power/area consumption for the DCO,
especially if LC tanks have to be used for low jitter applications, but also nulls
the frequency offset between DCO and input data rate, which is commonly
existing in conventional dual-loop CDRs.
With a transmitter to provide a wide range of data rates, the receiver
architecture can be extended into high-performance transceivers with the
capability of operating at a wide range of data rates, and with flexibility in
selection of jitter transfer bandwidth and jitter tolerance corner frequency.
Regarding energy-proportional links, starting from this source-synchronous
link prototype, extension in the following two directions can have very high
impact on optimizing link power efficiency at system level. (i) In many sparse
data communication scenarios such as mobile applications where power con-
sumption is critical, source-synchronous energy-proportional links can really
help links adapt to dynamic workload and achieve long battery life. Chal-
lenges here are fully integrated solutions with constrained area and power
consumption, and intelligent data management for dynamic workload. (ii)
Further explore the rapid on/off solution for embedded clock link systems,
especially with lossy channel. The fast phase acquisition for the CDR in
such a scenario is not only interesting and challenging, but also has pro-
found practical influence. For instance, in display area, as the resolution
becomes higher and higher, the aggregated bandwidth of links increases dra-
matically (144Gb/s for next generation 8K displays) and so does the power
consumption. Energy-proportional operation greatly cuts power wastage for
such links, because the scan control signal can naturally serve as a duty cycle
sequence for the links with proper arrangement.
103
APPENDIX A
RELIABILITY ANALYSIS OF PROPOSED
FREQUENCY ACQUISITION SCHEME
This appendix aims to analyze the reliability of the proposed frequency acqui-
sition scheme based on bang-bang phase detector (BBPD) in Chapter 4. In
other words, this analysis shows the probability that, when residual frequency
error is within the target, the frequency-locking loop (FLL) will declare fre-
quency locking. Recall from Chapter 4 that the FLL claims frequency locking
when the consecutive Early/Late (E/L) number hits the locking threshold
NTH (see Fig. 4.4). The appendix tries to understand how reliably the FLL is
able to declare frequency locking when the residual frequency error between
input data, DIN, and the recovery clock, RCK, is within the target, taking
the random jitter in the system into account. All through the frequency
acquisition process in Fig. 4.4, the last update of DCO frequency is most
vulnerable because it takes the longest time to count and the random jitter
has the largest influence. During this process, the residual frequency offset
is given by (inferred from Eq. (4.2)):
∆F = FDIN
ρ
NTH
pi
2pi
(A.1)
In order to guarantee that the FLL can declare lock with this frequency offset,
NTH number of consecutive E/L signals should be counted. In other words,
as shown in Fig. A.1, the edge of RCK should always be located within the
same 1UI of DIN with the presence of random jitter.
In addition to present the relationship between the FLL locking reliability
and period jitter, this analysis also compares the FLL reliability with con-
ventional DCO to the proposed DCO which is based on fractional-N PLL,
starting with the case of conventional DCO.
104
Figure A.1: Sampling instance between RCK and DIN in the presence of
random jitter.
105
A.1 FLL Locking Reliability with Conventional DCO
Assume the distance between RCK edge and transition of DIN is a random
variable X, and the aggregate jitter is white Gaussian distribution with a
standard deviation σ (see Fig. A.1). In order to achieve NTH number of E/L,
X needs to be smaller than 1UI for all those Early or Late decisions. Taking
into account the fact that the residual frequency offset, ∆F, also reduces the
margin by α = ∆F/FDINUI with one increase in E/L count, the probability
to get consecutive NTH E/L is given by:
PNTH =P((X1 < 1− α) ∩ (X1 +X2 < 1− 2α) ∩ ... ∩ (
NTH∑
i=1
Xi < 1− iα))
=P(A1 ∩A2 ∩ ... ∩ Ai)
=1− P(A1 ∪ A2 ∪ ... ∪Ai)
>1−
NTH∑
i=1
P(Ai) (A.2)
As the oscillator is operating under open loop manner during frequency ac-
quisition, the period jitter of DCO is accumulated after each step. Note
that union bound is applied in the last inequality, since the direct evalu-
ation of the probability is difficult. The approximation is good only when
P(Ai) is much smaller than 1 and summation of P(Ai) is also less than 1.
Intuitively, small residual frequency error makes the approximation more ac-
curate. Fig. A.2 describes the reliability of FLL locking behavior, in other
words, the possibility of reaching NTH consecutive E/L signals. Specifically,
Fig. A.2(a) illustrates the probability of PNTH in Eq. (A.2). This shows the
reliability of the frequency locking loop in one update step. Even if the fre-
quency acquisition logic fails to declare locking in one step and the DCO
frequency increases by one LSB (50 ppm in this design), the FLL can still
lock before the frequency goes beyond the target frequency. In other words,
the FLL is able to lock unless all the following steps fail. Therefore, the
overall reliability of the FLL is lower bounded by:
Poverall ≥1− (1− PNTH)
∆F
∆LSB×FDIN (A.3)
Fig. A.2(b) takes this into account and provides the FLL locking reliability.
106
Figure A.2: FLL locking reliability versus locking threshold NTH: (a) one
step reliability, (b) overall reliability.
107
Fig. A.2 also shows the relationship between FLL locking reliability and
data transition density ρ. Intuitively, with the same locking threshold NTH,
larger transition density ρ corresponds to larger residual frequency offset;
thus the sampling margin shrinks faster during locking process and leads to
lower locking reliability/probability.1 This intuition holds for all the results
hereafter as well. Above analysis assumes to have a period jitter of 1%UIrms (
1 psrms for 10Gb/s data rate). The relationship between FLL locking reliabil-
ity and clock period jitter is also explored, and the results are summarized in
Fig. A.3. With locking threshold of NTH, the frequency acquisition logic can
reliably declare frequency locking in one step when period jitter is less than
0.5%UIrms (see Fig. A.3(a)). The overall FLL can lock with high probability
when period jitter is less than 1%UIrms (see Fig. A.3(b)).
A.2 FLL Locking Reliability with Fractional-N
PLL-based DCO
The key difference between these two kinds of DCO is that fractional-N
PLL-based DCO does not suffer from continuous phase noise accumulation
as frequency detection time increases. The PLL loop will suppress the jitter
accumulation after the acquisition time lasts longer than the time constant
corresponding to the PLL bandwidth. Therefore, the probability to get con-
secutive NTH E/L is given by:
PNTH =P((X1 < 1− α) ∩ (X2 < 1− 2α) ∩ ... ∩ (Xi < 1− iα))
=P(X1 ∩X2 ∩ ... ∩ Xi)
=P(X1)P(X2)...P(Xi)
=
NTH∏
i=1
P(Xi) (A.4)
where α = ∆F/FDIN is the reduction of sampling margin in each sample
due to the residual frequency error and the period jitter of DCO is only the
1This analysis only takes into account the transition density’s influence on residual
frequency offset according to Chapter 4. Ideally, the statistical property of data transition
should also be included. However, it is too tedious for complete analysis. For empirical
simulation, there is a straightforward way to include data transition statistics by modeling
them as a Bernoulli random process with the parameter of transition density ρ.
108
Figure A.3: FLL locking reliability versus period jitter: (a) one step
reliability, (b) overall reliability.
109
period jitter of PLL without accumulation. Fig. A.4 describes FLL locking
reliability, similar as before, in both one step and overall reliabilities. The
Figure A.4: FLL locking reliability versus locking threshold, NTH, with
fractional-N PLL-based DCO: (a) one step reliability, (b) overall reliability.
relationship between FLL locking reliability and clock period jitter is also
explored, and the results are summarized in Fig. A.5 . With locking threshold
of NTH, the frequency acquisition logic can reliably declare frequency locking
in one step when period jitter is less than 10%UIrms (see Fig. A.5(a)). The
overall FLL can lock with high probability when period jitter is less than
20%UIrms(see Fig. A.5(b)).
To summarize, in addition to improve the recovery clock jitter, this analysis
proves that fractional-N PLL-based DCO in Chapter 4 also increases the FLL
locking reliability. Specifically, in the case of fractional-N PLL-based DCO,
the FLL can declares lock reliably with 20x more jitter compares to the FLL
with conventional DCO.
110
Figure A.5: FLL locking reliability versus period jitter with fractional-N
PLL-based DCO: (a) one step reliability, (b) overall reliability.
111
APPENDIX B
ANALYSIS OF LINKS WITH DVFS AND
ROO TECHNIQUES USING QUEUE
MODEL
Links with DVFS and/or ROO techniques are closely analyzed under the
framework of a queue model [84]. In analogy to the queue model for commu-
nication/computation networks, a queue model for serial links is constructed
in Fig. B.1. The model includes a queue with first-in-first-out (FIFO) prin-
ciple and a link to conduct the service for data communication.
Figure B.1: Queue model for serial links.
Assume that packets arrive according to the Poisson process with average
arrival rate of λ, and the service from links is modeled as an exponential
process with average service rate of µ. The overall waiting time is T = tQ+tL,
including the waiting time in the queue (tQ) and the service time of link (tL).
The packet number in the systems is N = NQ + NL, including both packets
waiting in the queue (NQ) and being served in the link (NL). The state
transition diagram of the queue is given in Fig. B.2 with infinite length. In
steady state, the probability in each state satisfies the following relationship:
λp0 = µp1 (B.1)
(λ+ µ)pi = λpi−1 + µpi+1 (B.2)
112
Figure B.2: Queue model for serial links.
where pi is the probability of State i in steady state. Eq. (B.2) suggests that:
λpi − µpi+1 = λpi−1 − µpi = constant for i=1, 2, ... (B.3)
Together with Eq. (B.1), it implies that:
constant = λpi − µpi+1 = 0 for i=0, 1, 2, ... (B.4)
Thus, we have the probability for each state as following:
pi = p0(
λ
µ
)i = p0ρ
i for i=0, 1, 2, ... (B.5)
where ρ = λ
µ
. Note that the summation of probabilities in all states is unit,
which leads to:
1 =
∞∑
i=0
pi =
p0
1− ρ (B.6)
Combining Eq. (B.5) and (B.6), we have state probabilities only related to
ρ:
pi = (1− ρ)ρi for i=0, 1, 2, ... (B.7)
113
With this, the expected number of packet, N, in the system is:
E[N] =
∞∑
i=0
i ∗ pi (B.8)
=
∞∑
i=0
i ∗ (1− ρ) ∗ ρi (B.9)
=
ρ
1− ρ (B.10)
Apply Little’s formula [84], the expect waiting time in the whole link
system can also be obtained:
E[T] =
E[N]
λ
=
1
µ− λ (B.11)
Now, consider an example of the link queue model with service rate of 1.1
times arrival rate (µ = 1.1λ). Based on the above equations, ρ = 10/11,
E[N] = 10 packets, and E[T] = 10 units of time. One interesting point to
consider is the practical limitation on queue length and the probability of
overflow in the queue. In this example, if the overflow probability needs to
be controlled within 1%, we have to guarantee the following probability is
less than 1%:
P(N > NOW) = 1−
NOW∑
i=0
pi (B.12)
= 1−
NOW∑
i=0
(1− ρ)ρi (B.13)
= ρNOW+1 (B.14)
≤ 1% (B.15)
where NOW is the limit of packet number before overflow happens. Fig. B.3
describes the relationship between queue length and the probability of over-
flow. It can be inferred that a length of about 50 is needed to guarantee less
than 1% overflow probability. Thus far, we have constructed the basic queue
model for link systems. The rest of the appendix uses this model to evaluate
the mean waiting time, E[T], and energy delay product (EDP) while apply-
ing DVFS and ROO techniques. Specifically, a comparison between DVFS
114
Figure B.3: Queue length versus the probability of overflow (expected
queue length is E[N]=10).
and ROO is presented, followed by a discussion of the benefit of combining
DVFS and ROO techniques.
B.1 Comparison between DVFS and ROO
We first compare the mean waiting time in DVFS and ROO techniques with
different FIFO queue length (in practice, it is the buffer size before links).
According to Little’s formula, the expected queue length E[NQ] can be
derived:
E[NQ] = λE[tQ] (B.16)
= λ(E[T]− E[tL]) (B.17)
= E[N]− λE[tL] (B.18)
=
ρ2
1− ρ (B.19)
With expected queue length E[NQ] = 10, 1.0, or 0.5, respectively, Fig. B.4
depicts the mean waiting time E[T] of DVFS and ROO techniques along with
different effective data rate. As described in Chapter 5, in the ROO case, the
effective data rate is the product of peak data rate and utilization level. In
this calculation, the arrival rate is scaled with the utilization level and service
115
rate is kept constant. In the DVFS case in Fig. B.4, the mean waiting time is
achieved by scaling both arrival rate and service rate according to utilization.
Figure B.4: Expected waiting time comparison between DVFS and ROO
with different expected queue length (E[NQ]).
As expected, E[T] for DVFS is larger than that of ROO in all three cases.
One trend to notice is that, as buffer length decreases, the ratio of the mean
waiting time between DVFS and ROO is decreasing as well (see Fig. B.7(a)).
Energy delay product (EDP) is another important metric commonly used to
evaluate communication systems. One basic definition is:
EDP = Energy/bit (pJ/bit)×Mean waiting time (E[T]) (B.20)
In order to obtain EDP, the measured link energy efficiency in Chapter 5
is used, replotted in Fig. B.5 for convenience. Specifically, for this source-
synchronous link transceiver, DVFS improves energy efficiency by 2x from
10Gb/s peak data rate to 3Gb/s. Fig. B.6 shows the comparison of EDP
between DVFS and ROO links. Due to the energy efficiency benefit from
DVFS, the ratio of EDP between DVFS and ROO is reduced as shown by the
difference from Fig. B.7(a) to Fig. B.7(b). But the absolute EDP with DVFS
technique is still worse than ROO technique. This suggests that ROO is
116
Figure B.5: Measured transceiver energy efficiency at different peak data
rates in DVFS mode (replot Fig. 5.21 for convenience).
beneficial for latency sensitive applications, while DVFS technique has more
potential to save power in latency insensitive scenarios; a similar conclusion
has also been reached in [85].
B.2 Combine DVFS and ROO
Most previous work has focused on leveraging the benefit in DVFS [21,57,64]
and ROO [67, 68, 75], respectively. In this section, the proposed scheme of
combining DVFS and ROO techniques [72] in Chapter 5 is evaluated in terms
of mean waiting time and EDP.
With expected queue length E[NQ] = 1, Fig. B.8 shows the mean waiting
time E[T] in link system when DVFS and ROO techniques are combined.
Specifically, with the mean waiting time in only the DVFS condition plotted
for reference, the ROO is combined with DVFS from 0.2x to 1.0x of maximum
data rate. Based on the analysis in the previous section, the mean waiting
time is decreasing as the peak data rate for ROO is increasing, approaching
to the best case when the peak data rate is maximum.
The benefit of combining DVFS and ROO becomes clear in EDP as il-
117
Figure B.6: Energy delay product comparison between DVFS and ROO
with different expected queue length (E[NQ]).
lustrated in Fig. B.9. The EDP in only DVFS case is also presented as
reference. In contrast to the results of mean waiting time, the EDP is
not minimal when operating ROO under maximum DVFS condition (i.e.
DVFS(1.0)). The energy efficiency benefit in DVFS compensates for its long
waiting time, resulting in lower EDP in the region under the case of only ROO
(i.e. DVFS(1.0)+ROO). This promises benefits in both energy efficiency and
EDP by combining DVFS and ROO techniques.
B.3 Simulation Results
In addition to the above analysis, this section conducts time-domain sim-
ulation to compare the DVFS and ROO techniques, and further evaluates
the benefits of combining DVFS and ROO. Two main simulation setups are
given:
Queue model: Same as the above analysis, the queue type is M/M/1
with FIFO principle used in simulation. The inter-arrival time is a Poisson
process and the service time follows an exponential process.
Link power model: Both link power models in Chapter 5 for source-
118
Figure B.7: DVFS and ROO with different expected queue length (E[NQ]):
(a) ratio of expected waiting time, (b) ratio of energy delay product.
119
Figure B.8: Expected waiting time with combined DVFS and ROO with
expected queue length E[NQ] = 1.
Figure B.9: Energy delay product with combined DVFS and ROO with
expected queue length E[NQ] = 1.
120
synchronous [72] and the reported power details in [30] for an embedded
clock link (operating at 6.25Gb/s in 90 nm ) are evaluated in simulation.
Similar trends are observed, and we will focus on the case with power details
in [30],1 and the following assumed power model.
1) For DVFS, α-power law model for MOSFET [86] is assumed with the
following main features:2
FCLK ∝ (Vdd −Vth)
α
Vdd
, (α = 1 in simulation) (B.21)
Pactive = V
2
ddFCLK for digital circuit (B.22)
Pactive = VddIDC for analog circuit (B.23)
Vdd,max = 1V (B.24)
Vth = 0.3V (B.25)
Practically, considering the limits on low voltages, we assume the data rate
range for DVFS is:
Minimum peak data rate
Maximum peak data rate
=
1
5
(B.26)
Lastly, the DC-DC converter for link power management is assumed to have
90% power efficiency.
2) For ROO, the following conditions are assumed in simulation:
Pidle = 0.01Pactive
Exit latency =
10×Mean packet size
Peak data rate
(B.27)
A brief summary and discussion of the simulation is presented below.
Fig. B.10 compares the EDP between always on, DVFS (in which DVFSEpB
is optimized for energy-per-bit and DVFSEDP is optimized for energy delay
product), and ROO link. As expected from the analysis in the previous sec-
tion, the ROO technique is more favorable in latency sensitive applications.
(Some interesting simulation results with only DVFS and only ROO can also
1Along with the analysis with the measured power/energy efficiency in the previous
section for source-synchronous link, we show that similar results can be obtained with
both source-synchronous and embedded clock links.
2See more detailed analysis of power and energy efficiency scaling according to α-power
law in Appendix C.
121
be found in [85].)
0 0.2 0.4 0.6 0.8 110
0
101
102
103
104
Normalized Average Data Rate
N
o
rm
.
 
En
er
gy
-
D
el
ay
 
Pr
o
du
ct
 
 
Always On
Ideal ROO
ROO
DVFSEpB
DVFSEDP
Figure B.10: Simulated link energy delay product with M/M/1 queue
model.
Fig. B.11 summarizes the simulated link performance with combined DVFS
and ROO techniques using the M/M/1 queue model. The energy efficiency is
illustrated in Fig. B.11(a), with each trace corresponding to ROO operation
on a certain peak date rate controlled by DVFS. The rising trend in the
energy-per-bit at low average data rate is due to the power consumption in
idle state, which is consistent with the measurement result in Fig. 5.26. The
simulated EDP is detailed in Fig. B.11(b). This result for an embedded link
is consistent with the analysis with measured energy efficiency for source-
synchronous link in the previous section. The result basically suggests a
decision boundary: on one side the ROO technique is better for its lower
EDP, and on the other side, combining DVFS and ROO is favorable.
In summary, this appendix developed a queue model for a general link
system and used the model to analyze the implication in mean waiting time
(E[T]) and energy delay product for DVFS and ROO techniques. This anal-
ysis and time-domain simulation based on the queue model are applied to
122
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
Normalized Average Data Rate
N
o
rm
al
iz
ed
 
En
er
gy
 
Pe
r 
B
it
 
 
ROO
DVFS & ROO
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
4
6
8
10
12
14
16
18
20
Normalized Average Data Rate
N
o
rm
al
iz
ed
 
En
er
gy
-
D
el
ay
 
Pr
o
du
ct
 
 
ROO constant VDDLDO
DVFS & ROO adaptive VDDLDO
Figure B.11: Simulated link performance with combined DVFS and ROO
using M/M/1 queue model: (a) normalized energy per bit, (b) normalized
energy delay product.
evaluate the benefits of combining DVFS and ROO techniques. The analysis
and conclusions in the appendix also serve as theoretical backgrounds for the
energy-proportional link prototype in Chapter 5.
123
APPENDIX C
DISCUSSION ON α-POWER LAW MODEL
FOR MOSFET AND ITS EFFECT ON
DVFS
This appendix aims to explain the α-power law model for MOSFET [86] in
detail, with both examples of power consumption for embedded clock link [30]
and source-synchronous link [72], respectively. Some design trade-offs and
limitations of the DVFS technique will also be covered. Because measured
MOS transistor I-V curves, especially in short-channel process, deviate from
(VGS−VTH)2 (quadratic relationship as described in Shockley model), the key
idea of the α-power law model is to depict a more accurate I-V characteristic
with the modification that transistor drain current is proportional to (VGS−
VTH)
α. Physically, α is closely related to carrier velocity saturation. It equals
2 for a very long-channel device, and the model coincides with the Shockley
model. For 65 nm CMOS it is approximately 1.2, and it approaches to 1 for
even finer technology. In this analysis, for simplicity, α is assumed to be 1,
which is the case of velocity saturation.
C.1 Scaling of Supply Voltage and Data Rate
According to α-power law model, data rate (denoted as FCLK) and supply
voltage Vdd are related as follows:
FCLK ∝ (Vdd − Vth)
α
Vdd
, (α = 1 for this analysis) (C.1)
Assume the data rate at maximum supply Vdd,max is FCLK,max. The supply
voltage for target data rate, FCLK, can be derived:
Vdd =
Vdd,maxVth
Vdd,max − FCLKFCLK,max(Vdd,max−Vth)
(C.2)
For the source-synchronous link in Chapter 5, it achieves maximum data
124
rate of 10Gb/s with 1.1V supply voltage provided by internal LDO. Per-
forming DVFS for this transceiver according to the α-power law model, the
relationship between data rate and supply as shown in Fig. C.1. Please note
that, by following the α-power law strictly, the Vdd reaches down to 0.3V
for 1Gb/s data rate, which is not a practical supply voltage for the link to
operate at. According to the demonstration in [87], it is possible to achieve
reasonably good performance while link is operating at 0.45V which is about
1.5×Vth.
Figure C.1: DVFS according to α-power law model for source-synchronous
transceiver in Chapter 5.
C.2 Scaling of Active Power of Difference Circuit
Serial link transceivers contain both analog and digital circuitries, which have
very different power scaling features as supply voltage Vdd and clock fre-
quency FCLK change. In this analysis, Table C.1 shows power scaling charac-
teristics applied for most of the common building blocks in link transceivers.
In general, power consumption for these circuits is considered to be related
to load capacitor CL, supply voltage Vdd, and operating frequency FCLK. In
all the supply scaling process, the load capacitor is assumed to be constant
(small variation due to supply change is ignored). Given this, link power scal-
125
ing is really decided by the scaling of voltage, kVdd , and scaling of frequency,
kFreq, respectively.
Table C.1: Power scaling of link transceiver building blocks
Building blocks Power scaling ratio
Serializer/Deserializer kFreq × (kVdd)2
CML driver kVdd
VM driver kFreq × (kVdd)2
Clocking circuit kFreq × (kVdd)2
CTLE/VGA kFreq × kVdd
Digital circuits kFreq × (kVdd)2
C.3 Power Scaling of Link Transceivers with α-Power
Law Model
With supply voltage Vdd and operating frequency FCLK scaling relation-
ship derived from α-power law and the power scaling feature of each build-
ing block, we are ready to study the trend of energy efficiency while per-
forming DVFS for link transceivers. First, consider the source-synchronous
transceiver in Chapter 5, and its power consumption at 10Gb/s with 1.1V
supply voltage is summarized in Table C.2 [72].
Table C.2: Power distribution of a source-synchronous link transceiver @
10Gb/s
Building blocks Power [mW]
Tx serializer 8.0
Tx pre-driver 10.2
Tx CML driver 20
Tx clocking 11.4
Rx amp 2.6
Rx samplers 5.2
Rx deserializer 5.6
Rx clocking 2.2
By scaling the data rate according to α-power law model, we can have
the relationship of data rate and supply, along with the trend of link energy
efficiency. It is interesting to note that low Vdd does not necessarily lead
126
to better energy efficiency because it becomes difficult to scale the supply
when the voltage is approaching the threshold voltage Vth. The benefit of
power saving from supply scaling is diminishing at low supply, especially
for a transceiver which contains significant analog circuitry. In best case,
the transceiver energy efficiency improves by about 2x, from 7.1 pJ/bit at
10Gb/s to 3.7 pJ/bit at 5Gb/s (see Fig. C.2).
Figure C.2: Data rate and energy/bit scaling according to α-power law
model for source-synchronous transceiver.
A similar trend is also observed for an embedded clock transceiver with
the power consumption at 6.25Gb/s summarized in Table C.3 [30]. The em-
bedded clock transceiver case has about 3x improvement in energy efficiency,
from 2.4 pJ/bit at 6.25Gb/s to 0.8 pJ/bit at 3Gb/s (see Fig. C.3). The im-
provement is a bit larger than the source-synchronous case since the later
one has more analog circuitry, especially the CML driver on the transmitter
side (the reason for using CML driver is detailed in Chapter 5).
To summarize, according to the α-power law model for MOSFETs, DVFS
helps to improve transceiver energy efficiency in general. But caution is
needed to decide the proper range of data rate and supply voltage. Fur-
thermore, the comparison between two transceivers confirms that digital-
intensive circuitry has more potential to benefit from DVFS operation.
127
Table C.3: Power distribution of an embedded clock link transceiver per
channel @ 6.25Gb/s
Building blocks Power [mW]
Tx serializer/ 2.0
Tx pre-driver 0.4
Tx VM driver 1.1
Tx clocking 1.1
Rx CTLE 2.3
Rx samplers 0.5
Rx deserializer 1.6
Rx clocking 4.2
Common clocking 1.1
Figure C.3: Data rate and energy/bit scaling according to α-power law
model for embedded clock transceiver.
128
REFERENCES
[1] P. Kogge, “Exascale computing study: technology challenges in
achieving exascale systems,” DARPA, Tech. Rep., Sept. 2008. [Online].
Available: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf
[2] V. Balan, O. Oluwole, G. Kodani, C. Zhong, R. Dadi, A. Amin,
A. Ragab, and M.-J. Lee, “A 15-22 Gb/s serial link in 28 nm CMOS
with direct DFE,” IEEE J. Solid-State Circuits, vol. 49, no. 12, pp.
3104–3115, Dec. 2014.
[3] H.-W. Lee, J. Song, S.-A. Hyun, S. Baek, Y. Lim, J. Lee, M. Park,
H. Choi, C. Choi, J. Cha, J. Kim, H. Choi, S. Kwack, Y. Kang, J. Kim,
J. Park, J. Kim, J. Cho, C. Kim, Y. Kim, J. Lee, B. Chung, and S. Hong,
“A 1.35-V 5.0-Gb/s/pin GDDR5M with 5.4-mW standby power and an
error-adaptive duty-cycle corrector,” in IEEE ISSCC Dig. Tech. Papers,
Feb. 2014, pp. 434–435.
[4] J. Bulzacchelli, C. Menolfi, T. Beukema, D. Storaska, J. Hertle, D. Han-
son, P.-H. Hsieh, S. Rylov, D. Furrer, D. Gardellini, A. Prati, T. Morf,
V. Sharma, R. Kelkar, H. Ainspan, W. Kelly, L. Chieco, G. Ritter,
J. Sorice, J. Garlett, R. Callan, M. Brandli, P. Buchmann, M. Kossel,
T. Toifl, and D. Friedman, “A 28-Gb/s 4-Tap FFE/15-Tap DFE serial
link transceiver in 32-nm SOI CMOS technology,” IEEE J. Solid-State
Circuits, vol. 47, no. 12, pp. 3232–3248, Dec. 2012.
[5] J. Kenney, T. Chen, L. DeVito, D. Dalton, S. McCracken, R. Soenneker,
W. Titus, and T. Weigandt, “A 6.5-Mb/s to 11.3-Gb/s continuous-rate
clock and data recovery,” in IEEE Proc. IEEE Custom Integr. Circuits
Conf.(CICC), Sept. 2014, pp. 1–4.
[6] P. K. Hanumolu, G.-Y. Wei, and U.-K. Moon, “Equalizers for high-
speed serial links,” International Journal of High Speed Electronics and
Systems, vol. 15, no. 02, pp. 429–458, 2005.
[7] B. Razavi, Design of Integrated Circuits for Optical Communications.
McGraw-Hill, 2003.
[8] L. Couch, Digital and Analog Communication Systems. Pearson, 2013.
129
[9] J. Zerbe, C. Werner, V. Stojanovic, F. Chen, J. Wei, G. Tsang, D. Kim,
W. Stonecypher, A. Ho, T. Thrush, R. Kollipara, M. Horowitz, and
K. Donnelly, “Equalization and clock recovery for a 2.5-10-Gb/s 2-
PAM/4-PAM backplane transceiver cell,” IEEE J. Solid-State Circuits,
vol. 38, no. 12, pp. 2121–2130, Dec. 2003.
[10] J. Man, W. Chen, X. Song, and L. Zeng, “A low-cost 100GE optical
transceiver module for 2km SMF interconnect with PAM4 modulation,”
in IEEE Proc. Optical Fiber Communications Conference and Exhibition
(OFC), Mar. 2014, pp. 1–3.
[11] M. van Ierssel, A. Sheikholeslami, H. Tamura, and W. Walker, “A 3.2
gb/s CDR using semi-blind oversampling to achieve high jitter toler-
ance,” IEEE J. Solid-State Circuits, vol. 42, no. 10, pp. 2224–2234, Oct.
2007.
[12] D. Dalton, K. Chai, E. Evans, M. Ferriss, D. Hitchcox, P. Murray,
S. Selvanayagam, P. Shepherd, and L. DeVito, “A 12.5-Mb/s to 2.7-
Gb/s continuous-rate CDR with automatic frequency acquisition and
data-rate readback,” IEEE J. Solid-State Circuits, vol. 40, no. 12, pp.
2713–2725, Dec. 2005.
[13] R. Inti, W. Yin, A. Elshazly, N. Sasidhar, and P. Hanumolu, “A 0.5-to-
2.5-Gb/s reference-less half-rate digtial CDR with unlimited frequency
acquisition range and improved input duty-cycle error tolerance,” IEEE
J. Solid-State Circuits, vol. 46, no. 12, pp. 3150–3162, Dec. 2011.
[14] P. Hanumolu, G.-Y. Wei, and U.-K. Moon, “A wide-tracking range clock
and data recovery circuit,” IEEE J. Solid-State Circuits, vol. 43, no. 2,
pp. 425–439, Feb. 2008.
[15] W. Yin, R. Inti, A. Elshazly, M. Talegaonkar, B. Young, and P. Hanu-
molu, “A TDC-Less 7-mW 2.5-Gb/s digital CDR with linear loop dy-
namics and offset-free data recovery,” IEEE J. Solid-State Circuits,
vol. 46, no. 12, pp. 3163–3173, Dec. 2011.
[16] Understanding Jitter and Wander Measurements and Standards, Agilent
Technologies Inc. [Online]. Available: www.agilent.com.
[17] M. Talegaonkar, R. Inti, and P. Hanumolu, “Digital clock and data
recovery circuit design: Challenges and trade-offs,” in IEEE Proc. IEEE
Custom Integr. Circuits Conf.(CICC), Sept 2011, pp. 1–8.
[18] Synchronous Optical Network (SONET) Transport Systems: Common
Generic Criteria, Telecordia Technologies Inc. [Online]. Available:
http://www.linkteltech.com.
130
[19] G. Shu, S. Saxena, W.-S. Choi, M. Talegaonkar, R. Inti, A. Elshazly,
B. Young, and P. Hanumolu, “A 5-Gb/s 2.6-mW/Gb/s Reference-less
Half-Rate PRPLL-based Digital CDR,” in IEEE Symp. VLSI Circuits
Dig. Tech. Papers, June 2013, pp. 278–279.
[20] S. Sidiropoulos and M. Horowitz, “A semidigital dual delay-locked loop,”
IEEE J. Solid-State Circuits, vol. 32, no. 11, pp. 1683–1692, Nov. 1997.
[21] J. Kim and M. Horowitz, “Adaptive supply serial links with sub-1-V
operation and per-pin clock recovery,” IEEE J. Solid-State Circuits,
vol. 37, no. 11, pp. 1403–1413, Nov. 2002.
[22] K.-Y. K. Chang, J. Wei, C. Huang, Y. Li, K. Donnelly, M. Horowitz,
Y. Li, and S. Sidiropoulos, “A 0.4-4-Gb/s CMOS quad transceiver cell
using on-chip regulated dual-loop PLLs,” IEEE J. Solid-State Circuits,
vol. 38, no. 5, pp. 747–754, May 2003.
[23] J. Sonntag and J. Stonick, “A digital clock and data recovery archi-
tecture for multi-gigabit/s binary links,” IEEE J. Solid-State Circuits,
vol. 41, no. 8, pp. 1867–1875, Aug. 2006.
[24] P. Larsson, “A 2-1600-MHz CMOS clock recovery PLL with low-Vdd
capability,” IEEE J. Solid-State Circuits, vol. 34, no. 12, pp. 1951–1960,
Dec. 1999.
[25] T. Lee and J. Bulzacchelli, “A 155-MHz clock recovery delay-and phase-
locked loop,” IEEE J. Solid-State Circuits, vol. 27, no. 12, pp. 1736–
1746, Dec. 1992.
[26] J. Lee, K. Kundert, and B. Razavi, “Analysis and modeling of bang-
bang clock and data recovery circuits,” IEEE J. Solid-State Circuits,
vol. 39, no. 9, pp. 1571–1580, Sept. 2004.
[27] P. Hanumolu, V. Kratyuk, G.-Y. Wei, and U.-K. Moon, “A sub-
picosecond resolution 0.5-1.5-GHz digital-to-phase converter,” IEEE J.
Solid-State Circuits, vol. 43, no. 2, pp. 414–424, Feb. 2008.
[28] T. Toifl, C. Menolfi, P. Buchmann, M. Kossel, T. Morf, R. Reutemann,
M. Ruegg, M. Schmatz, and J. Weiss, “A 0.94-ps-RMS-Jitter 0.016-
mm2 2.5-GHz multiphase generator PLL with 360 degree digitally pro-
grammable phase shift for 10-Gb/s serial links,” IEEE J. Solid-State
Circuits, vol. 40, no. 12, pp. 2700–2712, Dec. 2005.
[29] T. Toifl, C. Menolfi, P. Buchmann, C. Hagleitner, M. Kossel, T. Morf,
J. Weiss, and M. Schmatz, “A 72-mW 0.03-mm2 inductorless 40-Gb/s
CDR in 65-nm SOI CMOS,” in IEEE ISSCC Dig. Tech. Papers, vol. 38,
Feb. 2007, pp. 226–228.
131
[30] J. Poulton, R. Palmer, A. Fuller, T. Greer, J. Eyles, W. Dally, and
M. Horowitz, “A 14-mW 6.25-Gb/s transceiver in 90-nm CMOS,” IEEE
J. Solid-State Circuits, vol. 42, no. 12, pp. 2745–2757, Dec. 2007.
[31] J. Kenney, D. Dalton, E. Evans, M. Eskiyerli, B. Hilton, D. Hitch-
cox, T. Kwok, D. Mulcahy, C. McQuilkin, V. Reddy, S. Selvanayagam,
P. Shepherd, W. Titus, and L. DeVito, “A 9.95-11.3-Gb/s XFP
transceiver in 0.13-µm CMOS,” IEEE J. Solid-State Circuits, vol. 41,
no. 12, pp. 2901–2910, Dec. 2006.
[32] P. Hanumolu, M. Brownlee, K. Mayaram, and U.-K. Moon, “Analysis
of charge-pump phase-locked loops,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 51, no. 9, pp. 1665–1674, Sept. 2004.
[33] J. Bulzacchelli, M. Meghelli, S. Rylov, W. Rhee, A. Rylyakov,
H. Ainspan, B. Parker, M. Beakes, A. Chung, T. Beukema, P. Pepelju-
goski, L. Shan, Y. Kwark, S. Gowda, and D. Friedman, “A 10-Gb/s
5-tap DFE/4-tap FFE transceiver in 90-nm CMOS technology,” IEEE
J. Solid-State Circuits, vol. 41, no. 12, pp. 2885–2900, Dec. 2006.
[34] W. Yin, R. Inti, A. Elshazly, B. Young, and P. Hanumolu, “A 0.7-to-
3.5-GHz 0.6-to-2.8-mW highly digital phase-locked loop with bandwidth
tracking,” IEEE J. Solid-State Circuits, vol. 46, no. 8, pp. 1870–1880,
Aug. 2011.
[35] A. Arakali, S. Gondi, and P. Hanumolu, “Low-power supply-regulation
techniques for ring oscillators in phase-locked loops using a split-tuned
architecture,” IEEE J. Solid-State Circuits, vol. 44, no. 8, pp. 2169–
2181, Aug. 2009.
[36] AWG7000ArbitraryWaveformGenerator, Tektronix Inc. [Online].
Available: http://www.tektronix.com/awg7000.
[37] D.-H. Oh, D.-S. Kim, S. Kim, D.-K. Jeong, and W. Kim, “A 2.8-Gb/s
all-digital CDR with a 10-b monotonic DCO,” in IEEE ISSCC Dig.
Tech. Papers, Feb. 2007, pp. 12–14.
[38] S.-K. Lee, Y.-S. Kim, H. Ha, Y. Seo, H.-J. Park, and J.-Y. Sim, “A 650-
Mb/s-to-8-Gb/s referenceless CDR circuit with automatic acquisition of
data rate,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2009, pp. 184–185.
[39] L. M. DeVito, “A versatile clock recovery architecture and monolithic
implementation,” inMonolithic Phase-Locked Loops and Clock Recovery
Circuits: Theory and Design, B. Razavi, Ed. NewYork: IEEE Press,
1996, pp. 405–442.
132
[40] D. Richman, “Color-carrier reference phase synchronization accuracy
in NTSC color television,” Proceedings of the IRE, vol. 42, no. 1, pp.
106–133, Jan. 1954.
[41] F. M. Gardner, “Properties of frequency difference detectors,” IEEE
Trans. on Communications, vol. 33, no. 2, pp. 131–138, Feb. 1985.
[42] A. Pottbacker, U. Langmann, and H. Schreiber, “A Si bipolar phase
and frequency detector ic for clock extraction up to 8 Gb/s,” IEEE J.
Solid-State Circuits, vol. 27, no. 12, pp. 1747–1751, Dec. 1992.
[43] N. Kocaman, S. Fallahi, M. Kargar, M. Khanpour, A. Nazemi, U. Singh,
and A. Momtaz, “An 8.5-11.5-Gb/s SONET transceiver with reference-
less frequency acquisition,” IEEE J. Solid-State Circuits, vol. 48, no. 8,
pp. 1875–1884, Aug. 2013.
[44] J. Han, J. Yang, and H.-M. Bae, “Analysis of a frequency acquisition
technique with a stochastic reference clock generator,” IEEE Trans. Cir-
cuits Syst. II, Exp. Briefs, vol. 59, no. 6, pp. 336–340, June 2012.
[45] G. Shu, S. Saxena, W.-S. Choi, M. Talegaonkar, R. Inti, A. Elshazly,
B. Young, and P. Hanumolu, “A reference-less clock and data recov-
ery circuit using phase-rotating phase-locked loop,” IEEE J. Solid-State
Circuits, vol. 49, no. 4, pp. 1036–1047, Apr. 2014.
[46] H. Won, T. Yoon, J. Han, J.-Y. Lee, J.-H. Yoon, T. Kim, J.-S. Lee,
S. Lee, K. Han, J. Lee, J. Park, and H.-M. Bae, “A 0.87 W transceiver IC
for 100 Gigabit ethernet in 40 nm CMOS,” IEEE J. Solid-State Circuits,
vol. 50, no. 2, pp. 399–413, Feb. 2015.
[47] G. Shu, W.-S. Choi, S. Saxena, T. Anand, A. Elshazly, and P. Hanumolu,
“A 4-to-10.5-Gb/s 2.2-mW/Gb/s continuous-rate digital CDR with au-
tomatic frequency acquisition in 65nm CMOS,” in IEEE ISSCC Dig.
Tech. Papers, Feb. 2014, pp. 150–151.
[48] M.-J. Park and J. Kim, “Pseudo-linear analysis of bang-bang controlled
timing circuits,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 60,
no. 6, pp. 1381–1394, June 2013.
[49] B. Nikolic, V. G. Oklobdzija, V. Stojanovic, W. Jia, J. K.-S. Chiu, and
M. Ming-Tak Leung, “Improved sense-amplifier-based flip-flop: design
and measurements,” IEEE J. Solid-State Circuits, vol. 35, no. 6, pp.
876–884, June 2000.
[50] A. Elkholy, A. Elshazly, S. Saxena, G. Shu, and P. Hanumolu, “A
20-to-1000 MHz 14 ps peak-to-peak jitter reconfigurable multi-output
all-digital clock generator using open-loop fractional dividers in 65 nm
CMOS,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2014, pp. 272–273.
133
[51] T. Riley, M. Copeland, and T. Kwasniewski, “Delta-sigma modulation in
fractional-n frequency synthesis,” IEEE J. Solid-State Circuits, vol. 28,
no. 5, pp. 553–559, May 1993.
[52] D. Park and S. Cho, “A 14.2-mW 2.55-to-3-GHz cascaded PLL with
reference injection, 800-MHz delta-sigma modulator and 255fsrms inte-
grated jitter in 0.13 µm CMOS,” in IEEE ISSCC Dig. Tech. Papers,
Feb. 2012, pp. 344–346.
[53] A. Elshazly, R. Inti, B. Young, and P. Hanumolu, “Clock multiplication
techniques using digital multiplying delay-locked loops,” IEEE ISSCC
Dig. Tech. Papers, vol. 48, no. 6, pp. 1416–1428, June 2013.
[54] R. Nandwana, T. Anand, S. Saxena, S.-J. Kim, M. Talegaonkar,
A. Elkholy, W.-S. Choi, A. Elshazly, and P. Hanumolu, “A calibration-
free fractional-N ring PLL using hybrid phase/current-mode phase in-
terpolation method,” IEEE J. Solid-State Circuits, vol. 50, no. 4, Apr.
2015.
[55] S. Ye, L. Jansson, and I. Galton, “A multiple-crystal interface PLL with
VCO realignment to reduce phase noise,” IEEE J. Solid-State Circuits,
vol. 37, no. 12, pp. 1795–1803, Dec. 2002.
[56] S. Huang, J. Cao, and M. Green, “An 8.2-to-10.3-Gb/s full-rate linear
reference-less CDR without frequency detector in 0.18 µm CMOS,” in
IEEE ISSCC Dig. Tech. Papers, Feb. 2014, pp. 152–153.
[57] M. Mansuri, J. Jaussi, J. Kennedy, T.-C. Hsueh, S. Shekhar, G. Balamu-
rugan, F. O’Mahony, C. Roberts, R. Mooney, and B. Casper, “A scal-
able 0.128-1 Tb/s, 0.8-2.6 pJ/bit, 64-lane parallel I/O in 32-nm CMOS,”
IEEE J. Solid-State Circuits, vol. 48, no. 12, pp. 3229–3242, Dec. 2013.
[58] S. Rusu, H. Muljono, D. Ayers, S. Tam, W. Chen, A. Martin, S. Li,
S. Vora, R. Varada, and E. Wang, “Ivytown: A 22nm 15-core enterprise
Xeon processor family,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2014,
pp. 102–103.
[59] E. J. Fluhr, S. Baumgartner, D. Boerstler, J. F. Bulzacchelli, T. Diemoz,
D. Dreps, G. English, J. Friedrich, A. Gattiker, T. Gloekler, C. Gonzalez,
J. D. Hibbeler, K. A. Jenkins, Y. Kim, P. Muench, R. Nett, J. Pare-
des, J. Pille, D. Plass, P. Restle, R. Robertazzi, D. Shan, D. Siljen-
berg, M. Sperling, K. Stawiasz, G. Still, Z. Toprak-Deniz, J. Warnock,
G. Wiedemeier, and V. Zyuban, “The 12-core POWER8TM processor
with 7.6 tb/s io bandwidth, integrated voltage regulation, and resonant
clocking,” IEEE J. Solid-State Circuits, vol. 50, no. 1, pp. 10–23, Jan.
2015.
134
[60] J. W. Poulton, W. J. Dally, X. Chen, J. G. Eyles, T. H. Greer, S. G. Tell,
J. M. Wilson, and C. T. Gray, “A 0.54 pJ/b 20-Gb/s Ground-referenced
single-ended short-reach serial link in 28 nm CMOS for advanced pack-
aging applications,” IEEE J. Solid-State Circuits, vol. 48, no. 12, pp.
3206–3218, Dec. 2013.
[61] A. Manian and B. Razavi, “A 40-Gb/s 14-mW CMOS wireline receiver,”
in IEEE ISSCC Dig. Tech. Papers, Jan. 2016, pp. 412–414.
[62] S. Saxena, G. Shu, R. K. Nandwana, M. Talegaonkar, A. Elkholy,
T. Anand, S. J. Kim, W. S. Choi, and P. K. Hanumolu, “A 2.8-
mW/Gb/s 14-Gb/s serial link transceiver in 65-nm CMOS,” in IEEE
VLSI Circuits Symp. Dig. Tech. Papers, June 2015, pp. 352–353.
[63] V. S. Sathe, S. Arekapudi, A. Ishii, C. Ouyang, M. C. Papaefthymiou,
and S. Naffziger, “Resonant-clock design for a power-efficient, high-
volume x86-64 microprocessor,” IEEE J. Solid-State Circuits, vol. 48,
no. 1, pp. 140–149, Jan. 2013.
[64] G. Balamurugan, J. Kennedy, G. Banerjee, J. E. Jaussi, M. Mansuri,
F. O’Mahony, B. Casper, and R. Mooney, “A scalable 5-15-Gb/s, 14-
75mW low power I/O transceiver in 65nm CMOS,” in IEEE J. Solid-
State Circuits, June 2007, pp. 270–271.
[65] F. O’Mahony, J. Kennedy, J. E. Jaussi, G. Balamurugan, M. Mansuri,
C. Roberts, S. Shekhar, R. Mooney, and B. Casper, “A 47x10Gb/s
1.4mW/Gb/s parallel interface in 45nm CMOS,” in IEEE J. Solid-State
Circuits, Feb. 2010, pp. 156–157.
[66] J. Zerbe, B. Daly, W. Dettloff, T. Stone, W. Stonecypher, P. Venkatesan,
K. Prabhu, B. Su, J. Ren, B. Tsang, B. Leibowitz, D. Dunwell, A. C.
Carusone, and J. Eble, “A 5.6-Gb/s 2.4-mW/Gb/s bidirectional link
with 8ns power-on,” in IEEE VLSI Circuits Symp. Dig. Tech. Papers,
June 2011, pp. 82–83.
[67] M. Talegaonkar, A. Elshazly, K. Reddy, P. Prabha, T. Anand, and
P. K. Hanumolu, “An 8-Gb/s-to-64-Mb/s, 2.3-4.2-mW/Gb/s burst-
mode transmitter in 90-nm cmos,” IEEE J. Solid-State Circuits, vol. 49,
no. 10, pp. 2228–2242, Oct. 2014.
[68] T. Anand, M. Talegaonkar, A. Elkholy, S. Saxena, A. Elshazly, and
P. K. Hanumolu, “A 7 Gb/s embedded clock transceiver for energy
proportional links,” IEEE J. Solid-State Circuits, vol. 50, no. 12, pp.
3101–3119, Dec. 2015.
[69] L. Barroso and U. Holzle, “The case for energy-proportional computing,”
Computer, vol. 40, no. 12, pp. 33–37, Dec. 2007.
135
[70] D. Abts, M. R. Marty, P. M. Wells, P. Klausler, and H. Liu, “Energy
proportional data center networks,” in Proceedings of International Sym-
posium on Computer Architecture (ISCA), June 2010, pp. 338–347.
[71] J. C. Eble, S. Best, B. Leibowitz, L. Luo, R. Palmer, J. Wilson, J. Zerbe,
A. Amirkhany, and N. Nguyen, “Power-efficient i/o design considera-
tions for high-bandwidth applications,” in IEEE Proc. IEEE Custom
Integr. Circuits Conf.(CICC), Sept. 2011, pp. 1–8.
[72] G. Shu, W. S. Choi, S. Saxena, S. J. Kim, M. Talegaonkar, R. Nandwana,
A. Elkholy, D. Wei, T. Nandi, and P. K. Hanumolu, “A 16Mb/s-to-
8Gb/s 14.1-to-5.9pJ/bit source synchronous transceiver using DVFS and
rapid on/off in 65nm CMOS,” in IEEE ISSCC Dig. Tech. Papers, Jan.
2016, pp. 398–399.
[73] R. Redl and J. Sun, “Ripple-based control of switching regulators-an
overview,” IEEE Trans. on Power Electron., vol. 24, no. 12, pp. 2669–
2680, Dec. 2009.
[74] H. Krishnamurthy, V. Vaidya, P. Kumar, G. Matthew, S. Weng,
B. Thiruvengadam, W. Proefrock, K. Ravichandran, and V. De, “A
500-MHz, 68on-die digitally controlled buck voltage regulator on 22nm
Tri-Gate CMOS,” in IEEE VLSI Circuits Symp. Dig. Tech. Papers,
June 2014, pp. 1–2.
[75] T. Anand, A. Elshazly, M. Talegaonkar, B. Young, and P. K. Hanumolu,
“A 5 Gb/s, 10 ns power-on-time, 36 µW off-state power, fast power-on
transmitter for energy proportional links,” IEEE J. Solid-State Circuits,
vol. 49, no. 10, pp. 2243–2258, Oct. 2014.
[76] B. Leibowitz, R. Palmer, J. Poulton, Y. Frans, S. Li, J. Wilson,
M. Bucher, A. M. Fuller, J. Eyles, M. Aleksic, T. Greer, and N. M.
Nguyen, “A 4.3 GB/s mobile memory interface with power-efficient
bandwidth scaling,” IEEE J. Solid-State Circuits, vol. 45, no. 4, pp.
889–898, Apr. 2010.
[77] B. M. Helal, M. Z. Straayer, G. Y. Wei, and M. H. Perrott, “A highly
digital mdll-based clock multiplier that leverages a self-scrambling time-
to-digital converter to achieve subpicosecond jitter performance,” IEEE
J. Solid-State Circuits, vol. 43, no. 4, pp. 855–863, Apr. 2008.
[78] A. Elshazly, R. Inti, B. Young, and P. K. Hanumolu, “Clock multiplica-
tion techniques using digital multiplying delay-locked loops,” IEEE J.
Solid-State Circuits, vol. 48, no. 6, pp. 1416–1428, June 2013.
136
[79] S. Levantino, G. Marucci, G. Marzin, A. Fenaroli, C. Samori, and A. L.
Lacaita, “A 1.7 GHz fractional-N frequency synthesizer based on a mul-
tiplying delay-locked loop,” IEEE J. Solid-State Circuits, vol. 50, no. 11,
pp. 2678–2691, Nov. 2015.
[80] J. Zerbe, B. Daly, L. Luo, W. Stonecypher, W. Dettloff, J. C. Eble,
T. Stone, J. Ren, B. Leibowitz, M. Bucher, P. Satarzadeh, Q. Lin, Y. Lu,
and R. Kollipara, “A 5-Gb/s link with matched source synchronous
and common-mode clocking techniques,” IEEE J. Solid-State Circuits,
vol. 46, no. 4, pp. 974–985, Apr. 2011.
[81] M. S. Chen and C. K. K. Yang, “A low-power highly multiplexed par-
allel PRBS generator,” in IEEE Proc. IEEE Custom Integr. Circuits
Conf.(CICC), Sept. 2012, pp. 1–4.
[82] C. Menolfi, T. Toifl, R. Reutemann, M. Ruegg, P. Buchmann, M. Kos-
sel, T. Morf, and M. Schmatz, “A 25-Gb/s PAM4 transmitter in 90nm
CMOS SOI,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2005, pp. 72–73.
[83] J. W. Jung and B. Razavi, “A 25-Gb/s 5-mWCMOS CDR/deserializer,”
IEEE J. Solid-State Circuits, vol. 48, no. 3, pp. 684–697, Mar. 2013.
[84] A. Leon-Garcia, Probability, statistics, and random processes for electri-
cal engineering. Pearson Education, Inc., 2008.
[85] M. V. Talegaonkar, “Design of energy efficient high speed I/O inter-
faces,” Ph.D. dissertation, University of Illinois, 2016.
[86] T. Sakurai and A. R. Newton, “Alpha-power law MOSFET model and
its applications to CMOS inverter delay and other formulas,” IEEE J.
Solid-State Circuits, vol. 25, no. 2, pp. 584–594, Apr. 1990.
[87] W.-S. Choi, G. Shu, M. Talegaonkar, Y. Liu, D. Wei, L. Benini, and
P. K. Hanumolu, “A 0.45-to-0.7V 1-to-6Gb/s 0.29-to-0.58pJ/b source-
synchronous transceiver using automatic phase calibration in 65nm
CMOS,” in IEEE ISSCC Dig. Tech. Papers, vol. 58, Mar. 2015, pp.
66–67.
137
