Clocking and Skew-Optimization For Source-Synchronous Simultaneous Bidirectional Links by Ankur Kumar, FNU





Submitted to the Office of Graduate and Professional Studies of
Texas A&M University
in partial fulfillment of the requirements for the degree of
MASTER OF SCIENCE
Chair of Committee, Samuel M. Palermo
Committee Members, Jose Silva-Martinez
Laszlo Kish
Rabi N. Mahapatra
Head of Department, Miroslav M. Begovic
December 2018
Major Subject: Electrical Engineering
Copyright 2018 Ankur Kumar
ABSTRACT
There is continuous expansion of computing capabilities in mobile devices which
demands higher I/O bandwidth and dense parallel links supporting higher data rates. High-
speed signaling leverages technology advancements to achieve higher data rates but is lim-
ited by the bandwidth of the electrical copper channel which have not scaled accordingly.
To meet the continuous data-rate demand, Simultaneous Bi-directional (SBD) signaling
technique is an attractive alternative relative to uni-directional signaling as it can work at
lower clock speeds, exhibits better spectral efficiency and provides higher throughput in
pad limited PCBs.
For low-power and more robust system, the SBD transceiver should utilize for-
warded clock system and per-pin de-skew circuits to correct the phase difference devel-
oped between the data and clock. The system can be configured in two roles, master and
slave. To save more power, the system should have only one clock generator. The master
has its own clock source and shares its clock to the slave through the clock channel, and the
slave uses this forwarded clock to deserialize the inbound data and serialize the outbound
data. A clock-to-data skew exists which can be corrected with a phase tracking CDR. This
thesis presents a low-power implementation of forwarded clocking and clock-to-data skew
optimization for a 40 Gbps SBD transceiver. The design is implemented in 28nm CMOS
technology and consumes 8.8mW of power for 20 Gbps NRZ data at 0.9 V supply. The
area occupied by the clocking 0.018 mm2 area.
ii
DEDICATION
To my mother and father,
Mamta and Alok,
My strengths, my faith, my joy
iii
ACKNOWLEDGMENTS
Firstly, I would like to express my sincere gratitude to my advisor, Prof. Samuel
Palermo, for his patience, guidance and the continuous support during my thesis. I thank
him for believing in me and giving me the opportunity to work on this project. He has
always motivated me to do better and think outside the box. I am impressed with his
work ethics and group dynamics and would definitely incorporate these qualities in my
professional life.
Former and current members of Prof. Palermo’s group have been my teachers,
mentors and friends. I am grateful to Yang-Hang, Takayuki, Ashkan for their kindness,
time, help, and feedback on this project. This project would not be completed without
their support. I would like to thank Shengchang, Shiva, Yuanming, Po-Hsuan and Noah
for their valuable inputs and discussions. I have learnt a lot about high-speed IC design
from all of you. I also thank Bo Sun, Thomas Kilpatrick and Danny Butterfield from
Qualcomm for supporting our research and their inputs and feedback on this project.
I would like to thank Prof. Jose Silva-Martinez, Prof. Laszlo Kish and Prof. Rabi
N. Mahapatra for serving on my committee. Prof. Silva-Martinez has taught me Network
Theory which deepened my interest in Clocking Circuits. Prof. Kish’s course on Fluctua-
tions and Noise enlightened me about noise analysis of circuits and related research in the
field.
A special thanks all my friends who made my time in College Station enjoyable
and fun. Life in graduate school would be a lot harder without you all. Thank you Umang,
Ishan, Aalhad, Sudhanshu, Mitchell and Nick. You guys have always been there to lift up
my mood and spirits. Also, I thank my friends from India, Anand, Saurav, Bhupesh and
Neha. You all have kept the friendship boat sailing through, always keeping in touch, and
iv
providing me encouragement and support.
Lastly, I thank my grandfather, Dr. Chandrika Prasad Gupta for motivating to do
research. Also I thank my elder brother Ankit and sister-in-law Sonam for always cheering
me up. I thank my mother and father for believing in me and supporting my decision to
study abroad and pursue research. This thesis, for all its worth, is dedicated to them.
v
CONTRIBUTORS AND FUNDING SOURCES
Contributors
This work was supervised by a thesis committee consisting of Professor Samuel
Palermo, Professor Jose Silva-Martinez and Professor Laszlo Kish of the Department of
Electrical and Computer Engineering and Professor Rabi N. Mahapatra of the Department
of Computer Science.
The analyses depicted in section 1.2 for data-path of Simultaneous Bidirectional
Links was conducted by Yang-Hang Fan, Ph.D. student in Department of Electrical and
Computer Engineering.
All other work conducted for the thesis (or) dissertation was completed by the
student independently.
Funding Sources
This work was supported by Qualcomm Technologies, Inc. San Diego, CA.
vi
NOMENCLATURE
BBPD Bang-Bang Phase Detector
BER Bit-Error Rate








PCB Printed Circuit Board
PI Phase Interpolator
PLL Phase-Locked Loop









ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
CONTRIBUTORS AND FUNDING SOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
NOMENCLATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF TABLES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
1. INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 High Speed Serial Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Electrical Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Simultaneous Bidirectional Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2. LITERATURE SURVEY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Source-Synchronous Forwarded Clock Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Clock and Data Recovery Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Types of CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 CDRs in Forwarded-Clock I/O Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Existing Simultaneous Bi-directional Link Architectures . . . . . . . . . . . . . . . . . . . . 18
2.4 Clocking Scheme in Existing SBD Links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3. SIMULTANEOUS BIDIRECTIONAL TRANSCEIVER ARCHITECTURE . . . . . 21
4. PROPOSED CLOCKING SCHEME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Multiphase Clock Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
viii
4.2 Forwarded Clock Transmitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Receiving Clock Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 De-Skew Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 CDR Loop Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4.2 Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.3 Phase Rotator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.4 Oversampling Clock Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.5 Phase Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.6 Digital Filter and FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.7 Bypass Clock Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5. SIMULATION RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 Ideal Model Simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Clocking Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Link Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Proposed Test Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6. CONCLUSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65




1.1 High Speed Electrical Link System1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 A Typical Electrical Backplane2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Frequency Response of Channels With Different Lengths3 . . . . . . . . . . . . . . . . . . 3
1.4 Basic Concept of SBD Link System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Conventional Forwarded Clock System4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Normalized Jitter vs Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Normalized Jitter vs JTB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Frequency Domain Model for Forwarded Clock Link5 . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Jitter tolerance v/s Jitter Frequency Plots For Different Clock Skew Values 13
2.6 Optimal Position for Sampling Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 (a) Embedded Clocking and (b) Forwarded Clocking Architecture6 . . . . . . . . . 16
2.8 Analog PLL-based CDR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9 Digital CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10 SBD Link Architecture with plesiochronous clocking.7 . . . . . . . . . . . . . . . . . . . . . . 20
2.11 SBD Link Architecture for with uni-directional clocks forwarded in both
directions.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Frequency Response for 6" FR4 Channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 System-Level Diagram for Proposed SBD Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Data Transmission from Master-to-Slave: Skew ~0. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Data Transmission from Slave-to-Master: Skew > 2*trace delay (~2160ps) 24
x
4.1 Injection-Locked Oscillator for IQ Generation9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Quadrature Locked Loop for Phase Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 ILO Control Voltage v/s Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 ILO Frequnecy v/s Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Low Swing Output Driver for Clock Transmitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.6 Receiving Clock Buffer and CML-to-CMOS Converter at Slave Side . . . . . . . 30
4.7 Adaptation of ILO while tracking phase. Aqua curve: I output (CLK 0) at
fixed position (−45◦). Indigo curve: I output (CLK 0) varying with code
(−45◦to+ 45◦) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.8 ILO based Phase Rotator as described in10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.9 Proposed Deskew Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.10 Proposed Block Diagram for a Edge-Rotating CDR . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.11 Closed-Loop Diagram for the CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.12 Transfer Function for the CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.13 Double-Tail Sampler with Regenerative Latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.14 Latch Resolution Comparison for two Double-Tail Comparators . . . . . . . . . . . . 37
4.15 Impulse Sensitivity Function Comparison for two Double-Tail Comparators 38
4.16 Schematic for Proposed Phase Rotator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.17 Edge Rotation and Need for Oversampling Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.18 Oversampling Clock Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.19 Linear Hogge Phase Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.20 Non-Linear Alexander Phase Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.21 BBPD Transfer Function for Unidirectional 20Gbps Data . . . . . . . . . . . . . . . . . . . 44
4.22 BBPD Transfer Function for Simultaneous Bidirectional Data . . . . . . . . . . . . . . 44
4.23 Block Diagram of Digital Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
xi
4.24 Illustration of Digital Filter Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.25 Bypass Clock Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.26 Layout of the Fabricated Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.27 Layout of the Chip showing CDR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.28 Die Photo of the Fabricated Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Eye Diagram showing Ideal CDR Lock Position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Eye Diagram showing CDR Lock Position for 3-bit dither bits . . . . . . . . . . . . . . 53
5.3 Eye Diagram showing CDR Lock Position for 6-bit dither bits . . . . . . . . . . . . . . 53
5.4 Free-running ILO at 5GHz @ Control Voltage of 504 mV . . . . . . . . . . . . . . . . . . . 54
5.5 Free-Running Frequency v/s Control Voltage for ILO . . . . . . . . . . . . . . . . . . . . . . . . 54
5.6 Output of Forwarded Clock Transmitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.7 Output of CML to CMOS Buffer at clock RX side . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.8 Phase Rotator Transient Output for all code values . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.9 Phase Transfer Curve for Phase Rotator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.10 Dyanmic Non-Linearity Curve for Phase Rotator for 1UI . . . . . . . . . . . . . . . . . . . . 57
5.11 Output of 2X Oversampling CLK Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.12 Output of Digital Filter for Always Early Clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.13 Output of Digital Filter for Always Late Clock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.14 Schematic Level Simulation 5/4X CDR with Ideal PRBS Data . . . . . . . . . . . . . . 59
5.15 Eye Diagram for Schematic-Level CDR using 2" Channel and 6-bit LSB
filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.16 Digital Filter Code Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.17 Eye Diagram for 20Gbps Unidirectional Data After 6” FR4 Channel . . . . . . . 61
5.18 Eye Diagram for 20Gbps Unidirectional Data After CTLE . . . . . . . . . . . . . . . . . . 62
xii
5.19 Eye Diagram for 20Gbps Bidirectional Data After CTLE and FIR Echo
Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.20 PRBS15 Data is serialized and checked by PRBS15 Checker at TX Output 63
5.21 20 Gbps PRBS15 Data is sent over the channel and checked by PRBS15
Checker at Sampler Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63




2.1 Transfer Function and JTOL For Different De-skew Elements . . . . . . . . . . . . . . . 11
2.2 Comparison Between Digital and Analog CDRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Loop Parameters for 5/4X CDR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Digital Code Table for 5-bit Phase Rotator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Selection Code and Corresponding Clock Outputs for 4:2 MUX in Phase
Rotator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1 Power Summary of the CDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Comparison of Digital CDRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xiv
1. INTRODUCTION
1.1 High Speed Serial Links
Data processing capabilities of computer and mobile systems have tremendously
increased which are primarily enabled by integrated circuit scaling and developments in
multi-core, multi-processor based computer architectures [1]. However, this increase is
not proportionally scaled for the number of I/O pins. The pin count on chip packages is
limited and further hindered by printed circuit board wiring constraints.
High-speed signaling techniques have utilized semiconductor process technology
improvements and achieve high data rates. A typical high-speed electrical link system is
shown below in Figure 1.1.
Figure 1.1: High Speed Electrical Link System1
1*Figure reprinted with permission from CMOS Nanoelectronics: Analog and RF VLSI Circuits by
Krzysztof Iniewski, McGraw-Hill Publishing Co., New York, USA. Copyright c⃝2011 by McGraw-Hill
Education, LLC.
1
Low-speed parallel data streams are serialized by a transmitter to overcome the
constraint presented by the count of high-speed pads in chip packages and printed circuit
board (PCB) wiring, which have not siultaneously as the MOSFET transistor. We use
low-swing differential transmitter for better common-mode noise rejection and reduced
crosstalk[2].
The incoming signal is sampled at the receiver side and restored to CMOS levels,
and then deserialized to lower-speeds. The data is synchronized via high-frequency clocks,
which are generated using a phase-locked loop (PLL) based frequency synthesizer at the
transmitter. The clocks used for sampling at the receiver are aligned to the data with the
help of a clock and data recovery system [2].
1.2 Electrical Channel
Figure 1.2: A Typical Electrical Backplane2
Copper-based electrical channels are commonly used in current computing sys-
tem. The lengths of these channels can vary from few inches (e.g. processor-to-memory
2*Figure reprinted with permission from CMOS Nanoelectronics: Analog and RF VLSI Circuits by
Krzysztof Iniewski, McGraw-Hill Publishing Co., New York, USA. Copyright c⃝2011 by McGraw-Hill
Education, LLC.
2
interconnection) to several meters (multi-layer backplanes) depending on the required ap-
plication. A typical backplane system showing electrical interconnects is shown in Figure
1.2.
Electrical signals propagate through these copper interconnects. The bandwidth of
electrical channels is restricted by loss at higher freuqencies exhibited by the copper traces,
and the reflections caused from impedance discontinuities and adjacent signal crosstalk[3].
The frequency response of these channels for different channel lengths is shown below in
Figure 1.3.
Figure 1.3: Frequency Response of Channels With Different Lengths3
As seen from the plots, the loss increases with channel lengths. Overall, all chan-
nels exhibit a low-pass characteristic, resulting in a degraded received signal whose energy
is now spread over multiple bit periods.
3*Figure reprinted with permission from CMOS Nanoelectronics: Analog and RF VLSI Circuits by
Krzysztof Iniewski, McGraw-Hill Publishing Co., New York, USA. Copyright c⃝2011 by McGraw-Hill
Education, LLC.
3
1.3 Simultaneous Bidirectional Links
As discussed in previous sections, even though process technology advancements
allow for high performance electrical links, the channel forms the bottleneck in the overall
system design limiting the maximum data rate flowing through the interconnects.
Simultaneous bi-directional (SBD) signaling technique is another alternative which,
relative to unidirectional signaling, can work at lower clock speeds, exhibits better spec-
tral efficiency and provides higher throughput in pad limited PCBs. In an SBD transceiver,
each side can transmit and receive data at the same time. An SBD system conceptual dia-
gram is shown in Figure 1.4.
Figure 1.4: Basic Concept of SBD Link System
The transmitter on the left side delivers the outbound signal, V1, on the channel
4
and receives the inbound signal, V2, generated by the transmitter of the right side. In the
receiver-end on the left side, the receiving signal is the superposition of V1 and V2, so
the SBD receiver should have the ability to separate the V2 from V1+V2. The separation
needs a Vgen which can produce a replica of the the outbound signal V1 and a subtractor
which can subtract the outbound signal component from the receive-end signal [4][5].
1.4 Research Contribution
The SBD transceiver will be able to achieve lower power and more robustness, if
it utilizes source-synchronous forwarded clock system and per-pin de-skew circuits to to
correct the phase difference developed between the data and clock. This requires the clock
pattern to be sent over the channel. The SBD system can be configured for two roles,
master and slave. The system should have only one clock generator to save power. The
master has the clock generator and shares this clock to the slave through the clock channel,
and the slave uses this forwarded clock to deserialize the inbound data and serialize the
outbound data.
In a forwarded-clock system, the frequency at the slave side is exactly equal to
the frequency at master side. Hence only the phase between clock and data needs to be
corrected which can be achieved with a phase-tracking CDR. This thesis presents imple-
mentation of forwarded clocking and clock-to-data skew optimization for a 40 Gbps SBD
transceiver. Another aim of this thesis is to focus on a low-power implementation of the
overall clocking and skew-optimization for the SBD system.
1.5 Thesis Organization
The thesis is organized as follows. Section 2 summarizes the literature survey for
the SBD transceiver and latest forwarded clock architectures. Section 3 introduces the
5
SBD system and clocking analysis in detail. Section 4 shows the proposed circuit level
implementations. Section 5 discusses the simulation results. Section 6 concludes this
thesis. References used for analysis and comparison are mentioned at the end.
6
2. LITERATURE SURVEY
2.1 Source-Synchronous Forwarded Clock Systems
A source-synchronous clock (or a forwarded-clock) in multi-channel system helps
to achieve higher data-rates and allows low to high frequency jitter tracking [6]. This
system is also termed as a mesochronous system and has been used in processor-memory
interfaces like Intel Quick Path Interconnect (QPI) and multi-processor communication
like Hypertransport [7].
Figure 2.1: Conventional Forwarded Clock System1
1*Figure reprinted with permission from CMOS Nanoelectronics: Analog and RF VLSI Circuits by
Krzysztof Iniewski, McGraw-Hill Publishing Co., New York, USA. Copyright c⃝2011 by McGraw-Hill
Education, LLC.
7
A block diagram of a conventional forwarded clock system is shown in Figure
2.1. An extra channel is used in the source-synchronous architecture to send a fixed clock
pattern from transmitter side to the receiver. A replica transmitter drives a fixed pattern
like 1,0,1,0 which results in a clock pattern at the output of this transmitter and is fed onto
this extra channel. This ensures maximum jitter correlation between data and clock. A
clock amplifier is used at the receiver side to account for any channel loss and distribute
the clock.
The clock and data channel will exhibit a mismatch in time-delay due to differ-
ent driver strengths, loading or trace lengths. This will degrade the overall system timing
margins and limit the effective jitter tracking. Moreover, as the frequency of the data jitter
changes from low to moderate frequencies, the phase shift becomes larger and the dif-
ferential jitter between the data path and clock path increases[8]. The following equation
defines a relation between the normalized differential jitter JNOR and the skew ∆T caused
by different signal propagation delay [8],
|JNOR(ω)| = |1− e−jω∆T | (2.1)
Figure 2.2 plots the above expression for different jitter frequencies. We can easily
see that the normalized differential jitter changes proportionally with clock-to-data skew.
Large skew causes jitter gain > 1.
Also, higher frequency jitter has higher gain at the same skew value. Hence a low-
pass filter is needed which can filter out high frequency jitter[8]. Also, for a given jitter
frequency, skew between data and clock also affects the jitter tracking bandwidth as shown
in Figure 2.3.
Commonly used de-skew mechanisms which reduce the skew between incoming
data and clock and help in sampling the data pattern at the optimal point include Delay-
8
Figure 2.2: Normalized Jitter vs Skew
Figure 2.3: Normalized Jitter vs JTB
locked loops (DLL) or Phase-locked Loops (PLL) with Phase interpolators(PI), or an
Injection-Locked Oscillator. A frequency domain model is shown below in Figure 2.4.
Including the jitter transfer function HCR(jω), from the de-skew circuitry [8], the nor-
9
malized differential jitter equation (1) is modified as shown below:
|JNOR(ω)| = |1− e−jω∆T |HCR(ω)|| (2.2)
Figure 2.4: Frequency Domain Model for Forwarded Clock Link2
A DLL with a PI shows an all-pass characteristic as shown in the table below and
hence cannot filter the amplified high-frequency jitter. A PLL with a PI has inherent low-
pass characteristics but implementations of PLLs which have high bandwidth and consume
low power are not easy to implement and consume significant silicon area [6]. Moreover,
PLL designs typically involve stability concerns. An ILO based de-skew circuit exhibits
high jitter tolerance at a low complexity level when compared with other topologies [8].
For different skew elements, the following table plots the jitter transfer function
and the jitter tolerance transfer function:
For comparison of different de-skew architectures, The following parameters were
used for modelling:
2*Figure reprinted with permission from "Receiver Jitter Tracking Characteristics in High-Speed Source
Synchronous Links" by Ahmed Ragab, Yang Liu, Kangmin Hu, Patrick Chiang, and Samuel Palermo, 2011.
Journal of Electrical and Computer Engineering, vol. 2011, Article ID 982314, Copyright c⃝2011 by
Hindawi Publishing Corporation.
10














Table 2.1: Transfer Function and JTOL For Different De-skew Elements
• 20Gb/s data rate, 5GHz clock frequency
• PLL
– 3-dB bandwidth f3dB = 150MHz





s2 + 2ζωn + ω2n
(2.3)
– ζ = 1.2 (Damping factor)
– σ = 0.2ps @ BW > 10 MHz
• DLL
– All pass TF
• ILO
– K = 0.5 (injection strength)








– n = 4 for 4 delay stages ring oscillator
11
– ωosc = free running frequency which can be adjusted to achieve desired output
phase shift
– σ = 0.2ps @ K = 0.025, σ = 0.1ps @ K = 0.2, σ = 0.08ps @ K = 0.35
The jitter tolerance expression for different de-skew architectures as mentioned in
Table, is plotted for different skew values, assuming the above parameters in Figure 2.5.
The plots confirm that DLL with PI has no low-pass filter characteristics and the jitter
tolerance for the high skew is worst at high frequency. PLL jitter tracking bandwidth is
required to be smaller than 10 times of the reference frequency, thus exhibiting an inherent
limitation. Moreover PLLs are susceptible to power-supply noise and have loop stability
concerns which need to be addressed carefully.
An ILO inherently will perform a low-pass filtering on the received clock signal.
ILOs too are susceptible to power-supply noise but the severity is topology dependent [8].
The jitter tracking bandwidth of an ILO is a function of injection strength K and free-
running frequency, and shows a wider BW range and thus ILO seems to a better choice for
de-skew in forwarded clock systems. Recent forwarded-clock systems as in [9][10][11]
also select ILO as an appropriate de-skew element.
Recent Literature also has used ILO as an optimal solution for de-skewing in for-
warded clock system as seen in [9][10][11][12]. The theoretical de-skew range for an ILO
is ±90◦. Multiple ILOs can be configured to achieve a complete deskew range of 360◦.
However, unlike PI based systems, ILO face issues when it approaches the outer bounds
of this range, resulting in a more complex architecture as seen in [10]. Moreover, if the
clock architecture is quarter-rate, this range is relaxed to ±45◦ and ILOs can achieve this
range easily with more degrees of freedom like injection-strength tuning to get better jitter
tolerance.
12
Figure 2.5: Jitter tolerance v/s Jitter Frequency Plots For Different Clock Skew Values
2.2 Clock and Data Recovery Circuits
A clock and data recovery (CDR) is a closed-loop system which samples the in-
coming data to extracts clocking information from it and reconstructs the original transmit-
ted bit-stream at the receiver. The clocks from the CDR should have an effective frequency
equal to sample the incoming data and a proper phase relationship with the data for enough
timing margins and desired bit-error-rate (BER).
13
The eye-diagram which is shown in the Figure 2.6 below is constructed by super-
imposing consecutive data-bits onto a single-bit time. The arrow shows the optimal sam-
pling point. The main aim of the CDR is to produce clocks which can sample the data
at this indicated point, resulting in as few errors as possible or in other words, the lowest
bit-error rate [13].
Figure 2.6: Optimal Position for Sampling Data
The data transitions apparently seem to wander in position. This can be caused by
either a deterministic phase offset between the TX and RX clocks or a timing uncertainty
also known as jitter. The CDR design should be able to overcome these issues in order to
achieve correct data reconstruction at the receive side [13].
The CDR design also depends on what clocking architecture is used in I/O system
viz. common clocking, forward clocking or embedded clocking. In common clocking
is synchronized hence there is no active de-skew. Frequency in forwarded clocking is
14
exactly equal at the TX and RX side, which implies CDR has to take care of a determin-
istic skew between data and clock caused by driver strength and loading mismatches and
trace mismatches. In embedded clocking, both frequency and phase are unknown at the
receiver side. Here CDR design is complex and involves extracting frequency and phase
information. Moreover the skew has a random component which implies continuous phase
correction between data and clock.
Jitter refers to the timing uncertainty in the phase of clock edges caused by noise
(device noise, thermal and power supply variations) in the system. The jitter can be either
deterministic (DJ) or random (RJ). The common types of DJ found in real systems are data
dependent jitter, duty cycle distortion (DCD), and uncorrelated (to the data) bounded jitter
such as supply noise induced jitter[13]. The dominant source of DJ is from inter-symbol
interference (ISI). Thermal noise and flicker noise from active and passive components
basically contribute to the RJ. Often these jitter components are uncorrelated and should
be filtered by the CDR. [13]
2.2.1 Types of CDR
CDRs can be either analog or digital. Another classification is either a single-loop
or a dual-loop CDR. Analog PLL-based CDRs are the most prevalent timing recovery
systems both in industry and research. The block diagram of an analog CDR is shown in
Figure 2.7.
A PLL-based CDR consists of a phase detector (PD), that characterizes the phase
difference between data and clock, a charge-pump, which converts the output of phase
detector into current, an analog loop filter which extracts averages of the phase-detector
output and sets the CDR bandwidth. The last stage shown is the voltage controlled oscil-
lator (VCO) which will adjust its frequency based on the loop filter to optimally sample
the data in the middle of the eye[14].
15
Figure 2.7: (a) Embedded Clocking and (b) Forwarded Clocking Architecture3
Digital CDRs replace the charge-pump and analog loop filter with their digital
3*Figure reprinted with permission from CMOS Nanoelectronics: Analog and RF VLSI Circuits by
Krzysztof Iniewski, McGraw-Hill Publishing Co., New York, USA. Copyright c⃝2011 by McGraw-Hill
Education, LLC.
16
Figure 2.8: Analog PLL-based CDR
counterparts which can be a digital accumulator with phase mixers/interpolator[15]. A
common architecture is shown below in Figure 2.9. Digital CDRs offer reduced data
dependent jitter caused by loop filter and offsets due to current mismatches in the charge-
pump. Digital CDRs support supply scaling and its loop dynamics is PVT invariant.
Figure 2.9: Digital CDR
17
2.2.2 CDRs in Forwarded-Clock I/O Systems
In a forwarded clock system, the clock is sent on a separate channel. Hence effec-
tively, the frequency of data and sampling clock are exactly equal and correlated. However,
due to different strengths and routing, there will be deterministic skew between data and
clock which needs to be corrected. A low-overhead CDR can accomplish this. Implemen-
tations in [28]-[29] are analog DLL based CDRs. These CDRs are process-sensitive and
consume large area. In [16][17][18] digital CDRs have been implemented with are process
invariant, consume less area and compatible with supply-scaling.
Reference Reutemann Loke Mansuri Li
JSSC 2010 JSSC 2012 ISSCC 2013 VLSI 2014
CDR Type Analog PLL Analog DLL Digital Digital
Data-Rate 3.2-6.4 0.4-8.0 6.0-9.0 2.4-6.4
Area(mm2) 0.16 - 0.025 0.36
Energy Efficiency4(pJ/b) 4.5 @ 6.4Gb/s 1.95 @ 6.4Gb/s 1.02 0.56
Table 2.2: Comparison Between Digital and Analog CDRs
2.3 Existing Simultaneous Bi-directional Link Architectures
A SBD transceiver can double the data rate per pin compared to a conventional
uni-directional transceiver since it transmits and receives data on the same channel[19]-
[20]. At the receiver-end in the SBD transceiver, some mechanism is needed to extract
the incoming signal. From the literature survey, the SBD transceivers achieved this either
by changing the comparator reference voltage according its output data [19]-[21][22] , or
4*Estimation includes complete receiver.
18
adapted switched-capacitor hybrid (SCH) to subtract the outbound signal generated by the
replica driver [4][5], or used resistor-transconductor (R-gm) hybrid [20]. The analysis of
the data-path implementation is outside the scope of this thesis.
While SBD has better spectral efficiency when the performance is compared to
uni-directional signaling, this does not imply that it should be used for all the high-speed
systems. The major interference or uncertainty in the separated signal is the replica driver
mismatch, the long channel and the echoes introduced by the channel discontinuities which
restrict SBD aggregate rate [5]. As the chip-to-memory I/O transceivers designs target
short channels, low power and high pin density, a SBD signaling based transceiver can
achieve higher aggregate data rate.
2.4 Clocking Scheme in Existing SBD Links
For low-power and more robust system, this SBD transceiver should utilize for-
warded clock system and per-pin de-skew circuits to correct the phase difference devel-
oped between the data and clock. References [19]-[23][21][22][20] focus on the data-path
implementation of the SBD system and do not mention any clocking scheme for their
proposed SBD system. In [24], a forwarded-clock system with analog-type PLL is im-
plemented for correcting clock-to-data skew. Plesiochronous clocking with a PI-based
clocking-recovery system is used in SBD link described in [4].
As seen in the Figure 2.10, transmit and receive sides do not share the clock gen-
erator, resulting in a higher overall power. The complexity of the clock-recovery system is
also higher since it needs to determine frequency of the clock and correct phase between
data and clock. The power reported is. A similar clocking implementation is also found in
[25]. In SBD link of [5], uni-directional clocks are forwarded from both sides on separate
channels, which implies use of two dedicated channels for clock which is also not power
efficient. Multiphase clocks are generated using a Delay-Locked Loop and clock-to data
19
skew is corrected using a phase interpolator.
Figure 2.10: SBD Link Architecture with plesiochronous clocking.5
Figure 2.11: SBD Link Architecture for with uni-directional clocks forwarded in both
directions.6
5*Figure reprinted with permission from "A 1-Gb/s bidirectional I/O buffer using the current-mode
scheme" by Jae-Yoon Sim, Young-Soo Sohn, Seung-Chan Heo, Hong-June Park and Soo-In Cho, 1999.
IEEE Journal of Solid-State Circuits, vol. 34, Copyright c⃝1999 by IEEE.
6*Figure reprinted with permission from "An 8 Gb/s Simultaneous Bidirectional Link with On-die Wave-
form Capture" by B. Casper, A. Martin, J. E. Jaussi, J. Kennedy, and R. Mooney, 2003. IEEE Journal of
Solid-State Circuits, vol. 38, Copyright c⃝2003 by IEEE.
20
3. SIMULTANEOUS BIDIRECTIONAL TRANSCEIVER ARCHITECTURE
The design goal is to build a 40 Gb/s simultaneous bidirectional source-synchronous
transceiver which implies 20 Gb/s flowing in each direction simultaneously between the
2 chips. The transceiver should support operation over the 6 channel with 12dB loss at
the 10GHz Nyquist frequency. Figure 3.1 shows the insertion loss and return loss of the
target channel. A source-synchronized (forwarded clock) architecture should be utilized
for lower power consumption. A key objective is excellent power-efficiency, power target
< 0.5pJ/b.
Figure 3.1: Frequency Response for 6" FR4 Channel
A proposed system-level block diagram is shown below in Figure 3.2. The sys-
tem can be configured as master or slave by the settings. The master side has its own
clock source which is a bring-in quarter-rate clock in this project, and it forwards the syn-
21
chronized differential quarter-rate clock through the clock channel to the slave side as the
standard forwarded clock system. In order to save a clock generator, the slave should re-
utilize the forwarded clock source as the transmitter clock, but the skew between data and
clock in the master will dramatically increase to 2X channel propagation time.
Figure 3.2: System-Level Diagram for Proposed SBD Link
As shown in the system-level block diagram of the entire SBD transceiver archi-
tecture, both sides have identical blocks since any side can be configured as master or
slave. Random PRBS pattern generated at 1.25Gbps rate is multiplexed using master-side
clock and a 16:1 multiplexer to transmit 20Gbps data from master to slave. This data at
the slave-side is separated using a subtractor for further equalization and de-multiplexing.
A separate channel forwards differential quarter-rate clock from master to slave. The slave
receives and buffers the clock and feeds it to a de-skew circuit which aligns the clock to
sample the incoming data optimally. A first-order CDR based on edge-rotation [18] helps
to track phase drifts due to PVT variations.
Another random stream of data is generated at slave-side which is multiplexed
using the forwarded-clock to simultaneously transmit 20Gbps data to the master-side. This
22
slave-side data is separated using a subtractor circuit, equalized and de-multiplexed with
the master-side clock. Thus, simultaneous data transmission from master to slave occurs
on a single channel.
Figure 3.3: Data Transmission from Master-to-Slave: Skew ~0
Figure 3.3 shows a typical skew scenario in a conventional master-slave based SBD
link. When the data is transmitted from master to slave as shown above, skew is small since
driver strengths and trace distances are similar.
However, a critical path transpires when data is received from slave to the master as
shown in Figure 3.4, since the skew is quite large due to different trace lengths and loading
(> 2 trace delays). We have a 6 FR4 channel whose propagation delay is 180ps/inch [26].
With this estimate, the skew is larger than 2160ps.
23
Figure 3.4: Data Transmission from Slave-to-Master: Skew > 2*trace delay (~2160ps)
24
4. PROPOSED CLOCKING SCHEME
The forwarded clock path is unidirectional as compared to data path which is bidi-
rectional. This path can be broadly broken down into the following blocks: Clock Gener-
ation, Forwarded-Clock Transmitter, Clock Buffering, De-Skew Circuit. These blocks are
described in detail in the following sections:
4.1 Multiphase Clock Generation
Quarter-rate transceiver architecture is the preferred choice in recent literature
[9][27][10][11] as it provides higher timing margins for the samplers at the receive side
even though the average C.V 2.f power is approximately similar to half-rate architectures.
Also quarter-rate clocks allow higher fan-out for buffers leading to lower clock distribu-
tion power. For the above mentioned reasons, we have selected quarter-rate architecture
for this project. However, for correct serialization and deserialization of data, quadrature
phase spacing is critical and requires additional circuitry for calibration as seen in [11].
A 40 Gb/s SBD transceiver is sending 20 Gbps data from both directions on the
same channel. Hence, quarter-rate clock frequency for this system is 5GHz. The clock is
uni-directional and forwarded from master to slave. Any side can be configured as master
or slave.
At the master side, a differential 5 GHz clock from an external source is injected
into an injection-locked oscillator (ILO) to generate four phases to multiplex outgoing
data [27]. As shown in the figure below, clocks are injected into a two-stage differen-
tial injection-locked oscillator using AC-coupled inverters with resistive feedback. The
schematic of the ILO is shown in Figure 4.1 below.
The output phases are balanced by adding dummy injection buffers. The drive
strength of the injection buffers’ are controlled using a 3-bit digital control and it helps in
25
optimizing the locking range. The ILO employs cross-coupled inverter delay cells which,
relative to current-starved delay cell-cells [9], generate a rail-to-rail output swing with
better phase spacing over a wide frequency range[27]. The ILO’s frequency is controlled
using the voltage signal EN-VCTL externally. This helps to finely tune frequency of the
ILO by adjusting strength of the pull-down transistor in the delay-cell[27].
Figure 4.1: Injection-Locked Oscillator for IQ Generation1
Accurate quadrature phase spacing is important for transmitter to achieve proper
data serialization. Injecting the clock with same input frequency into the ILO as the output
frequency will lead to phase inaccuracies[12]. A simple solution is to inject all four phases
into the ring of the ILO, which would lead to additional clock routing and significant power
consumption.
Hence, for quadrature-error calibration of IQ gen phases, a Quadrature Locked
Loop or QLL [11] is added to the system. The QLL is a closed-loop system which takes
1*Figure reprinted with permission from "An 8-to-16Gb/s 0.65-to-1.05pJ/b, Voltage-Mode Transmitter
With Analog Impedance Modulation Equalization and Sub-3 ns Power-State Transitioning" by Y. Song,
et.al., 2014. IEEE Journal of Solid-State Circuits, vol. 49, Copyright c⃝2014 by IEEE.
26
the consecutive clock phases (i.e. I and Q) as inputs. A XOR-XNOR based quadrature-
phase detector generates UP or DN signal based on the quadrature error between these
clocks. This error is averaged by a simple charge pump and a loop filter, and is used to
adjust the oscillator’s free-running frequnecy. In this way, the loop is complete and will
minimize any differences in the injected frequency (finj) and inherent natural frequency
(fo|) of the oscillator. The loop locks when the difference of |finj-fo| close to zero. The
block diagram of the closed-loop QLL is shown below.
Figure 4.2: Quadrature Locked Loop for Phase Correction
The following plots verify the functionality of the Quadrature Locked Loop. The
injected clock frequency is 5 GHz. The control voltage of the ILO settles to 504 mV after
50ns and frequency of oscillation is 5 GHz.
The four clock outputs from the ILO are fed into a 16:1 MUX to transmit 20Gbps
data from master to slave.
4.2 Forwarded Clock Transmitter
TX driver used in the SBD link is a low-swing NMOS cross-coupled driver with
100-200mV pk-to-pk programmable swing, similar to the one used in [9]. The schematic
27
Figure 4.3: ILO Control Voltage v/s Time.
Figure 4.4: ILO Frequnecy v/s Time.
for the output driver is shown below. Output clock driver at the TX side to forward differ-
ential clock-phases is similar to the data-path output driver to maximize jitter correlation
between the data and clock paths.
Compared to the transmitter on the data-channel, clock transmitter does not need
16:8 and 8:4 serializer, since the data is re-timed at the last-stage serializer, and this last
stage contributes maximum to the jitter. Hence, quarter-rate differential clock pattern is
forwarded onto a separate channel via the output TX driver by serializing a fixed pattern
28
Figure 4.5: Low Swing Output Driver for Clock Transmitter
(1,1,0,0 in this case) at the last stage 4:1 serializer. Since, any side can be configured as
master or slave, the TX on slave side is disabled and provides the required 50Ω termination.
4.3 Receiving Clock Buffer
The loss at 5GHz is approximate 6dB for 6 FR4 channel. The forwarded unidirec-
tional differential clock is first amplified at the slave side using a CML buffer. The buffer
provides 8-10 dB of gain. The output from the buffer is converted to full CMOS levels
before being distributed to the ILO. The schematic of the buffer is shown below. The IQ
Gen at the slave side is similar to the one at the slave side to generate four phases from the
differentially injected clock.
29
Figure 4.6: Receiving Clock Buffer and CML-to-CMOS Converter at Slave Side
4.4 De-Skew Circuit
The aim of the de-skew circuit is to reduce the skew between data path and clock
such that the clock samples the data at an optimal point. Phase shifting mechanism in
high-speed I/Os should exhibit the following properties [10]:
1) It should exhibit fast and coherent stop and restart
2) It should have a one-to-one monotonic relationship between the digital code and
the output phase for gradual and continuous transfer curve
3) It should be able to cover complete 360◦ range.
Considering low-power targets and based on the analysis shown in section 2.1, we
can initially planned to use injection-locked oscillator (ILO) to align data and clock at the
slave side for optimal sampling. The required deksew range in quarter-rate ILO is 90◦
(±45◦). However, with a single de-skew ILO, the wrap-around is not smooth and there
is a jump in phase when the code reaches the edge of the deskew range. The ILO takes
significant time ( 350ps) to adapt to this sudden change while tracking the phase, as shown
30
below:
Figure 4.7: Adaptation of ILO while tracking phase. Aqua curve: I output (CLK 0) at
fixed position (−45◦). Indigo curve: I output (CLK 0) varying with code (−45◦to+ 45◦)
As seen in the plot above, after 4ns, code jump (0->31) occurs with frequency
(1.25 GHz) and indigo curve (−45◦ position) should be able to track the phase of aqua
curve immediately. However, there is a delay of 345ps in this adaptation which can result
in loss of bits. Multiple ILOs can be used to extend the de-skew range as in [10], but
this jump in phase cannot be avoided while the CDR tracks the phase. Moreover, there is
an additional need of a replica ILO to calibrate tuning range and leads to more area and
stability overhead. Hence a single ILO cannot be used as de-skew element for continuous
(360◦) in a CDR circuit.
Therefore the de-skew circuit is modified to adapt Phase-interpolator as a de-skew
element in the CDR architecture, which satisfies all the criteria mentioned above for high-
speed I/Os. The block diagram of the proposed de-skew circuit is shown below in Figure
4.9.
The four-phases generated by the IQ Gen are given as inputs to a low-overhead
5/4X first-order phase tracking CDR, which reduces power consumption by reducing the
31
number of samplers from eight to five [2]. The logic in this CDR rotationally selects two
consecutive data samples via a 4:2 MUX and the corresponding edge-clock (and hence the
edge-sample) via a 4:1 MUX. The three sampled outputs are fed into a bang-bang phase
detector(BBPD). The BBPD output is deserialized and sampled by a digital accumulator
with programmable depth. The 7-bit output of this digital accumulator is used by phase
interpolator based CMOS phase rotators, which can rotate (360◦), and independently pro-
vide appropriate de-skewed clocks to the samplers.
Figure 4.8: ILO based Phase Rotator as described in2
An oversampling static CMOS phase interpolator follows these phase rotators,
which generates the equally spaced clocks for the quarter-rate data and edge samplers. De-
lay adjustable buffers precede all the samplers to compensate any static phase-mismatches.
The SBD data at the slave side is extracted and reconstructed using an R-gm based subtrac-
2*Figure reprinted with permission from "A 3.2-GHz 1.3-mW ILO phase rotator for burst-mode mobile
memory I/O in 28-nm low-leakage CMOS" by M. Aleksic, 2014. Proceedings of European Solid State
Circuits Conference, Copyright c⃝2014 by IEEE.
32
Figure 4.9: Proposed Deskew Circuit
tor, and is equalized using a CTLE. Details of data extraction, reconstruction and equal-
ization are beyond the scope of this thesis. The data is sampled at quarter-rate by 4 data
samplers and one edge-sampler using the de-skewed clocks from the CDR. All the com-
ponents used in the CDR are described in detail in further sections below.
4.4.1 CDR Loop Dynamics
Due to the forwarded clock architecture, the clock frequency at the RX side is
exactly equal to the TX side frequency. Only the phase of the received clock needs to be
corrected. Hence, the proposed first-order 5/4X CDR is a suitable choice. The linearized
33
Figure 4.10: Proposed Block Diagram for a Edge-Rotating CDR
model is shown below:
Figure 4.11: Closed-Loop Diagram for the CDR
Here, KPD : Phase Detector Gain, KV : Decimation Gain, K1: Phase update gain,
KPI : Digital to Phase Converter (DPC) Gain. Here KPD,eff = KPD.KV . There is an
additional term, z−NEL, which models the total delay due to buffers or analog circuitry
which falls under the control path of the DPC [15].
34
To track PVT variations, we have set a maximum target of 1 MHz for phase track-
ing bandwidth. We need to determine appropriate gain values to achieve this targeted
bandwidth. The transition density (TD) for random data is 0.5 In 5/4X CDR, for every 4
data samples we have only 1 edge sample. Hence TD = 0.5 . 1
4
or 0.125. The effective
KPD is 0.86 per UI for unidirectional signaling and 0.52 per UI for bidirectional signaling.
The details for KPD are mentioned in section 4.5.2 The early/late information from
the BBPD will be fed into the digital accumulator. To make sure the digital circuits meet
proper timing margins, the clock frequency is to the accumulator is decimated to 1.25
GHz. Hence there is an associated gain of 0.25 represented by KV .
We need to input 5-bits to the phase rotators which will translate to a resolution of
2.8◦(90◦/32) or 0.03 UI. Using the KPD value and reasonable gain values the bandwidth
is estimated to be 667 kHz for unidirectional signaling and kHz for bidirectional signaling
as shown below in the table:
Constant Description Value
KPI Digital to Phase Converter (DPC) Gain 2−5 UI/bits
KPD Phase Detector Gain 2.06 per UI
K1 Proportional Gain 2−3
NEL Loop Delay 2.06 per UI
TD Transition Density 0.125
Kv Decimation Gain 0.25
Table 4.1: Loop Parameters for 5/4X CDR
The corresponding closed-loop transfer function is shown below:
35
Figure 4.12: Transfer Function for the CDR
4.4.2 Sampler
A sampler or the comparator is the decision-making circuit in a high-speed re-
ceiver. The highest data-rate of the receiver is constrained by the sampler’s decision-
making time period. Also, the samplers gain and noise performance are significant con-
tributors in the system sensitivity and the maximum channel loss the receiver can handle
without bit errors.
Commonly used comparator topologies include strong-arm, and its modified ver-
sion as mentioned in [28] or a CML-type comparator. While strong-arm comparators, and
modified double-tail versions (Schinkel) exhibit no static power consumption and smaller
aperture time with CMOS levels outputs when compared to CML-type comparators, they
have larger delays and low gain due to a missing regeneration stage. We have used a re-
generative latch which consists of two-stage dynamic amplifier, with an additional third
stage connected in parallel for regeneration[28] as shown in figure below.
The latch proposed in [28] has smaller resolution voltage and has smaller aperture
36
Figure 4.13: Double-Tail Sampler with Regenerative Latch
time than the Schinkel latch.
Figure 4.14: Latch Resolution Comparison for two Double-Tail Comparators
37
Figure 4.15: Impulse Sensitivity Function Comparison for two Double-Tail Comparators
4.4.3 Phase Rotator
A phase interpolator (PI) is a digital-to-phase-conversion unit [15]. Two clock
phases are used as inputs and the output clock is a weighted sum of these inputs. A phase
rotator uses phase interpolator to achieve complete 360◦ rotation is phase. Implementa-
tions of phase interpolator circuits include CML-based analog-type interpolators, which
consume static power consumption and significant area. CMOS inverter based digital in-
terpolators are suitable for a low-power phase interpolator.
Digital phase interpolators either use multiple copies of inverters, MUX based or
multiple-switch based implementations. The schematic of the proposed CMOS PI based
phase rotator is shown in Figure 4.17.
Two 90◦ spaced clocks are selected from the IQ Gen outputs by the 4:2 MUX, and
fed into the interpolator. The clocks pass through a 5-bit slew-rate control inverter to en-
able analog-type smooth mixing of the clocks. Inverter following the slew-rate controlling
inverter is the mixing stage. 31 NMOS-PMOS pair control the current flowing through
the inverter which interpolates the output clock between the two clocks. The Table 4.2
38
Figure 4.16: Schematic for Proposed Phase Rotator
illustrates the control bits for the two invertors in the phase rotator.
One UI interval or 90◦ is covered by mixing consecutive two quarter-rate clocks
(like I and Q) providing a resolution of 2.81◦. The 4:2 MUX will switch to the next 90◦
quadrant by selecting the next two consecutive 90◦ spaced clocks when the digital code
overflows. The table below presents an input-output combination for selecting the two 90◦
clocks in the phase rotator.
4.4.4 Oversampling Clock Generator
A CDR needs to interpret data transitions to extract clocking information from the
incoming data. To achieve this, we sample data and edge information using samplers.
Ideally, data and edge information is spaced 0.5 UI apart as shown in Figure 4.19. To
obtain this information, we need clocks which are 0.5 UI spaced apart. Hence, we use a
static phase interpolator to generate these clocks.
In the proposed UI system, edge clocks are being rotated. Hence output of phase
39
SELI<31:0> SELQ<31:0>
0000 0000 0000 0000 0000 0000 0000 0000 1111 1111 1111 1111 1111 1111 1111 1111
0000 0000 0000 0000 0000 0000 0000 0001 1111 1111 1111 1111 1111 1111 1111 1110
0000 0000 0000 0000 0000 0000 0000 0011 1111 1111 1111 1111 1111 1111 1111 1100
0000 0000 0000 0000 0000 0000 0000 0111 1111 1111 1111 1111 1111 1111 1111 1000
0000 0000 0000 0000 0000 0000 0000 1111 1111 1111 1111 1111 1111 1111 1111 0000
0000 0000 0000 0000 0000 0000 0001 1111 1111 1111 1111 1111 1111 1111 1110 0000
0000 0000 0000 0000 0000 0000 0011 1111 1111 1111 1111 1111 1111 1111 1100 0000
0000 0000 0000 0000 0000 0000 0111 1111 1111 1111 1111 1111 1111 1111 1000 0000
0000 0000 0000 0000 0000 0000 1111 1111 1111 1111 1111 1111 1111 1111 0000 0000
0000 0000 0000 0000 0000 0001 1111 1111 1111 1111 1111 1111 1111 1110 0000 0000
0000 0000 0000 0000 0000 0011 1111 1111 1111 1111 1111 1111 1111 1100 0000 0000
0000 0000 0000 0000 0000 0111 1111 1111 1111 1111 1111 1111 1111 1000 0000 0000
0000 0000 0000 0000 0000 1111 1111 1111 1111 1111 1111 1111 1111 0000 0000 0000
0000 0000 0000 0000 0001 1111 1111 1111 1111 1111 1111 1111 1110 0000 0000 0000
0000 0000 0000 0000 0011 1111 1111 1111 1111 1111 1111 1111 1100 0000 0000 0000
0000 0000 0000 0000 0111 1111 1111 1111 1111 1111 1111 1111 1000 0000 0000 0000
0000 0000 0000 0000 1111 1111 1111 1111 1111 1111 1111 1111 0000 0000 0000 0000
0000 0000 0000 0001 1111 1111 1111 1111 1111 1111 1111 1110 0000 0000 0000 0000
0000 0000 0000 0011 1111 1111 1111 1111 1111 1111 1111 1100 0000 0000 0000 0000
0000 0000 0000 0111 1111 1111 1111 1111 1111 1111 1111 1000 0000 0000 0000 0000
0000 0000 0000 1111 1111 1111 1111 1111 1111 1111 1111 0000 0000 0000 0000 0000
0000 0000 0001 1111 1111 1111 1111 1111 1111 1111 1110 0000 0000 0000 0000 0000
0000 0000 0011 1111 1111 1111 1111 1111 1111 1111 1100 0000 0000 0000 0000 0000
0000 0000 0111 1111 1111 1111 1111 1111 1111 1111 1000 0000 0000 0000 0000 0000
0000 0000 1111 1111 1111 1111 1111 1111 1111 1111 0000 0000 0000 0000 0000 0000
0000 0001 1111 1111 1111 1111 1111 1111 1111 1110 0000 0000 0000 0000 0000 0000
0000 0011 1111 1111 1111 1111 1111 1111 1111 1100 0000 0000 0000 0000 0000 0000
0000 0111 1111 1111 1111 1111 1111 1111 1111 1000 0000 0000 0000 0000 0000 0000
0000 1111 1111 1111 1111 1111 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000
0001 1111 1111 1111 1111 1111 1111 1111 1110 0000 0000 0000 0000 0000 0000 0000
0011 1111 1111 1111 1111 1111 1111 1111 1100 0000 0000 0000 0000 0000 0000 0000
0111 1111 1111 1111 1111 1111 1111 1111 1000 0000 0000 0000 0000 0000 0000 0000
1111 1111 1111 1111 1111 1111 1111 1111 0000 0000 0000 0000 0000 0000 0000 0000
Table 4.2: Digital Code Table for 5-bit Phase Rotator
rotators are edge clocks which are named CLK0, CLK90, CLK180, CLK270. The data
clocks viz. CLK45, CLK105, CLK225, CLK 315, are interpolated from these clocks. The
phase spacing between all the 8 clocks should be exactly 0.5 UI or 25ps for 20Gbps Data.
40
Selection Inputs <1:0> Clock Phases Selected
00 I and Q
01 Q and IB
10 IB and QB
11 QB and I
Table 4.3: Selection Code and Corresponding Clock Outputs for 4:2 MUX in Phase Rota-
tor
Figure 4.17: Edge Rotation and Need for Oversampling Clocks
Statically-tuned delay-adjustable buffers with 4-bit digital control are inserted in every
path to account for any phase mismatches. The schematic is shown below in Figure 4.18.
4.4.5 Phase Detector
A CDR is a modified PLL with a changed phase detection mechanism which can
extract both data and phase information. There are two types of phase detector circuits,
linear [29] which provides phase error’s sign and magnitude, or non-linear binary [30]
whose output is the phase errors’s sign information.
Linear phase detectors are not suitable for higher data-rates as they require high
speed XOR gates and also exhibit dead-zones[33]. Non-Linear phase detectors alleviate
this problem as they provide equal width pulse for data and phase information and only
resolve sign information [31].
41
Figure 4.18: Oversampling Clock Generator
Figure 4.19: Linear Hogge Phase Detector
Non-linear binary or bang-bang or Alexanders phase detector is often abbreviated
as BBPD. The basic principle of BBPD involves subtracting two pulses generated from a
42
XOR to resolve the phase sign information.
Figure 4.20: Non-Linear Alexander Phase Detector
The first condition is to ascertain if a data transition occurred. The next step is to
determine if the edge sampler is same as the first bit(early) or the second bit(late). The gain
of BBPD is undefined, which makes CDR analysis difficult. We assert that high-frequency
noise due to jitter will linearize the BBPD transfer function[13].
This transfer function can be estimated for unidirectional and bi-directional sig-
naling by transmitting random data over the proposed SBD model and averaging enough
early/late information from the BBPD to know if the sampling phase is early or late. The
transfer curve is shown below:
The same values for the BBPD gain have been used in the CDR analysis discussed
in section 4.4.1.
4.4.6 Digital Filter and FSM
A 4:1 MUX exists in the edge-sampler clock-path which selects clock for edge-
sampler. The MUX introduces an extra delay ( 9ps) in the clock path. Hence, a dummy
MUX is added in the data-sampler clock-path to compensate this delay. Now all the sam-
43
Figure 4.21: BBPD Transfer Function for Unidirectional 20Gbps Data
Figure 4.22: BBPD Transfer Function for Simultaneous Bidirectional Data
plers receive equally spaced clocks. The CDR logic rotationally selects two consecutive
data samples using a 4:2 MUX. The three samples are fed into the bang-bang phase detec-
tor. The early and late information from the BBPD output is further deserialized by a 1.25
44
GHz clock to collect enough information for the digital circuit. The digital circuitry runs
at a sub-rate clock frequency of 1.25 GHz. The early/late information is being averaged
by a digital filter which provides a 7-bit output to control the phase rotators. The structure
of the digital filter is shown below:
Figure 4.23: Block Diagram of Digital Filter
The CDR is of first-order nature, since we have a forwarded clock system which
guarantees the frequency to be exactly equal at the transmitter and receiver side. Hence
the filter only requires one accumulator which basically is an integrator. The accumulator
should have enough resolution or dither bits [15] so as not be a significant source of noise.
The total depth of the accumulator is 15 bits. The top 7 MSB bits constitute the output
of the filter and the rest 9-bits serve as the dither bits as shown below in the Fig. These
dither bits are programmable. Also, from our transfer function, we arrive at the conclusion
that the minimum dithering of 3-bits should be provided, which gives a gain of 23 in the
system to effectively lock. Hence we have a range of dither bits from 3-bit to 9-bits.
The output of the digital accumulator is stored in 7-bit registers. This value is used
by the phase rotator; 2 MSB bits used by select lines in the 4:2 MUX rest of the 5-bits are
used to interpolate the clock.
The basic idea of the 5/4X CDR is based on rotating the edge, thus reducing num-
ber of samplers and save power. The FSM performs the following tasks:
45
Figure 4.24: Illustration of Digital Filter Depth
1. It increments a two-bit counter, which controls the select lines of 4:2 MUX for
selecting data samples. The same select lines also select the corresponding edge
clock via the 4:1 MUX.
2. It provides an option to correct each of the four clock phases independently. This
allows to tolerate noise due to duty-cycle distortion in the data as mentioned in [18].
The FSM will correct phase for the first clock phase and the resultant code is stored
in the 7-bit register. As the FSM rotates, the next clock phase is corrected. The
registers hold the codes and provide the ability to independently tune the phase of
each data-sampler clock. There is an option to update all the clock phases with the
same digital code.
3. The FSM also provides a mechanism to reset codes stored in the registers.
4. It also disables the CDR loop for manual code update and bypass-clock path.
The code for the digital filter and FSM is written using Verilog and synthesized in 28nm
process with Synopsis Place and Route tool.
4.4.7 Bypass Clock Path
To enable independent transmitter and receiver testing, an alternate bypass clock
path has been provided in the chip. The block diagram of the bypass clocking is shown
46
below in the figure:
Figure 4.25: Bypass Clock Path
The bypass clock path takes 10 GHz differential clock as input and use CMOS
dividers to generate 5 GHz 4 phase clocks. These clocks bypass the IQ Generator and are
distributed to the Transmitter. At the receive side, the clock from the dividers, bypasses
receive side IQ Generator and are given input to the phase rotators. In bypass mode, CDR
loop is disabled and codes for de-skew are given manually via digital control circuitry. A
5-bit duty-cycle and quadrature error correction DAC precedes the transmitter and receiver
for manual tuning of the four phases of the clocks.
47
4.5 Layout
The chip layout of the SBD transceiver is shown below in the Figure 4.27, which
is fabricated in TSMC 28nm CMOS HPC process. This chip contains two data lanes, one
clock lane, and common part circuitry. Each data lane has a transmitter, a receiver and
a CDR. The transmitters serializer and driver are placed close to the I/O pads. Behind
the TX driver, the SBD receiver has the R-gm followed by the CTLE and samplers. The
CDR is beside to the samplers to provide required clocks for minimizing the critical clock
distribution.
At the right side of the chip, clock lane includes a driver same as the data lane and
a clock receiver. The bring-in reference clock is input from the left side of the chip to the
clock buffer which is followed by the IQ generator, and the 4 phase clocks are distributed
to each lane. The circuitry on the top side of the chip is the common part including on-chip
resistor calibration, termination impedance control, and current and voltage bias.
The layout has been zoomed to highlight the clocking portions as shown below in
the Figure 4.28. The fabricated die photo is shown below in Figure 4.29.
48
Figure 4.26: Layout of the Fabricated Chip
49
Figure 4.27: Layout of the Chip showing CDR
50
Figure 4.28: Die Photo of the Fabricated Chip
51
5. SIMULATION RESULTS
5.1 Ideal Model Simulations
An ideal model of the 5/4X CDR (as shown in Fig. 20) using the gain values from
the table was formulated in Verilog-A. The AMS simulation is run for 150ns. The input
data is ideal PRBS pattern. The eye diagram when the CDR is locked with ideal input data
is shown below:
Figure 5.1: Eye Diagram showing Ideal CDR Lock Position
The digital filter output will increase/decrease based on early/late information from
the BBPD. The CDR will lock when the edge clock samples the middle of the data tran-
sition. Depending on the accumulator’s depth, the code will be either stable at one value
or jump between two fixed codes. The ideal model is also simulated with PRBS data after
52
passing through 2" channel (no equalization) to verify the depth of the accumulator. With
3-bit filtering, the clock locks between 2 codes. With 6-bit filtering, the clock locks with a
unique code.
Figure 5.2: Eye Diagram showing CDR Lock Position for 3-bit dither bits
Figure 5.3: Eye Diagram showing CDR Lock Position for 6-bit dither bits
53
5.2 Clocking Simulation Results
The ILO free-running frequency is 5GHz for Control Voltage = 504mV. The schematic
simulation results are shown below:
Figure 5.4: Free-running ILO at 5GHz @ Control Voltage of 504 mV
Figure 5.5: Free-Running Frequency v/s Control Voltage for ILO
The free-running curve of the ILO is plotted by disabling injection and sweeping
the voltage of the ILO from 200mV to 900mV. The clocks from IQ Gen are used in the
54
clock transmitter to serialize a fixed 1,1,0,0 pattern to give a 5GHz clock pattern at the
output of the TX as shown below:
Figure 5.6: Output of Forwarded Clock Transmitter
Figure 5.7: Output of CML to CMOS Buffer at clock RX side
The clock is forwarded onto the channel and received at the slave side. Then it is
55
passed through a CML Buffer and then amplified to full-swing via a CML to CMOS buffer
circuit as shown in the figure above. The clock are injected into a similar IQ Gen block at
the slave side. The four phase clocks are given input to the phase rotators. The transient
response for the phase rotator for all code values is shown below:
Figure 5.8: Phase Rotator Transient Output for all code values
The phase transfer curve of this phase rotator is shown below:
Figure 5.9: Phase Transfer Curve for Phase Rotator
56
The Dynamic Non Linearity (DNL) curve for 1 UI period is shown below. From the
curve we see that -0.55 LSB < DNL< 1 LSB. Hence the phase transfer curve is monotonic.
Figure 5.10: Dyanmic Non-Linearity Curve for Phase Rotator for 1UI
The output of the oversampling clock circuit is shown below. The spacing between
all the eight phases is 25ps.
Figure 5.11: Output of 2X Oversampling CLK Generator
57
The CDR is first operated in open-loop to verify the functionality of the digital
filter. The BBPD output is always early resulting in increasing digital code.
Figure 5.12: Output of Digital Filter for Always Early Clock
When the clock is operated such that it is always late, the digital codes were always
decreasing.
Figure 5.13: Output of Digital Filter for Always Late Clock
58
The schematic level implementation of the complete closed-loop CDR is simulated
with ideal PRBS pattern to verify the functionality.
Figure 5.14: Schematic Level Simulation 5/4X CDR with Ideal PRBS Data
The schematic level CDR is also tested with PRBS data after passing through 2"
channel with 6-bit LSB dithering as shown below:
Figure 5.15: Eye Diagram for Schematic-Level CDR using 2" Channel and 6-bit LSB
filtering
59
The digital code is stable between two code values (27 and 28) after 50ns as shown
below in the figure:
Figure 5.16: Digital Filter Code Convergence
The power summary for the schematic-level CDR is shown in the table below:
Component Power @ 0.9 V
Oversampling CLK generator 1.7 mW
Phase Rotator (PI + 4:2 MUX + 2:4 Encoder) X4 3.4 mW
4:1 MUX 20 µW
4:2 MUX 30 µW
BBPD + DeSer 323 µW
Digital Accumulator + CDR Logic 600 µW
IQ Gen + QLL 2 mW
Data Samplers 800 µW
Total 8.8 mW
Table 5.1: Power Summary of the CDR
60
The table above shows a comparison between previous CDR implementations in
literature and this work.
Metric Li [18] Mansuri[32] This Work
VDD 0.8V 1.08V 0.9V
Process 65nm 32nm 28nm
Data Rate 14Gbps 16Gbps 20Gbps
Clock Rate 3.5 GHz 4 GHz 5 GHz
Energy Efficiency 0.56 pJ/b 1.02 pJ/b 0.44 pJ/b
Table 5.2: Comparison of Digital CDRs
5.3 Link Performance
Unidirectional 20 Gbps data is simulated with half-rate bypass 10 GHz clocks. The
clocks are divided to get quarter-rate clocks for transmitting data at the master side.
Figure 5.17: Eye Diagram for 20Gbps Unidirectional Data After 6” FR4 Channel
61
Figure 5.18: Eye Diagram for 20Gbps Unidirectional Data After CTLE
20 Gbps Data is simultaneously sent over the 6" channel. The phase rotators are
manually adjusted to get the optimal sampling point. The eye diagram is shown below:
Figure 5.19: Eye Diagram for 20Gbps Bidirectional Data After CTLE and FIR Echo Can-
cellation
62
The TX and RX are individually verified with PRBS15 Checker as shown below.
This verifies that clock is sampling at the ideal position.
Figure 5.20: PRBS15 Data is serialized and checked by PRBS15 Checker at TX Output
Figure 5.21: 20 Gbps PRBS15 Data is sent over the channel and checked by PRBS15
Checker at Sampler Output
63
5.4 Proposed Test Plan
The test-plan procedure is shown below which can optimize the system settings
before starting the SBD signaling. The first step before turning on the system is to auto-
calibrate the on-chip resistor by the outside reference 400Ω resistor. The second step is to
control the termination impedance based on the calibrated on-chip resistor. Then, in order
to compensate the device mismatch and process variation, RX offset and R-gm needs to be
calibrated by achieving the 1 and 0 balance of the samplers output. When RX is ready to
receive the data and the forwarded clock, the CDR receives the uni-directional clock and
data and starts to track the phase by receiving uni-directional signaling. Finally, the echo
cancellation tap will auto adapt the tap weights by outputting uni-directional signaling.
After above procedure, SBD system can start simultaneous bi-directional signal on the
channel.
Figure 5.22: Proposed Test Plan for SBD Link using CDR
64
6. CONCLUSION
This thesis describes a low-power implementation of clocking and skew optimiza-
tion mechanism for a source-synchronous simultaneous bi-directional link system. For-
warded clocking architecture allows for maximum jitter correlation between data path and
clock path. Also as the effective clock frequency is equal, only the phase between clock
and data needs to corrected which results in a simple first-order CDR implementation.
The proposed SBD transceiver with forwarded clock system can be configured in
master and slave roles. The master has quarter-rate clock source and forwards the clock to
the slave through the clock channel. The slave utilizes the forwarded clock as both receiver
and transmitter clock source. The impact of high frequency jitter caused by the skew of 2X
channel propagation time is analyzed by JTF and JTOL, and the analysis shows the JTOL
is still larger than 0.2UI. The transceiver was implemented in 28nm CMOS technology.
The power efficiency is 1.4 pJ/b at 40 Gbps, with CDR contributing to approx. 8.8 mW
of power at 0.9 V supply. The forwarded clock TX consumes 5 mW of power, which
is amortized over the two data channels. The IQ Gen consumes this much power and is
amortized over 3 channels (2 data, 1 clock). CDR occupies 0.018mm2 of the chip area.
Overall area occupied by clocking is 0.024mm2
65
REFERENCES
[1] “International technology roadmap for semiconductors 2015.” Web, June 2015.
[2] S. Palermo, “High-speed serial i/o design for channel-limited and power-constrained
systems,” CMOS Nanoelectronic Analog and RF VLSI Circuits, K. Iniewski. New
York, NY: McGraw-Hill, pp. 289–336, June 2011.
[3] J. R. Broomall and H. V. Deusen, “Extending the useful range of copper intercon-
nects for high data rate signal transmission,” in 1997 Proceedings 47th Electronic
Components and Technology Conference, pp. 196–203, May 1997.
[4] H. Tamura, M. Kibune, Y. Takahashi, Y. Doi, T. Chiba, H. Higashi, H. Takauchi,
H. Ishida, and K. Gotoh, “5 gb/s bidirectional balanced-line link compliant with ple-
siochronous clocking,” in 2001 IEEE International Solid-State Circuits Conference.
Digest of Technical Papers. ISSCC (Cat. No.01CH37177), pp. 64–65, Feb 2001.
[5] B. Casper, A. Martin, J. E. Jaussi, J. Kennedy, and R. Mooney, “An 8-gb/s simulta-
neous bidirectional link with on-die waveform capture,” IEEE Journal of Solid-State
Circuits, vol. 38, pp. 2111–2120, Dec 2003.
[6] S. Shekhar, M. Mansuri, F. O’Mahony, G. Balamurugan, J. E. Jaussi, J. Kennedy,
D. J. Allstot, R. Mooney, and B. Casper, “Strong injection locking in low-qlc oscilla-
tors: Modeling and application in a forwarded-clock i/o receiver,” IEEE Transactions
on Circuits and Systems I: Regular Papers, vol. 56, pp. 1818–1829, Aug 2009.
[7] A. L. S. Loke, B. A. Doyle, S. K. Maheshwari, D. M. Fischette, C. L. Wang, T. T.
Wee, and E. S. Fang, “An 8.0-gb/s hypertransport transceiver for 32-nm soi-cmos
server processors,” IEEE Journal of Solid-State Circuits, vol. 47, pp. 2627–2642,
Nov 2012.
66
[8] A. Ragab, Y. Liu, K. Hu, P. Chiang, and S. Palermo, “Receiver jitter tracking char-
acteristics in high-speed source synchronous links,” Journal of Electrical and Com-
puter Engineering, vol. 2011, pp. 1–15, August 2011.
[9] Y. Song, R. Bai, K. Hu, H. Yang, P. Y. Chiang, and S. Palermo, “A 0.470.66 pj/bit,
4.88 gb/s i/o transceiver in 65 nm cmos,” IEEE Journal of Solid-State Circuits,
vol. 48, pp. 1276–1289, May 2013.
[10] M. Aleksic, “A 3.2-ghz 1.3-mw ilo phase rotator for burst-mode mobile memory i/o
in 28-nm low-leakage cmos,” in ESSCIRC 2014 - 40th European Solid State Circuits
Conference (ESSCIRC), pp. 451–454, Sept 2014.
[11] M. Raj, S. Saeedi, and A. Emami, “A wideband injection locked quadrature clock
generation and distribution technique for an energy-proportional 1632 gb/s optical re-
ceiver in 28 nm fdsoi cmos,” IEEE Journal of Solid-State Circuits, vol. 51, pp. 2446–
2462, Oct 2016.
[12] K. Hu, T. Jiang, J. Wang, F. O’Mahony, and P. Y. Chiang, “A 0.6 mw/gb/s, 6.47.2
gb/s serial link receiver using local injection-locked ring oscillators in 90 nm cmos,”
IEEE Journal of Solid-State Circuits, vol. 45, pp. 899–908, April 2010.
[13] H.-C. Lee, “An estimation approach to clock and data recovery,” Ph.D. Thesis, Stan-
ford University, 2006.
[14] F. Gardner, “Charge-pump phase-lock loops,” IEEE Transactions on Communica-
tions, vol. 28, pp. 1849–1858, November 1980.
[15] J. L. Sonntag and J. Stonick, “A digital clock and data recovery architecture for multi-
gigabit/s binary links,” IEEE Journal of Solid-State Circuits, vol. 41, pp. 1867–1875,
Aug 2006.
67
[16] S. Chen, H. Li, L. Yang, Z. Yang, W. Hu, and P. Y. Chiang, “A 1.2 pj/b 6.4 gb/s
8+1-lane forwarded-clock receiver with pvt-variation-tolerant all-digital clock and
data recovery in 28nm cmos,” in Proceedings of the IEEE 2013 Custom Integrated
Circuits Conference, pp. 1–4, Sept 2013.
[17] E. Prete, D. Scheideler, and A. Sanders, “A 100mw 9.6gb/s transceiver in 90nm
cmos for next-generation memory interfaces,” in 2006 IEEE International Solid State
Circuits Conference - Digest of Technical Papers, pp. 253–262, Feb 2006.
[18] H. Li, S. Chen, L. Yang, R. Bai, W. Hu, F. Y. Zhong, S. Palermo, and P. Y. Chiang,
“A 0.8v, 560fj/bit, 14gb/s injection-locked receiver with input duty-cycle distortion
tolerable edge-rotating 5/4x sub-rate cdr in 65nm cmos,” in 2014 Symposium on VLSI
Circuits Digest of Technical Papers, pp. 1–2, June 2014.
[19] T. Takahashi, M. Uchida, T. Takahashi, R. Yoshino, M. Yamamoto, and N. Kitamura,
“A cmos gate array with 600 mb/s simultaneous bidirectional i/o circuits,” IEEE
Journal of Solid-State Circuits, vol. 30, pp. 1544–1546, Dec 1995.
[20] Y. Tomita, H. Tamura, M. Kibune, J. Ogawa, K. Gotoh, and T. Kuroda, “A 20-gb/s
simultaneous bidirectional transceiver using a resistor-transconductor hybrid in 0.11-
µmcmos,” IEEE Journal of Solid-State Circuits, vol. 42, pp. 627–636, March 2007.
[21] H. Wilson and M. Haycock, “A six-port 30-gb/s nonblocking router component us-
ing point-to-point simultaneous bidirectional signaling for high-bandwidth intercon-
nects,” IEEE Journal of Solid-State Circuits, vol. 36, pp. 1954–1963, Dec 2001.
[22] J.-H. Kim, S. Kim, W.-S. Kim, J.-H. Choi, H.-S. Hwang, C. Kim, and S. Kim, “A
4-gb/s/pin low-power memory i/o interface using 4-level simultaneous bi-directional
signaling,” IEEE Journal of Solid-State Circuits, vol. 40, pp. 89–101, Jan 2005.
68
[23] R. Mooney, C. Dike, and S. Borkar, “A 900 mb/s bidirectional signaling scheme,” in
Proceedings ISSCC ’95 - International Solid-State Circuits Conference, pp. 38–39,
Feb 1995.
[24] J.-Y. Sim, Y.-S. Sohn, S.-C. Heo, H.-J. Park, and S.-I. Cho, “A 1-gb/s bidirectional i/o
buffer using the current-mode scheme,” IEEE Journal of Solid-State Circuits, vol. 34,
pp. 529–535, April 1999.
[25] R. J. Drost and B. A. Wooley, “An 8-gb/s/pin simultaneously bidirectional transceiver
in 0.35-/spl mu/m cmos,” IEEE Journal of Solid-State Circuits, vol. 39, pp. 1894–
1908, Nov 2004.
[26] A. Weiler, A. Pakosta, and A. Verma, “High-speed layout guidelines,” Application
Report, Texas Instruments, pp. 1–21, November 2006.
[27] Y. Song, H. Yang, H. Li, P. Y. Chiang, and S. Palermo, “An 816 gb/s, 0.651.05
pj/b, voltage-mode transmitter with analog impedance modulation equalization and
sub-3 ns power-state transitioning,” IEEE Journal of Solid-State Circuits, vol. 49,
pp. 2631–2643, Nov 2014.
[28] A. Roshan-Zamir, O. Elhadidy, H. Yang, and S. Palermo, “A reconfigurable 16/32
gb/s dual-mode nrz/pam4 serdes in 65-nm cmos,” IEEE Journal of Solid-State Cir-
cuits, vol. 52, pp. 2430–2447, Sept 2017.
[29] C. R. Hogge, “A self correcting clock recovery circuit,” IEEE Transactions on Elec-
tron Devices, vol. 32, pp. 2704–2706, Dec 1985.
[30] J. D. H. Alexander, “Clock recovery from random binary signals,” Electronics Let-
ters, vol. 11, pp. 541–542, October 1975.
[31] Y. M. Greshishchev and P. Schvan, “Sige clock and data recovery ic with linear-
type pll for 10-gb/s sonet application,” IEEE Journal of Solid-State Circuits, vol. 35,
69
pp. 1353–1359, Sept 2000.
[32] M. Mansuri, J. E. Jaussi, J. T. Kennedy, T. Hsueh, S. Shekhar, G. Balamurugan,
F. O’Mahony, C. Roberts, R. Mooney, and B. Casper, “A scalable 0.128-to-1tb/s 0.8-
to-2.6pj/b 64-lane parallel i/o in 32nm cmos,” in 2013 IEEE International Solid-State
Circuits Conference Digest of Technical Papers, pp. 402–403, Feb 2013.
70
