



#### PH.D. DISSERTATION

# A DESIGN OF 20GBPS/LANE SERIAL LINK FOR MEMORY INTERFACE

# 메모리 인터페이스를 위한 20GBPS급 직렬화 송수신기 설계

BY

HANKYU CHI

AUGUST, 2013

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING SEOUL NATIONAL UNIVERSITY

# A DESIGN OF 20GBPS/LANE SERIAL LINK FOR MEMORY INTERFACE

By

Hankyu Chi

# A Dissertation Submitted to the Department of Electrical and Computer Engineering In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

#### At

# SEOUL NATIONAL UNIVERSITY

August, 2013

**Committee in Charge** 

Professor Suhwan Kim, Chairman

Professor Deog-Kyoon Jeong, Vice-Chairman

**Professor Young-June Park** 

**Professor Joon-Sik Kih** 

**Professor Yongsam Moon** 

# ABSTRACT

Various types of serial link for current and future memory interface are presented in this thesis.

At first, PHY design for commercial GDDR3 memory is proposed. GDDR3 PHY is consists of read path, write path, command path. Write path and command path calibrate skew by using VDL (Variable delay line), while read path calibrates skew by using DLL (Delay locked loop) and VDL. There are four data channels and one command/address channel. Each data channel consists of one clock signal (DQS) and eight data signals (DQ). Data channel operates in 1.2Gbps (1.08Gbps~1.2Gbps), and command/address channel operates 600Mbps (540Mbps~600Mbps). In particular, DLL design for high speed and for SSN (simultaneous switching noise) is concentrated in this thesis.

Secondly, serial link design for silicon photonics is proposed. Silicon photonics is the strongest candidate for next generation memory interface. Modulator driver for modulator, TIA (trans-impedance amplifier) and LA (limiting amplifier) for photo diode design are discussed. It operates above 12.5Gbps but it consumes much power 7.2mW/Gbps (transmitter core), 2mW/Gbps (receiver core) because it is connected with optical device which has large parasitic capacitance. Overall

receiver which includes CDR (clock and data recovery) is also implemented. Many chips are fabricated in 65nm, 0.13um CMOS process.

Finally, electrical serial link for 20Gbps memory link is proposed. Overall architecture is forwarded clocking architecture, and is very simple and intuitive. It does not need additional synchronizer. This open loop delay matched stream line receiver finds optimum sampling point with DCDL (Digitally controlled delay line) controller and expects to consume low power structurally. Only two phase half rate clock is transmitted through clock channel, but half rate time interleaved way sampling is performed by aid of initial value settable PRBS chaser. A CMOS Chip is fabricated by 65nm process and it occupies 2500um x 2500um (transceiver). It is expected that about 2.6mW(2.4mW)/Gbps (transmitter), 4.1mW(2.7mW)/Gbps (receiver). Power consumption improvement is expected in advanced process.

Keywords : Delay locked loop (DLL), Phase locked loop (PLL), silicon photonics, source series termination (SST), trans-impedance amplifier, limiting amplifier, forwarded clocking, embedded clocking, Digitally controlled delay line (DCDL), clock and data recovery (CDR).

**Student number : 2008-30245** 

# **CONTENTS**

| ABSTRACT   | I                                                          |
|------------|------------------------------------------------------------|
| CONTENTS   | S V                                                        |
| LIST OF FI | GURESVII                                                   |
| LIST OF TA | ABLESXII                                                   |
| CHAPTE     | <b>R1</b> INTRODUCTION 1                                   |
| 1.1        | MOTIVATION1                                                |
| 1.2        | THESIS ORGANIZATION                                        |
| CHAPTE     | <b>R2</b> A SERIAL LINK PHY DESIGN FOR GDDR3 MEMORY        |
| INTERFACE  | 11                                                         |
| 2.1        | INTRODUCTION                                               |
| 2.2        | GDDR3 MEMORY INTERFACE ARCHITECTURE                        |
| 2.2.1      | READ PATH ARCHITECTURE15                                   |
| 2.2.2      | WRITE PATH ARCHITECTURE17                                  |
| 2.2.3      | COMMAND PATH ARCHITECTURE19                                |
| 2.3        | DLL DESIGN FOR MEMORY INTERFACE                            |
| 2.3.1      | SSN(SIMULTANEOUS SWITCHING NOISE)                          |
| 2.3.2      | DLL ARCHITECTURE                                           |
| 2.3.3      | VOLTAGE CONTROLLED DELAY LINE (VCDL)22                     |
| 2.3.4      | Hysteresis Coarse Lock Detector (HCLD)23                   |
| 2.3.5      | DYNAMIC PHASE DETECTOR AND CHARGE PUMP                     |
| 2.4        | SIMULATION RESULT                                          |
| 2.5        | CONCLUSION                                                 |
| CHAPTE     | <b>R3</b> Optical front-end serial link design for 20 GBPS |
| MEMORY INT | ERFACE                                                     |
| 3.1        | SILICON PHOTONICS INTRODUCTION                             |
| 3.2        | Optical front-end Transmitter design                       |

| 4.4<br>4.5<br>4.6<br>4.7 | 4.3.2<br>4.3.3<br>0<br>D<br>D<br>SI | DESIGN CONCEPT<br>PROPOSED PROTOCOL AND LOCKING PROCESS<br>PTIMUM POINT SEARCH ALGORITHM BASED DCDL CONTROLLER<br>CDL (DIGITALLY CONTROLLED DELAY LINE) DESIGN<br>FE (DECISION FEEDBACK EQUALIZER) AND OTHER BLOCKS DESIG<br>MULATION RESULTS |          |
|--------------------------|-------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|
| 4.4<br>4.5<br>4.6        | 4.3.2<br>4.3.3<br>O<br>D<br>D       | DESIGN CONCEPT<br>PROPOSED PROTOCOL AND LOCKING PROCESS<br>PTIMUM POINT SEARCH ALGORITHM BASED DCDL CONTROLLER I<br>CDL (DIGITALLY CONTROLLED DELAY LINE) DESIGN<br>FE (DECISION FEEDBACK EQUALIZER) AND OTHER BLOCKS DESIG                   |          |
| 4.4<br>4.5               | 4.3.2<br>4.3.3<br>D                 | DESIGN CONCEPT<br>PROPOSED PROTOCOL AND LOCKING PROCESS<br>PTIMUM POINT SEARCH ALGORITHM BASED DCDL CONTROLLER I<br>CDL (DIGITALLY CONTROLLED DELAY LINE) DESIGN                                                                              | 97<br>   |
| 4.4                      | 4.3.2<br>4.3.3<br>O                 | DESIGN CONCEPT<br>PROPOSED PROTOCOL AND LOCKING PROCESS<br>PTIMUM POINT SEARCH ALGORITHM BASED DCDL CONTROLLER 1                                                                                                                              | 97<br>   |
|                          | 4.3.2<br>4.3.3                      | DESIGN CONCEPT<br>PROPOSED PROTOCOL AND LOCKING PROCESS                                                                                                                                                                                       | 97       |
|                          | 4.3.2                               | DESIGN CONCEPT                                                                                                                                                                                                                                | 97       |
|                          |                                     |                                                                                                                                                                                                                                               |          |
|                          | 4.3.1                               | PROPOSED OVERALL ARCHITECTURE                                                                                                                                                                                                                 |          |
| DEL                      | AY MATC                             | HED STREAM LINED RECEIVER                                                                                                                                                                                                                     | 95       |
| 4.3                      | D                                   | ESIGN CONCEPT AND PROPOSED SERIAL LINK ARCHITECTURE – OF                                                                                                                                                                                      | PEN LOOP |
| ARC                      | HITECTU                             | RES                                                                                                                                                                                                                                           | 90       |
| 4.2                      | C                                   | ONVENTIONAL ELECTRICAL FRONT-END HIGH SPEED SERIAL LINK                                                                                                                                                                                       |          |
| 4.1                      | IN                                  | TRODUCTION                                                                                                                                                                                                                                    | 87       |
| MEMOR                    | Y INTER                             | RFACE                                                                                                                                                                                                                                         |          |
| CHA                      | PTER 4                              | ELECTRICAL FRONT-END SERIAL LINK DESIGN FO                                                                                                                                                                                                    | r 20Gbps |
| 3.5                      | C                                   | ONCLUSION                                                                                                                                                                                                                                     | 86       |
|                          | 3.4.6                               | DIE PHOTO AND LAYOUT                                                                                                                                                                                                                          | 82       |
|                          | 3.4.5                               | OPTICAL-ELECTRICAL OVERALL MEASUREMENTS                                                                                                                                                                                                       | 80       |
|                          | 3.4.4                               | OPTICAL RX BACK END SIMULATION                                                                                                                                                                                                                | 79       |
|                          | 3.4.3                               | OPTICAL RX FRONT END MEASUREMENT AND SIMULATION                                                                                                                                                                                               | 77       |
|                          | 3.4.2                               | OPTICAL TX FRONT END MEASUREMENT AND SIMULATION                                                                                                                                                                                               | 74       |
|                          | 3.4.1                               | MEASUREMENT AND SIMULATION ENVIRONMENTS                                                                                                                                                                                                       | 70       |
| 3.4                      | М                                   | EASUREMENT AND SIMULATION RESULTS                                                                                                                                                                                                             | 70       |
|                          | 3.3.4                               | OPTICAL RECEIVER BACK END DESIGN – CDR                                                                                                                                                                                                        | 66       |
|                          | 3.3.3                               | Optical receiver back end design – LA, Driver                                                                                                                                                                                                 | 63       |
|                          | 3.3.2                               | Optical receiver back end design – TIA                                                                                                                                                                                                        |          |
| 010                      | 3.3.1                               | OPTICAL RECEIVER BACK END REQUIREMENTS.                                                                                                                                                                                                       |          |
| 3.3                      | 0                                   | PTICAL FRONT-END RECEIVER DESIGN                                                                                                                                                                                                              |          |
|                          | 3.2.3                               | MODULATOR DRIVER DESIGN - CURRENT MODE DRIVER                                                                                                                                                                                                 | 50       |
|                          | 3.2.1                               | MODULATOR DRIVER REQUIREMENTS                                                                                                                                                                                                                 | 40<br>47 |
|                          | 321                                 | MODULATOR DRIVER REQUIREMENTS                                                                                                                                                                                                                 | 46       |

#### **BIBLIOGRAPHY 128**

# LIST OF FIGURES

| FIG 1.1 SMART-PHONE AND TABLET SALES [1.1]                | 2    |
|-----------------------------------------------------------|------|
| FIG 1.2 APPLE I-DEVICE MEMORY BANDWIDTH [1.2]             | 3    |
| FIG 1.3 DRAM DATA RATE/PIN TREND [1.3]                    | 4    |
| FIG 1.4 DDR INTERFACE IN PCB[SOURCE : RAMBUS]             | 5    |
| FIG 1.5 A CLASSIFICATION OF SERIAL LINK                   | 6    |
| FIG 2.1 GDDR3 OVERALL INTERFACE ARCHITECTURE              | 12   |
| Fig 2.2GDDR3 PHY ARCHITECTURE                             | 13   |
| FIG 2.3 READ PATH ARCHITECTURE                            | 15   |
| FIG 2.4 READ PATH TIMING REQUIREMENT                      | 15   |
| FIG 2.5 WRITE PATH ARCHITECTURE                           | 17   |
| FIG 2.6 CMD/ADDR PATH ARCHITECTURE                        | 19   |
| FIG 2.7 CONVENTIONAL DELAY LOCKED LOOP                    | 21   |
| FIG 2.8 ARCHITECTURE OF THE PROPOSED DLL                  | 22   |
| FIG 2.9 TIMING DIAGRAM OF THE MODIFIED CLD                | 23   |
| Fig 2.10 HCLD block diagram                               | 24   |
| FIG 2.11 BLOCK AND TIMING DIAGRAM OF A HYSTERESIS LOGIC   | 24   |
| FIG 2.12 (A) BLOCK DIAGRAM OF PHASE DETECTOR, (B) PD GAIN | 27   |
| FIG 2.13 (A) BLOCK DIAGRAM OF A CHARGE PUMP, (B) CP CON   | TROL |
| VOLTAGE-CURRENT CURVE                                     | 28   |
| FIG 2.14 SSN MODELING AND OVERALL DLL LOCKING PROCESS     | 29   |
| FIG 2.15 LOCKING PROCESS COMPARISON.                      | 30   |
| FIG 2.16 EYE DIAGRAM COMPARISON                           | 31   |
| FIG 2.17 GDDR3 PHY LAYOUT                                 | 31   |
| FIG 2.18 GDDR3 PHY LAYOUT                                 | 32   |

| FIG 3.25 TIA OVERALL ARCHITECTURE                                        | 63     |
|--------------------------------------------------------------------------|--------|
| FIG 3.26 LIMITING AMPLIFIER [3.10],[3.11],[3.12],[3.13]                  | 64     |
| FIG 3.27 CHERRY-HOOPER LA AND NEGATIVE CAPACITANCE CIRCUIT [3.12],       | [3.13] |
|                                                                          | 64     |
| FIG 3.28 NEGATIVE MILLER CAPACITANCE LA[3.12],[3.13]                     | 64     |
| FIG 3.29 OUTPUT DRIVER [3.10], [3.12]                                    | 65     |
| FIG 3.30 CLOCK AND DATA RECOVERY CONCEPT [3.16]                          | 66     |
| FIG 3.31 OPTICAL FRONT-END OVERALL RECEIVER ARCHITECTURE                 | 67     |
| FIG 3.32 SAMPLER ARCHITECTURE AND OFFSET CONTROL                         | 69     |
| FIG 3.33 OPTICAL DEVICE MODELING FOR SIMULATION                          | 70     |
| FIG 3.34 TRANSMITTER ELECTRICAL MEASUREMENT ENVIRONMENTS                 | 71     |
| FIG 3.35 RECEIVER ELECTRICAL MEASUREMENT ENVIRONMENTS                    | 71     |
| FIG 3.36 ELECTRICAL MEASUREMENT ENVIRONMENT                              | 72     |
| FIG 3.37 OVERALL MEASUREMENT ENVIRONMENT                                 | 73     |
| FIG 3.38 PROPOSED INVERTER BASED VOLTAGE MODE DRIVER MEASURE             | MENT   |
| RESULTS                                                                  | 74     |
| FIG 3.39 PROPOSED MODIFIED STACKED FET CURRENT MODE DI                   | RIVER  |
| MEASUREMENT RESULTS                                                      | 74     |
| FIG 3.40 MODIFIED SIMPLE HIGH SWING SST VOLTAGE MODE DRIVER              | 75     |
| FIG 3.41 PROPOSED MODIFIED SIMPLE HIGH SWING SST VOLTAGE MODE DRIVE      | ER.75  |
| FIG 3.42 GAIN BOOSTED CASCADE TIA MEASUREMENT RESULTS                    | 77     |
| FIG 3.43 SHUNT AND SERIES PEAKING TIA                                    | 77     |
| FIG 3.44 NEGATIVE MILLER CAPACITANCE TIA (12.5GBPS)                      | 77     |
| FIG 3.45 INVERTER BASED TIA (RMS JITTER : 4.14PS AT 10GBPS, 3.08PS 12.50 | Gbps)  |
|                                                                          | 78     |
| FIG 3.46 SIMULATION RESULTS (A) EDGE SAMPLING, (B) DATA SAMPLING         | G, (C) |
| FREQUENCY LOCKING                                                        | 79     |
| FIG 3.47 OPTICAL MEASUREMENT. (A) ENVIRONMENTS AND (B) RESULTS [3.15]    | ] 80   |
| FIG 3.48 OPTICAL MEASUREMENT WITH COMMERCIAL PD                          | 81     |
| FIG 3.49 GAIN BOOST CASCODE TIA, LA DIE PHOTO (5000UM X 5000UM)          | 82     |
| FIG 3.50 Shunt and series peaking TIA, LA die photo [3.10] (2700         | им х   |

| 1800um)                                                               | 83      |
|-----------------------------------------------------------------------|---------|
| FIG 3.51 NEGATIVE MILLER CAPACITANCE TIA, LA DIE PHOTO [3.12] (240    | OUM X   |
| 1100им)                                                               | 83      |
| FIG 3.52 MODIFIED SIMPLE SST DRIVER DIE PHOTO [3.12] (2500um x 1100um | 4) 84   |
| FIG 3.53 STACKED FET MODULATOR DRIVER DIE PHOTO (950UM X 800UM)       | 84      |
| FIG 3.54 PROPOSED MODIFIED SIMPLE SST DRIVER DIE PHOTO. (2350UM X 10  | )00um)  |
|                                                                       | 84      |
| FIG 3.55 INVERTER BASED TIA, LA DIE PHOTO (2400UM X 900UM)            | 85      |
| FIG 4.1.ELECTRICAL MEMORY LINK COMPOSITION [4.1]                      | 88      |
| FIG 4.2 GLOBAL DLL AND PHASE INTERPOLATOR ARCHITECTURE [4.1]          | 91      |
| FIG 4.3 INJECTION LOCKED ARCHITECTURE [4.4]                           | 91      |
| FIG 4.4 OVERALL ARCHITECTURE OF PROPOSED SERIAL LINK                  | 95      |
| FIG 4.5 SKEW COMPENSATION METHOD                                      | 97      |
| FIG 4.6 DESIGN CONCEPT DIAGRAM OF PROPOSED SERIAL LINK RECEIVER       | 98      |
| FIG 4.7 BLOCK DIAGRAM OF PROPOSED ARCHITECTURE (A) TRANSMITTE         | ER, (B) |
| Receiver                                                              | 99      |
| FIG 4.8 PROPOSED PROTOCOL FOR PROPOSED ARCHITECTURE                   | 100     |
| FIG 4.9 LOCKING PROCESS OF PROPOSED SERIAL LINK                       | 101     |
| FIG 4.10 DCDL CONTROLLER TIMING DIAGRAM                               | 104     |
| FIG 4.11 DCDL CLK_CODE - D_CODE TABLE                                 | 104     |
| FIG 4.12 OPTIMUM SAMPLING POINT SEARCH ALGORITHM                      | 105     |
| FIG 4.13 DCDL TABLE IN NOISE ENVIRONMENT – PROBLEM                    | 106     |
| FIG 4.14 DCDL TABLE IN NOISE ENVIRONMENT – SOLUTION                   | 107     |
| FIG 4.15 DCDL TABLE OPTIMIZATION- ACTIVE AREA EXPANSION               | 108     |
| FIG 4.16 RECALIBRATION ALGORITHM                                      | 109     |
| FIG 4.17 DCDL CONTROLLER ALGORITHM FLOW CHART                         | 111     |
| FIG 4.18 DESIGNED DCDL                                                | 112     |
| FIG 4.19 GLITCH FREE METHOD                                           | 112     |
| FIG 4.20 DCDL RESOLUTION                                              | 113     |
| FIG 4.21 DCDL POWER DISSIPATION                                       | 114     |
| FIG 4.22 DESIGNED CTLE (CONTINUOUS TIME LINEAR EQUALIZER)             | 115     |

| FIG 4.23 DESIGNED HALF RATE PRBS CHASER                         | 116     |
|-----------------------------------------------------------------|---------|
| FIG 4.24 DESIGNED DFE (DECISION FEEDBACK EQUALIZER)             | 116     |
| FIG 4.25 DCDL CONTROLLER SIMULATION ENVIRONMENT                 | 117     |
| FIG 4.26 SIMULATION JITTER SPECIFICATION                        | 118     |
| FIG 4.27 DCDL CONTROLLER SIMULATION -SAMPLING POINTS SEARCH ALC | GORITHM |
|                                                                 | 118     |
| FIG 4.28 DCDL CONTROLLER SIMULATION – OPTIMUM SAMPLING POINT    | SEARCH  |
| ALGORITHM                                                       | 119     |
| FIG 4.29 RECALIBRATION ALGORITHM                                | 119     |
| FIG 4.30 RECALIBRATION ALGORITHM AND PROTOCOL                   | 120     |
| FIG 4.31Resolution control and interpolation technique          | 120     |
| FIG 4.32 OVERALL SIMULATION – LOCKING PROCESS                   | 121     |
| FIG 4.33 OVERALL SIMULATION – LOCKING PROCESS                   | 121     |
| FIG 4.34 CHIP FLOOR PLANNING                                    | 122     |
| FIG 4.35 CHIP LAYOUT                                            | 122     |

# LIST OF TABLES

| TABLE 2.1 SUMMARY OF GDDR3 PHY KEY CHARACTERISTICS [2.7] |    |
|----------------------------------------------------------|----|
| TABLE 3.1 MODULATOR DRIVER SUMMARY                       | 76 |
| TABLE 3.2 RECEIVER ELECTRICAL MEASUREMENTS SUMMARY       | 78 |
| TABLE 4.1 CONVENTIONAL MULTI CHANNEL LINK SUMMARY        | 93 |
| TABLE 4.2 POWER ESTIMATION                               |    |
| TABLE 4.3 FIGURE OF MERIT                                |    |

# **CHAPTER 1**

# INTRODUCTION

## 1.1 Motivation

From the last two decades, human has experienced a sea change in daily life. Smart-phone and tablet PC got into general circulation with apple's i-phone and i-pad release as a momentum, so people can access internet very easily and quickly through these devices, anywhere and at any time. Popularization of internet made a sharp increase of desire to exchange contents as well as made a human life more convenient. For example, lots of high quality video files are uploaded in 'you-tube' day after day, and more than a billion people exchange many photo and text through SNS(social networking service) such like 'face-book' and 'twitter' in real time. It is noteworthy that these trends cause heavy internet traffic, and demand for higher data communication bandwidth is increasing progressively. In reality, people want larger and more vivid picture and video constantly. All of them can be realized by rapid data exchange between memory controller or CPU(central processing unit) and memory.



Fig 1.1 Smart-phone and tablet sales [1.1]

Memory industry has been developed explosively around SDRAM and Flash. At first, memory was used mostly in PC (personal computer) and laptop. After MP3 became popular, memory demand has begun growth. But audio data has smaller capacity relative to video data, so central market and market scale did not change a lot by that time. However, appearance and population of smart-phone and tablet made a tremendous demand for memory in data rate as well as quantity. As a result, memory market has grown and diversified sharply. Actually, smart phone`s share of total DRAM consumption grew from 4.4 percent in 2011 to 7.6 percent in 2012, and it will expand to 16.0 percent in 2015.[1.2]

Fig 1.2 shows the relationship between apple i-device generation and their memory bandwidth. It presents that devices need not only mere memory capacity

but also data rate in their processing. Surprisingly, memory bandwidth requirement doubles every year for 7 years and this tendency expects to be maintained for a while.



Apple iDevice Memory Bandwidth vs. Generation

Fig 1.2 Apple i-device memory bandwidth [1.2]

One easy way to double bandwidth is double the number of memory chip. But this solution accompanies two important problems. First, memory controller processing speed must be increased along memory capacity. Secondly, consumer electronics pursue to be thin and to be compact. So it is indispensible to double the memory`s own bandwidth. Fortunately, memory process has been developed and memory bandwidth becomes faster continuously. Fig 1.3 shows the DRAM data rate/pin for erstwhile DDRs and GDDRs. Although previous DRAM data bandwidth is about 1Gbps, latest DRAM data bandwidth is distributed from 5Gbps to 10Gbps. But memory bandwidth enhancement is not all. As mentioned above, memory controller processing speed must be accompanied by memory bandwidth enhancement.



Fig 1.3 DRAM data rate/pin trend [1.3]

Fig 1.4 shows a DDR DRAM memory and its surrounding in PCB(printed circuit board). It is easily understandable that memory interface is very complicate and it is composed of lots of data and clock signal line. For correct data exchange, length of every data signals and clock signals must be matched within a one clock cycle. But to increase data rate, clock period must be decreased as well as the number of data lines must be increased. So, memory controller and PCB line must matched more data lines within a shorter time. Complexity increasing by strict line length match also causes crosstalk to one another. Besides, PCB line is made of copper medium which has a sharp attenuation near 10GbHz. As a result, memory controller and PCB design becomes more and more difficult. Above line length match among data and clock signals, special techniques or complex circuits such as equalization circuit for compensating channel attenuation are needed.



Fig 1.4 DDR interface in PCB[source : Rambus]

Front-end of memory and memory controller is PHY layer and is implemented by serial link. Serial link is a system which exchange serialized data between two devices. Fig 1.5 shows a classification of serial link. Serial link is classified under two groups, embedded clock link and forwarded clock link. Embedded clock link is a serial link in the true sense of the word. Its transmitter transmits only a serialized data to receiver. But it is a problem that transmitted data is synchronized transmitter clock, and transmitter and receiver has different reference clock source. Frequencies between two clock sources are slightly different. So after its receiver receives a data signal, receiver recoveries clock from received data by using PLL(Phase locked loop) or DLL(Delay locked loop) based CDR(Clock and data recovery) circuit, and then it recoveries data using recovered clock. For accurate clock recovery, data must have sufficient transition between '0' and '1', so data encoding scheme such like 8b/10b encoding is used for ensuring short run length.



Fig 1.5 A Classification of serial link (a) Embedded clock link, (b) Forwarded clock link [1.4]

Forwarded clock link transmits clock signal as well as data signals. It has strengths and weaknesses. Weak point is that it needs additional wire cable in relative to embedded clock. A cable is much more expensive component than a CMOS(Complementary metal-oxide-semiconductor) chip. But the more the number of data channel, the less cost overhead. Strong point is that clock signal has accurate frequency information. So we don`t have to clock recovery by using complex block such as PLL and DLL, and just do clock to data skew compensation (deskewing). And received clock and data are closely correlated in jitter. As a result, forwarded clock link has much better jitter performance in relative to embedded clock link.

With this background, this research deals with a serial link circuit design methodology for memory interface.

At first, a study for designing memory controller PHY serial link for established graphic memory GDDR3, which is operated by 500MHz~1.2GHz is proposed. Fortunately, there is no need to consider channel attenuation in this bandwidth region. GDDR3 has 4 channels, and each channel has a DQS(clock) signal and 8 DQ(data) signals. Every 8 DQ signals has different skew and each skew must be calibrated. Proposed architecture is a kind of forwarded clock link and is based on DLL and VDL(Variable controlled delay line) for skew calibration. It is possible to control clock and data phase relationship between edge align and center align by using DLL and VDL.

Secondly, a study for designing memory controller PHY serial link preparing for next generation memory which has a bandwidth about 20Gbps is proposed. As mentioned, copper medium has a serious attenuation over 10 GHz. So nowadays, there is some movement to replace copper medium with fiber. Originally, optical device is made of compound semiconductor and was much more difficult than CMOS. Moreover CMOS performance was much poorer than compound semiconductor. But as compound semiconductor cost decreases and CMOS performance improves, silicon photonics comes to the fore. The concept of silicon photonics is as in the following. At first, front end of transceiver is comprise of optical devices such like photo diode and modulator which convert signals from light to electrical, or reverse way. And then, back end of transceiver is comprised of amplification logic for converted signal and general electrical CMOS link logic. So, overall silicon photonics link can take advantage of both optical compound semiconductor link and electrical CMOS link by allocating two separated link components properly.

At last, another study for designing memory controller PHY serial link preparing for next generation memory which has a bandwidth about 20Gbps is proposed. Silicon photonics which is mentioned above has some limitation and difficulties. So this research proposes realistic solution which can be implemented by a kind of electrical link. It uses many techniques to overcome high speed signal attenuation on copper medium and layout complexity increase due to multi phase distribution. Linear equalizer and DFE(Decision feedback equalizer) compensates channel loss adaptively, Real time PRBS(Pseudo random binary sequence) chaser predicts appointed PRBS data pattern and so proposed architecture can operate half rate interleaved way with only two phase clock. In addition to these techniques, a DCDL controller fulfills data recovery and data synchronization among four channels at the same time. Proposed architecture is implemented simple and intuitive way, so it appears to be apt for high speed serial link for memory interface

In conclusion, this research is a extensive study about serial link for memory interface. It handles serial link PHY design for GDDR3 memory interface and proposes two serial link design methods for 20Gbps next generation memory. One is a silicon photonics, which replaces electrical copper medium with optical fiber. It is a fundamental solution and will be generalized in many high speed interfaces. But it has some limitation and difficulties. The other is to design a open loop delay matched stream lined receiver with linear equalizer and DFE. It is a practical and instantaneous solution and needs no environment changes. Prototype chips are fabricated in 130nm, 90nm, and 65nm CMOS process.

Simulation and measurement results shows designed architectures are suitable for current and next generation memory serial link PHY.

The remainder of this thesis is focused on the analysis and discussing implementation issues of the serial link for memory interface mainly with receiver. In chapter 2, practical memory PHY design for existing GDDR3 memory is studied. It operates with 1.2GHz clock uses DLL and VDL for accurate sampling. GDDR3 PHY is implemented by 0.13um CMOS process. In Chapter 3, this research suggests novel technique, silicon photonics for overcome 20Gbps. Optical device (front-end) and CMOS circuit (back-end) construct overall system and they uses optical fiber medium instead of copper medium. 12.5Gbps back-end of serial link is studied, 10Gbps overall silicon photonics system is mentioned. But it is difficult to arrive silicon photonics final goal, optical device and CMOs circuit integration in CMOS process. And still optical device parasitic is too large. So Chapter 4, realistic alternative for 20Gbps next generation memory interface is proposed. It is simple and intuitive, open loop delay matched stream-line receiver. It uses equalizer and PRBS chaser for half rate interleaving communication with only two phase clock signals. Finally, Chapter 5 concludes this dissertation and summarizes the potential benefits of proposed approach.

# **CHAPTER 2**

# A SERIAL LINK PHY DESIGN FOR GDDR3 MEMORY INTERFACE

# 2.1 Introduction

This chapter discusses GDDR3 PHY design. GDDR3 is an one of the widely used memory in graphic card or other consumer electronics such as 3D TV. Overall architecture and 3 main paths (read path, write path, and command path) architecture are described. And the one of the important block, DLL design for memory controller and novel DLL for SSN and high speed operation by using HCLD are proposed.[]

#### Controller GDDR3 PHY o\_dm[3:0] Digital Memory Controller CALMODE[2:0] io dq0,1,2,3[7:0] CALDON Auto Skew Calibration 2C\_CH0,1,2,3[31:0] i\_rdqs[3:0] Skew 12C\_CMD[19:0] Calibration I2C\_PLL[9:0] Control das(3:01 DO[119:0] slow mode rdg\_tcon0,1,2,3[63: Analog cke GDDR3 SDRAM Memory Timing Controlle FIFO y[3:0] o\_cmd[3:0] Initializer Deserializer with the n0 1 2 30 Line Buffer dm0.1.2.3[7:0] x4 i\_cke0,1 DI[119:0] i\_csb0,1 Serializer o\_ba[2:0] i\_casb0,1 i\_rasb0,1 o\_addr[11:0] i web0,1 Read/write Controller i\_ba0,1[2:0] **Control Path** o ck/ckb i addr0.1[11:0] 540MHz w\_latency[2:0] clk 135 to PLL (clock generator VREF

# 2.2 GDDR3 memory interface architecture

Fig 2.1 GDDR3 overall interface architecture

Fig 2.1 shows a GDDR3 overall system which is composed of controller (TCON), Physical layer(PHY), and memory(GDDR3). Controller is s digitally synthesized block and serves many functional operations for memory communication. A GDDR3 PHY transmits commands which is ordered from TCON to memory, and provides basic functions for communicating data related to READ/WRITE operations between TCON and memory. A PHY consists of serializer / deserializer, a latency control block, a PLL for clock generation, a skew calibration logic, variable delay lines(VDL), and I/O buffers. A PHY overall architecture is shown in Fig 2.2. There are 3 main paths, each of them is for READ/WRITE/COMMAND operation. A command path is a command transmission path from TCON to memory. A read path transmits read data from memory to TCON after read command is transmitted from TCON to memory through PHY. On the contrary, a write path transmits write data to memory. Write data are received together with a WRITE command from TCON.



Fig 2.2GDDR3 PHY architecture

A skew calibration logic compensates timing error which is caused by many timing mismatch factors - wire length mismatch, trip time difference from TCON to PHY and from memory to PHY. This logic controls delay of each signals, so it can compensate skew among signals. A training sequence brings a skew information, and calculated skew is applied to VDL. This operation makes an align between each signal through VDL setting and determining each path delay. Finally, overall operation can be done without skew mismatch problem. Meanwhile, PLL generates system clocks which are used in TCON and PHY. TCON operation is synchronized with 135MHz clock signal which is provided from PHY PLL. Read data(RDQ), write data(WDQ), command(CMD), and address(ADDR) signals are exchanged between TCON and PHY. On the other hand, READ/WRITE data(IO\_DQ) are exchanged 1080Mbps, and strobe, clock(CK/CKB), CMD/ADDR signals are exchanged 540MHz between PHY and memory.

PHY is composed of four DQ channels and one CMD/ADDR channel. DQ channel exchanges READ/WRITE 64bits READ/WRITE data with TCON and exchanges 8bits IO\_DQ data with memory. READ/ write path consist of 8:1 serializer and 1:8 deserializer, respectively. A VDL controls data and clock delays with constant resolution by 4 bits codes. Delay of every single wire which communicates with memory is controlled by VDL. A DLL of a read path shifts 90° phase of RDQS. A 90° phase shifting is needed because RDQ and RDQS signals are received with edge aligned from memory. An edge align is changed into a center align by 90° phase shifting. And a deseiralizer and a FIFO in read path keep reset status ordinary, and are activated only when a READ command is received. these status are controlled by CMD/RD/WR control block.

#### 2.2.1 Read path architecture



Fig 2.3 Read path architecture

Without DQ VDL





Fig 2.4 Read path timing requirement

Fig 2.3 shows a read path in GDDR3 PHY. After READ command is received, memory transmits read data which is aligned with RDQS through DQ channel.

The right side of Fig 2.3 shows RDQ(read data) and RDQS(read clock) signals from memory. Transmission speed of RDQ is 1080Mbps and RDQS is 540MHz and they are edge aligned each other. So for accurate sampling DQs in a memory controller system, it is necessary to shift the phase of DQS by an amount of 90 degree. There are many methods for shifting phase of signals. By using VDL, constant phase shifting can be obtained according to digital code. In this design, VDL and DLL schemes are combined for effective skew compensation.. The reference DLL generates a proper control voltage for one-cycle-delayed clock, and transfers the control voltage to the replica half delay line of each channel. VDL in DQ and DQS change the delay according to their own control codes, so they can calibrate timing mismatch, which results from PCB trace difference and process variation between DQ and DQS. A training sequence is used for determining VDL control codes for skew compensation. This read path timing requirement is shown in Fig 2.4. Sampler deserializes 1080 Mbps 8 data into 540 Mbps16 data and the 16:64 deserializer makes 135Mbps 64 data which is suitable for TCON. But RDQ and their deserialized data are synchronized with RDQS on the other hand TCON signals are synchronized with system clock which provided by PHY. So FIFO changes clock domain of RDQ into system clock. One of important blocks is read latency control. After TCON transmits read command into PHY, TCON gets appropriate data from PHY fixed latency later. A read latency control block disable deserializer and FIFO before read command come into PHY. After read commands, latency control block enables serializer and FIFO and transmits data to TCON after

calculated read latency. A read latency is about 7~15 cycles (540MHz) and it

includes CAS latency, PHY transit time, and memory to PHY trip time.

In this system, the levels of DQ and DQS input signals are converted from 3.3V to

1.2V. A level converter consumes much power whenever transition happens.

Moreover, about forty signals change simultaneously in the overall four channels.

These are the main sources to generate SSN. SSN causes a voltage fluctuation and

aggravates the reliability of sampled DQ signals. Therefore, we must consider SSN

for PHY design.



### 2.2.2 Write path architecture

Fig 2.5 Write path architecture

Fig 2.5 shows a write path architecture of GDDR3 memory interface. Write path transmits WRITE command and WRITE data from TCON to memory. After PHY receives command and data, it must transmit them with constant latency (1~7 540MHz clock cycles). A WL shifter controls latency between WRITE command and WRITE data. The left side of Fig 2.5 shows WDQ parallel data which consists of WDQ and data mask signals. WDQ parallel data are synchronized with TCON system clock which is generated from PHY. Data which are synchronized with TCON 135MHz system clock pass 8:1 serializer so clock domain changes into 1080MHz PHY system clock. Write command and WDQ data phase status is checked in serializer (CMD-DQ match), and WL shifter shifts CMD and WDQ amount of a decided write latency value. Fixed skew among data and CMD is compensated by VDL and data is transmitted to memory through I/O buffer. CMD decode block determines I/O buffer function between read mode and write mode. When PHY transmits WDQ, WDQS generator generates a strobe signal which is synchronized with WDQ. WDQS is generated using 1080MHz PHY clock signal and wdqs\_mask signal which is made by CMD decoder. WDQ and WDQS are center aligned and are transmitted to memory together through their own WL shifter, VLD, and I/O buffer.



#### 2.2.3 Command path architecture

Fig 2.6 CMD/ADDR path architecture

Fig 2.6 shows a CMD/ADDR path which transmits command and address to memory. Entered parallel commands and address signals are synchronized with system clock and pass serializer and there skew are compensated by VDLs. As mentioned above, WDQ and WDQS get a WL latency based on a center of CSB command signal and then transmitted to memory through I/O buffer.

#### 2.3 DLL design for memory interface

#### 2.3.1 SSN(Simultaneous switching noise)

SSN is a voltage fluctuation between power and ground that occurs when multiple output drivers switch simultaneously. The amount of fluctuation is related to the inductance between device ground (power) and system ground (power) and can be expressed by multiplication of inductance L and current deviation di/dt. L is composed of the inductance of package bond wire, package trace, and board inductance [2.1]. The other term, di/dt is cumulative and proportional to speed and the number of simultaneous switching I/Os. Therefore, as memory controllers scale down to meet increasing bandwidth requirement, the side effects of SSN are becoming more apparent by increasing speed and the number of I/Os, and so SSN problem must be considered for current memory controller design.

Voltage fluctuation caused by SSN generates a system delay and logic faults. So it may make a system unstable and degrade its performance. But up to this time, SSN problems have been treated only in an external circuit level such as improving package and board power wiring. Dispersing power currents and shortening a distance from ground are simple and easy solutions to it. Not only is SSN problem getting worse, but also these trials have several limitations, though. So SSN problem must be considered within a circuit design level, too.

A delay locked loop (DLL) is the logic block that has been widely used in micro-processors, memory interfaces, and communication IC's for generating

on-chip clocks and suppressing skew and jitter in the clocks.[2.2] In designing a DLL, the effect of SSN ripples must be considered along with harmonic lock and stuck problems which are emphasized until now.[2.8]

#### 2.3.2 DLL architecture

DLLs are widely used for generating on-chip clock with low jitter. Fig 2.7 shows the architecture of a conventional DLL. Reference clock is delayed through VCDL, and two clock phases are compared in PD. The result is applied to CP and LF, so phase difference between reference clock and delayed clock retain zero. This type of DLL has some advantages that it does not accumulate jitter and it locks quickly compared with a PLL. But it adjusts only phase, not frequency, and the operating frequency and Voltage Controlled Delay Line (VCDL) range is severely limited by harmonic lock problem [2.2], [2.5].



Fig 2.7 Conventional delay locked loop

Fig 2.8 shows the block diagram of the proposed dual loop DLL. It consists of a VCDL, a HCLD, a dynamic phase detector (PD), and a current mismatch calibrated

charge pump (CP). After frequency acquisition between the input clock and the delayed clock in the HCLD using the VCDL multi-phases, one-cycle phase lock occurs in the PD. By using the HCLD, DLL can be immune to SSN without harmonic lock and stuck problem. [2.8]



Fig 2.8 Architecture of the proposed DLL

#### 2.3.3 Voltage Controlled Delay Line (VCDL)

A VCDL consists of a single-to-differential converter and 15-stage differential delay cells. The reasons why we use a single-to-differential converter are to keep the duty cycle of input clock, and to sample DQs by half rate after DQS passes through replica DLL. While VDL shifts fixed delay(different phase) regardless of data rate, DLL which uses VCDL feedback can shift constant phase(different delay) according to data rate. While there needs only 12 stages to drive a PD, two additional stages are used for the HCLD and the final stage is added for dummy. The VCDL delays the input clock according to control voltage and provides proper clock phases to the HCLD and the PD.

A VCDL range is directly related to a DLL's operating range. To cover overall frequency range from 500MHz to 1.2GHz, the delay of the VCDL must satisfy eq.(2.1).

 $12 \text{ x } \text{T}_{\text{VCDL-1stage Maximum delay}} > 2 \text{ ns} (1/500\text{MHz}),$ 

 $12 \text{ x } T_{\text{VCDL-1stage minimum delay}} < 0.833 \text{ ns} (1.2\text{GHz}) (2.1)$ 

#### 2.3.4 Hysteresis Coarse Lock Detector (HCLD)

The HCLD is composed of a modified CLD and a hysteresis logic. Fig.2.10 shows the modified CLD architecture, and Fig 2.9 shows its timing diagrams at LOCK, UP, DN states respectively. The HCLD receives an input clock and odd-numbered phases, i.e. PH[5], PH[7], ~ PH[15], from the VCDL. The HCLD generates a clock whose frequency is half the input's, so operating speed burden can be reduced in half. Its cycle is composed of an evaluation phase and a reset phase.[2.6], [2.8]



Fig 2.9 Timing diagram of the modified CLD
MEMORY INTERFACE



Fig 2.10 HCLD block diagram



Fig 2.11 Block and timing diagram of a hysteresis logic

In the reset phase, the HCLD counter, QA[1]~QA[5], becomes zero. In the evaluation phase, every other-edge-detection is neglected before the rising edge of PH[5] is detected. So, the HCLD counts the exact number of odd-numbered phase edges in its every evaluation phase. The number of edges determines whether UP or DOWN. Thus the proposed DLL can avoid a harmonic lock and a stuck problem without requiring any external reset and arbitration logics.[2.6], [2.8]

The conventional CLD [2.3] has some shortcomings in speed and area. To overcome these problems, before entering a flip-flop, the divided clock is delayed as much as the same delay amount of the counting logic and we can acquire some timing margin. It is represented by the shaded area in Fig 2.9. Fig 2.11 represents a hysteresis logic to control the coarse lock range and its timing diagram. At first, the HCLD locks in a narrow mode. After the coarse lock lasts for 3 cycles, it changes the coarse lock range from a narrow mode to a wide mode. [2.6], [2.8]

Under an SSN environment in a memory controller, power supply voltage fluctuations directly influence a control voltage to be unstable even in a lock state. So if the HCLD range is fixed into a narrow mode, like conventional CLDs, SSN environment breaks the lock state, and the CLD recovers coarse lock quickly again, which will happen continually at all times. This can be a jitter source because a frequency tracking loop and a phase tracking loop may interfere with each other when the CLD transfers control signal to the PD and vice versa. If we fix the HCLD range to a wide mode, on the contrary, proper locking process is done, but locking speed slows down. [2.6], [2.8] The proposed HCLD takes advantages of narrow and wide modes. At first, narrow mode is selected for fast locking. Once a lock state is entered and held during 3 cycles, coarse lock range becomes wide. So the PD keeps controlling, hence jitter is reduced. [2.6], [2.8]

#### 2.3.5 Dynamic Phase Detector and Charge Pump

The high precision PD implemented here can operate with a less dead zone at high frequencies due to the symmetry of circuits, small logic depth, and small amount of pumped charges [2.2]. The widths of UP and DOWN pulses are proportional to phase difference of the inputs as shown in Fig 2.12. It adjusts the delay between a differential input clock and a 12th VCDL output into one cycle. After coarse lock is attained, only PD controls the VCDL delay. Consequently, the PD determines the overall precision of the DLL. Timing difference between two input clocks of the PD after phase locking is less than 20 ps. [2.6], [2.8]



MEMORY INTERFACE



**(b)** 

Fig 2.12 (a) Block diagram of Phase Detector, (b) PD Gain

And we use a current-mismatch-calibrated CP [2.4]. Most DLLs use a charge pump to implement an integrating loop filter [2.5]. But conventional charge pumps have a current mismatch problem. Difference between charging and discharging currents can cause a static phase offset as well as dynamic jitter. The implemented CP has not only low current mismatch but also a wide dynamic range. Its valid voltage range concludes the VCDL range in eq.(2.1). So the DLL becomes stable in a wide range. Fig 2.13 shows the implemented CP block diagram and its control voltage – current curve. Current mismatch in a valid control voltage range is below 5 uA which is very small quantity compare with normal analog charge pump. [2.4], [2.8]

The replica CP always delivers constant currents by calibrating the same amount of UP and DOWN mismatch currents. Consequently, it can align multiphase of the VCDL uniformly much more in a lock state and reduce overall DLL jitter. MEMORY INTERFACE





Fig 2.13 (a) Block diagram of a charge pump, (b) CP control voltage-current curve

### 2.4 Simulation result

As mentioned before, a DLL in a read path is designed. DLL in GDDR3 must be immune to SSN. Fig 2.14 shows an overall locking process of the proposed DLL at 1.2GHz. To model SSN, the ripples of a 1-MHz sinusoidal wave plus a 1.2-GHz signal AM modulated at 6GHz, whose peak-to-peak amplitudes are 2.5% of a nominal supply voltage respectively, are added to a supply voltage. First two waves show modeled power and its enlarged form. Whatever a control voltage the DLL has in an initial state, the DLL is able to go into a locking state without a harmonic lock and a stuck problem. Fig 2.14 shows that the proposed DLL, at first, performs coarse lock by using the HCLD and then goes into phase locking steady state and keeps stable. Lock time is shorter than 1 us because of the HCLD`s narrow coarse locking range. [2.6], [2.8]



Fig 2.14 SSN modeling and overall DLL locking process



Fig 2.15 Locking process comparison.

Fig 2.15 shows the comparison of the frequency and phase locking process between when a conventional CLD is used and when the proposed HCLD is. Whereas the CLD repeats coarse UP and DOWN signals under the influence of SSN, the HCLD keeps a coarse lock state in a given SSN environment. Degree of control voltage fluctuation is well contrasted with each other. Fig 2.16 shows an eye diagram of the proposed DLL, which is simulated at 1.2V, 27 °C, and typical corners at the operating frequency of 1.2GHz. With the aid of the HCLD, DLL jitter is reduced about 30% and the phase difference between two inputs of the PD is less than 20ps after all locking processes are completed. DLL is fabricated in a 0.13µm CMOS process with an area of 0.04 mm<sup>2</sup>. Layout of DLL is shown in Fig 2.17. [2.6], [2.8]



Fig 2.16 Eye diagram comparison.



Fig 2.17 GDDR3 PHY layout

# 2.5 Conclusion



#### Fig 2.18 GDDR3 PHY layout

| POSIM Power Estimation |                                                       |                        |           |                  |           |            |           |
|------------------------|-------------------------------------------------------|------------------------|-----------|------------------|-----------|------------|-----------|
|                        | Sub-block                                             | Avg I per channel [mA] |           | Total Avg I [mA] |           | Power [mW] |           |
|                        |                                                       | FFFF                   | FFSF      | FFFF             | FFSF      | FFFF       | FFSF      |
| Global                 | (a) Clock Tree + PLL                                  | 60.62                  | 63.6      | 60.62            | 63.6      | 72.744     | 76.32     |
| Read @ DQ              | (b) I/O 1.8V                                          | 103.87                 | 120.17759 | 415.48           | 480.71036 | 747.864    | 865.27865 |
|                        | (c) Core (VDL+sampler +FIFO+DLL+DES+etc)<br>+I/O 1.2V | 96.58                  | 111.74306 | 386.32           | 446.97224 | 463.584    | 536.36669 |
|                        | * DLL only : 13.92 mA                                 |                        |           |                  |           |            |           |
|                        | XVDL only : 1 mA                                      |                        |           |                  |           |            |           |
| Write @ DQ             | (d) I/O 1.8V                                          | 73.6                   | 85.2      | 294.4            | 340.8     | 529.92     | 613.44    |
|                        | (e) Core (VDL+SER+ETC)+I/O 1.2V                       | 86.67                  | 95.6      | 346.68           | 382.4     | 416.016    | 458.88    |
| CMD/CK                 | (f) I/O 1.8V                                          | 50                     | 52        | 50               | 52        | 90         | 93.6      |
|                        | (g) Core+I/O 1.2V                                     | 66.77                  | 72.47     | 66.77            | 72.47     | 80.124     | 86.964    |
| Total RD @ 1.2V        | (a)+(c)+(g)                                           |                        |           | 513.71           | 583.04224 | 616.452    | 699.65069 |
| Total RD @ 1.8V        | (b)+(f)                                               |                        |           | 465.48           | 532.71036 | 837.864    | 958.87865 |
| Total WR @ 1.2V        | (a)+(e)+(g)                                           |                        |           | 474.07           | 518.47    | 568.884    | 622.164   |
| Total WR @ 1.8V        | (d)+(f)                                               |                        |           | 344.4            | 392.8     | 619.92     | 707.04    |

<Power consumption in worst case>

- Read total power consumption @ FFSF 1.08Gbps : 1678.89 mW

- Write total power consumption @ FFSF 1.08Gbps : 1329.20 mW

<Mean power consumption in worst case>

-853.36mW @ FFFF 1.08Gbps

-961.55mW @ FFSF 1.08Gbps

| Process            | 0.13 µm CMOS Process             |                                     |  |  |
|--------------------|----------------------------------|-------------------------------------|--|--|
| Sumply             | Core                             | 1.2 V (0.13 μm)                     |  |  |
| Suppry             | I/O                              | 1.8 V (0.35 μm)                     |  |  |
|                    | Total                            | $3380 \text{x} 3380  \mu\text{m}^2$ |  |  |
| Aroo               | Data Channel                     | $1000 \text{x} 1300\mu\text{m}^2$   |  |  |
| Alea               | Command Channel                  | $1220 \text{x} 1300\mu\text{m}^2$   |  |  |
|                    | PLL                              | $270 \text{x} 340  \mu \text{m}^2$  |  |  |
| Operating<br>Range | 500 Mbps ~ 1.2 Gbps              |                                     |  |  |
| Power              | Read@1.08 Gbps                   | 853.36 mW                           |  |  |
| Estimation         | Write@1.08 Gbps                  | 961.55 mW                           |  |  |
| BER                | $< 10^{-12}@1.2$ Gbps            |                                     |  |  |
| PLL jitter         | 17.4 ps <sub>RMS</sub> @1.2 Gbps |                                     |  |  |

Table 2.1 Summary of GDDR3 PHY key characteristics [2.7]

The proposed GDDR3 PHY is fabricated in 0.13um CMOS technology. The entire chip occupied 3380um x 3380um. Layout of overall GDDR3 PHY is shown in Fig 2.18. Overall PHY and component power estimation in many process corner cases are shown Fig 2.19.

In this chapter, this research discussed GDDR3 PHY design focused on DLL design. DQ channel (read path, write path) and command path in GDDR3 PHY are described.

DLL is an important component in DQ channel and DLL shifts read path data amount of 90° of 1080MHz clock. So read path DQ can be sampled safely. Design of DLL in memory interface must consider SSN because many data signals (DQ) and clock signal (DQS) may switch simultaneously. The proposed DLL operates in the frequency range from 500MHz to 1.2GHz and consumes 16.6mW at 1.2 GHz. The post-layout-simulated peak-to-peak jitter is less than 30 ps in an SSN environment. It is about 30% performance improvement in comparison of the DLL using a conventional CLD

GDDR3 PHY operates well between GDDR3 memory (BER  $< 10^{-12}$  at 1.2Gbps) and TCON PLL jitter is under 17.4ps<sub>RMS</sub> at 1.2Gbps. (Table 2.1)

# **CHAPTER 3**

# **OPTICAL FRONT-END SERIAL LINK**

# **DESIGN FOR 20 GBPS MEMORY**

# **INTERFACE**

### 3.1 Silicon photonics introduction

Silicon is the primary material utilized in semiconductor constructing today because it is plentiful, inexpensive, and well appreciated by the semiconductor industry [3.1]. So CMOS has been widely used for circuit design in a wide area. Generally, CMOS chip is put on PCB and PCB line is made of copper. But copper medium has sharp attenuation in the vicinity of 10GHz and by distance. In the past, using copper as a medium did not generate a problem because operating

frequency was under Gbps without reference to length. But as mentioned above, required bandwidth has been increased explosively caused by generalization of internet, requirement of high definition and large video, and population of smart phone and tablet. Therefore, required bandwidth is closed to 10Gbps and will be beyond 10Gbps soon. So copper medium is faced with crisis and it needs to suggest alternatives. This situation is same in memory interface. Memory interface is composed of memory and memory controller and they are placed in separated PCB or in backplane of server, PC, tablet, or smart phone and communicate with CPU (Central processing unit). Fig 3.2.(a) shows a current memory interface based on electrical copper medium. Memory has many data signals and clock signals, and the number of signals on PCB is about several hundreds. For accurate data communication, line length match among signals within one cycle are indispensable. In contrast to decrease of one cycle time period, PCB line length is kept in similar level. So PCB design complexity has been increased and this tendency will be retained. Silicon photonics is one of alternatives to overcome these difficulties. Until now, communications are classified under electrical communication and optical communication by signal medium, and are also classified under chip to chip communication, board to board communication, rack to rack communication, and enterprise communication by distance. Optical communication is based on fiber medium and devices are made of compound semiconductor process. It is easy to implement high speed transceiver and fiber medium has little attenuation for distance and fast transmission speed but cost is high relatively. Electrical communication is based

on copper medium and devices are made of CMOS process. Its cost is low but devices are slow and copper medium has sharp attenuation in high speed. So electrical link is used in short distance communication such like chip to chip, board to board mainly. On the other hand, optical link is used in long distance communication and high speed required application such like enterprise, rack to rack communication. This tendency and why silicon photonics appears is shown in Fig 3.1. Recently, CMOS process has been developed quickly, and optical cost also has been getting lower and lower. So there happens a trial, silicon photonics, to take advantages of both ways in transition zone. Data rate of transition zone is expected from 10Gbps to 30Gbps and this region can be expanded.



# **Fig 3.1 Electrical communication VS optical communication [source : Intel]** Silicon photonics is a novel method that implements a front end with optical device and exchanges data through fiber medium, and then implements a back end with a chip CMOS process. Detailed description will be followed. Important

thing is that recent CMOS has fast processing speed within a chip level in some degree (10Gbps~20Gbps) and optical fiber has little attenuation in such data rate and several signals can be carried in a single wire with this property by changing a signal modulation wavelength.



 

 128bit DDR interface (65mm)
 Trace length matching difficult for wide buses
 [source]Rambus

 Fig 3.2 Current electrical memory interface [source : Rambus, Intel]
 [source]Rambus

So, interface complexity can be reduced sharply if both transmitter and receiver use optical front end. This method can be adapted in memory interface. Fig 3.2(b) shows a future memory interface which uses optical front and optical fiber. The number of signal (fiber) is very small and PCB becomes much simpler than Fig 3.2.(a). Moreover, there is little interference among adjacent signals in a fiber unlike copper medium. Fig 3.3 shows a various memory interface by connection way between memory controller and memories. Traditionally, there are many common copper signal lines between them. Every memory in a slot shares a common command and read, write data signal from memory controller. But memory bus (multi drop) connection has some disadvantages. At first, it is very hard to increase transmission speed of each line. Secondly, the number of memory in a slot changes by situation. Development of memory interface begins with increasing transmission speed of single line and decreasing the number of transmission line. But there comes limitation like mentioned above. Instead of multi drop connection, point to point connection method appears. It is a direct one to one connection method between memory and memory controller independently. In point to point connection, increasing a single line transmission speed is much easier than multi drop connection method. But because of copper medium attenuation, memory interface using optical fiber is needed for increasing both single transmission speed and the number of memory. Moreover several optical signals which have their own wavelength can be carried in a single fiber by using DWDM (Dense wavelength division multiplexing).



Fig 3.3 Memory and memory controller connection (multi-drop VS point-to-point) [source : IBM]

General silicon photonics overall architecture is shown in Fig 3.4. Overall architecture is composed of transmitter and receiver. In receiver, optical signal is received through optical fiber and optical signal propagates in a CMOS chip. Photo-detector such like photo diode convert optical signal into electrical signal. And then, CMOS circuitry handles signals. For transmission, CMOS circuitry generates a signal for transmission, and then filter modifies signals into compatible for driving modulator and selects wavelength of laser light source. Laser just generates a light as a source, and modulator modulates incident light according to filtered signal. Modulated light (optical) signal is carried in fiber and is transmitted through fiber.



[source] Intel

Fig 3.4 Silicon photonics based transceiver architecture [source : Intel]

For implementing all of these transceiver, there needs so many technologies. Needs for silicon photonics are presented in Fig 3.5. At first, light source is needed. In electrical link, electrical signal is generated by current or voltage difference. But in silicon photonics, for modulating a light, light source is needed. But implementing a light source in a CMOS technology is the most difficult part in every requirement for silicon photonics. Secondly, wave guide is needed. Wave guide means a passage of light in silicon which has a role to connect devices on silicon and optical fiber with little loss. Thirdly, Modulation is needed. Modulator modulates light source according to electrical signal. Fourthly, photo detection is needed. Photo diode detects a entered light and converts light signals into electrical current. Fifthly, low cost assembly is needed. External fiber must be aligned to wave guide for low loss. Assembler such like package fixes fiber position. And for taking advantage of CMOS, this assembler must have a low cost. At last, CMOS intelligence is needed. CMOS circuit must process received current signal and must generate signal in the form of modulator compatible.



Fig 3.5 Needs for silicon photonics [source : IBM]

It is certain that silicon photonics study will be headed to fully integrate every optical devices and CMOS circuit in a single CMOS chip. But until now, fully integration fall overall performance in the region of data transmission speed and needs large area relatively. This research deals with CMOS circuit for silicon photonics mainly and hybrid silicon photonics implementation that processing components are implemented in CMOS and optical devices are implemented in compound process, separately. CMOS circuit chip and optical device chip connects with COB(Chip on board) bonding in a sub millimeter length.



Fig 3.6 Conceptual memory interface based on silicon photonics in Backplane [source : IBM]

Fig 3.6 shows a detailed conceptual memory interface based on silicon photonics in the range of backplane. Hybrid way of silicon photonics is applied to memory interface. On a backplane, two memory cards are attached and optical signals are exchanged through PCB wave guide. Connector only changes optical signal direction. Optical device chip is placed between CMOS chip and optical connector. Optical device chip converts optical signal and electrical signal each other. CMOS chip processes and generates electrical signal and it communicates with optical device chip. CMOS chip and optical device chip can be connected by COB or PCB copper line. In this situation, copper line does not fall performance because length of connection is very short and simple.



Fig 3.7 On chip optical network in a 3D IC [source : Intel]

Fig 3.7 shows a final target of silicon photonics. Recently, 3D IC using TSV (Through silicon via) is a hot issue. 3D IC is stacking multi CMOS chips from bottom to top. If 3D IC can be implemented with a high reliability, overall computer networking system can be constructed. CPU is on a bottom CMOS layer and memory is on next above. Because most of data traffic in computer architecture is between CPU and memory, two chips must be adjacent. They exchange many data signals through several hundreds of silicon via. But data communication efficiency for overall system, fast I/O (input and output) interface is needed and so silicon photonics CMOS chip layer is essential on top layer.

Finally, overall computer architecture system for next generation can be implemented through silicon photonics technique.



## 3.2 Optical front-end Transmitter design

Fig 3.8 Silicon photonics transceiver overall architecture

Fig 3.8 shows a silicon photonics transceiver overall architecture which is a form of CMOS and compound process hybrid. Transmitter and receiver are connected with single optical fiber. In this section, optical front-end transmitter design will be studied. In transmitter, PLL (Phase lock loop) generates a various clock signals which is suitable to each components in transmitter and they have various frequency and phases. Actually, end user uses low speed data and there exists many users. So, parallel low speed data enters into transmitter and serializer must serialize them into single or differential high speed data. Driver converts the serialized data into the form of modulator compatible. Finally, optical modulator modulates light from laser diode by using electrical signal which is generated in driver. In the future all components will be integrated in a single CMOS chip. But in this study, modulator and laser diode is separated from CMOS chip and is implemented in a compound semiconductor process. Two chips are connected through COB and COB length is under 1mm.



### 3.2.1 Modulator driver requirements

Fig 3.9 CMOS driver and modulator connection

Fig 3.9 shows a connection between CMOS driver and MZ (Mach-zender) modulator. modulator modulates light entered from laser diode. Light is divided into two path and united after passing their own path. when they pass their own path, light is modulated according to CMOS driver value. When CMOS driver drives '1', modulator does nothing, original light is recovered. On the other hand, when CMOS drives '0', modulator modulates a light of only on path, united light results no amplitude. CMOS driver must generate signals compatible to modulator. Modulator is made of compound semiconductor process and it has large input capacitance. Moreover COB has large inductance so transmitted signal experiences large loss. And modulator requires large voltage swing above 2V to modulate a light. Finally, it is necessary that overall operating speed exceeds 10Gbps for get a true meaning of silicon photonics. In other words, modulator driver requirements are 1. large loading - above 500fF capacitance, 2. large voltage swing - above 3V (differential), 3. high operating speed -above 10Gbps in the same time.

### 3.2.2 Modulator driver design - current mode driver

As mentioned above, modulator driver must satisfy many conditions. The most difficult requirements is high voltage swing about 2V. As CMOS process develops, power supply voltage falls down, while device operating speeds increases and device size becomes decrease . 0.13um process usually uses 1.2V supply voltage and 90nm and 65nm process usually use 1.0V supply. It is expected that this tendency will be maintained. Generally, CML (Current mode logic) is used for driver in most of link such like memory controller. But conventional CML buffer output swings VDD (power supply) ~ VDD-IR (I : current, R : resistance), so it is impossible to get a output voltage swing above VDD.



Fig 3.10 transformer boosted current mode driver [3.2]

Fig 3.10.(a) shows a transformer boosted current mode driver which has a

voltage swing above VDD.[3.2]. Driver consists of two current mode differential drivers driven by pre-driver output and Fig 3.10 shows the single-ended equivalent of the drivers. Driver consists of main current mode driver and boosting driver. Driver output equals sum of voltage across loading resister and loading inductance ( $V_{out} = V_R + V_L$ ). In this equation,  $V_L$  does not directly depend on supply voltage, so this driver can achieve a large output swing that exceeds power supply voltage [3.2]. Inductance consists of self inductance and mutual inductance equals  $L_1$ , and mutual inductance equals  $k\sqrt{L_1L_2}$ . Overall voltage across inductor  $L_1$  equals sum of voltages caused by self inductance and mutual inductance.

$$V_{L} = V_{L1} + V_{M1} = L_{1} \frac{dI_{SS1}}{dt} + k\sqrt{L_{1}L_{2}} \frac{dI_{SS2}}{dt} (3.1)$$
$$V_{OUT} = V_{R} + V_{L} = V_{R} + V_{L1} + V_{M1} (3.2)$$

As shown in (3.1),  $V_{M1}$  is dependent on boosting driver current  $I_{SS2}$ , not on main driver current  $I_{SS1}$ . So  $V_{OUT}$  swing can exceed power supply. It is necessary deep consideration to decide coupling k (coefficient), boosting driver inductance L<sub>2</sub>, boosting driver current  $I_{SS2}$ . Boosting driver also increases overall driver bandwidth as well increases voltage swing above power supply. Fig 3.10.(b) shows transformer boosted current mode driver overall frequency response. This driver 1.42V<sub>PP</sub> differential output swing in a 0.13um 1.2V CMOS when it operates at a data rate of 8Gbps. But this architecture consumes 137mA from the 1.2V supply [3.2]. There is another approach to get a high voltage swing in a current mode driver. Fig 3.11 shows a stacked FET current mode driver. The key idea of this design is using high voltage supply  $V_{G}$ ,  $V_{DD}$ , and stacks 4 FET while each transistor  $V_{GS}$ ,  $V_{DS}$ ,  $V_{GD}$  are under VDD (1V).[3.3]



Fig 3.11 Stacked FET current mode driver [3.3]



Fig 3.12 shows a modified stacked FET current mode driver. Output termination is changed from simple resistor termination to double serried peaking for extending a bandwidth (Asymmetric T-coil peaking [3.8]). Measurement result will be presented measurement section.

#### 3.2.3 Modulator driver design - current mode driver

In link design, there are two kinds of driver. One is current mode driver and the other is voltage mode driver. A voltage mode is one whose signal states are completely and unambiguously determined by its node voltages, a current mode is one whose signal states are completely and unambiguously by its branch currents. [3.4]. Fig 3.13 shows a inverter based voltage mode driver. It is the most simple idea to use a thick oxide transistor to get a high voltage swing by using inverter based voltage mode driver. But thick oxide transistor has low operating frequency compared to thin oxide transistor.

Proposed driver circuits are composed of pre-driver which operates in a low power supply LVDD (1.2V), and post driver which operates in a high power supply HVDD (3.3V). Pre-driver is a general inverter chain and there are latch for compensating a timing mismatch (-30ps~30ps) between two differential signals. Post driver is made of both thick oxide transistor and thin oxide transistor. Two thick oxide PMOS is cascaded and lower PMOS is biased, so it lowers the voltage which is injected to NMOS. NMOS is made of thin oxide transistor. There is feedback resistor between input and output, the resistor improves operating speed as well as decrease output voltage swing memory interface



Fig 3.13 Proposed inverter based voltage mode driver

Final termination is composed of resistor and inductors. Termination resistor reduces ISI (Inter symbol interference) and termination inductors are used for inductor peaking. This driver has about 2V output voltage swing in a 9Gbps operating frequency. Measurement result will be presented measurement section.



Fig 3.14 Concept of SST [3.6]

It is possible to get a high voltage swing above VDD from current mode driver, but current mode driver has an architecture that may have low output voltage swing fundamentally. And voltage mode driver in Fig 3.13 is also improved from pure thick oxide transistor driver but basically it has a limitation in an operating speed. SST (Source series termination) is the optimal solution to overcome these two difficulties. The key concept of SST is to get a high voltage swing using thin oxide transistor. Fig 3.14 shows a concept of SST. The driver output state is subdivided into a pull-up and a pull-down branch implemented as a PMOS or NMOS switch transistor followed by a series termination resistor. Each of the two branches is impedance matched to the transmission line impedance, which is typically 50 Ohm. In current mode driver, the method of matching impedance is to control load resistor by mixing load resistor with fixed small poly resistor and controllable large transistor switching resistor. In SST, original line driver output stage is impedance scaled and N times duplicated such that the parallel connection of selectivity, results in the desired transmission line impedance.

SST signaling overcomes supports many different termination voltages combined with a high signal swing. In addition to providing impedance tuning capability, transmitter should perform appropriate channel equalization. In SST design, De-emphasis is performed easily by distributing parallel identical SST structure, each SST structures are composed of many binary scaled SST stages. In other words, both impedance matching and de-emphasis equalization is possible by designing SST properly. Fig 3.15 shows a clocked SST driver. The output driver is operated from a high voltage supply Vddh, which can be flexibly set between 1.2V~2V and determines the available output swing. A low side 1V Vdd supply is used to drive the pull down NMOS devices, whereas a high side 1V supply between Vddh-Vdd and Vddh is used for the PMOS pull-up branch. The pull-up and pull-down branches are protected by a thin oxide NMOS or PMOS cascade protection device biased at Vdd or Vddh-Vdd, respectively. Through this driver, output can have a Vddh~Gnd signal swing.



Fig 3.15 High swing SST voltage mode driver [3.5]

Fig 3.16 and Fig 3.17 show architectures that combines pre-designed voltage mode driver and SST driver. In Fig 3.16, series inductors are disposed every stages to extend a bandwidth and cross coupled latches are disposed to compensate a timing mismatch. These signals are divided into two level signals. One is original signal and the other is voltage level shifted signal. Two signals are transmitted to SST driver for getting a high output voltage swing. Level shifter is shown in Fig 3.17 and converted signal has an upper bound of mvdd and has a maintained swing. Main difference between Fig 3.16 and Fig 3.17 are whether level shifted signal inverter or not. Level shifted signal is an output of capacitor,

so it is unstable and if the signal is connected to SST driver PMOS directly, it may generates noise and may not transmit accurate signal information. Inverter for level shifted signal is designed by using triple well transistor and caution is need for designing. After passing the inverter, level shifted signal becomes stable, so accurate signal information transmits, and bandwidth extension is occurred. Measurement result will be presented measurement section.



Fig 3.16 Simplified high swing SST voltage mode driver



Fig 3.17 Proposed modified Simple high swing SST voltage mode driver

### 3.3 Optical front-end Receiver design

Fig 3.8 shows a silicon photonics transceiver overall architecture which is a form of CMOS and compound process hybrid. Transmitter and receiver are connected with single optical fiber. In this section, optical front-end receiver design will be studied. In receiver, optical signal is received through fiber. Lens condenses light signal for minimizing signal loss. Photo diode converts light signal into electrical current. Because current swing is very small (several tens of uA ~ hundreds of uA), CMOS circuit must amplifies this current signal. TIA(Trans impedance amplifier) and LA(Limiting amplifier) converts current signal into voltage signal and amplifies into several hundreds of voltage signal. CDR (clock and data recovery) circuit recovers clock signal from this amplified voltage signal and then recovers data signal from the recovered clock signal. Actually, there are many end user and they uses much slower signal. So, Deserializer deserializes the fast serial recovered data into slow parallel data. At that time, CDR provides proper frequency and phase clock to deserializer such like PLL does in transmitter.

In the future all components will be integrated in a single CMOS chip. But in this study, optical device (photo diode) is separated from CMOS chip and is implemented in a compound semiconductor process. Two chips are connected through COB and COB length is under 1mm.



### 3.3.1 Optical receiver back end requirements

#### Fig 3.18 Optical front-end receiver block diagram [3.10], [3.11], [3.12], [3.13]

Fig 3.18 shows an optical receiver overall architecture without clock and data recovery. Received light signal is converted into electrical current through photodiode. TIA converts current signals into voltage signal and amplifies it. Limiting amplifier amplifies differential voltage signals up to CMOS level for using in a overall chip level. But TIA output is a single ended signal, so it is needed to generate differential signals from a single ended signal. Offset cancellation circuit is also needed for preventing signal saturation and for finding a common voltage of single ended signal. It is better way to use only one PD and generate differential signals later because optical device is an expensive component compared with CMOS logic. Driver translates limiting amplifier output to measure output signals at out of chip and out of board.

### 3.3.2 Optical receiver back end design – TIA

Generally, the most important factor which lowers TIA operating speed is a photodiode output capacitance. The value of capacitance was about 100fF~500fF in 2000s. But nowadays, this value has been lowered to under 100fF. As a result, design method and issue are also changed. In this chapter, this research starts with traditional approaching method for designing TIA, and develops discussion into modern TIA design method.

The simplest TIA design is common gate amplifier. As mentioned, PD output capacitance is large and it generates a dominant pole, so minimizing input impedance is needed. Input impedance of CG is suggested at (3.3). CG amplifier has small input impedance and trans-impedance gain is determined by load resistance.



Fig 3.19 (a) common gate amplifier (CG), (b) RGC (Regulated cascode)

$$R_{in} = \frac{1}{g_m}, \ R_T = R_D, (3.3)$$
$$R_{in} = \frac{1}{g_{m1}(1 + g_{m2}R_2)}, \ R_T = R_1(3.4)$$

RGC (Regulated cascode) in Fig 3.19 is improved form from CG, and it has

much smaller input impedance as shown at (3.4). There is a amplifier feedback which is composed of M2, R2 is added between CG gate and drain, and input impedance is reduced a factor or input impedance and increased bandwidth. A s-domain response function and bandwidth are suggested in (3.5), (3.6), respectively.

$$\frac{V_{OUT}}{I_{in}} = \frac{R_1}{(\frac{C_{in}}{g_{m1}g_{m2}R_2}s + 1)(R_1C_{OUT}s + 1)} (3.5)}$$
$$f_{-3dB} = \frac{1}{2\pi} \frac{g_{m1}g_{m2}R_2}{C_{in}} (3.6)$$

Fig 3.20 shows a gain boosted cascode TIA which increase the bandwidth of a TIA by decreasing a input impedance of the CG with the high gain feedback scheme. And bias circuit of GBC TIA which utilizes the capacitance multiplication to reduce the area of low-pass filter, to generate appropriate bias voltage. [3.9] GBC input stage is one more feedback amplifier added form from RGC.



Fig 3.20 Gain boosted cascade TIA [3.9]

Input impedance and bandwidth of GBC are suggested at (3,7), (3.8), respectively. Input impedance is decreased and bandwidth is increased by the factor of  $g_{m3}R3$ .

$$R_{IN} \approx \frac{1}{g_{m1}(1 + g_{m2}R_2)(1 + g_{m3}R_3)} (3.7)$$
$$f_{-3dB} \approx \frac{1}{2\pi} \frac{g_{m2}R_2g_{m3}R_3}{g_{m1}C_{IN}} (3.8)$$

All of CG, RGC TIA, and GBC TIA are focused on reducing input impedance for expanding a bandwidth because large PD capacitance generation pole is dominant for TIA bandwidth. RGC TIA and GBC use feedback amplifier by CS (common source) amplifier and lower input impedance by the factor of feedback amplifier gain. Actually, there are two poles in TIA. One is induced by input capacitance and input impedance, the other is induced by feedback loop and other parasitic capacitance. This trial of RGC and GBC TIA increases dominant pole frequency, so dominant pole which is induced by input capacitance and input impedance becomes no longer dominant pole. Moreover, increasing feedback CS amplifier gain  $(1 + g_{m2}R_2 \text{ or } 1 + g_3R_3)$  causes decreasing bandwidth. And there is an input impedance increasing near –3dB point of feedback CS amplifier, so overall TIA bandwidth decreases. Therefore, not only DC voltage gain of feedback CS amplifier but also bandwidth of feedback CS amplifier influences overall TIA bandwidth. Measurement result of GBC TIA will be suggested in next section

Fig 3.21 shows a shunt and series peaking TIA which uses inductors to enhance feedback path bandwidth. Shunt inductor  $L_1$  relieves output impedance
decreasing induced by load capacitance, so augment bandwidth. And  $L_1$  delays  $R_2$  current, so output capacitor current increases and reduce rise time and fall time [3.10]. Series inductor  $L_0$  decouples load capacitance and  $M_2$  drain capacitance, so  $L_0$  enhances feedback path bandwidth. Measurement result of shunt and series peaking TIA will be proposed in the next section.



Fig 3.21 Shunt and series peaking TIA [3.10], [3.11]

In other side, there are many trials to reduce input impedance further. Fig 3.22 shows two such trials. Kromer's common gate feed forward TIA provides much smaller input impedance than RGC TIA, and bandwidth is further increased by inductive peaking. There is one more CG amplifier stage which shifts DC level into higher value, so M3 can operate with high bias voltage. Input impedance is shown at (3.9). In short, Kromer's TIA has a common-gate topology with a gain enhancing feed forward path and an input impedance reducing feedback [3.14].

Negative miller capacitance TIA replaces inductors in Kromer's TIA with negative miller capacitor. Both ends of negative miller capacitor transits same directions, so the capacitor generates a peaking and reduces capacitance shown at X node. Eventually, additional capacitor makes a zero in response function. Removing inductors can reduce area. Measurement result of negative miller capacitance will be proposed in the next section.



Fig 3.22 RGC modified TIA (a) Kromer`s common gate feed forward TIA [3.14], (b) Negative miller capacitance TIA [3.12], [3.13]

$$R_{in} = \frac{1}{g_{m1}(1 + A_2 A_3)} A_2 = \frac{V_x}{V_i}, A_3 = \frac{V_y}{V_x} (3.9)$$

As shown below, there are many efforts for increasing a TIA bandwidth. But nowadays, PD input capacitance has been lowered under 100fF, and CMOS process has been developed very fast. So, there is some movement toward designing simple TIA. Fig 3.23 shows a inverter based simple TIA. This architecture does not need a bias circuit, consumes low power relatively. Moreover, it is shunt-shunt feedback structure, so both input impedance and output impedance become lowered. Inserting inductor series with resistor generates a zero at high frequency and a peaking, so bandwidth becomes enhanced.



Fig 3.23 Inverter based TIA

Equivalent circuit is shown in Fig 3.24 and input impedance is suggested at (3.10). Trans-impedance gain is determined by feedback resistor. But if feedback resistor becomes too large, input impedance becomes large and it causes bandwidth reduction. In the other hand, increasing  $g_m$  by enlarging size also increases input capacitance. In this study, resistance is controlled that trans-impedance gain can be a  $45 dB\Omega$ . Measurement result will be suggested at next section.



Fig 3.24 Small signal equivalent model - Inverter based TIA

$$Z_{in} \approx \frac{1}{\frac{1}{R_f} + (g_m - \frac{1}{R_f})(R_f \parallel r_o)\frac{1}{R_f}} \approx \frac{2}{g_m}, \text{ (at low frequencies)} (3.10)$$

$$R_T \approx R_f(3.11)$$

Fig 3.25 shows a TIA overall block diagram. Except inverter based TIA, every TIA design needs a bias voltage so LPF (Low pass filter) is added to TIA output and it makes a common mode voltage and bias voltage. TIA is designed to inject an external bias voltage for preparing false operation of biasing circuit.



Fig 3.25 TIA overall architecture

### 3.3.3 Optical receiver back end design – LA, Driver

LA (Limiting amplifier) amplifies TIA output signals about 10mV into CMOS level about 500mV~600mV. It needs large gain, so many stages (3~5) are used to acquire proper gain. And small mismatch can cause one of output differential signals to saturate due to large gain. Fig 3.26 shows offset cancellation LA and overall LA array. Second stage is offset cancellation LA and it receives fourth stage output for offset cancellation. Fig 3.27 and Fig 3.28 shows two LA implementations. Fig 3.27 shows cherry-hooper LA and negative capacitance circuit for bandwidth extension. Cross coupled NMOS and capacitor equals negative capacitance in an equivalent circuit. But it increases consuming power.



Fig 3.26 Limiting amplifier [3.10],[3.11],[3.12],[3.13]



Fig 3.27 Cherry-hooper LA and negative capacitance circuit [3.12], [3.13]



Fig 3.28 Negative miller capacitance LA[3.12],[3.13]

Fig 3.28 shows a LA which adapts a negative miller capacitance such like TIA. It extends LA bandwidth in the same way as described in negative miller capacitance TIA. It adds only small capacitors instead of 4 NMOSs and large capacitor, so area and design complexity become reduced.

In other hand, TIA and LA output signal must be measured at out of chip and board. But chip PAD capacitance is very large and output impedance must be 50  $\Omega$  for preventing reflection. It is hard to realize such function by using only LA. So output driver is added and 3~5 stages are needed. Final output stage must be matched to 50 $\Omega$  and must have ability to drive large capacitance.



Fig 3.29 Output driver [3.10], [3.12]

## 3.3.4 Optical receiver back end design – CDR

Modulator driver in transmitter and TIA, LA, driver in receiver are described so far. Actually, signal swing level is too small to operate with other circuit in TIA. So, LA amplifies small signals up to CMOS level which can operate with CMOS digital or analog circuit. In this chapter, how amplified signals can be used for accurate receiver operation. Clock and data recovery must be accomplished in receiver for an accurate communication. Input data is jittery and is based on transmitter clock and it does not have same frequency and phase with receiver clock. As shown in Fig 3.30, CDR recovers clock signal from received data, and then CDR resample input data by recovered clock. Therefore, jitter of recovered data will be eliminated. For clock recovery, data must have sufficient transition, so encoding such like 8b/10b encoding is needed.



Fig 3.30 Clock and data recovery concept [3.16]

Fig 3.31 shows proposed 12.5Gbps optical front end receiver overall architecture. Inverter based TIA and negative miller capacitance LA is used for signal detection and amplification. Just two LA stages are used for reducing

power dissipation and two stages have sufficient voltage swing (10mV~50mV) for sampler sampling. CDR operates with half rate time interleaved way, and 2X oversampling scheme is used. So four samplers are needed, two samplers sample edges information, and other two samplers sample data information. At that time, samplers may have their own offset due to device mismatch. The amount of mismatch is about 10mV~20mV and it can influence data sampling. So offset of sampler must be calibrated and it is shown in Fig 3.32.



Fig 3.31 Optical front-end overall receiver architecture

Sampler is composed of offset cancellation sense amplifier, normal sense amplifier, and latch. Sense amplifier is a strong arm type and it amplifies LA output signals into CMOS level signals. Sampler operates with half rate clock, so sampler bandwidth does not constitute a problem. CDR operates with two steps, frequency acquisition and phase acquisition. At first, frequency acquisition is fulfilled without data sampling. Receiver has its own reference oscillator clock of which frequency is 195.313MHz. Four stage VCO (Voltage controlled oscillator) oscillates with frequency which is controlled by control voltage and generates eight phases. VCO output clock is divided by 32 in the analog clock divider and its frequency is compared with reference clock frequency in the PFD (Phase and frequency detector). PFD generates up or down signal according to frequency and phase difference. Up and down signal pulse width is proportional to frequency and phase difference. Charge pump controls control voltage according to up and down signal pulse width. Digital frequency locking detector checks frequency difference between reference clock and divided VCO clock. If there is little frequency difference between two clocks, flock (Frequency lock) signal becomes '1'. It finishes frequency acquisition and CDR begins phase acquisition. In phase acquisition, sampled data and edge information is used to make a VCO control voltage into accurate value. There exist two paths, proportional path and integral path. In proportional path, sampled data and edge information is converted into up and down signal in phase detector, and is reflected to VCO frequency immediately. On the other hand, sampled data and edge information is accumulated, and up and down decision is performed synthetically and is reflected to VCO frequency slowly. Finally, VCO can oscillates with accurate frequency and phase in a jittery environment.



Fig 3.32 Sampler architecture and offset control

## 3.4 Measurement and simulation results

Measurement is performed in two steps. Only CMOS chip is measured in the first step – electrical measurement. A capacitor is added for modulator modeling in transmitter board and resistor set and capacitor is added for PD modeling in receiver board. There are electrical measurements in many versions of modulator driver and TIA. Second measurement step is performed in condition of optical device chip and CMOS chip are connected – optical measurement.

#### 3.4.1 Measurement and simulation environments





Fig 3.33 shows electrical modeling of optical device, modulator and PD, for simulation and measurement. PD and modulator are basically same form of diode. So modeling architecture is similar, but they have different value in details. Modulator has larger parasitic capacitance than PD.



Fig 3.34 Transmitter electrical measurement environments



Fig 3.35 Receiver electrical measurement environments

Fig 3.34 and Fig 3.35 shows electrical measurement environments. For transmitter board measurement,  $50\Omega$  termination is performed in oscilloscope

and modulator capacitor is added selectively. COB is modeled into inductor which has a inductance about 1nH/mm. COB length is expected under 1mm. For receiver board measurement, resistor set and PD capacitor is modeled for PD modeling. PD capacitor is used selectively.





Fig 3.36 Electrical measurement environment (a) Electrical measurement overall setup, (b) PCB board for electrical measurement

Fig 3.36 shows a electrical measurement environments and Fig 3.37 shows a optical measurement environments - optical device chip and CMOS chip hybrid measurement.



(a)



Fig 3.37 Overall measurement environment (a)chip connection, (b) aligning



## 3.4.2 Optical TX front end measurement and simulation

Fig 3.38 Proposed inverter based voltage mode driver measurement results



Fig 3.39 Proposed Modified stacked FET current mode driver measurement results







Fig 3.41 Proposed modified Simple high swing SST voltage mode driver measurement results (RMS jitter : 3.64ps at 10Gbps, 4.57ps at 12.5Gbps)

| Driver                    | Max. operating<br>frequency | Power<br>dissipation                | Power supply voltages | Process |
|---------------------------|-----------------------------|-------------------------------------|-----------------------|---------|
| Inverter based VMD        | 8Gbps                       | 20mA, 85mA(@8Gbps)<br>24mW, 229.5mW | 1.2V, 2.7V            | 0.13um  |
| Stacked FET CMD           | 12.5Gbps(*)                 | 7mA, 30mA<br>18.2mW, 114mW          | 2.6V, 3.8V            | 65nm    |
| Modified SST VMD          | 12.5Gbps                    | 88mA, 65mA<br>106mW, 156mW          | 1.2V, 2.4V            | 0.13um  |
| Proposed modified SST VMD | 12.5Gbps(*)                 | 26mA, 32mA<br>90mW                  | 1V, 2V                | 65nm    |

(\*): JBERT supports up to 12.5Gbps

| Table | 3.1 | Modulate | or driver | summary |
|-------|-----|----------|-----------|---------|
|-------|-----|----------|-----------|---------|

## 3.4.3 Optical RX front end measurement and simulation



Fig 3.42 Gain boosted cascade TIA measurement results



(a) 3Gbps





Fig 3.44 Negative miller capacitance TIA (12.5Gbps)





| TIA                             | Max. operating<br>frequency | Power<br>dissipation                     | Power supply<br>voltages | Process |
|---------------------------------|-----------------------------|------------------------------------------|--------------------------|---------|
| Gain boosted cascode TIA        | 8Gbps                       |                                          | 1.0V                     | 90nm    |
| Shunt and series peaking TIA    | 10Gbps                      | TIA : 5mW , LA : 28.4mW<br>DRV : 31.6mW  | 1.2V                     | 0.13um  |
| Negative miller capacitance TIA | 12.5Gbps                    | TIA : 2.7mW, LA : 25.4mW<br>DRV : 17.6mW | 1.2V                     | 0.13um  |
| Inverter based TIA              | 12.5Gbps(*)                 | TIA : 1.3mW, LA : 24mW,<br>DRV : 13mW    | 1.0V                     | 65nm    |

(\*) : JBERT supports up to 12.5Gbps

#### Table 3.2 Receiver electrical measurements summary



3.4.4 Optical RX back end simulation

Fig 3.46 Simulation results (a) edge sampling, (b) data sampling, (c) frequency locking Fig.3.46 shows a overall receiver simulation results which includes a CDR. Fig.3.46.(c) shows a frequency locking process. Control voltage locks into fixed voltage. And then, phase locking is attained, so stable data sampling can be obtained.(Fig.3.46.(b)) Phase margin is about 70 degree and PLL bandwidth is 3MHz.

## 3.4.5 Optical-electrical overall measurements

Fig 3.47 shows a optical measurement environments. There are three chips are on single PCB. Top chip is a CMOS modulator driver chip, bottom chip is a CMOS TIA, LA, and driver chip, and middle chip is optical device chip which has modulator, PD, and silicon wave guide [3.15]. Electrical signal is transmitted into top chip and modulator driver generates a modulator driving signal.







Fig 3.47 Optical measurement. (a) environments and (b) results [3.15]

Modulator driving signal is transmitted through COB and modulator modulates light signal and it is received into a receiver CMOS chip TIA, LA. So receiver CMOS chip converts light signal into electrical output. Eye diagrams of the signal which experiences every process are shown in Fig 3.47.(b) Inverter based modulator driver and shunt and series peaking TIA is implemented in CMOS chips. The reason that overall measurement result is better than inverter based modulator driver only is that COB length is much shorter in optical measurement than electrical measurement. A Fig.3.48 shows a optical measurement with commercial PD. It shows much better performance than electrical measurement with same CMOS chip as expected.RMS jitters are 2.01ps and 4.09ps at 10Gbps, 20Gbps, respectively.



Fig 3.48 Optical measurement with commercial PD

TIA LA TIA TIA TIA TLA TIA TIA TIA TIA COMPARED AND THE PARTY OF THE P 1 4 LA LA 1 RX RX Driver Driver Driver Driver Driver Driver Driver Driver Driver Driver

## 3.4.6 Die photo and layout

Fig 3.49 Gain boost cascode TIA, LA die photo (5000um x 5000um)



Fig 3.50 Shunt and series peaking TIA, LA die photo [3.10] (2700um x 1800um)



Fig 3.51 Negative miller capacitance TIA, LA die photo [3.12] (2400um x 1100um)



Fig 3.52 Modified simple SST driver die photo [3.12] (2500um x 1100um)



Fig 3.53 Stacked FET modulator driver die photo (950um x 800um)



Fig 3.54 Proposed modified simple SST driver die photo. (2350um x 1000um)



Fig 3.55 Inverter based TIA, LA die photo (2400um x 900um)

## 3.5 Conclusion

Required bandwidth has been increased explosively and this tendency is expected to be retained. In spite of low cost, Electrical link PCB copper medium has sharp attenuation near several GHz, silicon photonics began to emerge for chip to chip short distance communication link. Silicon photonics uses optical fiber medium which has little attenuation and loss, and uses optical device as a front-end. These optical device converts light signal into electrical signals, or opposite way. In back end, silicon photonics uses characteristic circuits such like TIA, modulator driver in addition to existing CMOS electrical circuits.

This research has suggested many types of modulator drivers for transmitter, TIA and LA for receiver. Finally, proposed modified simple SST driver and inverter based TIA, LA actualizes above 12.5Gbps communication in electrical measurement. Early version of modulator driver (inverter based modulator driver) and TIA(shunt and series peaking TIA), and LA (cherry hooper and negative capacitance LA) are connected to optical device chip, and overall system operates at 10Gbps. Optical measurement is better than electrical measurement because COB length is much shorter than electrical board environment due to two chips have similar thickness.

Moreover, 12.5Gbps optical front end overall receiver is designed. These circuits are fabricated in 0.13um CMOS process and 65nm CMOS process. In the near future, silicon photonics transceiver will be adapted in next generation memory interface.

## **CHAPTER 4**

## **ELECTRICAL FRONT-END SERIAL**

## LINK DESIGN FOR 20GBPS MEMORY

## **INTERFACE**

## 4.1 Introduction

Memory has been widely used in PC and server mostly for communicating with CPU. Generally, electrical link on PCB which is made of copper medium is used for memory interface construction. Fig 4.1 shows an electrical memory link composition. Memory communicates with memory controller or CPU in a short (about 5~10 inch) channel SB (Single board) or in a long (about 15~20 inch)

## 오류! 참조 원본을 찾을 수 없습니다.Electrical front-end serial link design for 20Gbps 88 memory interface

channel BP (Back plane). General memory interface is composed of many channels, each channel has a clock signal and many aligned data signal. In the past, requirement bandwidth was under Gbps, so there happens little problem. But recently, requirement bandwidth has been increased explosively, while PCB channel length has been scaled to short length just a little. So there happened a great difficulty because copper medium has sharp attenuation in recent required bandwidth (above 10GHz).



Fig 4.1.Electrical memory link composition [4.1]

In chapter 3, this research presents silicon photonics as a strong alternative for next generation memory interface which can operate above 10Gbps. It uses a optical fiber medium which has little attenuation and little interference between signals. Moreover, several signals can be carried in a single fiber by DWDM, so interface complexity can be reduced innovatively. Silicon photonics looks very appropriate and will be realized in a near future. But until now, silicon photonics has some difficulties and limitation for realization and commercialization. Hybrid silicon photonics which consists of CMOS chip and optical device ship can be implemented above 10Gbps, but it is not a practical usage because optical device is still expensive. Silicon photonics ultimate goal is integrating every component in a single CMOS chip. But it is very difficult to integrate optical device in a CMOS and its operating speed is low still in current status. So this research suggest practical and realistic solution for next generation memory interface which needs 10Gbps  $\sim$  20Gbps operating speed per lane. As described, there are two major sections in serial link, embedded clocking serial link and forwarded clocking serial link, and they have their own advantages and disadvantages. Because requirement for higher bandwidth continues, multi channel link looks indispensible. In multi channel link, forwarded clocking link is better than embedded clocking link because it has better jitter performance and clock channel overhead comes down as number of data channel increases. Moreover, high speed link in copper medium is difficult, so complex circuits such like equalizer which compensates channel loss are needed. Forwarded clocking serial link has simpler architecture than embedded clocking serial link, so it is more compatible in high speed electrical link.

In this chapter, a study about forwarded clocking multi channel link is discussed and novel forwarded clocking multi channel link architecture is proposed focused on receiver.

# 4.2 Conventional electrical front-end high speed serial link architectures

There have been many researches about forwarded clocking serial link. It is impossible to avoid a skew between clock and data channel due to of PCB line mismatch and layout mismatch. As mentioned (Fig 4.1), distance between memory and memory controller is about 5 ~ 20 inches. Signal transmission time is approximated to  $T = l^* \sqrt{LC}$  (l=length, L = inductance, C = capacitance), and each parameter can be approximated to L = 5nH / inch, C = 2pF / inch in a general PCB trace. When  $\Delta l$ =0.1inch,  $\Delta T$  is calculated to 10ps, which equals to 0.2UI in 20Gbps. And when  $\Delta l$ =1inch,  $\Delta T$  is calculated to 100ps, which equals to 2UI in 20Gbps [4.1]. It is concluded that 2UI ~ 5UI skew is expected to design a multi channel link. As skew is increasing, jitter correlation between clock channel and data channel is decreasing. Therefore, whether multi UI skew calibration is performed or not is also a consideration issue with 1UI skew calibration [4.7].

Representatively, there is a architecture which uses a global DLL in clock channel for multi phase generation and a PI (Phase interpolator) per lane for skew calibration and phase selection [4.1], [Fig 4.2]. Selected phases are used to sample a incoming data in the data channel. But this architecture needs multi phase distribution for PI phase selection. It is not appropriate in multi lane link because multi phase clock distribution consumes much power and is very difficult in layout. Fig 4.2 shows this architecture, and it is easy to understand that multi



phase distribution is complex in multi lane.

Fig 4.2 Global DLL and Phase interpolator architecture [4.1]

There is another architecture which uses DLL or PLL per data channel [4.2]. A clock channel just distributes single or differential clock signals and its frequency is also lowered. So clock distribution complexity and power dissipation is reduced. DLL or PLL in data channel generates proper frequency and phase using transmitted clock signal, but number of complex component, DLL and PLL, increases according to the number of lanes.



Fig 4.3 Injection locked architecture [4.4]

Nowadays, ILO (Injection locked oscillator) architecture begins to take center stage [Fig 4.3]. ILO uses only one clock phase as input. So it is not necessary to multi phase distribution. ILO is much simpler than DLL or PLL. ILO output frequency is injection locked to incoming data frequency when frequency difference between clock and data signal is small, and its phase is decided by ILO free running frequency. Moreover, clock jitter is filtered by ILO. Likewise, ILO architecture has many advantages. But ILO design is very tricky, and is very complex relationships among many parameters such like free running frequency, output phase, locking range, etc.

Adjustable PLL architecture can be a one of solution for forwarded clocking multi channel link [4.5]. Each data channel has one adjustable PLL and it uses one clock phase. General PLL can synthesize a required frequency, but not a required phase. But adjustable PLL can synthesis a required phase by control current in charge pump or multi phase control. Its disadvantage is that every channel has a PLL and VCO (Voltage controlled oscillator) which has large area.

Besides, there is an architecture which uses common phase rotator for some adjacent lanes, and each lane uses phase de-skew logic [4.6]. Each phase rotator controls sampling phase for each sampler.

As mentioned, N-UI skew calibration is also a consideration issue. Basically, It is needed an auxiliary channel to report receiver status. Receiver transmits N-UI skew information through auxiliary channel, and Transmitter shift clock or data, so N-UI shifted data becomes to transmit data to receiver. But the existence of auxiliary channel is burden to overall system. It occupies one more interface

## 오류! 참조 원본을 찾을 수 없습니다.Electrical front-end serial link design for 20Gbps 93 memory interface

channel similar with other data and clock wire line interface. Although N-UI skew compensation is not performed, some architecture interpolates JTB (Jitter tracking bandwidth) by jitter filter to reflect jitter correlation variation by skew, so each channel data signal can be sampled by appropriate phase [4.3].

At last, clock and data channel frequency relation is an important issue. Half rate clock transmission looks most efficient for maximizing a jitter correlation. But in forwarded clocking, clock routing is the most complex in overall design, and consuming power is proportional to clock frequency. But lowering clock frequency causes low jitter correlation, and frequency synthesis logic such like PLL is needed.

#### 1.N-UI skew calibration – jitter transfer

| Υ | Α | AUX channel, TX N-bit shifterare needed.                            |
|---|---|---------------------------------------------------------------------|
| Ν | Α | Each channel has their own optimized JTB(Jitter tracking bandwidth) |
|   | В | Every channel has commonlyoptimizedJTB.                             |

#### 2.Multi phase distribution(Phase Interpolator usage)mismatch

| Υ | А | GlobalDLL + PI                                   |
|---|---|--------------------------------------------------|
| Ν | A | LocalDLL + PI                                    |
| Ν | В | Local ILO(Injection Locked OSC)                  |
| Ν | С | Local adjustable PLL (Toif I Architecture no PI) |
| Ν | D | Local Phase rotator + fine deskew block          |

#### 3.Time Interleaving- (power, jitter)

| А | Half Rate – 10GHz |
|---|-------------------|
| В | Quad Rate – 5GHz  |

#### **Table 4.1** Conventional multi channel link summary

Table 4.1 shows a summary of conventional forwarded clocking serial link designs. There are three main consideration issues, (1) N-UI skew calibration, (2) multi phase distribution, (3) time interleaving.

## 4.3 Design concept and proposed serial link architecture – open loop delay matched stream lined receiver.

## 4.3.1 Proposed overall architecture

Fig 4.4 shows proposed forwarded clocking serial link transceiver architecture. It is composed of one clock channel and four data channels and operating speed is 20Gbps.



Fig 4.4 Overall architecture of proposed serial link

In transmitter, ADPLL (All digital Phase locked loop) generates various frequency clock signals which is needed in overall transmitter system at clock channel. A PRBS (Pseudo random binary sequence) and protocol manager generates 500Mbps 40 bit PRBS data and protocol data and they are 8b/10b
encoded selectively. These signals are serialized by PLL generated clock, and finally 20Gbps signals are transmitted through driver. Driver is matched to  $50\Omega$ , and it takes de-emphasis in preparation for channel loss.

In receiver, data signal which is attenuated by channel is boosted in CTLE (Continuous time linear equalizer). And then DFE (Decision feedback equalizer) controller and DCDL (Digitally controlled delay line) controller find its optimum point for channel loss compensation and for skew calibration, respectively. Clock signal which is received in clock channel is used for receiver system clock. Received data is sampled and deserialized to 500MHz 40bit signals. Deserialized data are used in DFE controller, DCDL controller for calibration, and in BER (Bit error rate) checker for error checking. This research discusses overall architecture and locking process briefly, and discusses deeper in DCDL and controller design.



### 4.3.2 Design Concept

Fig 4.5 Skew compensation method

Fig 4.5 shows a skew compensation method in a proposed architecture. As shown in Fig 4.5, there are two lines. Upper line means data path. At first, received clock signal is delayed in amount of data channel 2 skew (D2), and data is sampled by delayed clock. And the sampled data delays in amount of T-D2, which is related to data channel skew. The delayed data arrives in second flip flop. In other hands, clock channel signal passes two DCDLs with a delay in amount of constant value (D0+(T-D0)=T). In this way, not only data channel skew is compensated, but also every data channel data is synchronized with delayed

global clock. Fig 4.6 shows a more intuitive design concept diagram of proposed serial link receiver. Every data channel has their own data skew with clock channel. Every data channel has DCDL, and its code value is clk\_code. If every data channel optimum clk\_code is found, every data can be sampled at the position of data phase. So every data is sampled stable at different phase. Sampled data is delayed by DCDL in amount of d\_code, which is compensated value of clk\_code, so every data is finally aligned with same phase. So delayed global clock can sample all delayed data. Every data channel can perform deserialization with a single synchronized clock. Finally, clk\_code and d\_code selection for each data channel is the most important problem in a proposed serial link receiver.



Fig 4.6 Design concept diagram of proposed serial link receiver

Fig 4.7 shows a block diagram of proposed architecture. Fig 4.7.(a) shows a

transmitter. PRBS generator and 8b/10b encoder generates 40bit parallel data according to a proposed protocol. Its operating frequency is 500MHz. Serializer serializes 500MHz 40bit data into 20Gbps 1bit differential data with ADPLL providing clock. Serialized data is delayed by DCDL for transmitter data skew control, and finally it is transmitted by driver.





Fig 4.7.(b) shows a receiver. At first, nyquist pattern is received, and data is boosted by CTLE. DCDL CTRL controls DCDL value for stable nyquist pattern sampling. And then, PRBS pattern is received and half rate PRBS chaser starts working. PRBS pattern is promised, so real time PRBS chaser in receiver can generate an expected transmitter data. Digital PRBS chaser also provides promised PRBS pattern in the form of 40bit de-serialized. There are only two phases for sampling but receiver operates half rate, so when PRBS data is received, receiver generates expected value, and then edge samplings are done in a data path. In the process, edge type DFE operates and decides coefficient of DFE. As a result, stable sampling is obtained. And de-serializer de-serializes sampled data into 40 bit 500MHz data. 8b/10b decoder and control signal generator detects protocol signal, and generates control signal. This control signals are transmitted to EQ CTRL and DCDL CTRL. Pure data signals are transmitted to BER checker and it checks error status.



#### 4.3.3 Proposed protocol and locking process

#### Fig 4.8 Proposed protocol for proposed architecture

As mentioned above, proposed transceiver uses only two phases of clock and operates with half rate interleaved way. So it needs its own protocol. Fig 4.8 and Fig 4.9 shows a proposed protocol and overall locking process of proposed serial link. At first, nyquist pattern (11001100..) is received from transmitter. Nyquist

# 오류! 참조 원본을 찾을 수 없습니다.Electrical front-end serial link design for 20Gbps 101 memory interface

pattern is 5GHz signal which is slow relative to operating speed 20Gbps, so it is certain that data eye is opened without any equalization. At that time DCDL controller operates, and optimum DCDL point is found within 40~60us.



Fig 4.9 Locking process of proposed serial link

After DCDL optimum point is found, Protocol data changes into PRBS7 signal. When PRBS is received, data eye is closed because of channel attenuation. So CTLE boosts received signal until accurate data pattern is sampled. When correct signal is received, CTLE stops boosting because over boosts can cause jitter and noise boosting also. And then DFE controller controls DFE toward edges aligning. DFE control is implemented by shift DCDL control code through DCDL controller.

# 4.4 Optimum point search algorithm based DCDL controller design

As mentioned above, there are several DCDL in both data channel and clock channel. Because data cycle is 50ps and half rate data cycle is 100ps, DCDL resolution is assumed to 4ps and 32 stages are assumed. The sum of clock channel two DCDL value are assumed to be fixed with half code value 31. And data channel and clock channel skew is assumed to 0. Global clock means clock channel front side clock which is amplified by EQ or LA (Limiting amplifier). Sampler clock means data channel sampler input clock which is delayed by each data channel clk\_code. Fig 4.10 shows how can decide sampling point for each clk\_code. Actually clk\_code is decide by clock channel and data channel skew. Aligned clock means delayed clock by all clock channel DCDLs. Because clock channel DCDL codes are fixed, so aligned clock phase is also fixed. Fig 4.10.(a)~(c) notify that 1. Sum of clk\_code and optimum data\_code is constant, 2. When clk\_code is large, data\_code has narrow sampling margin.

Fig 4.11 shows a overall clk\_code, d\_code table for each skew between clock channel and data channel. It is easily checked that clk\_code, d\_code set does not change although channel skew changes and edge position is maintained to same. Based on this table and these properties, sampling point decision algorithm and, optimum point decision algorithm are proposed. DCDL controller sweeps clk\_code toward increasing and d\_code toward decreasing.



(a) DCDL controller timing diagram – clk\_code =3



(b) DCDL controller timing diagram - clk\_code =15

memory interface







Fig 4.10 DCDL controller timing diagram

Fig 4.11 DCDL clk\_code - d\_code table

At first, DCDL controller performs d\_code sweep toward decreasing with a fixed clk\_code. For accurate decision, DCDL controller samples 8 times and decides whether sampled data is 1 or 0. By this sweep process, DCDL controller

can find a sampling point for fixed clk\_code. If two edges are happened in a d\_code sweep for fixed clk\_code, the center of two edges is sampling point. If there is only one edge, it is dependent where edge d\_code is and which value clk\_code has. One cycle of half rate apart from edge d\_code is a sampling point. Only one or two edges regions are possible because DCDL range is from two cycle sto four cycles. Sampling point d\_code value is decreasing as clk\_code is increasing as expected. Ideally, when clk\_code increases one code, d\_code decreases one code.



Fig 4.12 Optimum sampling point search algorithm

After all sampling points are decided, sampling point sweeping is performed. Clk\_code has narrower edge to edge distance (50ps) than d\_code (100ps), so clk\_code decision is performed later and with stable sampling points. When sampling point sweeping is performed, there must be more than two transitions because DCDL range is larger than 2 cycles. So in sampling points sweeping, middle point of two transitions is decided to an optimum sampling point.



Fig 4.13 DCDL table in noise environment – problem

However, there happens a serious problem in a noisy environment. When DCDL controller is in noisy environment, random pattern can be sampled near clock edge. Fig 4.13 shows clk\_code-d\_code table when is in a random and sinusoidal noise. It is shown that random value is sampled closed to transition  $(clk_code = 12, 13, d_code = almost all codes)$ , closed to edge  $((clk_code, d_code) = (0,6), (2,5), (2,30),...)$ .So sampling points loses tendency between clk\_code and

d\_code. So this algorithm eliminate confusing row out of attention. Then decision becomes more simple and easy. As shown Fig 4.13, there are several confusing rows (row 0, 12, 13, 24, 25), and there are confusing points adjusting edges for each row. In sampling points decision algorithm and optimum sampling point decision algorithm, middle of edges or, transitions is decided only in stable region to maximize BER performance [4.8]. For example, optimum sampling point row is decided by (1+11)/2 = 6.



Fig 4.14 DCDL table in noise environment – solution

But in proposed architecture, DCDL controller must transmits optimum sampling point for data sampling and optimum edge point for sedge sampling to EQ controller. Second transition point is decided to optimum edge point. A clk\_code of optimum edge point is easily decided. But d\_code of optimum edge\_ point is difficult because second transition point must be confusing. To decide optimum sampling point, DCDL controller predicts second transition d\_code from optimum sampling point. Ideally, Relationship between clk\_code and d\_code of sampling points is linear.

Through these two algorithms, DCDL controller can find optimum sampling point and edge point. Because optimum sampling point is changed by EQ controller, first row of second transition is not a bed choice.



#### Global clock DCDL code correction

#### Fig 4.15 DCDL table optimization- active area expansion

As mentioned, DCDL controller exchanges some information with EQ controller and DCDL must transmit optimum sampling point and optimum edge point. EQ controller uses them and control optimum edge point. When EQ controller controls optimum edge point, clk\_code is increased or decreased while generating up or dn signal. At that time, d\_code slides among sampling points. So optimum edge point may increased or decreased, and the region that optimum edge point can move is too narrow. So It it performed to shift table horizontal axis right which is expanding a active area. By clock channel DCDL value increasing, active area expands and timing margin is also increased.



Fig 4.16 Recalibration algorithm

And another consideration is that calibration time is too long. When two algorithms are performed, It takes about 30us. Accurate calibration time equals (4.1) and (4.2).

$$2ns^{*}(8+8)^{*}(32)^{*}(32) = 32.768 \text{ us}_{(4.1)}$$
$$T_{digital clock period}^{*}(N_{sampling times} + N_{waiting times})^{*}(N_{clk\_code})^{*}(N_{d\_code}) (4.2)$$

So, this research proposes protocol which is proper this architecture, it includes recalibration part. Recalibration algorithm is described in Fig 4.16. When transmitter generates recalibration command which consists of 8b/10b encoding K character, then receiver starts recalibration algorithm. K character is received followed nyquist patterns. In recalibration, only one row which is the row of optimum sampling point is selected because optimum sampling point row must be most stable row. After d\_code sweep in the row, sampling point is decided in the same way of sampling point search algorithm. If it is shifted from initial calibration optimum sampling point value, every sampling points shifts same amount of it. For example, initial calibration optimum point is (19,12), it shifts right one code, then every sampling points shift right one code. Naturally, optimum edge point also shifts from (25,6) to (25,7). After recalibration, this information is reflected to EO controller, so optimum edge point up & dn is performed along with recalibrated sampling points. In EQ control, resolution is changed from 4.8ps to 2.4ps. So final resolution becomes 2.4ps/50ps = 4.8% < 5%of data rate.

Fig 4.17 shows a flowchart of sampling points decision algorithm and optimum point decision algorithm.

memory interface



Fig 4.17 DCDL controller algorithm flow chart

오류! 참조 원본을 찾을 수 없습니다.Electrical front-end serial link design for 20Gbps 112 memory interface

## 4.5 DCDL (Digitally controlled delay line) design

In DCDL controller design, DCDL is important factor of overall operation. When DCDL controller operates, monotonic is indispensable for right operation and linearity reflects operation performance. In our process, all DCDL operating speed is same with half of data rate. In this research, target data rate is 20Gbps and it is very difficult to design a 10GHz & 10Gbps DCDL of which resolution is 2~3 ps and overall range is above 100ps in a 65nm process.







Fig 4.19 Glitch free method

Designed DCDL is shown in Fig 4.18 and is composed of CDU (Coarse delay unit) and FDU (Fine delay unit). A CDU is made of NAND gate network and delay is decided by 2 NAND gate path delay. A FDU is made of inverter with feedback resistor. Delay control is performed by digital logic which translates binary code to thermometer code. One CDU delay equals 8 FDU delay, but monotony can be broken as CDU and FDU is independent delay, so PVT variation effect is different to CDU and FDU. For preventing monotony break, FDU delay code which is adjacent of CDU transition is skipped selectively.



Fig 4.20 DCDL resolution

memory interface



#### Fig 4.21 DCDL power dissipation

One special design consideration is that glitch free technique. In proposed architecture, DCDL code changes according to DCDL controller or EQ controller. If glitch happens, EQ controller can't compare accurate sampled data with PRBS chaser data. Glitch happens when DCDL changes from long delay to short delay, CDU loop changes. In glitch free technique, longer delay loop signal changes code earlier than shorter delay loop. So glitch can be prevented.

Actually, DCDL is designed with 128 stage of which resolution is about 2.4 ~ 2.5ps. In 20Gbps operation, 2.4ps 64stages are used. 4.8ps 32stages are used in DCDL controller and then EQ controller controls code within this range with 2.4ps resolution. In 10Gbps operation, 4.8ps 64stages are used. 9.6ps 32stages are used in DCDL controller and then EQ controller controls code within this range with 4.8ps resolution. Fig 4.20 shows a DCDL code-delay INL, DNL. A DCDL consumes about 5mW as shown in Fig 4.21.

# 4.6 DFE (decision feedback equalizer) and other

## blocks design

Fig 4.22 shows an designed CTLE. CTLE is general type and it can boost signal upto 14dB by digital control code.(2dB/digital code)



Fig 4.22 Designed CTLE (Continuous time linear equalizer)

Fig 4.23 shows a half rate PRBS7 chaser. Generally, PRBS pattern is made by long single flip flop chain and a XOR gate between two flip flop outputs. But it is difficult to implement this PRBS checker with high speed clock. In this research, flip flop and mux are merged to minimizing gate delay. So it can operate with 10Gbps. And PRBS chain can operate from received half rate data or from internal data selectively. When equalizer needs data sampling, external even data and external odd data is chosen, and PRBS checker chain are set by these values. And then equalizer changes mode from data sampling mode to edge sampling mode, so internal PRBS checker provides expected data to equalizer.



Fig 4.23 Designed Half rate PRBS chaser

Fig 4.24 shows a edge type DFE which is combined with integrating sampler. It is a 4-tap DFE. 1<sup>st</sup> tap and 2<sup>nd</sup> tap are used for loop unrolling, and 3<sup>rd</sup> tap and 4<sup>th</sup> tap are used in integrator. This DFE can operate in data mode and edge mode, selectively.



Fig 4.24 Designed DFE (Decision feedback equalizer)

# 4.7 Simulation results

Fig 4.25 shows a DCDL controller simulation environment. In data channel, jittery clock is generated and this data signal is used to sample protocol data. So jittery patterns are generated. In clock channel, jittery clock is generated and skew controller shift jittery clock for modeling skew between clock channel and data channel. Jitter is composed of random jitter and sinusoidal jitter. Random jitter of data channel and clock channel is independent each other. Sinusoidal jitter is same to each channel, but skew makes different effect to each channel. DCDL modeling is reflected by DCDL layout simulation result.



Fig 4.25 DCDL controller simulation environment

Fig 4.26 shows jitter histogram and other application jitter specification. Random jitter has a Gaussian distribution probability density function and jitter specification is decided by standard deviation. Sinusoidal jitter is decided by frequency and peak to peak value. Its distribution has peak at both ends of amplitude. Actually, the sum of two jitter component is injected in the channel. memory interface



(a) Jitter histogram

| DP     | Sj(PtoP) | Rj(RMS) | USB     | Sj(PtoP)  | Rj(RMS)     | F        |
|--------|----------|---------|---------|-----------|-------------|----------|
| 2MHz   | 0.953UI  | 0.0178  | 0.5MHz  | 2UI       | 0.0121      | ] ⊼      |
| 10MHz  | 0.274UI  | 0.0178  | 1MHz    | 1UI       | 0.0121      |          |
| 20MHz  | 0.231UI  | 0.0178  | 2MHz    | 0.5UI     | 0.0121      | <b>₩</b> |
| 2MHz   | 0.897UI  | 0.0132  | 4.9MHz  | 0.2UI     | 0.0121      |          |
| 10MHz  | 0.218UI  | 0.0132  | 33MHz   | 0.2UI     | 0.0121      |          |
| 20MHz  | 0.175UI  | 0.0132  | S       | J         | RJ(RMS)     | ]        |
| 100MHz | 0.161UI  | 0.0132  | 5K~500N | /IHz(1UI) | 0.02~0.03UI |          |

Peak to Peak

(b) Display port and USB3.0 jitter specification

Fig 4.26 Simulation jitter specification

Other similar applications, Display port and USB jitter specification is shown

in Fig 4.26, and simulated jitter specification is much above these specifications.

| ch0_dcdl_done              | St1     |      | _   |      |      |           |             |      |      |      |      | _    |      |       | _   |      | -   |      |    |     |      |      |      | _           |    | -   |      | Ļ    |         |       |       |
|----------------------------|---------|------|-----|------|------|-----------|-------------|------|------|------|------|------|------|-------|-----|------|-----|------|----|-----|------|------|------|-------------|----|-----|------|------|---------|-------|-------|
| ret_ch0_clk_code           | 19      | 0 (1 | 12  | 3    | 4    | 5         | 6           | 7 8  | 9    | (10  | 11   | 12 1 | 3 14 | 15    | 16  | 17 1 | 8 1 | 9 20 | 21 | 22  | 23 2 | 4 25 | 5 28 | 27          | 28 | 29) | 30 3 |      | 19      |       |       |
| ret_ch0_dat_code           | 0       |      |     |      |      |           |             |      |      |      |      |      |      |       |     |      |     |      |    |     |      |      |      |             |    |     |      |      | 0       |       |       |
| edge_data_dec              | 0       | Lu   | Ш   | Ш    | Ш    | <u>II</u> | <u>II I</u> |      |      |      | 1    |      |      |       | 1   |      |     |      |    |     |      |      | Ш_   |             |    |     |      |      | Ш       |       |       |
| first_edge                 | 25 26 2 |      |     | X    |      | Jk.       |             | x.)  |      |      |      | )×   |      | x)x.  | .X  | )×   | k., | k)   |    |     | x    | .).  | lk.  | <u>k.</u>   | )× | k.  | k)   | 25 ; | 26 27   | 28 29 | 30 6  |
| second_edge                | 0123    | ××   | x   |      | k    | ×         | )x)         | ×××  | xx;  | ×××  | ××:  | кхх  | хx   | xxx)  | ×х  | ×××  | ××  | 012  | 3  | 15× |      |      |      | <u>k.</u> ) | ×) |     | <)x  | þ    | ) 1 2 3 | 45 x  | ×××   |
| second_edge_flag           | 111111  | 000  | XO. | .)0  |      | 0         |             | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 00001 | 111 | 110  |     |      | _  |     |      |      |      |             | 0) | d)  | 0    | 10   | 11111   | 0000  | 00000 |
| samp_point                 | 0000    | xxx  | ××  | XXX  | xx   | xx        | XXX         | ×××  | XXX  | xxx  | xx   | xxx  | XX   | ×××   |     |      |     |      | _  |     |      |      |      |             |    | 1   |      | þ    | 0000    | 000   | 000   |
| ch0_samp_point_search_done | St1     |      |     |      | _    |           |             |      | -1   | C    | ~    | 10   | _    | 1n    | -   |      | o., |      |    |     |      |      |      |             |    |     |      | ₽    |         |       |       |
| ch0_opt_point_search_done  | St1     |      |     |      |      |           |             |      | _    | ~    | 00   | 10   | _    | ۹Þ    | 3   |      | 94  | ,3)  |    |     |      |      |      |             |    |     |      |      |         |       |       |
| first_tran_clk             | 13      | 0    |     |      |      |           |             |      |      |      | _    |      | _    |       |     |      |     |      |    |     |      |      |      |             |    |     |      | +    | 13      |       |       |
| first_tran_dat             | 6       | 0    |     |      | _    |           |             |      |      |      | _    |      | _    |       |     |      |     |      |    |     |      |      |      |             |    |     |      | t    | 6       |       |       |
| second_tran_clk            | 25      | 0    |     |      |      |           |             |      |      |      | _    |      | _    |       |     |      |     |      |    |     |      |      |      |             |    |     |      | +    | 25      |       |       |
| second_tran_dat            | 0       | 0    |     |      |      |           |             |      |      |      | _    |      |      |       |     |      |     |      | _  |     |      |      |      |             |    | F   |      | t    |         |       |       |
| unknown                    | 000000  | 0000 | 000 | 0000 | 0000 | 0000      | 0000        | 0000 | 0000 | 0000 | 0    |      | _    |       |     |      | _   |      |    |     |      |      |      |             |    | 1   |      | -    |         |       | +     |

#### Fig 4.27 DCDL controller simulation –sampling points search algorithm

Fig 4.27 shows a DCDL controller overall simulation. Sampling points

decision algorithm is performed, and clk\_code and d\_code sweep is shown in it.

In Fig 4.28, optimum sampling point decision algorithm is performed and is performed. Finally, first transition point (13,6) and second transition point (25,0) are decided and optimum sampling point (19,0) and optimum edge point (25,0) are transmitted to EQ controller. In this process, 1code equals 4.8ps. It takes about 30us as expected. Fig 4.29 shows a recalibration operation. From initial calibration optimum sampling point (13,17) and optimum edge point (19,11) are obtained. So in recalibration, clk\_code is fixed to 13 and d\_code sweep is performed. For example, optimum sampling point shifts right one code as a result of recalibration. And optimum edge point also shifts same amount of code.



Fig 4.28 DCDL controller simulation – optimum sampling point search algorithm

| rstb               | St1            |    |        |                  | 1      |                   |            |                   |               |      |         |         |      |        |        |      |
|--------------------|----------------|----|--------|------------------|--------|-------------------|------------|-------------------|---------------|------|---------|---------|------|--------|--------|------|
| calib_all_done     | 0              |    |        | _ (              |        |                   |            | 2                 |               |      |         |         |      |        |        |      |
| recalib            | St0            |    |        |                  |        |                   |            |                   | 3             |      |         |         |      |        |        |      |
| recalib_done       | 0              |    |        |                  |        |                   |            | <b>M</b>          | $\rightarrow$ |      |         |         |      |        |        |      |
| data_in            | 11001100110011 | ]0 | ) )0 1 |                  | (10011 | 00110011          | 00110      | (10)              | 1001          | 1100 | 1100110 | 0110011 | 0011 | 110011 | 011001 | 0011 |
| edge_on            | St0            |    |        |                  |        |                   |            |                   |               |      |         |         |      |        |        |      |
| ch0_clk_code       | 20             | 30 | 31     | ))               | 13     |                   |            |                   |               | _    |         |         |      | 19     | 20     | 19   |
| ch0_d_code         | 15             |    |        | u <b>(</b> ) u ( | 17     |                   |            | u (r r r r        | 18            | _    |         |         |      | 12     | 11     | 12   |
| up                 | St0            |    |        |                  |        |                   |            | $\leftrightarrow$ |               |      |         |         |      |        |        |      |
| dn                 | St0            |    |        |                  |        | 112 1             | -          |                   |               |      | 12 4    |         |      |        |        |      |
| final_opt_clk      | 0              | 0  |        |                  | 13     | (13,1             | <i>(</i> ) |                   |               | 4    | 13, 10  | P)      |      |        |        |      |
| final_opt_dat      | 0              | 0  |        |                  | 17     |                   |            |                   | 18            |      |         |         |      |        |        |      |
|                    |                |    |        |                  | 4.0    |                   |            |                   |               |      |         |         |      |        |        |      |
| inal_opt_edge_clk  | 0              | 0  |        |                  | 113    | <del>(19,</del> 1 | 1)         |                   | -             | ť    | 19,1:   | 2)      |      |        |        |      |
| final_opt_edge_dat | 0              | 0  |        |                  | 11     | 1 Ó               | -          |                   | 12            | Ľ    |         |         |      |        |        |      |

Fig 4.29 Recalibration algorithm



Fig 4.30 Recalibration algorithm and protocol

Fig 4.29 and Fig 4.30 show recalibration operation and protocol control signals. Protocol detector generates 'Reclaib' and 'recalib\_finish' signals at the proper timing of protocol.

| edge_<br>up          | _on                | (                          | dn                         |                      | St1<br>St1                   |                             |                       |                |                |                |                | -              | 1:             | 2              | .4p            | S              |                |                |                |                |                |                |                |               |               |               |               |               |               |               |                |
|----------------------|--------------------|----------------------------|----------------------------|----------------------|------------------------------|-----------------------------|-----------------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|----------------|
| dn                   |                    |                            |                            | s                    | StO                          |                             |                       |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |               |               |               | <u>.</u>      |               |               |               | -1             |
| ch0_a                | clk_coc            | de                         |                            | ε                    | 63                           |                             | 6                     | 3              |                |                |                |                |                |                | 62             | 2              | 61             |                | 60             | X              | 59             | 58             |                | 57            | Þ             | 56            | 5             | 5             | 54            |               | 53             |
| ch0_c                | d_code             | e                          |                            | 4                    | 12                           |                             | 4                     | 2              |                |                |                | -              |                |                |                |                | 22             |                | 2              |                |                |                |                |               |               |               | 13            |               | 4             |               | 5              |
|                      |                    |                            |                            |                      |                              |                             |                       |                |                |                |                |                |                |                |                |                |                | _              |                |                |                |                |                |               |               |               |               |               |               |               |                |
| Sa                   | mp                 | lin                        | gı                         | ooi                  | int                          |                             |                       |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |                |               |               |               |               |               |               |               |                |
| <b>Sa</b>            | 2                  | ling<br>4                  | <u>g</u>                   | <b>00</b>            | int<br>10                    | 12                          | 14                    | 16             | 18             | 20             | 22             | 24             | 26             | 28             | 30             | 32             | 34             | 36             | 38             | 40             | 42             | 44             | 46             | 48            | 50            | 52            | 54            | 56            | 58            | 60            | 62             |
| <b>Sa</b><br>0<br>62 | 2<br>60            | <b>lin</b><br>4            | <b>g</b><br>6<br>56        | <b>30</b><br>8<br>54 | 10<br>52                     | 12<br>50                    | 14<br>48              | 16<br>46       | 18<br>44       | 20<br>40       | 22<br>38       | 24<br>36       | 26<br>34       | 28<br>32       | 30<br>30       | 32<br>28       | 34<br>26       | 36<br>22       | 38<br>18       | 40<br>16       | 42<br>14       | 44<br>10       | 46<br>10       | 48<br>8       | 50<br>6       | 52<br>4       | 54<br>2       | 56<br>0       | 58<br>0       | 60<br>0       | 62<br>40       |
| 0<br>62<br>Sa        | 2<br>60            | lin<br>4<br>58             | 9  <br>6<br>56<br>9        | 54<br>54             | int<br>10<br>52              | 12<br>50                    | 14<br>48              | 16<br>46       | 18<br>44       | 20<br>40       | 22<br>38       | 24<br>36       | 26<br>34       | 28<br>32       | 30<br>30       | 32<br>28       | 34<br>26       | 36<br>22       | 38<br>18       | 40<br>16       | 42<br>14       | 44<br>10       | 46<br>10       | 48<br>8       | 50<br>6       | 52<br>4       | 54<br>2       | 56<br>0       | 58<br>0       | 60<br>0       | 62<br>40       |
| 0<br>62<br><b>Sa</b> | 2<br>60<br>2<br>mp | lin<br>4<br>58<br>lin<br>4 | <b>g</b><br>56<br><b>g</b> | 54<br>54<br>50       | int<br>10<br>52<br>int<br>10 | 12<br>50<br><b>Sh</b><br>12 | 14<br>48<br>ift<br>14 | 16<br>46<br>16 | 18<br>44<br>18 | 20<br>40<br>20 | 22<br>38<br>22 | 24<br>36<br>24 | 26<br>34<br>26 | 28<br>32<br>28 | 30<br>30<br>30 | 32<br>28<br>32 | 34<br>26<br>34 | 36<br>22<br>36 | 38<br>18<br>38 | 40<br>16<br>40 | 42<br>14<br>42 | 44<br>10<br>44 | 46<br>10<br>46 | 48<br>8<br>48 | 50<br>6<br>50 | 52<br>4<br>52 | 54<br>2<br>54 | 56<br>0<br>56 | 58<br>0<br>58 | 60<br>0<br>60 | 62<br>40<br>62 |

#### Fig 4.31Resolution control and interpolation technique

Fig 4.31 shows EQ controller controls DCDL code through DCDL controller. DCDL controller changes resolution from 4.8ps to 2.4ps. Sampling points which is not found at sampling points decision algorithm are interpolated from adjacent points.

Fig 4.32 and Fig 4.33 show overall receiver top simulation. Channel modeling, CTLE characteristic modeling, DFE, PRBS chaser, DCDL modeling, and EQ controller, DCDL controller is used for top simulation. As mentioned, nyquist

# 오류! 참조 원본을 찾을 수 없습니다.Electrical front-end serial link design for 20Gbps 121 memory interface

patterns are received, and DCDL controller find optimum sampling and edge points. Data changes from nyquist pattern to PRBS7 pattern. When PRBS7 patterns are received, EQ controller finds optimum point which is proper to PRBS7 pattern through DCDL controller. Then 8b/10b K character data is received, receiver protocol detector detects K character and can calculate BER from the first pure 8b/10b encoded data.



Fig 4.32 Overall simulation – locking process



Fig 4.33 Overall simulation – locking process

# 4.8 Power expectation and chip layout

Fig 4.34 and Fig 4.35 presents overall chip floor planning and chip layout. Overall chip size is 2500um x 2500um.



Fig 4.35 Chip layout

Table 4.2 presents overall power expectation. Although this paper mentions about only receiver, transmitter and receiver power expectations are shown in Table 4.2. Proposed architecture has special advantages in power consumption by its simple architecture. More large power efficiency may be obtained if proposed architecture is implemented more advanced process.

| Component    | Power         | /Data CH. | /Clock CH.    | Overall |
|--------------|---------------|-----------|---------------|---------|
| PRBS/ENC     | 4.11mW(1.0V)  | 1         | 0             | 4       |
| Serializer   | 8.162mw(1.1V) | 1         | 0             | 4       |
| Driver       | 22.08mW(1.2V) | 1         | 1             | 5       |
| Coarse DCDL  | 4.4mW(1.2V)   | 1         | 0             | 4       |
| PLL(TDC,DCO) | 16.17mW       | 0         | 1             | 1       |
| Overall      | 210.26mW      | Cloc      | <_dist : 17.7 | 7mW     |

#### **Transmitter power**

4xdata\_ch+1xclock\_ch = 80Gbps TX power = 210.26mW => 2.64mW/Gbps => 2.42mW/Gbps(active)

#### (a)Transmitter power estimation

| Component                       | Power         | /Data CH. | /Clock CH. | Overall |
|---------------------------------|---------------|-----------|------------|---------|
| CTLE                            | 2.8mW(1.2V)   | 1         | 0          | 4       |
| DFE                             | 11.3mW(1.2V)  | 1         | 0          | 4       |
| LA                              | 16mW(1.0V)    | 0         | 1          | 1       |
| DCDL                            | 5.0mW(1.0V)   | 4         | 4          | 20      |
| De-serializer                   | 5mW(1.0V)     | 1         | 0          | 4       |
| DCDL CTRL<br>EQ CTRL<br>DEC/BER | 26.86mW(1.0V) | 1         | 0          | 4       |
| Overall                         | 323.84mW      | Clo       | ck_dist:24 | mW      |

# **Receiver power**

4 x data\_ch + 1 x clock\_ch = 80Gbps RX power = 323.84mW = 4.05mW/Gbps => 2.7mW/Gbps(active)

(b) Receiver power estimation

 Table 4.2 Power estimation

## 4.9 Conclusion

In this chapter, 20 Gbps electrical serial link for memory interface is studied. As mentioned, current memory is under 10Gbps. But nowadays, required memory bandwidth is getting higher and it is certain that required band width will become 20Gbps. Electrical link is widely used for memory interface, but it is not proper to memory interface because electrical copper medium has sharp attenuation. Although silicon photonics technology becomes alternative, but it has some limitations and is difficult to realize in the present. So this research proposes novel architecture which is simple and proper to current condition. Proposed architecture is forwarded clocking serial link, it transmits two phase half rate clock signals. Receiver does not have complex component such like PLL or DLL, it uses just several DCDL. For channel loss and attenuation compensation, it uses CTLE and DFE, 4-tap DFE operates with only two phase half rate clock signal by aid of PRBS chaser. DCDL controller controls DCDL and finds optimum sampling point and optimum edge point. DCDL resolution is flexible and compatible protocol is also proposed. To solve long initial calibration time problem, it performs short recalibration after initial calibration. It is expected that proposed architecture consumes about 6.7mW(5.1mW)/Gbps. For the advantage of architecture, it looks possible to lower power consumption much better. Chip layout is implemented in 65nm CMOS process and area is 2500um x 2500um.

|              | ISSCC06                                                          | JSSC08                                                                                      | JSSC09                                                                              | JSSC11                                                                                 | Thiswork                                                                                         |
|--------------|------------------------------------------------------------------|---------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| Design       | Transceiver                                                      | Transceiver                                                                                 | Receiver                                                                            | Transceiver                                                                            | Transceiver                                                                                      |
| Architecture | Forwarded clock                                                  | Forwarded clock                                                                             | Embedded clock(ref)                                                                 | Forwarded clock                                                                        | Forwarded clock                                                                                  |
| Process      | 90nm                                                             | 65nm                                                                                        | 0.13um                                                                              | 40nm LP                                                                                | 65nm                                                                                             |
| Supply       | 1.2V                                                             | 1.2V                                                                                        | 1.45V                                                                               | 1.0V                                                                                   | 1.0V                                                                                             |
| Scalable     | 1X20Gb/s                                                         | 1X(5~15)Gb/s                                                                                | 8X5Gb/s                                                                             | 2X5Gb/s                                                                                | 4X(7~20)Gb/s                                                                                     |
| Power        | 11.8mWGbps                                                       | 5mW/Gbps                                                                                    | 9.2mW/Gbps                                                                          | 5mW/Gbps                                                                               | 6.7(5.1)mW/Gbps                                                                                  |
| Clocking     | 1.Quad rate 2 phase<br>2.CTLE<br>3.ODT clk. dist.<br>(no buffer) | 1.ODT clk. dist.<br>2.Half rate clock<br>3.Multiphase<br>4.1 global DLL<br>5.Driver - 160mV | 1.PLL -> global clock<br>2.DLL-> clk-data skew<br>3.Buffering<br>(RX inner clk gen) | 1.RX clock delay<br>2.CM clocking(2-ch opt.)<br>3.Half rate<br>4. TX driver~RX         | 1.RX clock delay<br>2.LA, buffering<br>3.Half rate 2 phase<br>4,Fanout 2<br>5.Driver - 200~600mV |
| Core         | 1.Local DLL+PI<br>2.1-channel                                    | 1.Multiphase<br>2.PI+sampler(half rate)<br>3.CTLE<br>3.Driver 120mV                         | 1.Sampler+DLL(MP)<br>2.No EQ.<br>3.5-channel                                        | 1.TX data delay<br>2.CTLE<br>3.Driver 150~250mV<br>4. 2-channel<br>5. AUX channel req. | 1.RX data delay<br>2.CTLE+DFE<br>3. 4-channel                                                    |
|              | Intel                                                            | Intel                                                                                       | Oregon                                                                              | Rambus                                                                                 | SNU                                                                                              |

Table 4.3 Figure of merit

# **CHAPTER 5**

# CONCLUSION

In this thesis, we implemented a serial link for memory interface in various ways. At first, current commercial GDDR3 PHY is implemented by 0.13um CMOS technology, and is composed of DLL and VDL. DLL generates a constant phase between clock signal and data signals, and VDL can compensate skew between clock single and data signals. But its operating speed is about 1.2Gbps, which is much lower than current required bandwidth. Copper medium electrical link has sharp attenuation and loss above 10GHz, so enhancing operating speed becomes more difficult. This research introduces innovative alternatives, silicon photonics, which uses optical fiber as a medium. Silicon photonics uses optical fiber, optical devices – photo diode, modulator) and electrical back-end (CDR, PLL CMOS circuit). Because compound semiconductor process becomes chipper and CMOS process becomes better in performance, silicon

photonics will substitute electrical serial link in the near future. In this thesis, we implemented a electrical back end CMOS circuits which is compatible to optical front end devices. Transmitter is modulator driver, which drives high load (0.8pF), high speed (above 12.5Gbps), and high swing (above 3V–differential). Receiver is composed of TIA and LA and Driver. Transmitter and receiver operate above 12.5Gbps, and it consumes about 6mW/Gbps (transmitter) and 4mW/Gbps (receiver). And they are implemented by various CMOS process (65nm, 0.13um). But the final purpose of silicon photonics is integrating optical device and CMOS circuits in a standard CMOS process, and it is difficult to realize overall integration and circuits consumes much power in present technology

Therefore, we implemented a 20Gbps electrical serial link which has very simple and intuitive architecture. An open loop delay matched stream line receiver calibrates clock and data skew in a simple way. It is a forwarded clocking architecture, and is composed of one clock channel and four 20Gbps data channels. Several DCDLs and DFE, CTLE, and PRBS chaser are used to implement accurate data processing with only two clock phases. It occupies 2500um x 2500um and it consumes 6.7mW(5.1mW)/Gbps and it can be reduced more in advanced technology.

# **BIBLIOGRAPHY**

[1.1] Strategy analytics, IT hardware industry, "Tablet wars : the rise of whitebox tablets", 29/05/2013,

[1.2] iPad 4 (Late 2012) Review, anandtech, Anand Lal Shimpi, December 6, 2012
[1.3] International Solid-State Circuits Conference 2013: Memory trends, Kevin
Zhang, Intel, OR, 02/15/2013

[1.4] B.J.Yoo, "A study on multichannel receivers with enhanced lane

expandability and loop linearity", ph.D thesis, seoul national university.

[2.1] Actel, "Simultaneous Switching Noise and Signal integrity Application Note : AC263"

[2.2] Yongsam Moon, "An All-Analog Multiphase Delay-Locked Loop Using a Replica Delay Line for Wide-Range Operation and Low-Jitter Performance" IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 35, NO. 3, MARCH 2000.
[2.3] Kyoeng-Ho Lee, "Delay locked loop for generating multi-phase clock ",

Korea Patent, 10-2000-0028609

[2.4] M.-S. Hwang, "Reduction of pump current mismatch in charge-pump PLL",ELECTRONICS LETTERS 29th, January 2009 Vol. 45, No. 3.

[2.5] C. -K. K. Yang, "Delay-Locked Loops-an overview" in Phase-Locking in High-Performance Systems From Devices to Architectures, B. Razavi, Ed. NewYork: Wiley/IEEE Press, pp.13-22.

[2.6] H.K.Chi, "A 500MHz-to-1.2GHz Reset Free Delay Locked Loop for Memory Controller with Hysteresis Coarse Lock Detector", International Technical Conference on Circuits/Systems, Computers and Communications, 2010.

[2.7] M.S.Hwang, "1.2 Gbps GDDR3 Physical Layer for 3D AMOLED Panel", The society for information display, 2011

[2.8] H.K.Chi, "A 500MHz-to-1.2GHz Reset Free Delay Locked Loop for Memory Controller with Hysteresis Coarse Lock Detector", Journal of Semiconductor Technology and Science, 2011, April.

[3.1] Silicon Photonics – Opportunities and Trials, Air technology.

[3.2] J.T.Kim, "A large-swing transformer-boosted serial link transmitter with > VDD swing", Journal of Solid-State Circuits, 2007, April.

[3.3] J.Kin, "A 40-Gb/s Optical Transceiver Front-End in 45 nm SOI CMOS", Journal of Solid-State Circuits, 2012

[3.4] B.Gilbert, "Current Mode, Voltage Mode, or Free Mode? A Few Sage

Suggestions", Analog Integrated Circuits and Signal Processing, 2004, March.

[3.5] C.Menofli, "A 16Gb/s Source-Series Terminated Transmitter in 65nm

CMOS SOI", International Solid-State Circuits Conference 2007

[3.6] M.Kossel, "A T-Coil-Enhanced 8.5 Gb/s High-Swing SST Transmitter in 65 nm Bulk CMOS With <[- 16 dB Return Loss Over 10 GHz Bandwidth", Journal of Solid-State Circuits 2008.

[3.7] C.Menofli, "A 14Gb/s High-Swing Thin-Oxide Device SST TX in 45nm CMOS SOI", International Solid-State Circuits Conference 2011.

[3.8] S.Shekhar, "Bandwidth Extension Techniques for CMOS Amplifiers",Journal of Solid-State Circuits 2006, November.

[3.9]Korean Patent – 1020100027745 – Gain boosted cascade common gate transimpedance amplifier circuit

[3.10] K.S.Park, "A design of 10-Gbps CMOS optical receiver front-end" M.S. thesis, seoul national university..

[3.11] K.S.Park, "A 10-Gb/s Optical Receiver Front-End with 5-mW

Transimpedance Amplifier", Asian Solid-State Circuits Conference 2010

[3.12] D.W.Kim, "A 12.5Gbps optical receiver front-end and a optical modulator driver in 0.13um CMOS", M.S. thesis, Seoul national university

[3.13] D.W. Kim, "12.5-Gb/S Analog Front-End of an Optical Transceiver in

0.13um CMOS", International Symposium on Circuits and Systems 2013

[3.14] C.Kromer, "A Low-Power 20-GHz 52-dB Transimpedance Amplifier in 80-nm CMOS", Journal of Solid-State Circuits 2004, June.

[3.15] G.O.Kim, "Low-voltage high-performance silicon photonic devices and photonic integrated circuits operating up to 30 Gb/s", optical express,

2011,December.

[3.16] M.S.Hwang, "A study on the design of wide range CDR without an external reference clock", PH.D thesis, , Seoul national university.

[4.1] Ganesh Balamurugan, "A Scalable 5–15 Gbps, 14–75 mW Low-Power I/O

Transceiver in 65 nm CMOS", Journal of Solid-State Circuits 2008, April

[4.2] B.Casper, "A 20Gb/s Forwarded Clock Transceiver in 90nm CMOS",

International Solid-State Circuits Conference, 2006

[4.3] S.H.Chung ,"An 8Gb/s Forwarded-Clock I/O Receiver with up to 1GHz

Constant Jitter Tracking Bandwidth Using a Weak Injection-Locked Oscillator in

0.13um CMOS", Symposium on VLSI Circuits, 2011

[4.4] F. O'Mahony,"A 27Gb/s Forwarded-Clock I/O Receiver Using an

Injection-Locked LC-DCO in 45nm CMOS", International Solid-State Circuits Conference, 2008

[4.5] T.Toifl, "A 0.94-ps-RMS-Jitter 0.016-mm2 2.5-GHz Multiphase Generator PLL with 360 Digitally Programmable Phase Shift for 10-Gb/s Serial Link", Journal of Solid-State Circuits, 2005, December.

[4.6] F. O'Mahony, "A 47 10 Gb/s 1.4 mW/Gb/s Parallel Interface in 45 nmCMOS", Journal of Solid-State Circuits, 2010, December.

[4.7] A.Ragab, "Receiver Jitter Tracking Characteristics in High-Speed Source Synchronous Links", Journal of Electrical and Computer Engineering, 2011, August.

[4.8] J.Zerbe, "A 5 Gb/s Link With Matched Source Synchronous and Common-Mode Clocking Techniques", Journal of Solid-State Circuits, 2011, April.
[4.9] B.J.Yoo, "A Highly Expandable Forwarded-Clock Receiver with Ultra-Slim Data Lane using Skew Calibration by Multi-Phase Edge Monitoring", Journal of Semiconductor Technology and Science, 2012, December.

## 초록

본 논문에서는 기존의 메모리 인터페이스 및 차세대 메모리 인터페이스를 위한 직렬화 송수신기의 구조와 그 회로를 제안하였다.

먼저, 기존의 GDDR3 메모리를 위한 PHY 설계가 제시되었으며, 이는 read path, write path, command path로 구성되어 있다. write path와 command path는 VDL(Variable delay line)을 이용해 skew를 보상하며, read path의 경우에는 VDL을 사용하여 skew를 보상하는 동시에, DLL(delay locked loop)을 사용하여 동작 속도에 관계없이 phase를 유지한다. 4개의 data 채널과 1개의 command / address 채널로 구성되며, 하나의 data channel은 8개의 data 신호와 (DQ) 1개의 clock 신호로 구성된다. Data 채널은 1.2Gbps로 동작하며, command/address 채널은 600Mbps로 동작한다. 특히 이 논문에서는 SSN(simultaneous switching noise) 하에서 높은 속도로 동작할 수 있는 DLL 설계에 대해 논의한다.

둘째로, 실리콘 포토닉스를 위한 직렬화기를 제안한다. 이는 차세대 메모리 인터페이스로 각광받고 있는 기술이다. 모듈레이터를 위한 드라이버와, 포토 다이오드를 위한 TIA(trans-impedance amplifier), LA(limiting amplifier) 회로 설계에 대해 논의한다. 이는 12.5Gbps 이상에서 동작하지만, optical device 칩과 bonding을 통해 연결되어 있기 때문에 큰 기생 capacitance가 존재하여, 6mW/Gbps (송신기), 4mW/ Gbps(수신기)의 높은 전력 소비를 갖게 된다. CDR(clock and data recovery)를 포함하는 전체 수신기 또한 설계 하였다. 이러한 연구들은 65nm, 0.13um CMOS 공정을 이용한 많은 칩들로 제작되었다.

마지막으로, 20Gbps 메모리 링크를 위한 전기적 직렬화기의 설계에 대해 논의한다. forwarded clocking 구조를 가지며 매우 간단하고 직관적인 구조를 가지는 동시에 별도의 동기화기를 필요로 하지 않는다. 이 open loop delay matched stream line 수신기는 DCDL (digitally controlled delay line)을 이용하여 최적의 지점을 찾으며, 구조적으로 낮은 전력 소비를 갖데 된다. 단 두 개의 half rate clock phase를 전달하면서도, 실시간 PRBS chaser를 이용해 half rate 으로 data sampling과 edge sampling을 하는 구조를 가지고 있다. 65nm CMOS 공정을 이용하였으며 2500um x 2500um (송수신기)의 크기를 갖는다. 송신기의 경우 2.6mW(2.4mW)/Gbps, 수신기의 경우 4.1mW(2.7mW)/Gbps의 전력 소모를 가지며 발전된 공정에서는 더욱 개선될 것으로 예측된다. 주요어 : 지연 고정 루프, 위상 고정 루프, 실리콘 포토닉스, 클럭 및 데이터 복원기, 제한 증폭기, 전류-전압 증폭기,

학 번 : 2008-30245