



**Ph.D.Dissertation** 

# Design of Receiver with Offset Cancellation of Adaptive Equalizer and Multi-Level Baud-Rate Phase Detector

오프셋 제거기의 적응 제어 등화기와 보우-레이트 위상 검출기를 활용한 수신기 설계

by

**Kwangho Lee** 

August, 2021

Department of Electrical and Computer Engineering College of Engineering Seoul National University

# Design of Receiver with Offset Cancellation of Adaptive Equalizer and Multi-Level Baud-Rate Phase Detector

지도교수 정 덕 균

이 논문을 공학박사 학위논문으로 제출함 2021년 8월

서울대학교 대학원

전기·정보공학부

이광호

이광호의 박사학위논문을 인준함

2021년 8월



# Design of Receiver with Offset Cancellation of Adaptive Equalizer and Multi-Level Baud-Rate Phase Detector

by

Kwangho Lee

A Dissertation Submitted to the Department of Electrical and Computer Engineering in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

at

SEOUL NATIONAL UNIVERSITY

August, 2021

Committee in Charge:

Professor Dongsuk Jeon, Chairman

Professor Deog-Kyoon Jeong, Vice-Chairman

Professor Woo-Seok Choi

Professor Yongsam Moon

Professor Jaeduk Han

### Abstract

In this thesis, designs of high-speed, low-power wireline receivers (RX) are explained. To be specific, the circuit techniques of DC offset cancellation, mergedsummer DFE, stochastic Baud-rate CDR, and the phase detector (PD) for multilevel signal are proposed.

At first, an RX with adaptive offset cancellation (AOC) and merged summer decision-feedback equalizer (DFE) is proposed. The proposed AOC engine removes the random DC offset of the data path by examining the random data stream's sampled data and edge outputs. In addition, the proposed RX incorporates a shared-summer DFE in a half-rate structure to reduce power dissipation and hardware complexity of the adaptive equalizer. A prototype chip fabricated in 40 nm CMOS technology occupies an active area of 0.083 mm<sup>2</sup>. Thanks to the AOC engine, the proposed RX achieves the BER of less than 10<sup>-12</sup> in a wide range of data rates: 1.62-10 Gb/s. The proposed RX consumes 18.6 mW at 10 Gb/s over a channel with a 27 dB loss at 5 GHz, exhibiting a figure-of-merit of 0.068 pJ/b/dB.

Secondly, a 40 nm CMOS RX with Baud-rate phase-detector (BRPD) is proposed. The RX includes two PDs: the BRPD employing the stochastic technique and the BRPD suitable for multi-level signals. Thanks to the Baud-rate CDR's advantage, by not using an edge-sampling clock, the proposed CDR can reduce the power consumption by lowering the hardware complexity. Besides, the proposed stochastic phase detector (SPD) tracks an optimal phase-locking point that maximizes the vertical eye opening. Furthermore, despite residual inter-symbol interference, proposed BRPD for multi-level signal secures vertical eye margin, which is especially vulnerable in the multi-level signal. Besides, the proposed BRPD has a unique lock point with an adaptive DFE, unlike conventional Mueller-Muller PD. A prototype chip fabricated in 40 nm CMOS technology occupies an active area of 0.24 mm<sup>2</sup>. The proposed PAM-4 RX achieves the bit-error-rate less than 10<sup>-11</sup> in 48 Gb/s and the power efficiency of 2.42 pJ/b.

**Keywords**: receiver, phase detector, decision-feedback equalizer, DC offset, Baudrate CDR, stochastic phase detector, forwarded-clock receiver, SS-MM CDR, PAM-4.

**Student Number : 2016-20943** 

## Contents

| ABSTRACT                                     | Ι   |
|----------------------------------------------|-----|
| CONTENTS                                     | III |
| LIST OF FIGURES                              | VI  |
| LIST OF TABLES                               | X   |
| CHAPTER 1 INTRODUCTION                       | 1   |
| 1.1 MOTIVATION                               | 1   |
| 1.2 THESIS ORGANIZATION                      | 5   |
| CHAPTER 2 BACKGROUNDS                        | 6   |
| 2.1 BASIC ARCHITECTURE IN SERIAL LINK        | 6   |
| 2.1.1 SERIAL COMMUNICATION                   | 6   |
| 2.1.2 CLOCK AND DATA RECOVERY                | 8   |
| 2.1.3 MULTI-LEVEL PULSE-AMPLITUDE MODULATION |     |
| 2.2 Equalizer                                | 12  |
| 2.2.1 Equalizer Overview                     | 12  |
| 2.2.2 DECISION-FEEDBACK EQUALIZER            | 15  |
| 2.2.3 Adaptive Equalizer                     |     |
| 2.3 Clock Recovery                           | 21  |
| 2.3.1 2x Oversampling PD – Alexander PD      | 22  |
| 2.3.2 BAUD-RATE PD – MUELLER MULLER PD       | 25  |

| CHAPTER 3 AN ADAPTIVE OFFSET CANCELLATION SCHEME ANI          | )    |
|---------------------------------------------------------------|------|
| SHARED SUMMER ADAPTIVE DFE                                    | 28   |
| 3.1 OVERVIEW                                                  | 28   |
| 3.2 AN ADAPTIVE OFFSET CANCELLATION SCHEME AND SHARED-SUMMER  |      |
| ADAPTIVE DFE FOR LOW POWER RECEIVER                           | 31   |
| 3.3 Shared Summer DFE                                         | 37   |
| 3.4 Receiver Implementation                                   | 42   |
| 3.5 MEASUREMENT RESULTS                                       | 45   |
| CHAPTER 4 PAM-4 BAUD-RATE DIGITAL CDR                         | 51   |
| 4.1 Overview                                                  | 51   |
| 4.2 OVERALL ARCHITECTURE                                      | 53   |
| 4.2.1 PROPOSED BAUD-RATE CDR ARCHITECTURE                     | 53   |
| 4.2.2 PROPOSED ANALOG FRONT-END STRUCTURE                     | 59   |
| 4.3 STOCHASTIC PHASE DETECTION PAM-4 CDR                      | 64   |
| 4.3.1 PROPOSED STOCHASTIC PHASE DETECTION                     | 64   |
| 4.3.2 COMPARISON OF THE STOCHASTIC PD WITH SS-MMPD            | 70   |
| 4.4 PHASE DETECTION FOR MULTI-LEVEL SIGNALING                 | 73   |
| 4.4.1 PROPOSED BAUD-RATE PHASE DETECTOR FOR MULTI-LEVEL SIGN. | al73 |
| 4.4.2 DATA LEVEL AND DFE COEFFICIENT ADAPTATION               | 79   |
| 4.4.3 PROPOSED PHASE DETECTOR                                 | 84   |
| 4.5 Measurement Result                                        | 88   |
| 4.5.1 MEASUREMENT OF THE PROPOSED STOCHASTIC BAUD-RATE PHASE  | 2    |
| DETECTION                                                     | 94   |

| 4.5.2 MEASUREMENT OF THE PROPOSED BAUD-RA | ATE PHASE DETECTION FOR |
|-------------------------------------------|-------------------------|
| MULTI-LEVEL SIGNAL                        |                         |
| CHAPTER 5 CONCLUSION                      | 103                     |
| BIBLIOGRAPHY                              | 105                     |
| 초 록                                       | 109                     |

## **List of Figures**

| TIG. 1.1 DEVELOTMENT OF THE MOETH LE WIRELINE STANDARD                                                                                                               |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| FIG. 1.2 TREND OF THE MODULATION METHODS FOR RECENT PUBLICATIONS                                                                                                     |
| FIG. 2.1 SIMPLIFIED BLOCK DIAGRAM OF THE SERIAL INTERFACE                                                                                                            |
| FIG. 2.2 SIMPLIFIED BLOCK DIAGRAM OF THE CDR                                                                                                                         |
| FIG. 2.3 TRANSMISSION CHANNEL OF IMPULSE, IMPULSE RESPONSE, SINGLE BIT, AND SINGLE                                                                                   |
| BIT RESPONSE                                                                                                                                                         |
| FIG. 2.4 BLOCK DIAGRAM OF THE N-TAP DFE                                                                                                                              |
| FIG. 2.5 SINGLE-BIT RESPONSE WITH 3-TAP DFE                                                                                                                          |
| FIG. 2.6 CONCEPTUAL BLOCK DIAGRAM OF THE ADAPTIVE EQUALIZER                                                                                                          |
| Fig. 2.7 Principle of 2x-oversampling PD (a) when the clock is "Early" and (b)                                                                                       |
| "Lате"                                                                                                                                                               |
|                                                                                                                                                                      |
| Fig. 2.8 Gain of the Alexander PD (a) ideal transfer function, (b) PDF of the high-                                                                                  |
| FIG. 2.8 GAIN OF THE ALEXANDER PD (A) IDEAL TRANSFER FUNCTION, (B) PDF OF THE HIGH-<br>FREQUENCY NOISE, AND (C) EFFECTIVE TRANSFER FUNCTION                          |
| <ul> <li>FIG. 2.8 GAIN OF THE ALEXANDER PD (A) IDEAL TRANSFER FUNCTION, (B) PDF OF THE HIGH-</li> <li>FREQUENCY NOISE, AND (C) EFFECTIVE TRANSFER FUNCTION</li></ul> |
| <ul> <li>FIG. 2.8 GAIN OF THE ALEXANDER PD (A) IDEAL TRANSFER FUNCTION, (B) PDF OF THE HIGH-<br/>FREQUENCY NOISE, AND (C) EFFECTIVE TRANSFER FUNCTION</li></ul>      |
| <ul> <li>FIG. 2.8 GAIN OF THE ALEXANDER PD (A) IDEAL TRANSFER FUNCTION, (B) PDF OF THE HIGH-<br/>FREQUENCY NOISE, AND (C) EFFECTIVE TRANSFER FUNCTION</li></ul>      |
| <ul> <li>FIG. 2.8 GAIN OF THE ALEXANDER PD (A) IDEAL TRANSFER FUNCTION, (B) PDF OF THE HIGH-<br/>FREQUENCY NOISE, AND (C) EFFECTIVE TRANSFER FUNCTION</li></ul>      |
| <ul> <li>FIG. 2.8 GAIN OF THE ALEXANDER PD (A) IDEAL TRANSFER FUNCTION, (B) PDF OF THE HIGH-<br/>FREQUENCY NOISE, AND (C) EFFECTIVE TRANSFER FUNCTION</li></ul>      |
| <ul> <li>FIG. 2.8 GAIN OF THE ALEXANDER PD (A) IDEAL TRANSFER FUNCTION, (B) PDF OF THE HIGH-<br/>FREQUENCY NOISE, AND (C) EFFECTIVE TRANSFER FUNCTION</li></ul>      |

| FIG. 3.6 CONCEPTUAL SCHEMATIC OF (A) CONVENTIONAL ADAPTIVE HALF-RATE DFE AND (B) |
|----------------------------------------------------------------------------------|
| PROPOSED ADAPTIVE HALF-RATE DFE40                                                |
| FIG. 3.7 CONCEPTUAL BLOCK DIAGRAM OF THE MERGED SUMMER DFE41                     |
| FIG. 3.8 BLOCK DIAGRAM OF THE PROPOSED RX                                        |
| FIG. 3.9 CHIP PHOTOMICROGRAPH OF THE IMPLEMENTED RX                              |
| FIG. 3.10 INSERTION LOSS OF TWO CHANNELS USED FOR TESTING                        |
| FIG. 3.11 POWER BREAKDOWN AND POWER EFFICIENCY OF THE PROPOSED RX                |
| FIG. 3.12 MEASURED BER VERSUS OFFSET CANCELLATION CODE                           |
| FIG. 3.13 MEASURED JTOL AT BER OF 10 <sup>-12</sup> WITH AND WITHOUT EDGE DFE    |
| FIG. 3.14 FIGURE-OF-MERIT OF THE PROPOSED RX                                     |
| FIG. 4.1 OVERALL CIRCUIT BLOCK DIAGRAM OF THE IMPLEMENTED RX                     |
| FIG. 4.2 SCHEMATIC OF THE IMPLEMENTED SAMPLER                                    |
| FIG. 4.3 SCHEMATIC OF THE PROPOSED IDAC                                          |
| FIG. 4.4 SIMULATION RESULT OF THE SAMPLING THRESHOLD VERSUS THRESHOLD CONTROL    |
| WORD                                                                             |
| FIG. 4.5 BLOCK DIAGRAM OF THE HALF-RATE INVERTER-BASED DFE                       |
| FIG. 4.6 TIMING DIAGRAM OF THE MERGED INVERTER-BASED DFE                         |
| FIG. 4.7 DETAILED OPERATION AND SCHEMATIC OF THE PROPOSED MERGED INVERTER-BASED  |
| DFE63                                                                            |
| FIG. 4.8 BLOCK DIAGRAM OF THE PROPOSED SPD LOGIC                                 |
| FIG. 4.9 EYE DIAGRAM OF PAM-4 WITH BAUD-RATE SAMPLERS                            |
| FIG. 4.10 HISTOGRAM OF THE $B_N$ with "Early" and "Late" phase                   |
| FIG. 4.11 RAW AND QUANTIZED WEIGHTS OF THE $B_N$                                 |
| FIG. 4.12 SIMULATED SPD GAIN CURVE                                               |

| FIG. 4.13 CASES OF DETECTING PHASE ERROR FOR THE SS-MMPD71                                 |
|--------------------------------------------------------------------------------------------|
| FIG. 4.14 WEIGHTS OF THE PROPOSED SPD AND THE SS-MMPD71                                    |
| FIG. 4.15 LOCK POINT ON SBR OF THE SS-MMPD AND THE PROPOSED SPD, (A) W/O DFE AND           |
| (B) W 1-TAP DFE                                                                            |
| FIG. 4.16 ESTIMATED VEM WITH DFE OF THE PAM-2 AND PAM-474                                  |
| FIG. 4.17 ESTIMATED VEM VERSUS THE CURSOR RATIO OF THE $H_0$ AND $H_{-1}$                  |
| FIG. 4.18 LOCK POINT OF THE PROPOSED BRPD FOR MULTI-LEVEL SIGNALING ON SBR76               |
| FIG. 4.19 SIMULATED M VERSUS TIME WITH SMOOTH-LOSS CHANNEL                                 |
| Fig. 4.20 (a) Equalized eye diagram of the PAM-4 and (b) data histogram of data            |
| +3                                                                                         |
| FIG. 4.21 COMPARISON OF THE CONVENTIONAL DATA-LEVEL ADAPTATION AND UN-EVEN                 |
| DATA-LEVEL ADAPTATION                                                                      |
| FIG. 4.22 GENERATING PHASE ERROR BASED ON THE CONSECUTIVE DATA $(D_N, D_{N+1}) = (+3, -3)$ |
| AND THE SIGN OF THE $V_{PD}$ ERROR                                                         |
| Fig. 4.23 Simulated lock point of the proposed BRPD when $N_{\scriptscriptstyle T}$ is 12  |
| FIG. 4.24 DETAILED BLOCK DIAGRAM OF THE SDL WHICH USES BRPD THAT SUITABLE FOR              |
| MULTI-LEVEL SIGNALING                                                                      |
| FIG. 4.25 CHIP PHOTOMICROGRAPH OF THE PROPOSED RX                                          |
| FIG. 4.26 DETAILED POWER BREAKDOWN OF THE PROPOSED RX90                                    |
| FIG. 4.27 BLOCK DIAGRAM OF THE MEASUREMENT SETUP                                           |
| FIG. 4.28 MEASURED INSERTION LOSS OF THE CHANNEL FOR PAM-2 AND PAM-4 SIGNALING             |
|                                                                                            |
| Fig. 4.29 Measured eye diagram of the prototype RX at BER of $10^{-11}$ 93                 |
| FIG. 4.30 BEHAVIOR MODELING SIMULATION RESULT OF THE PROPOSED SPD CDR94                    |

| FIG. 4.31 MEASURED BATHTUB OF THE PROPOSED SPD9                                                       | 5 |
|-------------------------------------------------------------------------------------------------------|---|
| FIG. 4.32 MEASURED JTOL AT BER OF 10 <sup>-11</sup> WITH PROPOSED SPD CDR9                            | 5 |
| FIG. 4.33 MEASURED DATA LEVELS OF THE (A) PAM-4 AND (B) PAM-29                                        | 8 |
| Fig. 4.34 Measured and calculated M value versus sampling clock phase and $\mathrm{N}_{\mathrm{t}}$ o | F |
| THE (A) PAM-4 AND (B) PAM-29                                                                          | 9 |
| Fig. 4.35 Measured bathtub of (a) PAM-4 at BER of $10^{-11}$ and (b) PAM-2 at BER of                  |   |
| 10 <sup>-12</sup>                                                                                     | 0 |
| Fig. 4.36 Measured JTOL at BER of $10^{-11}$ with Baud-rate CDR for multi-level                       |   |
| SIGNALING10                                                                                           | 1 |

## List of Tables

| TABLE II.I COMPARISON TABLE OF THE PAM-2 AND PAM-4 SIGNALS.         | .10  |
|---------------------------------------------------------------------|------|
| TABLE III. I TRUTH TABLE OF OFFSET AND PHASE DETECTION LOGIC        | . 32 |
| TABLE IIII. II COMPARISON WITH PREVIOUS RX                          | . 50 |
| TABLE IV. I DETAILED AREA OF THE SUB-BLOCK                          | . 88 |
| TABLE V.II COMPARISON TABLE FOR OVERALL PERFORMANCE OF THE PAM-4 RX | .96  |
| TABLE VI.III COMPARISON TABLE FOR OVERALL PERFORMANCE OF THE BR RX  | 102  |

### Chapter 1

### Introduction

#### **1.1 Motivation**

Recently, data transmission is explosively increasing. For example, the servers use a large amount of the data to upload/download, we watch high-resolution videos using a monitor or a television, and chip-to-chip transmission data is increasing [1]-[5]. As demand for data bandwidth grows, the electrical interconnect plays a crucial role in many interface standards. Fig. 1.1 shows the development of the multiple wireline standard [6]-[8]. The data rate gradually increases in various I/O standards such as peripheral component interconnects express (PCIe), high definition multimedia interface (HDMI), and DisplayPort (DP). Since commonly-used cables have low-pass filter characteristics, high-speed wireline links should compensate for cable loss to attain the best bit error rate (BER) performance. Many high-speed transceivers have been studied to keep pace with these standards.

Furthermore, the bandwidth demanding leads to the use of multi-level signals. Fig. 1.2 shows the trend of the modulation methods for recent publications. The pulseamplitude modulation 4-level (PAM-4) is a multi-level signal modulation format use to transmit the data. Four signal levels can represent 2 bits of logic information. The PAM-4 signal extends the data rate double while maintaining the Nyquist frequency. Recently, many PAM-4 transceivers have been studied.



Fig. 1.1 Development of the multiple wireline standard

In the receiver's (RX) data path side, enough vertical eye margin secures the BER. The DC offset induced by random mismatch and inter-symbol interferences (ISI) significantly affects the vertical eye margin. Conventional offset cancellation methods take calibration time to set the RX input common-mode voltage, monitor the output, and calibrate the DC offset to cancel the random DC offset [9], [10], [12]. However, to reduce the calibration time, the offset cancellation method, which operates with random data input, is presented. Furthermore, the decision-feedback equalizer (DFE), which eliminates post-cursor ISIs effectively, is utilized to obtain vertical eye margins.



Fig. 1.2 Trend of the modulation methods for recent publications

The DFE summers, separated from the time-interleaving method, are merged to reduce the power consumption and hardware overhead.

Besides, multi-level pulse amplitude modulation (PAM) is adapted in transceivers. The RX, which adapts multi-level PAM, utilizes a Baud-rate (BR) clock and data recovery (CDR) to reduce the number of samplers. For example, a Mueller Muller (MM) CDR has been studied for the BR CDR. However, multi-level PAM is adversely affected by inter-symbol interference, which causes the BER. Furthermore, a sign-sign MM CDR with a DFE moves the lock point forward to the point that the first tap post-cursor ISI ( $h_1$ ) and the first tap pre-cursor ISI ( $h_{-1}$ ) are equal to zero. The SS-MM CDR with DFE has various lock points, and it is hard to find a unique optimal point. The Baud-rate phase-detectors (BRPD) with stochastic phase detector (SPD) and the BRPD for multi-level signal are presented to resolve these problems.

### **1.2 Thesis Organization**

This thesis is organized as follows. In Chapter 2, the background of the RX is explained. The basic operations and critical blocks of the general CDR are described. Especially, serial link, equalizing, and timing recovery techniques are introduced. The equalizing methods include the equalizers and adaptive equalizing strategy. And, the timing recovery techniques include 2x-oversampling and BR algorithms.

In Chapter 3, an RX with adaptive offset cancellation (AOC) and merged summer DFE is proposed. The proposed AOC engine removes the random DC offset of the data path by examining the random data stream's sampled data and edge outputs. In addition, a shared-summer DFE in a half-rate structure is incorporated to reduce power dissipation and hardware complexity of the adaptive equalizer.

In Chapter 4, an RX with BRPD is proposed. The RX includes two BRPD. One is the BRPD which employs stochastic technique. The other is the BRPD which is suitable for multi-level signals. The SPD tracks an optimal phase-locking point that maximizes the vertical eye opening. Also, the BRPD for multi-level signal secures vertical eye margin, which is especially vulnerable in the multi-level signal. Besides, the proposed BRPD has a unique lock point with an adaptive DFE, unlike conventional Mueller-Müller PD (MMPD).

Chapter 5 summarizes the proposed works and concludes this thesis.

### Chapter 2

### Backgrounds

### 2.1 Basic Architecture in Serial Link

#### 2.1.1 Serial Communication

Serial communication is the protocol that transmits and receives data through one signal line. The data has regular time intervals (UI) in one signal line to transfer the logical information 0 or 1. The data transmission is divided into serial transmission and parallel transmission. In contrast, parallel transmission uses multiple signal lines. The number of signal lines is determined by the size of the data and protocols. The more number of the signal lines transmits more data per unit time, but the wireline's cost is high, and the parallel transmission line is expensive. The serial transmission interface serializes the data streams to lower the cost of the signal lines.

The simplified block diagram of the serial interface is shown in Fig. 2.1. The parallel data is serialized in the transmitter (TX), transferred through the channel, and deserialized to the RX. The examples of the SerDes are Ethernet, optical fiber, PCIe, HDMI, and so forth. As the data amount that industries require is increasing, the data rate of serial communication gets expanding. For example, DisplayPort 1.0 version that runs at 1.62 Gb/s/lane moves to version 2.0 that runs at 20 Gb/s/lane. The increasing data rates limit the bandwidth of the circuit, and the design issues increases. For example, the timing budget and the bandwidth limitation of the channel become tight. The timing circuit affects the performance of the serial link, and the limited bandwidth corrupts the signal integrity. The TX and RX have various strategies to transmit and receive the serialized data. The CDR circuit is primarily used on the RX side to recover the timing and data.



Fig. 2.1 Simplified block diagram of the serial interface

#### 2.1.2 Clock and Data Recovery

CDR is a block to receive the transmitted data without any additional timing reference. The CDR implemented on the RX side aims to perform synchronous operations by recovering timing and demultiplexing. As the operating speed increases, the synchronization is hard to achieve without the retiming blocks. Also, as the bandwidth limitation of the channel and noise, the signal-to-noise ratio (SNR) of the received signal gets degraded severely. Therefore, the CDR should recover a synchronous clock and improve signal integrity.

The clock recovery circuit and a re-timer generate the recovered clock and data, respectively [19]. Fig. 2.2 shows the simplified block diagram of the CDR.



Fig. 2.2 Simplified block diagram of the CDR

The reduced vertical and horizontal margin degrades the signal quality. From the disturbed data stream, the equalizer and clock recovery recover and retime the transmitted signal. The equalizer reduces the effect of the limited bandwidth channel. Because of the loss of the high-frequency signal, compensation of the loss is required. The feed-forward equalizer is commonly used on the TX side, and the continuous-time linear equalizer (CTLE) and the DFE are widely used on the RX side. The equalizers aim to flatten the system response. In Chapter 2.2, equalizers are introduced, and DFE is primarily studied. Furthermore, an adaptive equalizer that adapts equalizing coefficient through the condition of the received signal is introduced. The clock recovery recovers a periodic clock for sampling. In Chapter 2.3, techniques of timing recovery are presented. PDs for adjusting the optimum sampling point are presented using an alexander PD for 2x oversampling and a BR MMPD.

#### 2.1.3 Multi-Level Pulse-Amplitude Modulation

As mass bandwidth data is required, upgrading to faster systems needed. Because of the limited channel bandwidth, multi-level signaling has been discussed recently [11], [20]-[23], [27], [31]. For example, assume that there is a system with 56 Gbps data transmission. The Nyquist frequency of the channel is 28 GHz in non-return to zero (NRZ) and 14 GHz in PAM-4 modulation. In the case of high-speed data transmission, the bandwidth limit of the channel occurs rapidly, and the PAM-4 can operate advantageously in the perspective of a Nyquist loss. Multi-level signaling is a way to pack more data bits for the same amount of time. NRZ signaling is hard to support a high-speed data rate because of the channel insertion loss. The NRZ signaling called PAM-2 is a modulation method with two voltage levels representing logic 0 and logic 1. The NRZ modulation encodes one bit of data to one unit interval time. Encoding more data is done using more different signaling levels. For example, PAM-4 has four

|                        | PAM-2 | PAM-4  |
|------------------------|-------|--------|
| Bits / Symbol          | 1     | 2      |
| Electrical levels      | 2     | 4      |
| Eye Diagram / UI       | 1     | 3      |
| Relative SNR electrial | 0 dB  | 9.5 dB |

TABLE II.I Comparison table of the PAM-2 and PAM-4 signals

distinct levels to encode two data bits, and PAM-8 encodes three bits of data to eight levels.

Table 2.1 shows the comparison of the PAM-2 and PAM-4 modulation. The bits per symbol of the PAM-4 is double the bit of the PAM-2 modulation. PAM-4 encodes two bits that consist of a most significant bit (MSB) and a least significant bit (LSB). The LSB has a half voltage swing compared to the MSB voltage swing. The four electrical levels consist of 00, 01, 11, 10 in a thermometer-coded way or 00, 01, 10, 11 in a binary-coded way. Furthermore, the cost of the doubled bandwidth is the degraded signal integrity. As the eye diagram per unit interval of the PAM-4 is three, SNR is degraded as ~9.5 dB for PAM-4. Furthermore, the ISI disturbation affects three times more than PAM-2 modulation. The reduced SNR and more ISI disturbation require precise equalization at the TX and RX to get the BER.

### 2.2 Equalizer

#### 2.2.1 Equalizer Overview

The signals are transmitted from the source device to the sink device through the channel in the serial links. The channels include traces or cables such as the microstrip, the twisted pair, or the coaxial cable. Due to the skin effect and dielectric losses, the channels have the characteristic of the low-pass filter. As the data rate of the signals increases, the channel loss is more affected. The bandwidth-limited signals suffer from inter-symbol interference that reduces SNR considerably. The equalizers with high-pass filtered characteristics are utilized to compensate for the bandwidth-limiting channel response to remedy the problem. In other words, the equalizer makes the overall frequency response flat.

Assuming that the channel is the linear time-invariant (LTI) system, the singlebit response is the effective way to visualize and observe the signal integrity. SBR is the time domain response of the channel. The SBR of the channel, *sbr*(t), is written as,

$$sbr(t) = h(t) * \phi(t) \tag{2.1}$$

where h(t) is the channel's impulse response, and  $\phi(t)$  is the transmitted single-bit pulse. Fig. 2.3 shows the impulse, impulse response, single bit, and the SBR at the RX side.

The SBR is expressed as *sbr*(t) in continuous-time signal and as *sbr*[n] in discrete-time signals, such as

$$sbr[n] = sbr(T_0 + nT_b) \tag{2.2}$$

where  $T_0$  is sampling times of the main cursor  $(h_0)$ , and  $T_b$  is bit period, respectively. The main cursor is the value sbr[0] that is commonly expressed as  $h_0$ . On the other hand, when n is not zero, the non-zero sbr[n] is called ISI. Especially for the negative and positive n, sbr[n] is called pre-cursor ISI and post-



Fig. 2.3 Transmission channel of impulse, impulse response, single bit, and single

bit response

cursor ISI, respectively. The sbr[n] is commonly expressed as  $h_n$ . The ISI is a distortion of the signal that a symbol interferes with other neighboring symbol periods. The spread characters introduce errors on the RX side.

The channel response is assumed as an LTI system, and the signal of the RX side is expressed as the superposition of the SBR. The received signal rx(t) is written as

$$rx(t) = \sum_{k=-\infty}^{\infty} tx[n-k] \cdot sbr(t+kT_b)$$
(2.3)

where tx[n] is the transmitted signal which is +1 or -1 in PAM-2 and +3, +1, -1, or -3 in PAM-4. Also, the sampled received signal rx[n] is expressed as

$$rx[n] = rx(T_0 + nT_b) = \sum_{\substack{k=-\infty\\k\neq 0}}^{\infty} tx[n-k] \cdot sbr[k]$$
  
=  $tx[n] \cdot sbr[0] + \sum_{\substack{k\neq 0\\k\neq 0}}^{\infty} tx[n-k] \cdot sbr(t+kT_b)$  (2.4)  
=  $tx[n] \cdot h_0 + \sum_{\substack{k\neq 0\\k\neq 0}} tx[n-k] \cdot h_k.$ 

The equation rx[n] is composed of the desired signal and the unwanted disturbations called ISI. The equalizers aim to maximize  $h_0$  and to cancel the ISIs for minimizing BER. In the next chapter, we will focus on the detailed equalizer, especially DFE.

#### 2.2.2 Decision-Feedback Equalizer

DFE is an effective equalizing technique to recover the signal integrity from the signal distortion on the RX side. The DFE is a non-linear and finite-impulse response filter. The DFE is consists of the decision blocks and feedback filter, as shown in Fig. 2.4. The decision blocks have non-linear bang-bang characteristics that make DFE a non-linear equalizer. The decision block, commonly called a sampler, decides whether the received data is higher or lower than the sampling threshold. The determined value is logical 1 or -1, and the logical data is delayed and multiplied by tap coefficients. In other words, for the n<sup>th</sup> tap DFE, sampled data is delayed n<sup>th</sup> times, and the decision data is multiplied by n<sup>th</sup> tap coefficient



Fig. 2.4 Block diagram of the n-tap DFE

 $w_n$ . The DFE summer subtracts the multiplied analog value from the received input, canceling the post-cursor ISI. For example, Fig 3.3 shows the SBR of the equalized signal by 3-tap DFE. Assume that the received signal has  $h_{-1}$ ,  $h_0$ , and four post-cursor ISIs. Because the DFE decides the current value and cancels the post-cursor ISIs from the incoming next value, the DFE cannot cancel the precursor ISIs. The 4<sup>th</sup> ISI remains because the 3-tap DFE is the FIR filter. Furthermore, since DFE operates assuming that previous data is correct, error propagation is possible if the decision data is incorrect.

One of the most critical issues of the DFE design is the timing constraint of the feedback [30]. The feedback loop should settle before the following sampling timing comes to meet the timing constraint. The feedback loop is a sum of the C-



Fig. 2.5 Single-bit response with 3-tap DFE

to-Q delay, the setup time of the sampler, and the settling time at the summing node, and the feedback loop is expressed as

$$t_{C-to-Q} + t_{setup} + t_{settle} < 1 \text{ UI}$$

$$(2.5)$$

where  $t_{C-to-Q}$ ,  $t_{setup}$ , and  $t_{settle}$  are C-to-Q delay, the setup time of the sampler, and the settling time at the summing node, respectively.

#### 2.2.3 Adaptive Equalizer

The equalizers compensate the ISIs, which are brought out by frequencydependent channel loss. If the equalizer used to remove ISI does not compensate for the correct value, the remaining ISI leads to decreased SNR. Since ISI changes according to the sampling timing and it is difficult to predict the internal ISI size, it is difficult to perform accurate ISI removal. Therefore, an adaptation method must be accompanied to achieve accurate equalizing.

Fig. 2.6 shows the conceptual block diagram of the adaptive equalizer [31]. The RX input rx[n] and recovered sample d[n] is expressed as

$$rx[n] = \sum_{k=0}^{\infty} sbr[k] \cdot tx[n-k]$$
  
$$d[n] = \sum_{k=0}^{\infty} rx[k] \cdot w_{n-k}.$$
(2.6)

The Dlev is the desired data level of the fully equalized signal. The goal of the equalizer is to make the recovered sample is same as the DC-scaled version of the transmitted symbol tx[n],  $Dlev \cdot tx[n]$ , by adjusting the equalizer coefficient w[n]. The goal of the w[n] is the reduce the error, e[n]. When least-mean-square (LMS) is employed, the update equation of the  $w_k$  is written as

$$w[i+1]_{k} = w[i]_{k} - \left(\frac{\mu}{2}\right) \left(\frac{\partial e^{2}[n]}{\partial w_{k}}\right)$$
  
$$= w[i]_{k} - \left(\frac{\mu}{2}\right) \left(2 \cdot e[n] \cdot \frac{\partial}{\partial w[k]} (d[n] - Dlev \cdot tx[n])\right) \qquad (2.7)$$
  
$$= w[i]_{k} - \left(\frac{\mu}{2}\right) (2 \cdot e[n] \cdot rx[n-k])$$

where k is the tap index, i is the time instant, and  $\mu$  is the update coefficient. However, implementing the LMS adaptation algorithm is difficult because the e[n] and rx[n-k] is hard to obtain. Instead of the LMS algorithm, sign-sign LMS (SS-LMS) algorithm is used to implement the high-speed link. The SS-LMS algorithm is efficient because of the compact hardware complexity. The SS-LMS algorithm uses the polarity of the samples, and it is easy to detect using comparators. The modified SS-LMS algorithm is written as

$$w[i+1]_k = w[i]_k - \mu \cdot sgn(e[n]) \cdot D[n-k]$$
(2.8)

where D[n-k] is the recovered data by hard decision.



Fig. 2.6 Conceptual block diagram of the adaptive equalizer

The adaptation of the equalizer requires the equalized data level. Theoretically, the Dlev is the same as the magnitude of the  $h_0$ . In the case Dlev for tx[n] was "+1", update equation is written as

$$Dlev[i+1] = \begin{cases} Dlev[i] + \mu_{Dlev} \cdot sign(e[n]), & D[n] = "+1" \\ Dlev[i], & D[n] = "-1" \end{cases}$$
(2.9)

where  $\mu_{Dlev}$  is the update coefficient for Dlev.

### **2.3 Clock Recovery**

In serial link communication, timing recovery is essential to extract the transmitted symbols from the received symbols. The timing recovery aims to get samples at as far away from data disturbance as possible. The best point of the timing recovery is the point where it gets low BER. The clocks to extract the received data stream are recovered by utilizing the received data stream. Especially, PDs get phase information from the data streams.

The PDs generate a signal representing the difference in phase between the received signal and the recovered clock. The PDs may be classified by the number of clocks for 1UI data. For example, oversampling PDs use more than one clock phase per 1UI. The more clock phases per 1UI are used, the more information can be collected to find the optimum phase, so accurate phase detection is possible. However, the more the clock phase is needed, the more disadvantageous is the power consumption. In addition, generating multiple phases as it goes to high speed is difficult because the speed limit occurs. In particular, PD using two clock phases per 1UI is called 2x oversampling PD, and PD using one clock phase is called BRPD. The above two types of PD are mainly implemented PDs in high-speed CDRs. This chapter introduces a PD that can be used as a non-linear PD because a digital filter is used. In the 2x-oversampling method, alexander PD will be introduced, and in the BR manner, MMPD will be presented.

#### 2.3.1 2x Oversampling PD – Alexander PD

The 2x oversampling PD has two samples per one data time interval. Significantly, the alexander PD, one of the most widely used for robust operation, samples the data edge and data. The alexander PD generates "Early" and "Late" signals by monitoring the consecutive two edge samples and the data sample. Fig 2.7 shows the principle of 2x-oversampling PD, which generates a timing signal [13]. Comparing the forward edge sample and data sample generates the "Early" signal when the sampling clock is early. If the data and edge samples are different, the comparing logic implemented as



Fig. 2.7 Principle of 2x-oversampling PD (a) when the clock is "Early" and (b)

"Late"
XOR produces the "Early" signal. When the clock is late, comparing the data sample and backward edge sample generates the "Late" signal. The alexander PD locks that edge clock ( $clk_{edge}$ ) is located at zero crossings of the data. The  $h_{-0.5}$  and  $h_{0.5}$  are equal at the lock point, where the  $h_{-0.5}$  and  $h_{0.5}$  are the cursors located at -0.5 UI and 0.5 UI. The alexander PD generated "Early" and "Late" information in the data transition.



Fig. 2.8 Gain of the alexander PD (a) ideal transfer function, (b) PDF of the highfrequency noise, and (c) effective transfer function

The alexander PD output is phase error sign information which means that the PD has a bang-bang characteristic. The PD gain is ideally infinite at zero phase error. However, the gain of the alexander PD is finite with noise, clock jitter, meta-stability of the sampler, and ISI. Fig 2.8 shows the gain of the alexander PD. The transfer function of the alexander PD is the convolution of the ideal PD transfer characteristic and the noise probability density function (PDF). The effective transfer function has linear features around the lock point and the bang-bang feature at a distant point.

#### 2.3.2 Baud-rate PD – Mueller Muller PD

The BRPD is utilized to reduce the clocking power because of the reduced sampling phase by one. This chapter explains MMPD, widely used as BRPD [32]. The MMPD locks at the point that  $sbr(t_k-T_b) = sbr(t_k+T_b)$ . Denoting the single-bit response by sbr(t), the bit interval by  $T_b$ , and the sampling phase  $t_k$ . As shown in Fig. 2.9, the MMPD compares the  $sbr(t_k-T_b)$  and  $sbr(t_k+T_b)$  and produces the "Early" or "Late" signal. The received signal rx(t) can be expressed as,

$$rx(t) = \sum_{k=0}^{\infty} tx_{n-k} \cdot sbr(t+kT_b)$$
(2.10)



Fig. 2.9 Principle of the MM timing recovery

where the  $tx_n$  is the data symbol. For the k<sup>th</sup> sample taken at time  $t = k \cdot T_b + t_k$ 

$$rx_{k} = rx(k \cdot T_{b} + t_{k})$$

$$= \sum_{m}^{\infty} tx_{m} \cdot sbr[(k - m)T_{b} + t_{k}]$$

$$= \sum_{i}^{\infty} t_{k-i} \cdot sbr[iT_{b} + t_{k}]$$
(2.11)

Multiplying both sides by  $tx_{k-1}$  and taking the expectation,

$$E[rx_k \cdot tx_{k-1}] = \sum_{i} E[tx_{k-i} \cdot tx_{k-1} \cdot sbr(iT_b + t_k)]$$
  

$$\approx Tx^2 sbr(t_k + T_b)$$
(2.12)

where  $tx^2 = E[tx_{k-l}^2] = E[tx_k^2]$ . As same way,

$$E[rx_k \cdot tx_{k-1} - rx_{k-1} \cdot tx_k] \approx Tx^2[sbr(t_k + T_b) - sbr(t_k - T_b)]$$
(2.13)

The derived equation (2.13) produces the "Early" or "Late" signal using MMPD. Fig 2.4 shows a block diagram of the MMPD from eq. (2.13). The MMPD should be independent and equiprobable random data since the equation (2.12) holds. Therefore, the data pattern may cause the timing loop to lock if the data is not efficiently random.

Whereas the derived equation of the MMPD has difficulty implementing, an analog-to-digital converter (ADC) is needed. For implementation simplicity, sign-sign MMPD (SS-MMPD) uses the data sampler and error sampler with offset. The SS-MMPD timing recovery function  $\tau$  can be expressed as follows.

$$\tau = rx_{k} \cdot tx_{k-1} - rx_{k-1} \cdot tx_{k}$$
  
=  $rx_{k} \cdot tx_{k-1} - tx_{k} \cdot tx_{k-1} - rx_{k} \cdot tx_{k-1} + tx_{k} \cdot tx_{k-1}$   
=  $(rx_{k} - tx_{k}) \cdot tx_{k-1} - (rx_{k-1} - tx_{k-1})tx_{k}$   
 $\approx sign(rx_{k} - tx_{k}) \cdot tx_{k-1} - sign(rx_{k-1} - tx_{k-1}) \cdot tx_{k}$  (2.14)

Where  $sign(rx_k - tx_k)$  and  $sign(rx_{k-1} - tx_{k-1})$  are two sampled binary results. In the case of the falling transition,

$$\tau \approx sign(rx_k + A_{ref}) + sign(rx_{k-1} - A_{ref})$$
(2.15)

where  $sign(rx_k + A_{ref})$  is an error sampler output with a reference voltage of  $A_{ref}$ .

## Chapter 3

# An Adaptive Offset Cancellation Scheme and Shared Summer Adaptive DFE

#### **3.1 Overview**

In designing an RX, requirements for lower power consumption, higher bandwidth, and more extended reach are steadily increasing. However, the highloss channel degrades the vertical margin, which exacerbates the design difficulties of the RX. Various equalizers for improving the vertical margin have been proposed [9] - [15], [17], [18] to overcome the limitations.

Various offset cancellation schemes in the data path are employed in many designs [9], [10], [12] to increase the vertical margin. Conventional offset

cancellation methods set the RX inputs to a common-mode value and search the flip point while the compensation control words are swept. However, the calibration cannot track the change of the DC offset during the CTLE adaptation process, where the DC gain is changed. Furthermore, the calibration procedure takes a long time to monitor the sampler outputs and find appropriate control words. A dynamic offset cancellation scheme is proposed in [12] to shorten the calibration time. The dynamic offset cancellation scheme detects the DC offset of the data path from the average of the analog front-end (AFE) outputs using the low pass filter. The offset cancellation method reduces the DC offset to less than 1 mV with a random data pattern. Nevertheless, it suffers from a massive hardware overhead of integrating the low pass filter.

A DFE is adopted in many RX designs to increase the vertical margin by removing the post-cursor ISI [9]-[14]. Significantly, an adaptive equalizer finds optimum DFE coefficients varied according to the channel characteristics. Furthermore, in time-interleaved designs such as a half-rate or a quarter rate, the adaptive equalizers employ separate adaptation loops in the interleaved data paths [10]. While the individual adaptation loops suppress the mismatch effects of the equalizer coefficients, multiple identical hardware is replicated with additional digital loop filters (DLF) and error samplers. The additional hardware consumes excessive power in the AFE and synthesized digital logic (SDL).

To remedy these issues, we propose two design techniques. First, we offer a method of canceling the DC offset caused by the random mismatch of the entire data path. By utilizing the received random data pattern, the DC offset is detected and removed. Furthermore, the offset canceling process operates

simultaneously as the equalizer-adaptation sequence without using any additional hardware in the AFE. Second, a shared-summer DFE, where only one summer is used for multiple samplers, is proposed, reducing the number of power-hungry DFE summers. The proposed technique offers the additional benefit of eliminating mismatches in the loading than using dedicated summers for all samplers. As a result, the shared-summer DFE reduces power consumption and hardware complexity by sharing the summer, the adaptation loop, and the error sampler.

# 3.2 An Adaptive Offset Cancellation Scheme and Shared-Summer Adaptive DFE for Low Power Receiver

The DC offset of the data path caused by random device mismatches reduces the vertical margin of the received input (=  $h_0 - V_{offset}$ ), resulting in a higher BER. A proposed offset cancellation scheme detects the DC offset from a random data sequence. As shown in Fig. 3.1, with the 2X-oversampling PD, the edge information is sampled by the edge clock (*clk*<sub>edge</sub>) and is used to recover the clock phase. The sampled outputs at the zero-crossing edges in the locked state also



Fig. 3.1 Principle of extracting offset information at zero-crossing points

include offset information of the edge-sampling path. When the DC offset of the data path ( $V_{offset}$ ) is zero, the sampled output should be -1 or +1 with equal probability. On the other hand, when  $V_{offset} > 0$ , the sampled outputs at the edges are biased toward -1. As shown in the truth table of Table III.I, the offset detector determines the sign of the DC offset according to the polarity of the edge samples and cancels the DC offset. Conversely, when  $V_{offset} < 0$ , it works in the opposite direction.

Based on the algorithm mentioned above, the DC offset is detected in the edge sampler. However, the DC offset on the data sampling path is impossible to capture this way. A conceptual block diagram of the 2-way interleaved swap controller is presented in Fig. 3.2 and illustrates the concept of detecting the DC offsets of both paths. Our proposed controller swaps the edge and data samples

| D[n-1] | E[n] | D[n] | Offset<br>Detector | Phase<br>Detector |
|--------|------|------|--------------------|-------------------|
| -1     | +1   | +1   | DN                 | UP                |
| -1     | -1   | +1   | UP                 | DN                |
| +1     | +1   | -1   | DN                 | DN                |
| +1     | -1   | -1   | UP                 | UP                |

Table III. I Truth table of offset and phase detection logic



Fig. 3.2 Conceptual block diagram of the swap controller

alternately to cancel the DC offsets of the interleaved samplers' even and odd data paths. First, the proposed swap controller uses the even-path samplers as edge samplers and the odd-path samplers as data samplers. Then even-path offset is canceled. Next, the DC offset of the odd path is compensated by swapping the edge and data samples. With the two-step process, both offsets are canceled. The offset cancellation works simultaneously as the clock recovery and equalizer adaptation, and then the normal operation is started. The swapping of the edge and data signals is performed in the SDL, which includes the AOC engine and CDR. The AOC and CDR engines are implemented according to the truth table of the



Fig. 3.3 Schematic of the sampler with offset cancellation function

offset detector and PD presented in Table III.I. When data and edges are swapped, a phase offset of 0.5 UI occurs initially in the CDR, and the CDR engine spends a few tens of cycles to cancel the phase offset, and then the AOC engine cancels the offset of the edge sampler which used to be data samplers in the previous step. The reuse of the edge samples reduces design complexity and prevents bandwidth reduction of the AFE due to extra hardware. Furthermore, as shown in Fig. 3.2, the swap controller is implemented with a few MUXs, and the SDL operates at a slow speed.



Fig. 3.4 Simulated compensation voltage vs. offset control word

Chapter 3. An Adaptive Offset Cancellation Scheme and Shared-Summer Adaptive DFE 36

The DC offset includes the combined effects of the random mismatches: electro-static discharge diodes(ESD), on-die termination (ODT), CTLE, DFE summer, and samplers. In the implemented design, samplers cancel the DC offset for all the added offset at once. The compensation voltage ( $V_{comp}$ ) at the sampler cancels the offset of the entire data path at a single point. The circuit schematic of the sampler with the offset control is shown in Fig. 3.3. The sampler is of the StrongArm latch type with controlled varactors. The DC offset of the data path, 3.4 mV<sub>rms</sub>, is estimated in Monte-Carlo simulation due to local variation.  $V_{comp}$  is designed to vary from -14 mV to +14 mV (-4 $\sigma$  to +4 $\sigma$ ) to compensate for the DC offset of the data path. Since the sampler operates at a half rate, the sampler has a sufficient bandwidth margin. The simulation result of  $V_{comp}$  versus the offset control word is shown in Fig. 3.4 with 2 mV/code resolution.

#### **3.3 Shared Summer DFE**

The adaptive equalizer with DFE is usually employed to reduce ISI for various lossy channels. Fig. 3.5 shows the conceptual schematic and operating timing diagram of the conventional and the proposed adaptive half-rate DFE. In the traditional half-rate 2x-oversampling RX, six samplers (2-data / 2-edge / 2-error) and four summers would be employed. On the other hand, in the proposed structure, the DFE summers are integrated, and thus two data and one error samplers share one summer, and two edge samplers share one summer. Furthermore, the two error samplers are merged as one. The conventional DFE summer, which operates at the half rate, holds the feedback data over 2 UIs, as shown in Fig. 3.5 (a). However, each summer provides a valid sampling window for only 1 UI while staying idle for the other UI. The proposed shared-DFE summer operates at the full rate, as shown in Fig. 3.5 (b), producing summer outputs alternately for even and odd samplers. The summer is shared by two paths and offers valid data in the full rate for the half-rate samplers. Feedback data should hold valid data while being sampled and be forced to null while the other feedback data is used to achieve the full-rate operation.

The proposed shared-summer DFE offers three advantages. First, the number of summers is reduced from four to two, minimizing the hardware area and power consumption of the AFE. Furthermore, as the number of current-mode logic (CML) summers decreases, the parasitic capacitance decreases. Less parasitic capacitance reduces the amount of required static current for identical bandwidth. Second, shared-summer DFE minimizes the number of adaptation loops. In the conventional DFE, mismatches of the DFE coefficients, bandwidth, and loading appear in even and odd paths. The DFE coefficients and Dlev have to be handled separately for each summer to reduce random mismatches ( $h_{ev1}$ ,  $h_{ev2}$ , *Dlev<sub>ev</sub>*,  $h_{od1}$ ,  $h_{od2}$ , *Dlev<sub>od</sub>*) as shown in Fig.3.6(a). On the other hand, in the proposed DFE, there are no mismatches because the summers are combined. It reduces the adaptation loops from six to three  $(h_1, h_2, Dlev)$  and significantly reduces the synthesized digital loops' power consumption and hardware overhead. Third, the DFE minimizes the number of error samplers to one. The equalizer adaptation loop utilizes the output of the error sampler with a threshold of Dlev. Only one error sampler is used since the equalizer adaptation loops are merged into one. It reduces the loading capacitance of the summer, which yields wider bandwidth of the summer and the power consumption of the samplers. In summary, the proposed structure reduces the number of summers, adaptation loops, and samplers, and it significantly saves hardware and power consumption.

Fig. 3.7 shows a detailed schematic of the proposed DFE. As mentioned before, the proposed DFE utilizes two summers in the half-rate structure: data DFE and edge DFE. The data-DFE summer is a CML-type buffer with two taps, and the edge DFE summer has one DFE tap. The DFE tap consists of a current source and four switches, and the switches operate one at a time. The DFE adaptation logic controls the current source according to the SS-LMS algorithm. The DFE latch consists of a StrongArm latch and an RS latch. Since the output of the StrongArm latch maintains the valid level during 1UI, first-tap data is fed back through the buffer only. The feedback time of the first-tap data is reduced since it is provided directly to the CML tap. The second-tap data are fed back through the clock gating to convert the NRZ data to RZ. Therefore, the merged summer method needs additional two AND gates for the clock gating, and hardware benefit is clearly shown by reducing the summers, sampler, and adaptation loops.



Fig. 3.5 Conceptual timing diagram of (a) conventional adaptive half-rate DFE and (b) proposed adaptive half-rate DFE



**(a)** 



(b)

Fig. 3.6 Conceptual schematic of (a) conventional adaptive half-rate DFE and (b)

proposed adaptive half-rate DFE



Fig. 3.7 Conceptual block diagram of the merged summer DFE

### **3.4 Receiver Implementation**

Fig. 3.8 shows the block diagram of the proposed RX that consists of an adaptive equalizer, a de-serializer (DES), a digitally controlled oscillator (DCO), and an SDL, which includes a data and edge swap controller, an equalizer adaptation engine, an AOC engine, and a CDR engine. The equalizer adaptation process and the offset cancellation are performed by receiving a random data pattern. The equalizer adaptation process includes determining the coefficients of CTLE, DFE, and Dlev. After such the adaptation process and the offset cancellation, the RX starts normal operation.

In the first step of the clock recovery phase, a clock pattern is used to attain frequency and phase lock in a referenceless frequency acquisition. The frequency tracking uses a feature that data transition occurs every data sample in the clock-pattern input. For the DCO operating over  $f_{nyq}/3$ , an algorithm that counts the data transition number performs frequency tracking [13]. The DCO operating frequency is adjusted for the target data rates to ensure that the DCO works in the target range where frequency tracking is possible. After the frequency recovery, phase tracking utilizes an Alexander PD. A four-stage ring oscillator is used to support wide data rates from 1.62 Gb/s to 10 Gb/s. The PD outputs are forwarded to the DCO in two paths. Through the SDL as an integral path of the CDR, one has a long latency since the DLF runs at  $f_{nyq}/5$ . The other path directly connects to the second control input of the DCO for adjustment of the phase,

which considerably reduces the loop latency [13]. The direct proportional path, which has a short loop latency, improves the stability of the CDR loop.

The equalizer adaptation phase uses a random data pattern. The equalizer adaptation algorithm searches the optimal coefficients for various channel losses through the adaptation engine. A CTLE, a 2-tap data DFE, and a 1-tap edge DFE are employed to cancel the post-cursor ISI. The 1-stage CTLE with RC



Fig. 3.8 Block diagram of the proposed RX

degeneration is controlled by changing the DC gain. The CTLE adaptation engine drives the average of  $h_4$  and  $h_5$  to zero by using the SS-LMS algorithm [14]. The data DFE adaptation is performed using SS-LMS and the 1-tap edge DFE uses  $h_{1.5}$ to improve the accuracy of the edge samples and increase the CDR bandwidth. The edge-DFE coefficient is derived as the average of the two data-DFE coefficients for simplicity:  $h_{1.5}=(h_1+h_2)/2$  [13]. The Dlev is adapted with the SS-LMS algorithm using an un-even data level technique to find the exact equalizer coefficients in the presence of pre-cursor ISI [14]. The AOC engine is implemented in the SDL.

### **3.5 Measurement Results**

The prototype chip has been fabricated in 40nm CMOS technology, and the chip photomicrograph is shown in Fig. 3.9. The RX occupies a chip area of 0.135mm<sup>2</sup>. The RX is tested at the data rate from 1.62 Gbps to 10 Gbps with 2<sup>7</sup>-1 pseudorandom binary sequence (PRBS) pattern, and BER is measured with a biterror-rate tester (BERT). Fig. 3.10 illustrates the insertion loss of the two channels. The insertion losses are 27 dB at 5 GHz for channel 1 and 29 dB at



Fig. 3.9 Chip photomicrograph of the implemented RX

4.05GHz for channel 2. Channel 1 is used for the 10Gbps data rate, and channel 2 is used for the 8.1 Gbps. Fig. 3.11 shows a power breakdown of the RX when operating at 10 Gbps. The RX, including the CTLE, the 2-tap data DFE, the 1-tap edge DFE, the DES, the SDL, and the DCO, consumes 18.6 mW. It is noted that the DES and the SDL consume only 4.88 mW thanks to the shared summers and reduction of the number of adaptation loops.

To confirm the performance of the AOC engine, BER is measured by changing the offset cancellation control word, as shown in Fig. 3.12. As described in 3.2, by changing the offset-cancellation control word, BER varies. The offset cancellation



Fig. 3.10 Insertion loss of two channels used for testing

control word is adaptively converged to optimum codes by the AOC logic with a random data sequence. The AOC logic delivers a BER improvement from 10<sup>-8</sup> to under 10<sup>-12</sup> with an 8.1 Gbps data rate with 29 dB cable loss at 4.05 GHz. The effect of the DFE is clearly shown in the figure as well.

Fig. 3.13 shows the jitter tolerance performance at a 10<sup>-12</sup> BER with 27 dB channel loss at 5 GHz. The jitter tolerance test (JTOL) is measured with and without the edge DFE. The CDR bandwidth is improved by the edge DFE as shown in the JTOL curve. Table III.II compares the proposed RX with prior-arts



Fig. 3.11 Power breakdown and power efficiency of the proposed RX

in the literature. The result shows that the proposed RX is significantly power-efficient and offers the best FOM of 0.068 pJ/b/dB.



Fig. 3.12 Measured BER versus offset cancellation code



Fig. 3.13 Measured JTOL at BER of 10<sup>-12</sup> with and without edge DFE



Fig. 3.14 Figure-of-merit of the proposed RX

|                                                                               | ISSCC 2012<br>[9]       | VLSI 2015<br>[15] | ISSCC 2016<br>[12] | ISSCC 2017<br>[16] | ASSCC 2018<br>[17]     | VLSI 2019<br>[13] | This Work                                |
|-------------------------------------------------------------------------------|-------------------------|-------------------|--------------------|--------------------|------------------------|-------------------|------------------------------------------|
| Technology Node                                                               | 32nm                    | 65nm              | 28nm               | mu59               | 28nm                   | 65nm              | 40nm                                     |
| Data Rate (Gb/s)                                                              | 28                      | 14                | 25                 | 10                 | 6                      | 1.62 - 10.8       | 1.62 - 10                                |
| Link Loss (dB)                                                                | 35                      | 12                | 50                 | N/A                | 22                     | 34                | 27                                       |
| Equalization                                                                  | 4-tap FFE<br>15-tap DFE | 2-tap FIR**       | CTLE<br>14-tap DFE | CTLE               | 4-tap FSE<br>3-tap DFE | CTLE<br>2-tap DFE | CTLE<br>2-tap Data DFE<br>1-tap Edge DFE |
| Adaptive Equalizer                                                            | YES                     | NO                | YES                | ON                 | YES                    | YES               | YES                                      |
| <b>Offset Cancellation</b>                                                    | Calibration             | NO                | Dynamic            | ON                 | Calibration            | NO                | Adaptive                                 |
| Summer merging                                                                | NO                      | NO                | NO                 | ON                 | NO                     | NO                | YES                                      |
| Efficiency (pJ/b)                                                             | 14                      | 1.68              | 16.12              | 2.66               | 3.5                    | 3.4               | 1.86                                     |
| FOM (pJ/b/dB)*                                                                | 0.4                     | 0.14              | 0.316              | N/A                | 0.16                   | 0.1               | 0.068                                    |
| Power (mW)                                                                    | 392                     | 23.6              | 403                | 26.6               | 31.5                   | 37.7              | 18.6                                     |
| * FoM = (energy efficiency) / (channel lo<br>** transmitter equalization only | ss @ Nyquist freque     | ncy)              |                    |                    |                        |                   |                                          |

Table IIII. II Comparison with previous RX

Chapter 3. An Adaptive Offset Cancellation Scheme and Shared-Summer Adaptive DFE 50

## **Chapter 4**

## **PAM-4 Baud-Rate Digital CDR**

#### 4.1 Overview

Since oversampling-based CDR requires additional clocking power [20], [21], BR CDR draws attraction in PAM-4 RXs recently. The most popular PAM-4 BR CDR utilizes MM CDR, which requires an ADC [22], [23]. However, these ADC-based PAM-4 CDR designs are power-hungry due to high-speed and high-resolution ADCs and bulky digital back-ends, including DFEs and feed-forward equalizers (FFEs) the digital domain. For simplicity, a sign-sign SS-MMPD was presented using two voltage references instead of an ADC [24]. Furthermore, if the RX includes a DFE and an SS-MMPD, the SS-MMPD moves the lock point where  $h_{-1}$  becomes zero. If the TX is not sufficiently equalized, it makes the RX vulnerable to noise or causes the lock point to drift. Two types of BRPD are presented to resolve the following problems.

First, this article extends the design procedure to find the optimal weights for the PAM-4 BRPD. As a result, the proposed SPD achieves an optimal phase-locking capability that maximizes the PAM-4 vertical eye opening (VEO) compared to the conventional logical approaches. Besides, the proposed SPD detects the current phase states in a machine-learning-based way.

Second, BRPD, which is more suitable for multi-level signals, is proposed. Since the BRPD locks when  $h_0$  becomes  $N_t * h_{-1}$ , target cursor ratio ( $N_t$ ), the BRPD controls the VEO states and  $h_{-1}$ , which remains even with CTLE and DFE. Furthermore, the lock point is independent of the post-cursor ISI, and the BRPD with adaptive DFE has a unique lock point.

#### **4.2 Overall Architecture**

#### 4.2.1 Proposed Baud-rate CDR Architecture

The overall circuit block diagram of the implemented RX is shown in Fig. 4.1. The CDR utilizes a half-rate architecture with the forwarded clock. The RX consists of an SDL with proposed SPD and BRPD, an AFE, a digital-to-analog converter (DAC), a DES, and a phase rotator (PR). The SDL contains two PDs. One is a PD using the stochastic method, and the other is a PD that is suitable for multi-level signaling. It is implemented in SDL at the same time to verify two PDs with one prototype chip. When using one PD, the other PD is turned off. Also, even if the PD is changed, only the sampling threshold of the error sampler is altered, and the operation of other blocks does not change.

The AFE includes a 50  $\Omega$  resistor for impedance matching, a DFE summer, and ten samplers. In DFE, a single-stage amplifier is used as a DFE, and the DFE summer uses an inverter-based amplifier to improve the DC linearity characteristic. The halfrate DFE is implemented with one tap and merges two summers to reduce the DFE adaptation logic and power consumption. There are five samplers per clock phase and a total of 10 samplers in the AFE. There are three data samplers for PAM-4 signal ( $V_H$ ,  $V_0$ ,  $V_L$ ) and two error samplers for CDR operation and adaptation per clock phase. The error samplers consist of two samplers ( $V_{Dlev1}$ ,  $V_{Dlev2}$ ) for finding the magnitude of the  $h_0$  and operating the adaptive equalizer. For the SPD, two error samplers are used to adapt sampling thresholds of  $3h_0$  and  $-3h_0$ . The error samplers are used to adjust data level and phase detection.

$$V_{Dlev1} = 3h_0, V_{Dlev2} = -3h_0 \tag{4.1}$$

On the other hand, for the BRPD for the multi-level signals, one error sampler is used to adapt data level, and the other sampler is used for phase detection.



Fig. 4.1 Overall circuit block diagram of the implemented RX

$$V_{Dlev1} = V_{Dlev} = 3h_0 + 3h_{-1}, V_{Dlev2} = V_{PD} = \frac{V_{Dlev}(3N_t - 3)}{3N_t + 3}$$
(4.2)

The sampled outputs are de-serialized through DES and delivered to SDL. The outcomes of the samplers are decoded and monitored to measure the BER.

The SDL includes a DFE adaptation logic, a data-level adaptation logic, a sampling threshold calculation, SPD, and BRPD. The DFE adaptation is implemented using the SS-LMS algorithm. The proposed BRPD implemented in the SDL finds the locking point and generates a phase-controlled word (PCW) that controls the PR. The PR controls the sampling clock phase by receiving the forwarded clock. The clock input receives the clock at the same frequency as the clock used in the half-rate architecture.



Fig. 4.2 Schematic of the implemented sampler

The quadrature generator converts into a 4-phase clock with the same frequency as the clock input.

The StrongArm latch, which can control the sampling threshold, is designed. The detailed schematic of the StrongArm latch is shown in Fig. 4.2. The StrongArm latch compares the input voltage of the differential input transistor. A pair of differential input transistors are added to allow adjustment of the sampling threshold. Based on



Fig. 4.3 Schematic of the proposed IDAC

the assumption that the transceiver is a linear time-invariant system, the sampling threshold of the data sampler and the sampler for the PD are adaptively adjusted through  $V_{Dlev}$ . If the sampling threshold does not operate linearly for the control word of the sampling threshold, an error occurs in the sampling threshold through mismatch. The current of the input transistor must be adjusted linearly to adjust the sampling threshold for the control word linearly. Since the current of the input transistor operates in a square to the input voltage, the current generated by current DAC (IDAC) is adjusted through current mirroring. Fig. 4.3 shows the schematic of the IDAC. The



Fig. 4.4 Simulation result of the sampling threshold versus threshold control word

IDAC generates a differential current in a thermometer manner. The IDAC is controlled by 8 bits and generates the sampling threshold of the sampler calculated through SDL.

Also, mismatch calibration implemented to reduce the offset caused by random variation in the process is created and added by 6-bit. The simulated sampling threshold versus threshold control word is shown in Fig. 4.4. Since there are five samplers per clock phase, a total of ten IDACs are created. The differential current adjusts the sampling threshold through the diode of the sampler. The proposed assumes that the channel and equalizer are LTI systems.
### 4.2.2 Proposed analog front-end structure

The AFE in the RX is composed of a 1-stage amplifier. The AFE amplifier performs the function of the DFE summer and controls DC gain. The DC gain is adjusted up to 4 dB. The DFE summers are merged to prevent the difference in bandwidth and DFE tap coefficient caused by random mismatches in the time-interleaving method. In the DFE using the existing time-interleaving methods, an independent adaptation loop should be formed in the adaptive DFE to prevent the random mismatch. The



Fig. 4.5 Block diagram of the half-rate inverter-based DFE

mismatches cause the reducing vertical eye margin and BER degradation. Power consumption and hardware overhead increase by configuring such an independent equalizer adaptation loop. Instead of using several DFE summers, DFE summer, which operates at a full-rate speed, solves the problem. Furthermore, in PAM-4 methods, SNR degradation compared to PAM-2 is 9.5 dB. The PAM-4 AFE needs a high-linearity characteristic to prevent extra SNR degradation. The inverter-based amplifier is used to obtain the high-linearly output voltage versus input voltage.

Detailed DFE block diagram and half-rate timing diagram are shown in Fig. 4.5. In the AFE, to prevent SNR reduction due to non-linearity in PAM-4 signaling, the am-



#### AFE Half-rate timing diagram

Fig. 4.6 Timing diagram of the merged inverter-based DFE

plifier is used as an inverter-based type [33]. The Gm cell of the amplifier has a constant gain for a more comprehensive operating range because two MOSFETs operate simultaneously, NMOS and PMOS. It is necessary to implement the DFE tap as identical inverter-based as Gm and load cell to use an amplifier for linear operation. The Gm and load cell used as inverter-based types sustain the common-mode voltage of the AFE. The AFE can perform the CTLE and DFE simultaneously by merging the CTLE and DFE tap cells. The Gm cell controls the bandwidth of the amplifier, lowfrequency gain, and high-frequency peaking.

Fig. 4.6 shows the timing diagram of the merged inverter-based DFE. The DFE tap is controlled by receiving thermometer-coded data from samplers (000, 001, 011, 111). Feedback on valid data and invalid data is divided to merge the DFE summer of the time-interleaving method. When valid data are fed back, it operates to remove ISI on the SBR in the same way as the conventional DFE operation. According to the output of the sampler, 3-bit thermometer coded data is fed back and controlled with four strengths (-3, -1, +1, +3). On the other hand, an invalid state has the same 3-bit feedback but does not affect the summer output. In the AFE operating at half-rate speed, there are even and odd paths. When the symbol in one of the two passes is fed back, the invalid symbol is fed back in the other pass. By alternately providing valid data, merging into one summer from the method using multiple summers is possible.

Detailed operation and schematic of the proposed inverter-based DFE with merged summer are shown in Fig. 4.7. The proposed DFE tap consists of three cells which consist of two current sources and eight switches. The DFE tap is controlled as thermometer ways. Data samplers manage each DFE tap cell with the sampling threshold of  $V_H$ ,  $V_0$ , and  $V_L$ . In the DFE tap cell, the switches are even time feedback switches and odd time feedback switches. When switches in even time are valid state, the other switches in odd time go invalid state. In invalid states, four switches are turned off, as shown in Fig. 4.7 table. Whereas when valid data are fed back, the data is 0 or 1. The valid state has the thermometer in four cases (000, 001, 011, 111). When a valid sampler feedbacks the data 0, the fed back symbols turn on NMOS of  $SUM_N$  node and PMOS of  $SUM_P$  node. When a valid sampler feedbacks the data 1, the symbol turns on the NMOS of the  $SUM_P$  node and the PMOS of the  $SUM_N$  node. The ISI compensation method works when the ISI has a positive value. As mentioned in the previous chapter, the StrongArm latch adjusts the sampling threshold to have a sampling threshold of  $V_{Dlev1}$ ,  $V_{Dlev2}$ ,  $V_H$ ,  $V_0$ , and  $V_L$ . The sampler states are determined for the input clock. It amplifies input data at the positive edge of the input clock, maintains data when the clock is high, and pre-charges when the clock is low. Data are fed back in the data retention state, and in the pre-charge state, it is an invalid state.





## 4.3 Stochastic Phase Detection PAM-4 CDR

#### **4.3.1 Proposed Stochastic Phase Detection**

The proposed SPD logic is explained in Fig 4.8 and uses two consecutive symbol information by collecting three data samples ( $D_{H_b}$ ,  $D_{M_b}$ ,  $D_L$ ) and two error samples ( $E_{H_b}$ ,  $E_L$ ) per symbol with a half-rate clock. Then, a total of the ten samples expressed as single pair ( $B_N$ ) representing two consecutive PAM-4 symbols ( $A_{N_b}$ ,  $A_{N+1}$ ) are de-serialized and processed in the SDL circuit. Since  $A_N$  takes 0 to 5 characters according to the results of five samplers,  $B_N$  takes an ordered pair from (0,0) to (5,5). The numbers of occurrences for 36 cases are collected and multiplied by weights pre-determined by the machine-learning design procedure. The results are accumulated in a digital loop filter to adjust the sampling phase.

This chapter explains the proposed design procedure used to find the optimal weights for the SPD. The first step is to collect histograms of the sequential patterns using modeling simulation. Fig. 4.9 shows the eye diagram of the PAM-4 CDR with BR samplers. As indicated by black dots in Fig. 4.9, the sampling thresholds of the BR samplers are adapted to  $3h_{0}$ ,  $2h_{0}$ , 0,  $-2h_{0}$ , and  $-3h_{0}$ . The result of the five samplers is defined as  $A_{N}$ . The sampling results are expressed as 0 to 5. The higher the data level, the higher symbol is depicted. Two consecutive  $A_{N}$  expressed as  $B_{N}$  are utilized to find out the change of phase.



Fig. 4.8 Block diagram of the proposed SPD logic

To learn "Early" or "Late" patterns, the  $B_N$  data histograms sampled at the "Early" or "Late" clock phases are utilized. First, modeling simulation with the "Early" or "Late" phase is performed. The simulation collects and counts the  $B_N$  for enough time. For the two consecutive symbols,  $B_N$  has a total of 36 characters. The blue and red lines located where the PAM-4 eye begins to open in Fig. 4.9 are designated as "Early" and "Late" clock phases. The counted numbers are normalized as shown in Fig. 4.10. For example, the (0,4) symbol has no blue line, only a red line. The (0,4) symbol does not occur in the early phase; it means that the case  $B_N$  is (0,4) only occurs in the late



Fig. 4.9 Eye diagram of PAM-4 with Baud-rate samplers

phase. Then, the SPD weights in Fig. 4.11 are derived in a stochastic way as the difference between the "Early" and "Late" conditional probabilities using the Bayesian theorem and quantized.

$$W(B_N) = \Pr(Late|B_N) - \Pr(Early|B_N) = \frac{\Pr(N|Late) \cdot \Pr(Late)}{\Pr(B_N)} - \frac{\Pr(N|Early) \cdot \Pr(Early)}{\Pr(B_N)}$$
(4.3)

All weights are calculated for all  $B_N$  symbols and quantized to 4 bits to implement as a digital loop filter. Fig. 4.11 shows the quantized weights. The SPD logic multiplies the quantized weights with pattern and adds all results as  $P_{ERR}$ . The  $P_{ERR}$  is multiplied by the gain of the DLF  $k_{INT}$ . As a result, the DLF makes the phase control words and controls the PR. As a modeling simulation, the proposed SPD gain curve is adequately



Fig. 4.10 Histogram of the *B<sub>N</sub>* with "Early" and "Late" phase

obtained, as shown in Fig. 4.11, by multiplying and accumulating a random input sequence by pre-determined weights.







Fig. 4.12 Simulated SPD gain curve

## 4.3.2 Comparison of the Stochastic PD with SS-MMPD

Chapter 4.3.2 explains the comparison of the SPD and the SS-MMPD. Since the proposed SPD uses the weights collected from a large data set, it exhibits more optimal phase-locking behavior than the existing logical approaches, including the SS-MMPD. Assuming that the SS-MMPD in [24] is applied to PAM-4 by using two error samplers ( $E_H$ ,  $E_L$ ) per symbol, the phase error is detected only for 4 cases of  $B_N$  (Early: (0,4) or (5,1), Late: (1,5) or (4,0)), as shown in Fig. 4.13. Therefore, the weights of the SS-MMPD can be displayed in the same weight space as the SPD. The weights of the SS-MMPD and SPD are shown in Fig. 4.14. It can be seen that the weights of the SPD include all cases where the weights of the SS-MMPD are present.

Fig. 4.14 shows examples of SBR and corresponding PAM-4 VEO curves with or without 1-tap DFE. The SS-MMPD without DFE locks at the point  $(h_1=h_{-1})$  where the VEO is insufficient due to the ISI, as shown in Fig. 4.15 (a). With a 1-tap adaptive DFE that forces the  $h_1$  to be zero, the SS-MMPD locks wherever the  $h_{-1}$  is zero and drifts eventually, as shown in Fig. 4.15 (b). Consequently, it suffers from a severe multiple-locking problem with an adaptive DFE. On the other hand, even with an adaptive DFE, the proposed SPD can track a unique and optimal clock phase that maximizes the VEO by determining the weights according to the input conditions.



Fig. 4.14 Weights of the proposed SPD and the SS-MMPD



**(a)** 



Fig. 4.15 Lock point on SBR of the SS-MMPD and the proposed SPD, (a) w/o DFE

and (b) w 1-tap DFE

# 4.4 Phase Detection for Multi-Level Signaling

## 4.4.1 Proposed Baud-rate Phase Detector for Multi-Level Signal

This section explains the techniques for multi-level signal phase detecting in the BR method. The MMPD, which is mainly used as a BRPD, has a lock point where  $h_1$  and  $h_1$  are the same on the SBR. If  $h_{-1}$  and  $h_1$  are not sufficiently equalized to be zero, it is difficult to open the eye diagram and obtain sufficient BER due to SNR degradation. Especially, making  $h_{-1}$  is hard to equalize unless equalization in FFE in the TX is sufficiently performed. In addition, in the case of RXs using adaptive DFE, the lock point of the MMPD slips [34]. An MMPD locks at the lock point where  $h_{-1}$  becomes zero because the DFE adaptation loop makes  $h_1$  zero. The locking point drifts to the positions in which  $h_{-1}$  is zero is not unique. Configuring an independent loop is utilized to solve the wandering problem, but the method consumes more power and area.

The eye magnitude of the conventional PAM-2 can be estimated through the single-bit response as

$$V_{Eye,PAM-2} = 2 \cdot \left( h_0 - \sum_{k \neq 0}^n |h_k| \right)$$
(4.4)

where  $h_k$  is the magnitude of the k<sub>th</sub> cursor of the SBR. The StrongArm latches are adopted in conventional RXs as samplers, and DFE is used to compensate post-cursor ISI. For the RX which use N-tap DFE, the equalized eye magnitude is written as

$$V_{EQ\_Eye.PAM-2} = 2 \cdot \left( h_0 - \sum_{k \neq 0, 1, \dots, N}^n |h_k| \right).$$
(4.5)

The DFE effectively removes the post-cursor ISIs, whereas it cannot equalize the precursor ISIs. In other words, even if ISI compensation through DFE is performed, it is difficult for an RX using the StrongArm latch to equalize with DFE completely. For



Fig. 4.16 Estimated VEM with DFE of the PAM-2 and PAM-4

the PAM-4 signaling RX, which uses N-tap DFE, the equalized eye magnitude is written as

$$V_{\text{EQ}_{\text{Eye},\text{PAM}-4}} = 2\left(h_0 - 3 \cdot \sum_{k \neq 0, 1, \dots, N}^n |h_k|\right).$$
(4.6)

In the case of the PAM-4 signaling method, SNR degradation becomes the bigger problem. The SNR loss of the PAM-4 is 11 dB when non-linearity is concerned. Because of the SNR degradation, residual ISIs seriously degrades BER performance. In other words, ISI needs to be adjusted more precisely because the ISI has more influence on the signal using multi-level modulation (compared to PAM-2, PAM-4 is three



Fig. 4.17 Estimated VEM versus the cursor ratio of the  $h_0$  and  $h_{-1}$ 

times and PAM-8 is seven times). Furthermore, since it is more difficult to remove the pre-cursor ISI in the RX, the ratio of the  $h_0$  to the pre-cursor ISI determines the maximum vertical eye margin.

Fig. 4.16 shows the estimated vertical eye margin (VEM) with DFE of the PAM-2 and PAM-4. As shown in Fig. 4.16, even if adaptive DFE is used, the VEM of multilevel signaling is reduced by  $h_{-1}$ . Furthermore, Fig. 4.16 shows the lock point where MMPD locks cannot secure enough VEM. It is assumed that the post-cursor ISI is removed through DFE and a smooth-loss channel. As be seen in equation (4.6), the ratio of the  $h_0$  and  $h_{-1}$  must be greater than 3 for PAM-4 and greater than 1 for PAM-2.



Fig. 4.18 Lock point of the proposed BRPD for multi-level signaling on SBR

In conclusion, through this analysis, we propose a PD with the following properties. First, the  $h_0$  and  $h_{-1}$  determine the lock point. Second, the h0 and h-1 magnitude ratio must be externally adjustable coefficient N<sub>t</sub>. Fig. 4.17 shows the estimated VEM versus the cursor ratio of the  $h_0$  and  $h_{-1}$ , M (= $h_0/h_{-1}$ ). As shown in Fig. 4.18, the lock point of the proposed PD on the SBR is as follow

$$h_0 = N_t \cdot h_{-1}. \tag{4.7}$$

The proposed PD has two advantages. First, it is suitable for adaptive DFE, which is mainly used in the RX. The proposed PD has a lock point independent from the post-cursor ISI. The advantage is that the lock point does not change even if the residual post-cursor ISI changes due to the adaptive DFE. Second, the influence of  $h_{-1}$ can be directly controlled. Since an externally adjustable factor controls the  $h_0$  and  $h_{-1}$ 



Fig. 4.19 Simulated M versus time with smooth-loss channel

 $_{1}$  ratio, it can be set to an appropriate value according to the modulation method. For example, in PAM-4, the  $h_{0}$  should be higher than three times the  $h_{-1}$  to reduce the influence of  $h_{-1}$ . Also, in PAM-8, N<sub>t</sub> should be set larger than 7 to open the vertical eye.

For the proposed PD to operate normally, a condition that uses a smooth-loss channel is required. The ratio M of the  $h_0$  and  $h_{-1}$  should increase or decrease monotonically for 1UI. If it is not monotonic, there is a possibility that there are multiple lock points. Fig. 4.19 shows the M obtained for the channel with smooth loss through simulation. It is observed that the value of the ratio M gradually decreases over time. The observation indicates that the ratio M is phase early if it is greater than the externally controlled value N<sub>t</sub> and phase late if it is smaller than N<sub>t</sub>.

#### 4.4.2 Data level and DFE coefficient adaptation

One of the critical elements of clock recovery is recognizing the magnitude of the  $h_0$ ,  $h_{-1}$ , and cursor ratio M (= $h_0/h_{-1}$ ). The conventional data-level adaptation is utilized to find out the magnitude of the cursors. The data-level adaptation is a method used to use the adaptive equalizer, and the traditional adaptation method, which uses the SS-LMS algorithm, finds out the magnitude of the  $h_0$ . Furthermore, un-even data-level adaptation (UDA) performs accurate equalizer adaptation when  $h_{-1}$  remains [14]. If the  $h_{-1}$  remains, UDA adapts the data level, the combined value of  $h_0$  and  $h_{-1}$ . For the data level of  $h_0+h_{-1}$ , the following update equation is used

$$V_{Dlev}[n+1] = \begin{cases} V_{Dlev}[n] + 3 \cdot \mu_{Dlev} \cdot E_n, E_n > 0\\ V_{Dlev}[n] + 1 \cdot \mu_{Dlev} \cdot E_n, E_n < 0 \end{cases}, D_n > 0.$$
(4.8)

Also, for the data level of  $h_0$ - $h_{-1}$ , the following update equation is used

$$V_{Dlev}[n+1] = \begin{cases} V_{Dlev}[n] + 1 \cdot \mu_{Dlev}E_n, E_n > 0\\ V_{Dlev}[n] + 3 \cdot \mu_{Dlev}E_n, E_n < 0 \end{cases}, D_n > 0.$$
(4.9)

The locking point of  $V_{Dlev}$  is set to  $h_0+h_{-1}$  or  $h_0-h_{-1}$  by adjusting the "UP" and "DN" coefficient ratio.

By expanding the UDA to PAM-4 signaling, a data level consisting of a combination of  $h_0$  and  $h_{-1}$  can be obtained. Fig. 4.20 shows the eye diagram of DFE summer and the data histogram for +3 data under the assumption that post-cursor ISI is removed. The PAM-4 data histogram forms four peaks for data: +3. The Dlev for each peak is obtained using the coefficients of SS-LMS in Fig. 4.20 table. For example, the Dlev corresponding to  $3h_0+3h_{-1}$  is obtained through that  $\mu_{up}:\mu_{dn} = 7:1$ -adaptation coefficient since the probability of DN is seven times more than UP. On the other hand, it is possible to obtain the data-level present in the other peaks only by adjusting the coefficients of "UP" and "DN." To get a data level consisting of  $h_0$  and  $h_{-1}$ , it is possible through the following update equation.

$$V_{Dlev}[n+1] = \begin{cases} V_{Dlev}[n] + \mu_{up} \cdot \mu_{Dlev}E[n], E[n] > 0\\ V_{Dlev}[n] + \mu_{dn} \cdot \mu_{Dlev}E[n], E[n] < 0\\ for \ D_n = +3. \end{cases}$$
(4.10)

For example, to have a data level of  $3h_0$ - $3h_{-1}$ , it is possible through that  $\mu_{up}: \mu_{dn} = 1:7$  -adaptation.

The way that DFE adaptation works for un-even data-level has been proved as follows for the NRZ signaling.

$$w[n+1]_{k} = w[n]_{k} + \mu_{DFE} \cdot E[n] \cdot D[n-k], for \ D[n] > 0$$
(4.11)

where the  $w_k$  is the  $k^{th}$  DFE tap coefficient, and the k is the tap index. As expanding the



equation to the PAM-4 signaling, the equation is improved as follows.

$$w[n+1]_{k} = w[n]_{k} + \mu_{DFE} \cdot E[n] \cdot sgn(D[n-k])$$
  
for  $D[n] = 3.$  (4.12)

The direction of the DFE update is determined through a term consisting of error E[n]and a sign of the data D[n-k]. Fig. 4.21 compares the DFE adaptation using conventional Dlev and un-even Dlev operation for PAM-4 signal. Fig. 4.21 explains the function of DFE adaptation by dividing the region according to the magnitude of  $h_{1}$ - $w_{1}$  $(h_{1})$  and  $h_{-1}$ . It is assumed that  $h_{-1}$  exists, and 1-tap DFE is utilized. Furthermore, assuming that the data pattern updating the data level appears evenly, the DFE tap coefficient is updated in the direction. For the conventional data level, the data level has a magnitude of  $3h_{0}$ . In the presence of  $h_{-1}$ , the SS-LMS algorithm does not find the exact point where  $h_{1}$ - $w_{1}$  becomes zero, and the coefficient wanders. In the case that

$$0 < |D[n-1]| \cdot (h_1 - w_1) < |D[n+1]| \cdot h_{-1}$$
(4.12)

, conventional data adaptation method has equal UP and DN probabilities. In the presence of the  $h_{.1}$ , DFE adaptation is not sufficiently performed. On the other hand, when using un-even data-level, Dlev comprises a combination of  $h_0$  and  $h_{-1}$ . As shown in Fig. 4.21, the tap coefficient adaptation result is the same for UP and DN probabilities only in regions where  $h_1$  and  $w_1$  are the same. In un-even data-level, like NRZ signaling, finding the correct tap coefficient even if an  $h_{-1}$  exists.

| Fig.            |
|-----------------|
| 4.21            |
| Comparison      |
| of              |
| the conventiona |
| l data          |
| -level a        |
| daptation       |
| and 1           |
| un-even         |
| data-level      |
| adaptation      |

| Condition                                                   | Schematization                   |      | VDI         | ev.conv |                |      | VDie        | v.uneven |       |
|-------------------------------------------------------------|----------------------------------|------|-------------|---------|----------------|------|-------------|----------|-------|
|                                                             | A A                              | Case | Sgn(D[n-1]) | E[n]    | W1             | Case | Sgn(D[n-1]) | E[n]     | W1    |
|                                                             | IDIn+11/3: h                     | A    | Ŧ           | Ŧ       | dn             | A    | Ŧ           | ÷        | dn    |
| $(D[n-1]) \cdot (h_1 - w_1) > (D[n+1]) \cdot h_{-1}$        |                                  | в    | Ŧ           | Ŧ       | qu             | 8    | Ŧ           | ÷        | dn    |
| -2012 - 2010 - 11 XII 100 - 2010 - 2010                     |                                  | c    | 7           | Ŀ       | qu             | c    | 4           | ÷        | qu    |
|                                                             |                                  | D    | 4           | ÷       | qu             | D    | 4           | ÷        | qu    |
|                                                             | 26 · JTA                         | Case | Sgn(D[n-1]) | E[n]    | W1             | Case | Sgn(D[n-1]) | E[n]     | W1    |
|                                                             | ID[n+1]]/3: h.                   | A    | +1          | Ŧ       | qu             | A    | Ŧ           | +1       | dn    |
| $0 < (D[n-1]) \cdot (h_1 - w_1) < D[n+1] \cdot h_{-1}$      | 3ho                              | в    | 4           | Ŧ       | đ              | в    | 7           | ٤        | ĥ     |
|                                                             |                                  | c    | Ŧ           | 4       | dn             | c    | Ŧ           | ÷        | dn    |
|                                                             | international second             | D    | 7           | 4       | h              | ٥    | ÷           | Ŀ        | qu    |
|                                                             | 3                                | Case | Sgn(D[n-1]) | E[n]    | W <sub>7</sub> | Case | Sgn(D[n-1]) | E[n]     | WI    |
|                                                             | ID[n+1]]/3: h., B VOlev.uneven   | A    | Ŧ           | Ŧ       | þ              | A    | Ŧ           | +1,-1    | up,dn |
| $D[n-1]\cdot(h_1-w_1)=0$                                    | 3ho VDIev.com                    | в    | 7           | Ŧ       | dn             | в    | 7           | +1,-1    | up,dn |
|                                                             | $\frac{3n_0}{10} - \frac{c}{10}$ | c    | Ŧ           | ÷       | dn             | c    | Ŧ           | -        | đ     |
|                                                             |                                  | D    | 4           | 4       | h              | D    | 7           | 7        | qu    |
|                                                             | 36. t / A                        | Case | Sgn(D[n-1]) | E[n]    | W1             | Case | Sgn(D[n-1]) | E[n]     | W1    |
|                                                             | ID[n+1] /3· h.t B VDlev.uneven   | A    | 4           | Ŧ       | đ              | Þ    | 7           | Ŧ        | đ     |
| 0>- D[n-1] ·(h <sub>1</sub> -w <sub>1</sub> )>- D[n+1] ·h.1 | 3ho                              | 8    | Ŧ           | Ŧ       | h              | œ    | Ŧ           | ٤        | đ     |
|                                                             |                                  | c    | 4           | ÷       | þ              | c    | 7           | ٤        | h     |
|                                                             |                                  | 0    | Ŧ           | 4       | đ              | 0    | Ŧ           | د        | dn    |
|                                                             | 36 1 / A                         | Case | Sgn(D[n-1]) | E[n]    | W1             | Case | Sgn(D[n-1]) | E[n]     | W1    |
|                                                             | ID/n+11/3: h.t R VD/ev.uneven    | A    | 4           | Ŧ       | đ              | A    | 4           | Ŧ        | dn    |
| $- D[n-1]  \cdot (h_1 - w_1) < - D[n+1]  \cdot h_{-1}$      | 3ho Volev.conv                   | B    | 4           | Ŧ       | ŝ              | 8    | 4           | ÷        | h     |
|                                                             |                                  | c    | Ŧ           | خ       | đ              | c    | Ŧ           | ۲        | dn    |
|                                                             |                                  |      | Ŧ           | 4       | đ              | D    | Ŧ           | 4        | đ     |
|                                                             |                                  |      |             |         |                |      |             |          |       |

### 4.4.3 Proposed phase detector

The proposed PD secures a sufficient eye-opening. In particular, in multi-level signaling, the lock point becomes essential because it is more affected by ISI, and SNR is reduced. There is a way to lock the point with the small size of the  $h_{-1}$  to minimize the  $h_{-1}$  that is difficult to remove in the RX. A method of inferring the size using uneven data-level adaptation is presented to obtain the  $h_{-1}$  and  $h_0$ .

The proposed PD analysis assumes that the adaptive DFE has removed all post cursor ISI. In addition, it is assumed that a smooth-loss channel is used to take advantage of the monotonically decreasing ratio of the  $h_{-1}$  and  $h_0$ . Fig. 4.22 illustrates how the phase error is generated based on the consecutive data and the sign of the  $V_{PD}$ error. As explained in the previous chapter, the adaptation of the un-even data level



Fig. 4.22 Generating phase error based on the consecutive data  $(D_N, D_{N+1}) = (+3, -3)$ and the sign of the  $V_{PD}$  error

makes data-level combinations of the  $h_0$  and  $h_{-1}$  as  $3h_0+3h_{-1}$ . The  $V_{PD}$  is calculated and generated from the  $V_{Dlev}$  as follows

$$V_{PD} = V_{Dlev} \cdot \left(\frac{3N_t - 3}{3N_t + 3}\right).$$
(4.13)

The SDL calculates equation (4.13), and the IDAC converts the result to a sampling threshold. On the other hand, the proposed BRPD detects phase error through the data transition that the current symbol is "+3" and the following symbol is "-3". The two consecutive data are represented as follows

$$V_{IN}(t) = 3 \cdot h_0(t) - 3 \cdot h_{-1}(t), for \ (D_N, D_{N+1}) = (+3, -3).$$
(4.14)



Fig. 4.23 Simulated lock point of the proposed BRPD when Nt is 12

The  $V_{IN}$  is equal to  $V_{PD}$  at the point where  $h_0$  is N<sub>t</sub> times of the  $h_{-I}$ . In comparison,  $V_{IN}$  is larger than  $V_{PD}$  for the "Early" phase and smaller than  $V_{PD}$  for the "Late" phase. Because of the assumption that a smooth-loss channel is used, the M value increases at the "Early" phase and decreases at the "Late" phase. In other words,  $V_{IN}$  where  $(D_N, D_{N+1}) = (+3, -3)$  is larger than  $V_{PD}$  at the "Early" phase and smaller than  $V_{PD}$  at the "Early" phase and smaller than  $V_{PD}$  at the "Early" phase and smaller than  $V_{PD}$  at the "Late" phase. The magnitude comparison of the  $V_{IN}$  and  $V_{PD}$  is performed in the error sampler, as shown in Fig. 4.2.

To demonstrate the proposed BRPD, we simulated the modeling simulation of the proposed system. Fig. 4.23 shows the lock point of the proposed BRPD. In the simulation, N<sub>t</sub> is set to 12, which is greater than 3 for PAM-4. The BRPD locks at the point that  $h_{-1}$  and  $h_0$  are 16.07 mV and 211 mV on SBR. Since the calculated M (= $h_0/h_{-1}$ ) is 13.13, it can be confirmed that the proposed BRPD locks at the target point. Furthermore, by DFE adaptation,  $h_1$  is compensated and becomes almost 0.

The detailed block diagram of the SDL that uses BRPD suitable for multi-level signaling is shown in Fig. 4.24. The SDL includes the DFE adaptation, UDA, threshold level calculator, PD, and DLF. In phase detection, a pattern filter extracts the data  $(D_N, D_{N+1}) = (+3, -3)$  and  $E_{PD.N}$  determines the phase states. The BRPD logic generates the phase error as  $P_{ERR}$ . The  $P_{ERR}$  is multiplied by the gain of the DLF  $k_i$ . As a result, the DLF makes the phase control words and controls the PR.



Fig. 4.24 Detailed block diagram of the SDL which uses BRPD that suitable for multi-level signaling

## **4.5 Measurement Result**

The proposed PAM-4 RX prototype is fabricated in 40nm CMOS and occupies 0.24 mm<sup>2</sup> and consumes 116.3 mW. The detailed area of the sub-block is demonstrated in Table IV.I, and the chip photomicrograph is shown in Fig. 4.25. DAC is implemented in a thermometer way for the data-level and equalizer adaptation and occupies a large area. Also, the SDL contains two types of PD, the data-level calculation logic and the equalizer adaptation logic. The SDL occupies most of the area. The prototype chip achieves an energy efficiency of 2.42 pJ/b at 48 Gb/s. The detailed

Table IV. I Detailed area of the sub-block

| Block Description           | Area (um²) |
|-----------------------------|------------|
| Analog Front-end            | 3626       |
| Digital-Analog Converter    | 82031      |
| Phase Rotator               | 81039      |
| De-serializer               | 17578      |
| Synthesizable Digital Logic | 134463     |

power breakdown is shown in Fig. 4.26, and the most power is consumed in AFE, DAC, DES, and SDL. The RX is tested with a PRBS-7 pattern.



Fig. 4.25 Chip photomicrograph of the proposed RX



\* 2.4 pJ/b @ 48 Gbps

Fig. 4.26 Detailed power breakdown of the proposed RX

The block diagram of the measurement setup is shown in Fig. 4.27. The measurement process is automated by using I2C with Python scripts. The automated measurement performs eye scanning, bathtub plotting, and jitter tolerance graph. Passive power dividers are used to generated PAM-4 signals by combining the PRBS-7 MSB and LSB from the pattern generator. The 6-dB attenuator is utilized for the LSB pattern to lower the swing half. Two kinds of channels are used to verify the equalizer performance. PAM-2 and PAM-4 signals are used to verify the proposed phase detection method, and the channels with different losses are used, as shown in Fig. 4.28. The measured insertion losses are 19 dB in PAM-2 and 4 dB in PAM-4 at Nyquist frequency. The



Fig. 4.27 Block diagram of the measurement setup

channels have a smooth  $S_{21}$  response, as explicated in Section IV. A 12 GHz clock is inserted from the pattern generator to the prototype chip for the synchronous system. The prototype chip achieves the  $10^{-11}$  BER, and the measured eye diagram is shown in Fig. 4.29. Sampling thresholds of the data sampler are plotted.



Fig. 4.28 Measured insertion loss of the channel for PAM-2 and PAM-4 signaling



Fig. 4.29 Measured eye diagram of the prototype RX at BER of 10<sup>-11</sup>

## 4.5.1 Measurement of the proposed Stochastic Baudrate Phase Detection

As shown in simulation results of SPD CDR locking behavior in Fig. 4.30, the stochastic BRPD locks at a more optimal point close to the peak on SBR than the SS-MMPD when using the optimal weights. The test chip is wire-bonded to a test board. Fig. 4.31 shows the measured bathtub for the MSB and LSB with a BER criterion of less than 10<sup>-11</sup>. The proposed stochastic BRPD locks at the center of the eye opening. Fig. 4.32 shows the measured JTOL with a BER criterion of less than 10<sup>-11</sup>. The over-all performance of the RX is summarized and compared to other PAM-RXs in Table IV. II.



Fig. 4.30 Behavior modeling simulation result of the proposed SPD CDR


Fig. 4.31 Measured bathtub of the proposed SPD



Fig. 4.32 Measured JTOL at BER of 10<sup>-11</sup> with proposed SPD CDR

| BER                | Energy Efficiency<br>[pJ/b] | Power [mW] | Area [mm²]  | Equalizer                            | Sampler Number | Clocking     | Data Rate [Gb/s] | PD type                      | Modulation | Technology Node |                    |  |
|--------------------|-----------------------------|------------|-------------|--------------------------------------|----------------|--------------|------------------|------------------------------|------------|-----------------|--------------------|--|
| <10 <sup>-12</sup> | 2.22 (RX)                   | 56.8 (RX)  | •           | CTLE                                 | 12             | Quarter-rate | 25.6             | Oversampling                 | PAM4       | 28nm CMOS       | ISSCC 2019<br>[20] |  |
| <10 <sup>-12</sup> | 4.63                        | 259        | 0.51        | CTLE, 1-tap FIR,<br>1-tap IIR DFE    | 20             | Quarter-rate | 56               | Oversampling                 | PAM-4      | 65nm CMOS       | JSSC 2019<br>[21]  |  |
| <10 <sup>-7</sup>  | 8.92                        | 500        | 0.37        | CTLE, 20-tap FFE,<br>1-tap DFE (DSP) | 32 x 8b ADC    | 32-Way       | 56               | Baud-rate<br>(MM algorithm)  | PAM-4      | 7nm FinFET      | VLSI 2019<br>[22]  |  |
| <10 <sup>-10</sup> | 7.7                         | 431.2      | 0.72 / Lane | CTLE, 9-tap FFE,<br>1-tap DFE (DSP)  | 32 x 7b ADC    | 32-Way       | 56               | Baud-rate<br>(MM algorithm)  | PAM4       | 10nm FinFET     | ISSCC 2020<br>[23] |  |
| <10 <sup>-11</sup> | 2.42                        | 116.3      | 0.24        | CTLE,<br>1 tap DFE                   | 10             | Half-rate    | 48               | Baud-rate<br>(Stochastic PD) | PAM-4      | 40nm CMOS       | This work          |  |

Table V.II Comparison table for overall performance of the PAM-4 RX

Т

Т

Т

٦

# 4.5.2 Measurement of the proposed Baud-rate Phase Detection for multi-level signal

Measurement of the proposed BRPD for the multi-level signal is shown in Fig. 4.33 to Fig. 4.36. The proposed RX achieves BER less than 10<sup>-11</sup> for the PAM-4 signal and less than 10<sup>-12</sup> for the PAM-2. Fig. 4.33 shows the measured data levels of the PAM-4 and PAM-2 that estimate eye diagram. In Fig 4.33 (a), using the uneven data level adaptation, data levels for the  $3h_0+3h_{-1}$ ,  $3h_0-3h_{-1}$ ,  $h_0+3h_{-1}$ , and  $h_0-3h_{-1}$ ,  $h_0+3h_{-1}$ ,  $h_0+3h_{-1$  $3h_{-1}$  are plotted on the sampling clock phase. The sampling threshold for data high is also plotted. The magnitude of the eye diagram can be predicted through data level adaptation. Similarly, for the PAM-2 signal, the eye diagram is expected through the un-even data level. It can be seen that  $h_{1}$  increases as the sampling clock phase lags for the channel with smooth loss. The  $M(=h_0/h_1)$  is calculated and plotted on the sampling clock phase to confirm the performance of the proposed PD for multi-level signal, as shown in Fig. 4.34. In Fig. 3.34 (a), N<sub>t</sub> is adjusted as 40, and proposed PD locks in the phase where M is equal with  $N_t$ . Also, for the PAM-2 signal, N<sub>t</sub> is set to four, and the proposed PD locks in the phase where M is equal with N<sub>t</sub>. As explained in the previous chapter, N<sub>t</sub> should be larger than 3 for PAM-4 and larger than 1 for PAM-2. Furthermore, Fig. 4.35 shows the measured bathtub for PAM-4 and PAM-2 signals. Fig. 3.36 shows the jitter tolerance performance at a 10<sup>-11</sup> BER for PAM-4 signals. The overall performance of the RX is summarized and compared to other BR RX in Table IV.III.



Fig. 4.33 Measured data levels of the (a) PAM-4 and (b) PAM-2



(b)

Fig. 4.34 Measured and calculated M value versus sampling clock phase and  $N_{t}\,\text{of}$ 

the (a) PAM-4 and (b) PAM-2



Fig. 4.35 Measured bathtub of (a) PAM-4 at BER of 10<sup>-11</sup> and (b) PAM-2 at BER of



Fig. 4.36 Measured JTOL at BER of 10<sup>-11</sup> with Baud-rate CDR for multi-level sig-

naling

|   | <1E-12                                 | <1E-12                     | 3.4E-9                   | <1E-12                       | <1E-9                          | BER               |
|---|----------------------------------------|----------------------------|--------------------------|------------------------------|--------------------------------|-------------------|
|   | 56.6                                   | 52                         | 321                      | 102                          | 173                            | Power (mW)        |
|   | 2.02                                   | 2.1                        | 6.27                     | 3.19                         | 2.88                           | Efficiency (pJ/b) |
|   | CTLE, 2-tap DFE                        | •                          | •                        | CTLE,<br>1-tap DFE           | CTLE, 2-tap FFE,<br>2-tap DFE  | Equalizer         |
|   | 20                                     | -                          | -                        | 14.8                         | -                              | Link Loss (dB)    |
|   | 28                                     | 25                         | 51                       | 32                           | 60                             | Data Rate (Gb/s)  |
|   | Baud-rate<br>(Maximum Eye<br>tracking) | Baud-rate<br>(Integrating) | Baud-rate<br>(Bang-Bang) | Baud-rate<br>(Pattern-based) | Baud-rate<br>(Slope Detection) | PD type           |
|   | PAM-2                                  | PAM-2                      | PAM-4                    | PAM-2                        | PAM-2                          | Modulation        |
|   | 40nm CMOS                              | 40nm CMOS                  | 65nm CMOS                | 28nm CMOS                    | 65nm CMOS                      | Technology Node   |
|   | VLSI. 2020.<br>[29]                    | JSSC. 2019.<br>[28]        | ASSCC 2017<br>[27]       | JSSC 2017<br>[26]            | JSSC 2016<br>[35]              |                   |
| 1 |                                        |                            |                          |                              |                                |                   |

| Table V       |
|---------------|
| <b>VI.III</b> |
| Comparisor    |
| ı table       |
| for           |
| overall       |
| performance   |
| of t          |
| the BR I      |
| RX            |

## Chapter 5

## Conclusion

In this thesis, designs of high-speed, low-power wireline RXs are explained. To be specific, the circuit techniques of DC offset cancellation, merged-summer DFE, stochastic BR CDR, and the PD for multi-level signal are proposed. At first, an RX with AOC and merged summer DFE is proposed. The proposed AOC engine removes the random DC offset of the data path by examining the random data stream's sampled data and edge outputs. In addition, the proposed RX incorporates a shared-summer DFE in a half-rate structure to reduce power dissipation and hardware complexity of the adaptive equalizer. A prototype chip fabricated in 40 nm CMOS technology occupies an active area of 0.083 mm<sup>2</sup>. Thanks to the AOC engine, the proposed RX achieves the BER of less than 10<sup>-12</sup> in a wide range of data rates: 1.62-10 Gb/s. The proposed RX consumes 18.6 mW at 10 Gb/s over a channel with a 27 dB loss at 5 GHz, exhibiting a figure-of-merit of 0.068 pJ/b/dB. Secondly, a 40 nm CMOS RX with BRPD is proposed. The RX

includes two PDs: the BRPD employing the stochastic technique and the BRPD suitable for multi-level signals. Thanks to the BR CDR's advantage, by not using a *clk<sub>edge</sub>*, the proposed CDR can reduce the power consumption by lowering the hardware complexity. Besides, the proposed SPD tracks an optimal phase-locking point that maximizes the vertical eye opening. Furthermore, despite residual ISI, the proposed BRPD for multi-level signal secures vertical eye margin, which is especially vulnerable in the multi-level signal. Besides, the proposed BRPD has a unique lock point with an adaptive DFE, unlike conventional MMPD. A prototype chip fabricated in 40 nm CMOS technology occupies an active area of 0.24 mm<sup>2</sup>. The proposed PAM-4 RX achieves the BER of less than 10<sup>-11</sup> in 48 Gb/s and the power efficiency of 2.42 pJ/b.

## **Bibliography**

- S. Hwang, *et al.*, "A 1.62–5.4-Gb/s Receiver for DisplayPort Version 1.2a With Adaptive Equalization and Referenceless Frequency Acquisition Techniques," IEEE TCAS I, vol. 64, no. 10, pp. 2691-2702, Oct. 2017.
- W. Jung, *et al.*, "A 8.4Gb/s Low Power Transmitter with 1.66 pJ/b using 40:1 Serializer for DisplayPort Interface," ISOCC, 2020, pp. 41-42.
- [3] Y. Moon, et al., "A 2.41-pJ/bit 5.4-Gb/s Dual-Loop Reference-Less CDR With Fully Digital Quarter-Rate Linear Phase Detector for Embedded DisplayPort," IEEE TCAS I, vol. 66, no. 8, pp. 2907-2920, Aug. 2019.
- [4] G. Mandal, *et al.*, "A 2.68mW/Gbps, 1.62-8.1Gb/s Receiver for Embedded DisplayPort Version1.4b to Support 14dB Channel Loss," A-SSCC, 2020, pp. 1-4.
- P. S. Sahni, *et al.*, "An Equalizer With Controllable Transfer Function for 6-Gb/s HDMI and 5.4-Gb/s DisplayPort Receivers in 28-nm UTBB-FDSOI," T-VLSI, vol. 24, no. 8, pp. 2803-2807, Aug. 2016.
- [6] "HDMI Specifications and programs", Available: https://www.hdmi.org/spec/index.
- [7] "PCI-SIG Specifications", Available: <u>https://pcisig.com/specifications</u>.
- [8] "DisplayPort Technical Overview", Available: <u>http://www.vesa.org/wp-con-tent/uploads/2011/01/ICCE-Presentation-on-VESA-DisplayPort.pdf</u>.
- [9] J. Bulzacchelli, *et al.*, "A 28Gb/s 4-tap FFE/15-tap DFE serial link transceiver in 32nm SOI CMOS technology," ISSCC, 2012, pp. 324-326.

- [10] G. Balamurugan, et al., "A scalable 5-15 Gbps, 14-75 mW low-power I/O transceiver in 65 nm CMOS," JSSC, vol. 43, no. 4, pp. 1010–1019, Apr. 2008.
- [11] J. Im, et al., "A 40-to-56 Gb/s PAM-4 Receiver With Ten-Tap Direct Decision-Feedback Equalization in 16-nm FinFET," JSSC, vol. 52, no. 12, pp. 3486-3502, Dec. 2017.
- [12] T.Norimatsu, et al., "A 25Gb/s Multistandard Serial Link Transceiver for 50dB-Loss Copper Cable in 28nm CMOS," ISSCC, 2016, pp. 66-68.
- [13] J. Lee, et al., "A 2.44-pJ/b 1.62–10-Gb/s Receiver for Next Generation Video Interface Equalizing 23-dB Loss With Adaptive 2-Tap Data DFE and 1-Tap Edge DFE," TCAS II, vol. 65, no. 10, pp. 1295-1299, Oct. 2018.
- [14] J. Lee, et al., "A 0.1pJ/b/dB 1.62-to-10.8Gb/s Video Interface Receiver with Fully Adaptive Equalization Using Un-Even Data Level," VLSI, 2019, pp. C198-C199.
- [15] S. Saxena, et al., "A 2.8mW/Gb/s 14Gb/s serial link transceiver in 65nm CMOS," VLSI, 2015, pp. C352-C353.
- [16] R. K. Nandwana, *et al.*, "29.6 A 3-to-10Gb/s 5.75pJ/b transceiver with flexible clocking in 65nm CMOS," ISSCC, 2017, pp. 492-493.
- [17] S. Son, et al., "A 2× Blind Oversampling FSE Receiver with Combined Adaptive Equalization and Infinite-Range Timing Recovery," ASSCC, 2018, pp. 201-204.
- [18] M. Park, et al., "A 7Gb/s 9.3mW 2-Tap Current-Integrating DFE Receiver," ISSCC, 2007, pp. 230-599.
- [19] B. Razavi, Design of integrated circuits for optical communication, McGraw-Hill Professional, 200.

- [20] T. Toi, et al., "A 25.6Gb/s Uplink-Downlink Interface Employing PAM-4-Based 4-Channel Multiplexing and Cascaded CDR Circuits in Ring Topology for High-Bandwidth and Large-Capacity Storage Systems," ISSCC, 2019, pp. 478-480.
- [21] A. Roshan-Zamir, et al., "A 56-Gb/s PAM4 Receiver With Low-Overhead Techniques for Threshold and Edge-Based DFE FIR- and IIR-Tap Adaptation in 65-nm CMOS," JSSC, vol. 54, no. 3, pp. 672-684, March 2019.
- [22] D. Pfaff, et al., "A 56Gb/s Long Reach Fully Adaptive Wireline PAM-4 Transceiver in 7nm FinFET," VLSI, 2019, pp. C270-C271.
- [23] B. Yoo, et al., "6.4 A 56Gb/s 7.7mW/Gb/s PAM-4 Wireline Transceiver in 10nm FinFET Using MM-CDR-Based ADC Timing Skew Control and Low-Power DSP with Approximate Multiplier," ISSCC, 2020, pp. 122-124.
- [24] F. Spagna, *et al.*, "A 78mW 11.8Gb/s serial link transceiver with adaptive RX equalization and baud-rate CDR in 32nm CMOS," ISSCC, 2010, pp. 366-367.
- [25] K. Park, et al., "6.5 A 6.4-to-32Gb/s 0.96pJ/b Referenceless CDR Employing ML-Inspired Stochastic Phase-Frequency Detection Technique in 40nm CMOS," ISSCC, 2020, pp. 124-126.
- [26] W. Rahman, et al., "A 22.5-to-32-Gb/s 3.2-pJ/b Referenceless Baud-Rate Digital CDR With DFE and CTLE in 28-nm CMOS," JSSC, vol. 52, no. 12, pp. 3517-3531, Dec. 2017.
- [27] N. Qi, et al., "A 51Gb/s, 320mW, PAM4 CDR with baud-rate sampling for high-speed optical interconnects," ASSCC, 2017, pp. 89-92.

- [28] Y. Lee, et al., "A 25-Gb/s, 2.1-pJ/bit, Fully Integrated Optical Receiver With a Baud-Rate Clock and Data Recovery," JSSC, vol. 54, no. 8, pp. 2243-2254, Aug. 2019.
- [29] M. -C. Choi, et al., "A 0.1-pJ/b/dB 28-Gb/s Maximum-Eye Tracking, Weight-Adjusting MM CDR and Adaptive DFE with Single Shared Error Sampler," VLSI, 2020, pp. 1-2.
- [30] S. Ibrahim, et al., "Low-Power CMOS Equalizer Design for 20-Gb/s Systems," JSSC, vol. 46, no. 6, pp. 1321-1336,2011.
- [31] V. Stojanovic, *et al.*, "Autonomous dual-mode (PAM2/4) serial link transceiver with adaptive equalization and data recovery," JSSC, vol. 40, no. 4, pp. 1012-1026, April 2005.
- [32] K. Mueller, *et al.*, "Timing Recovery in Digital Synchronous Data Receivers," Transactions on Communications, vol. COM-24, no. 5, pp. 516-531, May 1976.
- [33] K. Zheng, et al., "An Inverter-Based Analog Front End for a 56 GB/S PAM4 Wireline Transceiver in 16NMCMOS," VLSI, 2018, pp. 269-270.
- [34] R. Dokania, et al., "10.5 A 5.9pJ/b 10Gb/s serial link with unequalized MM-CDR in 14nm tri-gate CMOS," ISSCC, 2015, pp. 1-3.
- [35] J. Han, et al., "Design Techniques for a 60 Gb/s 173 mW Wireline Receiver Front-end in 65 nm CMOS Technology," JSSC, vol. 51, no. 4, pp. 871-880, April 2016.

# 초 록

본 논문은 고속, 저전력으로 동작하는 유선 수신기의 설계에 대해 설명하고 있다. 구체적으로 말하면, 오프셋 상쇄, 병합된 서머를 사용하는 결정 피드백 등화기 기술, 확률적 보우 레이트 클럭과 데이터 복원기, 그리고 다중 레벨 신호에 적합한 위상 검출기를 제안한다.

첫째로, 적응 오프셋 제거 및 병합된 서머를 사용하는 결정 피드백 등화기를 갖춘 수신기를 제안한다. 제안된 적응 오프셋 제거 엔진은 임의의 데이터 스트림의 샘플링 데이터, 에지 출력을 검사하여 데이터 경로 상의 오프셋을 제거한다. 또한 하프 레이트 구조의 병합된 서머를 사용하는 결정 피드백 등화기는 전력의 사용과 하드웨어의 복잡성을 줄인다.40 nm CMOS 기술로 제작된 프로토타입 칩은 0.083 mm<sup>2</sup> 의 면적을 가진다. 적응 오프셋 제거기 덕분에 제안된 수신기는 10-12 미만의 BER 을 달성한다. 또한 제안된 수신기는 5GHz 에서 27 dB 의 로스를 갖는 채널에서 10 Gb/s 의 속도에서 18.6 mW 를 소비하며 0.068 pJ/b/dB 의 FoM 을 달성하였다.

두번째로, 보우 레이트 위상 검출기가 있는 40 nm CMOS 수신기가 제안되었다. 수신기에는 두개의 보우 레이트 위상 검출기를 포함한다. 하나는 확률론적 기법을 사용하는 보우 레이트 위상 검출기이다. 보우 레이트 클럭 데이터 복원기의 장점 덕분에 에지 샘플링 클럭을 사용하지 않음으로서 파워의 소모와 하드웨어의 복잡성을 줄였다. 또한 확률적 위상 검출기는 수직 아이 오프닝을 최대화하는 최적의 위상 지점을 찾을 수 있었다. 다른 위상 검출기는 다중 레벨 신호에 적합한 방식이다. 심볼 간 간섭이 다중 레벨 신호에 매우 취약한 문제가 있더라도 제안된 다중 레벨 신호용 보우 레이트 위상 검출기는 수직 아이 마진을 확보한다. 게다가 제안된 보우 레이트 위상 검출기는 기존의 뮬러-뮐러 위상 검출기와 달리 적응형 결정 피드백 등화기가 있더라도 유일한 락 지점을 갖는다. 프로토타입 칩은 0.24mm2 의 면적을 가진다. 제안된 PAM-4 수신기는 48 Gb/s 의 속도에서 10-11 미만의 BER 을 가지고, 2.42 pJ/b 의 FoM 을 가진다.

주요어 : 수신기, 결정 피드백 등화기, DC 오프셋, 보우 레이트 클럭 데이터 복원기, 확률적 위상 검출기, 클럭 전송형 수신기, SS-MM CDR, PAM-4

학 번:2016-20943