#### 저작자표시-비영리-변경금지 2.0 대한민국 #### 이용자는 아래의 조건을 따르는 경우에 한하여 자유롭게 • 이 저작물을 복제, 배포, 전송, 전시, 공연 및 방송할 수 있습니다. #### 다음과 같은 조건을 따라야 합니다: 저작자표시. 귀하는 원저작자를 표시하여야 합니다. 비영리. 귀하는 이 저작물을 영리 목적으로 이용할 수 없습니다. 변경금지. 귀하는 이 저작물을 개작, 변형 또는 가공할 수 없습니다. - 귀하는, 이 저작물의 재이용이나 배포의 경우, 이 저작물에 적용된 이용허락조건 을 명확하게 나타내어야 합니다. - 저작권자로부터 별도의 허가를 받으면 이러한 조건들은 적용되지 않습니다. 저작권법에 따른 이용자의 권리는 위의 내용에 의하여 영향을 받지 않습니다. 이것은 이용허락규약(Legal Code)을 이해하기 쉽게 요약한 것입니다. Disclaimer 🖃 #### Ph.D.Dissertation ## Design of High-Density Power-Efficient Transceiver for Next-Generation HBM 차세대 HBM 용 고집적, 저전력 송수신기 설계 by Han-Gon Ko August, 2020 School of Electrical Engineering and Computer Science College of Engineering Seoul National University # Design of High-Density Power-Efficient Transceiver for Next-Generation HBM 지도 교수 정 덕 균 이 논문을 공학박사 학위논문으로 제출함 2020 년 8 월 서울대학교 대학원 전기·정보공학부 고 한 곤 고한곤의 박사 학위논문을 인준함 2020 년 8 월 | 위 원 | 년 장 | (인) | |-----|-----|---------| | 부위 | 원장 | (인) | | 위 | 원 | <br>(인) | | 위 | 원 | (인) | | 위 | 원 | (인) | ## Design of High-Density Power-Efficient Transceiver for Next-Generation HBM by #### Han-Gon Ko A Dissertation Submitted to the Department of Electrical and Computer Engineering in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy at #### SEOUL NATIONAL UNIVERSITY August, 2020 ## Committee in Charge: Professor Jaeha Kim, Chairman Professor Deog-Kyoon Jeong, Vice-Chairman Professor Kang Yoon Lee Professor Jung-Hoon Chun Professor Woo-Seok Choi ABSTRACT I ## **Abstract** This thesis presents design techniques for high-density power-efficient transceiver for the next-generation high bandwidth memory (HBM). Unlike the other memory interfaces, HBM uses a 3D-stacked package using through-silicon via (TSV) and a silicon interposer. The transceiver for HBM should be able to solve the problems caused by the 3D-stacked package and TSV. At first, a data (DQ) receiver for HBM with a self-tracking loop that tracks a phase skew between DQ and data strobe (DQS) due to a voltage or thermal drift is proposed. The self-tracking loop achieves low power and small area by utilizing an analog-assisted baud-rate phase detector. The proposed pulse-to-charge (PC) phase detector (PD) converts the phase skew to a voltage difference and detects the phase skew from the voltage difference. An offset calibration scheme that can compensates for a mismatch of the PD is also proposed. The proposed calibration scheme operates without any additional sensing circuits by taking advantage of the write training of HBM. Fabricated in 65 nm CMOS, the DQ receiver shows a power efficiency of 370 fJ/b at 4.8 Gb/s and occupies 0.0056 mm². The experimental results show that the DQ receiver operates without any performance degradation under a ± 10% supply variation. In a second prototype IC, a high-density transceiver for HBM with a feed-forward-equalizer (FFE)-combined crosstalk (XT) cancellation scheme is presented. To compensate for the XT, the transmitter pre-distorts the amplitude of the FFE output according to the XT. Since the proposed XT cancellation (XTC) scheme reuses ABSTRACT II the FFE implemented to equalize the channel loss, additional circuits for the XTC is minimized. Thanks to the XTC scheme, a channel pitch can be significantly reduced, allowing for the high channel density. Moreover, the 3D-staggered channel structure removes the ground layer between the vertically adjacent channels, which further reduces a cross-sectional area of the channel per lane. The test chip including 6 data lanes is fabricated in 65 nm CMOS technology. The 6-mm channels are implement- ed on chip to emulate the silicon interposer between the HBM and the processor. The operation of the XTC scheme is verified by simultaneously transmitting 4-Gb/s data to the 6 consecutive channels with 0.5-um pitch and the XTC scheme reduces the XT-induced jitter up to 78 %. The measurement result shows that the transceiver achieves the throughput of 8 Gb/s/um. The transceiver occupies 0.05 mm<sup>2</sup> for 6 lanes and consumes 36.6 mW at 6 x 4 Gb/s. Keywords: Baud-rate CDR, crosstalk cancellation, crosstalk-induced jitter, far-end crosstalk, feed-forward equalizer (FFE), forwarded-clock (FC) receiver, HBM, memory interface, on-chip interconnect, parallel links, phase detector, RC-dominant wire, single-ended signaling, silicon interposer, transceiver. **Student Number : 2015-20883** CONTENTS ## **Contents** | ABSTRACT | I | |--------------------------------------------|-------| | CONTENTS | III | | LIST OF FIGURES | VI | | LIST OF TABLES | X | | CHAPTER 1 INTRODUCTION | 1 | | 1.1 Motivation | 1 | | 1.2 THESIS ORGANIZATION | 4 | | CHAPTER 2 BACKGROUND ON HIGH-BANDWIDTH MEM | ORY 6 | | 2.1 Overview | 6 | | 2.2 Transceiver Architecture | 10 | | 2.3 READ/WRITE OPERATION | 15 | | 2.3.1 READ OPERATION | 15 | | 2.3.2 Write Operation | 19 | | CHAPTER 3 BACKGROUNDS ON COUPLED WIRES | 21 | | 3.1 GENERALIZED MODEL | 21 | | | 26 | CONTENTS | 4.1 Overview | 29 | |--------------------------------------------------------|------| | 4.2 FEATURES OF DQ RECEIVER FOR HBM | 33 | | 4.3 PROPOSED PULSE-TO-CHARGE PHASE DETECTOR | 35 | | 4.3.1 OPERATION OF PULSE-TO-CHARGE PHASE DETECTOR | 35 | | 4.3.2 Offset Calibration | 37 | | 4.3.3 OPERATION SEQUENCE | 39 | | 4.4 CIRCUIT IMPLEMENTATION | 42 | | 4.5 Measurement Result | 46 | | CHAPTER 5 HIGH-DENSITY TRANSCEIVER FOR HBM WITH 31 | )- | | STAGGERED CHANNEL AND CROSSTALK CANCELLATION SO | HEME | | | 57 | | 5.1 Overview | 57 | | 5.2 Proposed 3D-Staggered Channel | 61 | | 5.2.1 IMPLEMENTATION OF 3D-STAGGERED CHANNEL | 61 | | 5.2.2 CHANNEL CHARACTERISTICS AND MODELING | 66 | | 5.3 PROPOSED FEED-FORWARD-EQUALIZER-COMBINED CROSSTALK | | | CANCELLATION SCHEME | 72 | | 5.4 CIRCUIT IMPLEMENTATION | 77 | | 5.4.1 Overall Architecture | 77 | | 5.4.2 Transmitter with FFE-Combined XTC | 79 | | 5.4.3 Receiver | 81 | | 5.5 Measurement Result | 82 | | CHAPTER 6 CONCLUSION | 93 | | CONTENTS | V | |----------|---| | CONTENIO | , | | BIBLIOGRAPHY | 95 | |--------------|-----| | 초 록 | 102 | LIST OF FIGURES VI ## **List of Figures** | Fig. 1.1 Challenges of designing a transceiver for HBM and solutions | PROPOSED IN | |------------------------------------------------------------------------|--------------| | THIS THESIS | 2 | | FIG. 2.1 CROSS-SECTION VIEW OF HBM PACKAGE. | 7 | | FIG. 2.2 BANDWIDTH PER CHIP OF VARIOUS DRAM INTERFACES. | 8 | | FIG. 2.3 NORMALIZED POWER EFFICIENCY OF VARIOUS DRAM INTERFACES | 9 | | Fig. 2.4 Single channel signal count. | 11 | | FIG. 2.5 MICROBUMP LAYOUT OF HBM FOR SINGLE DWORD | 12 | | Fig. 2.6 Overall architecture of the HBM interface. | 14 | | FIG. 2.7 TIMING DIAGRAM OF BURST MODE READ OPERATION. | 17 | | FIG. 2.8 TIMING DIAGRAM OF SEAMLESS MODE READ OPERATION. | 18 | | FIG. 2.9 TIMING DIAGRAM OF WRITE OPERATION. | 20 | | FIG. 3.1 SCHEMATICS USED TO DERIVE THE TRANSFER FUNCTION OF THE COUPLE | D WIRE 22 | | FIG. 4.1 BLOCK DIAGRAM OF DQ RECEIVER AND SIGNAL PATH FROM CPU TO DQ | RECEIVER.30 | | FIG. 4.2 TIMING DIAGRAMS OF THE DQ AND WDQS BEFORE AND AFTER THE VOI | LTAGE OR | | THERMAL DRIFT. | 31 | | Fig. 4.3 Comparison with the matched DQ receiver and the DQ received | R WITH SELF- | | TRACKING LOOP | 32 | | FIG. 4.4 BLOCK DIAGRAM AND TIMING DIAGRAM OF THE BANG-BANG (BB) PD | 33 | | FIG. 4.5 BLOCK DIAGRAM AND TIMING DIAGRAM OF THE PHASE-INTERPOLATOR I | BASED CDR. | | | 34 | | FIG. 4.6 CONCEPTUAL DIAGRAM OF THE PULSE-TO-CHARGE PD | 36 | | FIG. 4.7 TIMING DIAGRAM OF THE PROPOSED PULSE-TO-CHARGE PD. | 36 | |-------------------------------------------------------------------------------|----| | FIG. 4.8 SOURCES OF THE OFFSET OF THE PULSE-TO-CHARGE PD. | 38 | | FIG. 4.9 SIMULATED OFFSET HISTOGRAM OF THE PULSE-TO-CHARGE PD. | 38 | | FIG. 4.10 SIMULATED OFFSET HISTOGRAM OF THE PULSE-TO-CHARGE PD | 40 | | Fig. 4.11 Operation sequence of the PD training and self-tracking loop | 41 | | Fig. 4.12 Overall architecture of the proposed DQ receiver | 42 | | Fig. 4.13 Circuit implementation of the DQS amplifier | 43 | | FIG. 4.14 CIRCUIT IMPLEMENTATION OF THE SAMPLER | 43 | | Fig. 4.15 Circuit implementation of the DCDL. | 44 | | Fig. 4.16 Circuit implementation of the PC PD. | 44 | | Fig. 4.17 Chip photomicrograph of the DQ receiver. | 46 | | Fig. 4.18 Measurement setup for the DQ receiver. | 47 | | Fig. 4.19 Flowchart for the DQ receiver measurement. | 47 | | FIG. 4.20 BAALOCK DIAGRAM OF THE TX | 49 | | Fig. 4.21 Timing diagram of the DQ and DQS when the burst length is 4 and the | | | PERIOD OF THE WRITE COMMAND IS 8 UI | 49 | | Fig. 4.22 Overall block diagram of the test chip including the test circuits | 50 | | Fig. 4.23 Measured eye diagrams without the self-tracking loop | 53 | | Fig. 4.24 Measured eye diagrams with the self-tracking loop and PC PD | | | CALIBRATION | 54 | | Fig. 4.25 Measured timing margin versus the supply voltage | 55 | | Fig. 4.26 Power breakdown of the DQ receiver | 55 | | Fig. 5.1 Definition of the throughput in the silicon interposer | | | Fig. 5.2 Cross-section view of the 3D-staggered channel | 62 | | Fig. $5.22$ Measured eye diagrams of the proposed transceiver with and without | | |--------------------------------------------------------------------------------|----| | XTC SCHEME. | 86 | | Fig. 5.23 Measured XT induced jitter with and without the XTC. | 87 | | Fig. 5.24 Measured Bathtub curve at 4 Gb/s. | 87 | | Fig. 5.25 Measured XT-induced jitter. | 88 | | Fig. 5.26 Power Breakdown of the transceiver. | 89 | | Fig. 5.27 Throughput versus channel length among the multi-channel on-chip | | | SERIAL LINK PAPERS PURLISHED OVER THE PAST 10 YEARS | 92 | LIST OF TABLES X ## **List of Tables** | Table 4.1 Performance comparison with previous forwarded-clock receivers | . 56 | |---------------------------------------------------------------------------|------| | TABLE 5.1 PARAMETERS OF 3D-STAGGERED CHANNEL. | . 64 | | TABLE 5.2 PERFORMANCE SUMMARY AND COMPARISON WITH OTHER MULTI-CHANNEL ON- | | | CHIP SERIAL LINKS. | .91 | | TARLE 5.3 COMPARISON WITH OTHER MULTI-CHANNEL XTC SCHEMES | 92 | ## Chapter 1 ## Introduction ## 1.1 Motivation With the dramatic increase in processor performance, the bandwidth of external DRAM has become a bottle neck for system performance. To meet the demand for memory bandwidth, a high-bandwidth memory (HBM) which uses a through-silicon via (TSV) and a silicon interposer technology has been proposed [1]-[3]. The HBM shows higher bandwidth and better energy efficiency than the graphic DDR DRAM (GDDR). The packaging technology of the HBM significantly increases the throughput, but it causes some challenges. One of the main challenges is a thermal drift. HBM reduces the form factor by stacking the memory dies, but this increases the thermal density of dynamic random access memory (DRAM) [4]. The thermal drift not only increases the leakage cur- Fig. 1.1 Challenges of designing a transceiver for HBM and solutions proposed in this thesis. rent of the DRAM cell, but also changes the logic delay of the DRAM I/O circuits. If the logic delay changes after the write training, the relative phase between the data and the sampling clock also changes. Therefore, the sampling timing margin is degraded. In this thesis, a baud-rate self-tracking loop that can compensate for the timing skew caused by the voltage and thermal drift is proposed. Since the self-tracking loop uses an analog-assisted baud-rate phase detector, it is effective in terms of power and area. Another challenge is a crosstalk (XT) between the channels on the silicon interposer. Since the channel pitch of the silicon interposer is much lower than other packaging technologies, the impact of the XT is more severe. In this thesis, a feed-forward-equalizer (FFE)-combined XT cancellation (XTC) scheme is proposed. Be- cause the proposed XTC scheme reuses the FFE implemented to equalize the channel loss, additional circuits for XTC is minimized. Moreover, to increases the throughput further, a 3D-staggered channel structure is proposed. The 3D-staggered channel structure removes the ground layer between the vertically and horizontally adjacent channels to reduce the cross-sectional area of the channel per lane. Figure 1.1 shows the challenges of designing a transceiver for HBM and the solutions proposed in this thesis. To compensate for the phase skew between the data and the sampling clock due to the voltage and thermal drifts, the DQ receiver with the baud-rate self-tracking loop is proposed. In addition, to increase the throughput, the FFE-combined XTC scheme and the 3D-staggered channel structure is proposed. ## 1.2 Thesis Organization This thesis is organized as follows. In Chapter 2, backgrounds of the HBM is provided. The structure and features of the transceivers for HBM are introduced with an overview of the HBM. In addition, the read/write operation is also described which is the most basic operation of the memory interface. Chapter 3 shows background of coupled wires. Frist, the generalized model for the wires is described, and then how the XT occurs in coupled wire and the characteristics of the RC-dominant channels are explained. In Chapter 4, a self-tracking loop to compensate for the phase skew between the data and the strobe due to the thermal and voltage drifts is presented. First, the features that the data receiver for HBM should satisfy and the difficulties in applying a conventional CDR structures are introduced. Next, a pulse-to-charge (PC) baud-rate phase detector (PD) is proposed. The concept and operation of the PC PD is presented. In addition, the offset calibration scheme and the training sequence when the self-tracking loop is applied to the memory interface is described. Then, the circuit implementation of the proposed data receiver and the measurement results are shown. In Chapter 5, a high-density transceiver with the 3D-staggered channel and a feed-forward-equalizer (FFE)-combined crosstalk (XT) cancellation scheme is proposed. The concept of the proposed FFE-combined XTC scheme and its effectiveness is described. The circuit implementation of the prototype transceiver and its measurement results are shown. The comparison tables compare this work with the state-of-the-art on-chip serial links and the XTC schemes. Chapter 6 summarizes the proposed works and concludes this thesis. ## Chapter 2 ## Background on High-Bandwidth Memory ## 2.1 Overview Recently, demands for high memory bandwidth keep increasing to cope with a technology trends such as big data, cloud computing, internet of things (IoT) and artificial intelligence. The main reason for this trend is the growing disparity of speed between central processing unit (CPU) and memory outside the CPU chip. The CPU speed improved at an annual rate of 55% while the memory speed only improved at 10% from 1986 to 2000 [5]. If memory bandwidth become insufficient to provide the CPU with enough instructions and data to continue computation, the CPU will effectively always be stalled waiting on memory. This phenomenon is Fig. 2.1 Cross-section view of HBM package. called "memory wall". Therefore, to increase the overall system performance, the memory bandwidth should be increased. Figure 2.1 shows the cross-section view of HBM package [6]. HBM consists of a buffer die at the bottom and a stacked DRAM dies on it. HBM can have three types of stack configuration; 2H, 4H, and 8H. The total height of HBM is the same regardless of the number of stack. The reason is that a thermal solution such as heat spreader or active cooler is required on top of the package to solve the thermal issues of the CPU and HBM. HBM uses through-silicon via (TSV) and silicon interposer technologies to provide multiple chip stacks and large number of I/Os between the processor and HBM. Fig. 2.2 Bandwidth per chip of various DRAM interfaces. All pins between the processor and HBM including 1024 data lanes, command, address, clocks, and etc., are connected through channels implemented on the silicon interposer. Typically, one processor and two or four HBMs are packaged together in the same system-in-package (SiP). This is because the limited length of the channel implemented on the silicon interposer. The length of the channel is recommended under 6 mm for good signal integrity. The buffer die provides not only routes from the TSVs to micro-bumps but also test functionality to system makers and DRAM vendors. Figure 2.2 shows the bandwidth per chip of various DRAM interfaces [7]. HBM2 shows 9.1 times higher bandwidth per chip compared with the GDDR5. Although HBM has a lower data rate per pin than GDDR, the bandwidth per chip of HBM is much higher than that of GDDR because of the large number of I/Os. Fig. 2.3 Normalized power efficiency of various DRAM interfaces. Moreover, since the data rate per pin of GDDR is already more than 10 Gb/s, it is hard to increase, therefore the performance gap between HBM and the GDDR will be increased even more. Figure 2.3 shows a normalized power efficiency of various DRAM interfaces. Thanks to small pin capacitance of short silicon interposer channel and point-to-point configuration, HBM shows highest power efficiency among the high speed DRAM interfaces. HBM has the best bandwidth and power efficiency performance, making it a promising candidate for high-performance computing applications. ## 2.2 Transceiver Architecture To understand the transceiver architecture for HBM, one should first understand a data transfer protocol of HBM. The HBM interface is divided into 8 channels and each channel is completely independent of one another [8]. Each channel provides access to an independent set of DRAM banks. Requests from one channel may not access data attached to a different channel. The division of channels among the DRAM dies within a stack is left to the vendor. Figure 2.4 shows a single channel signal count of HBM [8]. Single channel consists of one pair of clock (CK\_t and CK\_c), a set of command and address, 4 data word (DWORD) and redundant pins for lane repair. The command and address set includes Row, Column, CKE, and AERR and is synchronized to CK. AERR pin is used when an error is detected in the command and address set. Redundant data, row and column pins is used for a lane repair function. When the microbump is not connected due to a package fault, HBM detects the unconnected pins and remap the pins to the redundant pins. This feature is introduced to prevent the yield drop due to the package fault. Each DWORD consists of 32 DQ, 4 DM, 4 DBI, one pair of WDQS (WDQS\_t and WDQS\_c) and one pair of RDQS (RDQS\_t and RDQS\_c). Similar to the other DRAM interfaces, HBM uses strobes (WDQS and RDQS) to sample DQ. Each DM bit is used for data masking of each byte. DBI bit indicates the corresponding byte is whether inverted or not. This feature is introduced to save the power by reducing the data transition. Since all DQ pins of single DWORD share the same strobes, a skew | Function | # of uBumps | Notes | |------------------------|-------------|----------------------------------------------| | Data | 128 | DQ[127:0] | | Column Command/Address | 9 | C[8:0] | | Row Command/Address | 7 | R[6:0] | | DBI | 16 | 1 DBI per 8 DQs | | DM | 16 | 1 DM per 8 DQs | | PAR | 4 | 1 PAR per 32 DQs | | DERR | 4 | 1 DERR per 32 DQs | | Strobes | 12 | 1 RDQS_t/RDQS_c,<br>WDQS_t/WQDS_c per 32 DQs | | Clocks | 2 | CK_t/CK_c | | CKE | 1 | CKE | | AERR | 1 | AERR | | Redundant Data | 8 | RD[7:0] | | Redundant Row | 1 | RR | | Redundant Column | 1 | RC | | Total | 214 | | Fig. 2.4 Single channel signal count. between DQs or a performance degradation due to a delay of strobe buffer may occur. Figure 2.5 shows the microbump layout of HBM for single DWORD. HBM employs a staggered pattern where a staggered bump is located halfway between major row and column. The horizontal pitch of two adjacent microbumps is 96 um and the vertical pitch of two adjacent microbumps is 55 um. The footprint of entire HBM ballout consists of 300 rows with a pitch of 27.5 um and 68 columns with a pitch of 48 um. The overall microbump array size is 3241 um X 8247.5 um. The strobe signals are located at the center of the DQ microbumps. 600 um Fig. 2.5 Microbump layout of HBM for single DWORD. Figure 2.6 shows overall architecture of the HBM interface for single channel. A memory controller transmits current instruction and address through ROW and COL pins, and then sends or receives the corresponding data through the DQ pins. In order for the HBM interface to operate properly, a training sequence before normal operation is required. Through the training, a skew between strobe and clock, write latency, read latency, a skew between DQ and strobe, a skew between DQs are compensated correctly. Fig. 2.6 Overall architecture of the HBM interface. ## 2.3 Read/Write Operation A read/write operation is the most basic operation of DRAM. Throughput and latency of the read/write operation greatly affects the overall system performance. In addition, like most DRAM interfaces, HBM should support a burst mode operation to reduce power consumption in the idle state. Therefore, the transceiver for HBM should be able to operate in burst mode while reducing the latency and power consumption. ## 2.3.1 Read Operation Figure 2.7 shows the timing diagram of the burst mode read operation. In order to perform the read operation, READ command and corresponding address should first be transmitted from the controller. After clock cycles corresponding to read latency from READ command, HBM should output the valid data through DQ pins. The length of the valid data, which is the burst length, is controlled by the controller through the mode register. The RDQS signals are toggled so that the controller can sample the valid data output from the DQ pins. In order to stabilize the valid strobes that samples DQ, it is necessary to transmit a single cycle of pre-amble and postamble every valid strobes before and after. The last bit of valid data should be kept for half cycle more. After one burst transmission ends, RDQS\_t returns to 0 (RDQS c returns to 1) and HBM prepares for the next operation. Figure 2.8 shows the timing diagram of the seamless mode read operation. Like the burst mode operation, after clock cycles corresponding to read latency from READ command, HBM should output the valid data through DQ pins. However, in seamless mode, because the READ command comes at the same interval as the burst length, the next burst arrives when the post-amble comes. If the READ command is transmitted continuously, the pre-amble and the post-amble are omitted. Only the pre-amble of the first burst and the post-amble of the last burst remain. The interval at which the READ command are transmitted is determined by the read latency. ## Burst mode, burst length = 2 #### Burst mode, burst length = 4 Fig. 2.7 Timing diagram of burst mode read operation. ## Seamless mode, burst length = 2 ### Seamless mode, burst length = 4 Fig. 2.8 Timing diagram of seamless mode read operation. ## 2.3.2 Write Operation Figure 2.9 shows the timing diagram of the write operation. The upper timing diagram shows the case of the burst mode operation when the burst length is 2 and the lower timing diagram shows the case of the seamless operation when the burst length is 4. Like the read operation, the WRITE command and corresponding address should first be transmitted from the controller. The transceiver for HBM samples DQ using WDQS\_t and WDQS\_c. The phase skew between DQ and WDQS is calibrated by the controller during the training sequence. There is no post-amble in the write operation. In case of the seamless operation, the WRITE command is transmitted continuously at the period of write latency. Since the read and write operation have different signal directions, a bus turn around time is required, which is denoted as tWTR. A READ command can be issued any time after a WRITE command as long as tWTR is met. #### Burst mode, burst length = 2 ## Seamless mode, burst length = 4 Fig. 2.9 Timing diagram of write operation. ## Chapter 3 ## **Background on Coupled Wires** ## 3.1 Generalized Model Before designing the transceiver, it is essential to accurately analyze and model the characteristics of the channel. This chapter presents the analytical channel model for the coupled wire. Figure 3.1 shows the schematics used to derive the transfer function of the coupled wire. The driver is represented as Thevenin equivalent model and the receiver is modeled as impedance. The wire is modeled as distributed general RLGC model with the length of L. Only 2 coupled channels are considered for simplicity. $V_1(x,w)$ , $V_2(x,w)$ , $I_1(x,w)$ and $I_2(x,w)$ are the traveling wave voltages and currents of the channel 1 and channel 2 along the wire at distance x, respectively. The differential equations for the channel voltages and currents can be written into vector forms and is given as [21] Fig. 3.1 Schematics used to derive the transfer function of the coupled wire $$-\frac{\partial}{\partial x} \begin{bmatrix} V_1(x, w) \\ V_2(x, w) \end{bmatrix} = (R + \dot{y} L) \begin{bmatrix} I_1(x, w) \\ I_2(x, w) \end{bmatrix}$$ (3.1) $$-\frac{\partial}{\partial x} \begin{bmatrix} I_1(x, w) \\ I_2(x, w) \end{bmatrix} = (G + \dot{p} C) \begin{bmatrix} V_1(x, w) \\ V_2(x, w) \end{bmatrix}$$ (3.2) $$\begin{bmatrix} V_{i_1,1}(w) \\ V_{i_1,2}(w) \end{bmatrix} = \begin{bmatrix} V_1(0,w) \\ V_2(0,w) \end{bmatrix} + Z_S \begin{bmatrix} I_1(0,w) \\ I_2(0,w) \end{bmatrix}$$ (3.3) $$\begin{bmatrix} V_1(L, w) \\ V_2(L, w) \end{bmatrix} = Z_L \begin{bmatrix} I_1(L, w) \\ I_2(L, w) \end{bmatrix}$$ (3.4) where R, L, G, and C are the 2-by-2 RLGC matrices which is expressed as $$R = \begin{bmatrix} r_0 & r_c \\ r_c & r_0 \end{bmatrix} \tag{3.5}$$ $$L = \begin{bmatrix} l_0 & l_c \\ l_c & l_0 \end{bmatrix} \tag{3.6}$$ $$G = \begin{bmatrix} g_0 & g_c \\ g_c & g_0 \end{bmatrix} \tag{3.7}$$ $$C = \begin{bmatrix} c_0 & c_c \\ c_c & c_0 \end{bmatrix}. \tag{3.8}$$ The first off-diagonal terms are the crosstalk terms from the other channel. Note that the skin effect terms are ignored for simplicity. Equations (3.3) and (3.4) are the boundary equations defined by the output impedance of the TX and the input impedance of the RX. The solution of the Equations (3.1-3.4) is given by $$T_{com\ m\ on}(w) = \frac{e^{-l\sqrt{(z_0 + z_c)(y_0 + y_c)}}}{(\frac{\sqrt{\frac{z_0 + z_c}{y_0 + y_c}} + 1})(1 + \frac{Z_S}{\sqrt{\frac{z_0 + z_c}{y_0 + y_c}}})}$$ (3.9) $$T_{diff}(w) = \frac{e^{-l\sqrt{(z_0 - z_c)(y_0 - y_c)}}}{(\frac{\sqrt{\frac{z_0 - z_c}{y_0 - y_c}}}{Z_L} + 1)(1 + \frac{Z_S}{\sqrt{\frac{z_0 - z_c}{y_0 - y_c}}})}$$ (3.10) $$T(w) = \frac{V_1(L, w)}{V_{in, 1}} = \frac{V_2(L, w)}{V_{in, 2}} \approx T_{com\ m\ on}(w) + T_{diff}\ (w)$$ (3.11) $$T_{XT}(w) = \frac{V_1(L, w)}{V_{i_{1}, 2}} = \frac{V_2(L, w)}{V_{i_{1}, 1}} \approx T_{com\ m\ on}(w) - T_{diff}(w)$$ (3.12) $$Z = R + \dot{y}vL = \begin{bmatrix} z_0 & z_c \\ z_c & z_0 \end{bmatrix}$$ (3.13) $$Y = G + j w C = \begin{bmatrix} y_0 & y_c \\ y_c & y_0 \end{bmatrix}$$ (3.14) where T(w) and $T_{XT}(w)$ are the through and crosstalk transfer functions of the channel. Note that the voltage and current waves propagating in the -x direction is omitted for simplicity. Equations (3.9-3.14) are general and accurate, but too complex to provide design intuition. In case of a lossless transmission line with ideal impedance matching in the receiver ( $Z_L=Z_0$ ), only $l_0$ and $c_0$ of the elements of the RLGC matrices remain, in which case the through T is expressed as $$T(w) = \frac{e^{-jw \, l\sqrt{l_0 c_0}}}{1 + \frac{Z_S}{Z_0}}$$ (3.15) where Z0 is the characteristic impedance of the channel and is given as $$Z_0 = \sqrt{\frac{r_0 + \dot{y} \cdot l_0}{g_0 + \dot{y} \cdot c_0}}. (3.16)$$ In case of RC-dominant channels ( $r_0 \gg jwl_0$ ), R matrix should be considered in- stead of L matrix. Unlike the lossless transmission line, the characteristic impedance of the channel is dependent on frequency, results in the difference between voltage mode and current mode. The characteristic impedance and the transfer function of the channel is given as $$Z_0(w) = \sqrt{\frac{r_0}{\dot{y}_0 c_0}}. (3.17)$$ $$T(w) = \frac{V_1(L, w)}{V_{i_{1,1}}} = \frac{2e^{-l\sqrt{jw}\,r_0c_0}}{(\frac{Z_0(w)}{Z_L} + 1)(1 + \frac{Z_S}{Z_0(w)})}.$$ (3.18) For the RC-dominant channels, the transfer function is dependent on the termination impedance. Therefore, the output resistance of the TX and the input impedance of the RX should be carefully designed. ## 3.2 Effect of Crosstalk The next topic to be discussed is the effect of the crosstalk. Typically, when there is the crosstalk between the coupled wires, the inductive and capacitive coupling through the electric and magnetic field is dominant over the other coupling terms. This crosstalk can be represented using $l_c$ and $c_c$ terms among the elements of the RLGC matrices. The L and C matrices is given as $$L = \begin{bmatrix} L_s & L_m \\ L_m & L_s \end{bmatrix} \tag{3.6}$$ $$C = \begin{bmatrix} C_s & -C_m \\ -C_m & C_s \end{bmatrix} \tag{3.7}$$ where the $L_s$ is the self inductance, $C_s$ is the self capacitance, $L_m$ is the mutual inductance and C is the mutual capacitance. In case of the lossless transmission line with the crosstalk, the through and crosstalk transfer function is given as $$T(w) = \frac{e^{-jw l\sqrt{(L_S + L_m)(C_S - C_m)}}}{\sqrt{\frac{L_S + L_m}{C_S - C_m}} + 1)(1 + \frac{Z_S}{\sqrt{\frac{L_S + L_m}{C_S - C_m}}}} + \frac{e^{-jw l\sqrt{(L_S - L_m)(C_S + C_m)}}}{\sqrt{\frac{L_S - L_m}{C_S + C_m}} + 1)(1 + \frac{Z_S}{\sqrt{\frac{L_S - L_m}{C_S + C_m}}}})$$ (3.19) $$T_{XT}(w) = \frac{e^{-jw l\sqrt{(L_S + L_m)(C_S - C_m)}}}{\sqrt{\frac{L_S + L_m}{C_S - C_m}} + 1)(1 + \frac{Z_S}{\sqrt{\frac{L_S + L_m}{C_S - C_m}}})} - \frac{e^{-jw l\sqrt{(L_S - L_m)(C_S + C_m)}}}{\sqrt{\frac{L_S - L_m}{C_S + C_m}} + 1)(1 + \frac{Z_S}{\sqrt{\frac{L_S - L_m}{C_S + C_m}}})}.$$ (3.20) The most dominant effect of the coupling is the difference in time of flight between the transfer function of the common mode and the differential mode. The time of flight of the common mode and the differential mode is given as $$TOF_{com\ m\ on} = l\sqrt{(L_s + L_m)(C_s - C_m)}$$ (3.21) $$TOF_{diff} = l\sqrt{(L_s - L_m)(C_s + C_m)}$$ (3.22) Clearly, the time of flight of the two modes are equal if $L_m/L_s = C_m/C_s$ . This condition is guaranteed when the channel is homogeneous. Therefore, if the homogeneity holds, the transfer function of the crosstalk becomes zero. This result forms the basis of directional couplers and occurs quite naturally. However, for microstrip lines, homogeneity is not guaranteed since the electric and magnetic fields above and below the channel are not symmetric [22]. In general, the channels are inhomogeneous and the far-end crosstalk (FEXT) occurs. However, Equation (3.20) needs to be simplified because it is too complicated to provide design intuition. If the coupling is loose enough such that there is negligible degradation of the driving signal as it propagates the full length of the line, the near-end crosstalk (NEXT) and the FEXT are simplified into [23], [24] $$V_{NEXT}(t) = \frac{1}{4} \cdot \left(\frac{C_m}{C_s} + \frac{L_m}{L_s}\right) \cdot \{V_{in}(t) - V_{in}(t - 2 \cdot TOF)\}$$ (3.23) $$V_{FEXT}(t) = \frac{TOF}{2} \cdot \left(\frac{C_m}{C_s} - \frac{L_m}{L_s}\right) \cdot \frac{\partial V_{in}(t - TOF)}{\partial t}.$$ (3.24) since the peak of the NEXT is small, it is not critical in signal integrity. Moreover, in the memory interfaces, the NEXT has no significant effect, since all data is transmitted in the same direction. On the other hand, the FEXT induces a significant peak at the receiver, therefore it can be a factor that limits the data rate. # Chapter 4 # DQ Receiver with Baud-Rate Self-Tracking Loop ## 4.1 Overview HBM utilizes 3D stacking package and the silicon interposer technology to increase the bandwidth. Figure 4.1 shows the block diagram of DQ receiver and the signal path from CPU to the DQ receiver. The input ports of DQ receiver for one DWORD consist of 32 DQ, 4 DM, 4 DBI, and one WDQS pair. The DQ receiver should sample the 40 data lanes at the optimum sampling point using the WDQS transmitted from the controller. During the write operation, the data lanes and WDQS arrive at the DQ receiver via the silicon interposer, buffer die, and TSV from the controller. The signals arriving at the DQ receiver can undergo a large change in the delay as the signals goes through a long write path. Fig. 4.1 Block diagram of DQ receiver and signal path from CPU to DQ receiver. All of these delay changes should be compensated for HBM to operate normally. To do this, the controller performs write training before starting the normal operation. The phase skew between DQ and WDQS is calibrated during the write training. If the delays of DQ path and the WDQS path are different in the DQ receiver, the controller compensates for all phase skew, including this, during the write training. As a result, after the write training, the WDQS samples DQ in the optimal sampling point. However, if the voltage or thermal drift occurs, the delays of the DQ path and the WDQS path change differently, resulting in the additional phase skew. The timing diagrams of the DQ and WDQS before and after the voltage or thermal drift are shown in Figure 4.2. Typically, WDQS path has an analog front end and a clock buffers and the delay of the WDQS path is longer than that of the DQ path. Fig. 4.2 Timing diagrams of the DQ and WDQS before and after the voltage or thermal drift. Therefore, the timing margin of the DQ is decreased due to the voltage or thermal drift. One possible solution to alleviating this additional phase skew is a matched DQ receiver. The matched DQ receiver has a delay line that mimics T<sub>DQS</sub> in front of DQ samplers. Therefore, in case of matched DQ receiver, the delays of WDQS path and DQ path are the same and the phase between the DQ and WDQS is not changed when the voltage or thermal drift occurs. However, since HBM has many DQ lanes compared to the other DRAM interfaces, the cost of the replica delay in terms of power and area is unacceptable. Therefore, a self-tracking loop that can compensate for the timing skew in background is necessary. Figure 4.3 shows the comparison with the matched DQ receiver and the DQ receiver with the self-tracking loop. Since the cost of the sensing circuits and control circuits is amortized on all DQ lanes, the total cost of the self-tracking loop is much lower than the matched DQ receiver. Unlike to the conventional clock and data recovery (CDR), the DQ receiver for HBM cannot adopt the 2X oversampling CDR architecture directly since only baudrate clock is available. In addition, since DQS toggles only in preambles and write bursts in the burst mode operation, a phase interpolator based CDR architecture is also not acceptable. This this presents a low-power small-area 4.8-Gb/s DQ receiver with the self-tracking loop utilizing an analog-assisted baud-rate phase detector (PD). Since the proposed pulse-to-charge (PC) PD operates only with the baud-rate clock, it is suitable for HBM3 and is effective in terms of power and area. # 4.2 Features of DQ Receiver for HBM The DQ receiver for HBM has several features that differ from typical receivers. The first feature is the input clock at half the frequency of the data rate. If the rising edge and falling edge of the input clock are used, all input data can be sampled. However, the 2x oversampling band-band (BB) PD used in the typical receivers requires an additional clock for the edge sample, therefore the BBPD cannot be used. Figure 4.4 shows the block diagram and the timing diagram of the BBPD. A multi- Bang-bang phase detector Fig. 4.4 Block diagram and timing diagram of the bang-bang (BB) PD. phase generator can be used to create a clock for edge samplers, but it is difficult to apply because of the large power and area overhead. Therefore, a power and area efficient baud-rate PD is required for the DQ receiver for HBM. Another feature of the DQ receiver is a strobe clock for data sampling. Unlike the typical receivers, the input clock of the DQ receiver toggles only when the input data is valid. The phase interpolator (PI) cannot be used since it requires two adjacent edges to generate the output. Figure 4.5 shows the block diagram and timing diagram of the PI-based CDR. The DQ receiver for HBM should use the delay line to adjust the phase between the data and the sampling clock. Fig. 4.5 Block diagram and timing diagram of the phase-interpolator based CDR. # 4.3 Proposed Pulse-to-Charge Phase Detec- tor ## 4.3.1 Operation of Pulse-to-Charge Phase Detector The conceptual and the timing diagram of the proposed pulse-to-charge (PC) phase detector (PD) is shown in Figure 4.6 and Figure 4.7, respectively. The proposed PC PD converts the phase skew to a voltage difference and detects the phase skew from the voltage difference using the sampler. If there is a rising edge of DQ between the rising edges of DQS\_I and DQS\_Q, the PC PD charges Vc,- from the rising edge of DQS\_I to the rising edge of DQ and charges Vc,+ from the rising edge of the DQ to the rising edge of the DQS\_Q. If DQS leads DQ, the time between the rising edge of DQS\_I and the rising edge of DQ is greater than the time between the rising edge of DQ and the rising edge of DQS\_Q, so Vc,- is greater than Vc,+ and UP becomes logic 0 and vice versa. The output of the PC PD is valid when there is a rising edge of DQ between the rising edge of DQS\_I and DQS\_Q. Fig. 4.6 Conceptual diagram of the pulse-to-charge PD. Fig. 4.7 Timing diagram of the proposed pulse-to-charge PD. #### 4.3.2 Offset Calibration Since the PC PD calculates the phase error in the voltage domain, a random variation of a capacitance or a threshold voltage can induce an offset. If there is a mismatch between the capacitors or an offset in the sampler, the locking point of the self-tracking loop can deviate from the optimum point. Figure 4.8 shows the sources of the offset of the PC PD. In general, when using the common-centroid layout, mismatch of the capacitor is smaller than the offset of the sampler. The offset of the sampler is mostly due to the threshold mismatch of the input transistor. Figure 4.9 shows the simulated offset histogram of the pulse-to-charge PD. The average of the offset is 16.8 mV and the standard deviation of the offset is 22.8 mV. The average of the offset is caused by the mismatch in the layout, not the random mismatch. In this paper, a small transistor is used for the capacitor and the sampler to reduce the power consumption and area. Due to this, the value of the offset increases, but the overall performance is not degraded because the offset is compensated before the normal operation. Figure 4.10 shows the sequence of the proposed offset calibration scheme. The proposed offset calibration scheme operates without any additional sensing circuits by taking advantage of the write training of HBM. Fortunately, DQ and DQS are aligned to the optimum sampling point after the write training sequence and the PC PD can be calibrated using the optimum sampling point as a reference. If DQ and DQS are aligned to the optimum sampling point, the output of the PC PD represents the polarity of the offset. The optimum calibration code can be found by changing Fig. 4.8 Sources of the offset of the pulse-to-charge PD. Fig. 4.9 Simulated offset histogram of the pulse-to-charge PD. the calibration code according to the output of the PC PD and then checking the output of the PC PD again. In this thesis, the 4-bit calibration code is retrieved using a binary search so the offset calibration of the PC PD is done in a few DQS cycles. Note that the calibration code has a resolution of 10 mV, which corresponds to a 0.03 UI offset in PD gain curve. ## **4.3.3 Operation Sequence** Figure 4.11 shows the operation sequence of the PD training and the self-tracking loop. Since the phase between DQ and WDQS cannot be changed while the self-tracking loop is enabled, all training sequences should be done while the self-tracking loop is disabled. Since the proposed PD training scheme works when the WDQS samples DQ in optimum sampling point, it should be performed after the symbol training. The proposed self-tracking loop should be enabled before the normal operation and after the PD training. If the PD training is performed after a long time after symbol training, the eye margin is reduced by the skew caused by the voltage or thermal drift during this time. Fig. 4.10 Simulated offset histogram of the pulse-to-charge PD. Fig. 4.11 Operation sequence of the PD training and self-tracking loop. # 4.4 Circuit Implementation Figure 4.12 shows the overall architecture of the proposed DQ receiver. The amplifier amplifies the 400-mV 2.4-GHz DQS input to the full swing and the IQ divider divides the differential 2.4-GHz DQS to the 1.2-GHz 4-phase quadrature clocks. The 4.8-Gb/s, 4-channel DQ inputs are sampled by the 1.2-GHz quadrature clock. The proposed PC PD detects the phase error between DQ and DQS without the extra Fig. 4.12 Overall architecture of the proposed DQ receiver. phase clock for edge sampling. The operation of the PC PD is described in detail in Chapter 4.3. The digital loop filter accumulates the output of the PC PD and adjusts the delay control word (DCW) to the optimum sampling point. Figure 4.13 shows the circuit implementation of the DQS amplifier. The amplifier consists of 2-stage differential amplifiers. The first stage amplifier uses the Fig. 4.13 Circuit implementation of the DQS amplifier. Fig. 4.14 Circuit implementation of the sampler. resistive load for high bandwidth and the second stage amplifier uses the active load for large gain. Figure 4.14 shows the circuit implementation of the sampler. Since the input voltages below the threshold voltage of NMOS have to be sampled, the strong arm latch with PMOS inputs is used. A NOR gate based RS latch follows the strong arm latch. Figure 4.15 shows the circuit implementation of DCDL. The 6-bit DCDL is implemented using inverters, MOS capacitors and NMOS switches. The delay of the Fig. 4.15 Circuit implementation of the DCDL. Fig. 4.16 Circuit implementation of the PC PD. DCDL is controlled by adjusting the load capacitance of the inverter. The binary weighted MOS capacitor array is partially connected to the inverter using the NMOS switches. The delay tuning range of the DCDL is more than 1UI in all PVT corners. The circuit implementation of the PC PD is shown in Figure 4.16. Like the DCDL, the capacitance digital-to-analog converter (DAC) is implemented using the NMOS capacitor and the NMOS switches. The capacitance of $V_{outp}$ is controlled by 4-bit calibration code and the control code of capacitance DAC connected to $V_{outm}$ is fixed to middle code (4'b1000). ## 4.5 Measurement Result The prototype chip is fabricated in 65 nm CMOS technology. Figure 4.17 shows the chip photomicrograph of the DQ receiver. A pattern generator, an emulated silicon interposer channel and a transmitter are implemented to emulate the environment in which the DQ receiver operates in HBM3. The pattern generator generates 4.8 Gb/s PRBS7 patterns for 4 DQ lanes with different seeds and corresponding 2.4 GHz WDQS signal. The transmitter is implemented to independently control the phase between the output signals to support the write training function. The emulated silicon interposer channel is implemented on chip using the metal wire of standard CMOS process. The simulated insertion loss of the channel is about 1 dB at the Fig. 4.17 Chip photomicrograph of the DQ receiver. #### Nyquist frequency. Figure 4.18 shows the measurement setup for the DQ receiver. The bit error rate tester (BERT) generates the 4.8-GHz TX clock and receives the 1.2 Gb/s recovered Fig. 4.18 Measurement setup for the DQ receiver. Fig. 4.19 Flowchart for the DQ receiver measurement. data from the prototype chip. The personal computer (PC) controls the prototype chip via I2C interface and the BERT via VISA\_PY interface to measure the eye diagram and bathtub. The write training is performed by changing the phase of output signals in TX and measuring the eye diagram. The power supply is used to generate the voltage drift externally. Figure 4.19 shows the flowchart for the DQ receiver measurement. First, make the supply voltage to 1.1 V, and then performs write training using PC and BERT. Next, the PD calibration is performed and the self-tacking loop is activated. In this state, after applying voltage drift externally, verify the effect of the self-tracking loop through the eye diagram. Figure 4.20 shows the block diagram of the TX. The TX consists of the pattern generator, serializers, clock divider, the DCDL and the driver. The pattern generator operates at 500 MHz and generates the parallelized data and corresponding parallelized DQS pattern according to the burst length. This parallelized data is serialized using the 8:1 serializer. The phase of each DQ output is controlled independently using the DCDL. The TX is designed to emulate the per-pin write training by adjusting the DCDL code, the strength of the driver, and the reference voltage of the RX. Figure 4.21 shows the timing diagram of the DQ and DQS when the burst length is 4 and the period of the WRITE command is 8 UI. The pattern transmitted to the DQ repeats the 8-bit pattern, which consists of 4 bits of valid data and 4 bits of last valid data. Figure 4.22 shows the overall block diagram of the test chip including the test circuits. The on-chip eye monitor measures the eye diagram by sweeping the phase between the Clk1 and Clk2 using the BERT and $V_{ref2}$ using the external power Fig. 4.20 Baalock diagram of the TX. Fig. 4.21 Timing diagram of the DQ and DQS when the burst length is 4 and the period of the WRITE command is 8 UI. Fig. 4.22 Overall block diagram of the test chip including the test circuits. supply. The bathtub measurement sequence is also presented in Figure 4.22. The phase between the DQ and DQS are adjusted using the DCDL in the TX. To verify the self-tracking loop, a basic RX without the self-tracking loop is also implemented. The basic RX is implemented identically to the proposed DQ receiver with all circuits other than the self-tracking loop. Figure 4.23 and Figure 4.24 shows the measured eye diagrams of the DQ receiver with and without the proposed self-tracking loop and the offset calibration scheme. The timing step and $V_{ref}$ voltage step are 3 ps (1.4 mUI) and 25 mV, respectively. Assuming that the write training is performed with a 1.1-V supply, the margin is defined as the maximum tolerable DQS timing variation from the trained point for the error-free operation. The measurement results show that 100-mV voltage drift significantly reduces the timing margin and results in the errors without the self-tracking loop. Fig. 4.25 shows the measured timing margin versus the supply voltage. After the write training is performed, the timing margin is maximized to 0.38 UI in both with and without the self-tracking loop. However, if the voltage drift occurs, the timing margin of the DQ receiver without the self-tracking loop reduces significantly. The margin of the DQ receiver with the self-tracking loop remains almost unchanged from 1 to 1.2-V supply variation. When the proposed offset calibration scheme is disabled, the lock point deviates from the optimum sampling point due to the offset and the timing margin is degraded by 0.07 UI. However, since the self-tracking loop tracks the skew due to the voltage drift, the margin does not change with the supply variation. Figure 4.26 shows the power breakdown of the DQ receiver. The total power consumption is 7.04 mW when 4 DQ lanes operate at 4.8 Gb/s. The power overhead of the self-tracking loop is 1.47 mW, which is 21 % of the total power consumption. Table 4.1 compares this work with state-of-the-art forwarded-clock receivers. The proposed DQ receiver achieves the smallest area per lane and comparable power efficiency among the recent works since the proposed baud-rate PD detects the phase skew with a small power overhead. Fig. 4.23 Measured eye diagrams without the self-tracking loop. Fig. 4.24 Measured eye diagrams with the self-tracking loop and PC PD calibration. Fig. 4.25 Measured timing margin versus the supply voltage. Fig. 4.26 Power breakdown of the DQ receiver. Table 4.1 Performance comparison with previous forwarded-clock receivers. | 0.47 pJ/b | 0.37 pJ/b | 0.28 pJ/b | 0.56 pJ/b 0.28 pJ/b 0.37 pJ/b 0.47 pJ/b | 0.36 pJ/b | 1.2 pJ/b | Power<br>efficiency | |-----------|------------------------|-----------|-----------------------------------------------|-----------------------|---------------------|---------------------| | 1.2 V | 1.1 V | 1 V | 0.8 V | 0.9 V | 0.85/1.1 V | VDD | | 2 | $0.0056~\mathrm{mm}^2$ | | 0.36 mm <sup>2</sup> | $0.025~\mathrm{mm}^2$ | $0.02 \text{ mm}^2$ | Area* | | | DLL | | PI | DLL | DLL | CDR arch. | | | 65 nm | | 65 nm | 65 nm | 28 nm | Technology | | | 1.2 GHz | | 3.5 GHz | 6.25 GHz | 3.2 GHz | Clock rate | | | 4.8 Gb/s | | 14 Gb/s | 12.5 Gb/s | 6.4 Gb/s | Data Rate | | | This work | | SOVC'14<br>[11] | TCAS-I'16<br>[10] | CICC'13<br>[9] | | \* Normalized to the number of lanes # Chapter 5 High-Density Transceiver for HBM with 3D-Staggered Channel and Crosstalk Cancellation Scheme ### **5.1 Overview** With dramatic increases in processor performance, the bandwidth of an external DRAM has become a bottle neck for enhancing system performance. To meet the demand for the memory bandwidth, HBM which uses the TSV and the silicon interposer technology has been proposed. Since the channel pitch of the silicon interposer is much lower than other packaging technologies, HBM has a large number of I/Os and provides a higher throughput (Gb/s/um) than the other DRAM interfaces. To overcome the bandwidth limitation of these RC-dominant wires, various parallel links over on-chip wires have been proposed [13] -[20]. A capacitively-driven show high energy efficiency because of their intrinsic prelinks [13] -[16] emphasis characteristics. However, this structure has two problems; DC droop of the coupling capacitor and the large area of the coupling capacitor. A DC-balanced coding scheme can remove the DC droop of the coupling capacitor, but it is difficult to apply this coding scheme to the DRAM interface because the excessive latency due to the coding is unacceptable. In addition, since HBM has a large number of I/Os, the occupied area of the coupling capacitors is too large. A combined AC/DC driver is proposed to remove the DC droop of the coupling capacitor without the DC-balanced coding scheme, but the coupling capacitor still occupies too large an area. A pulse-width-modulation (PWM)-based link using multi-bit signaling [17] shows high energy efficiency and high throughput, but the signal is attenuated too much over the high loss channel since the PWM signal has higher frequency component than the NRZ signal. Meanwhile, to increase the throughput further, either the per-pin data rate or the channel density should be increased. Figure 5.1 shows the definition of the throughput in the silicon interposer. Since increasing the per-pin data rate requires a complex and power-hungry circuitry, increasing the channel density is an effective way to achieve the high throughput. However, a main problem with reducing the channel pitch is the crosstalk (XT) between adjacent lanes [17]. If the channels are stacked vertically for high channel density, the vertically adjacent channels become additional XT sources. There have been much research on the XT cancellation (XTC) [15], [17], [25] -[32], but only a few consider the XT issues in multiple lanes [29] -[32] . A XTC scheme using an RC differentiator [29], [30] is power-efficient and can be easily extended to multiple lanes. However, as the number of lanes increases, the area of passive elements becomes too large. A decision-feedback-based XT can- Fig. 5.1 Definition of the throughput in the silicon interposer. celler [32] can compensate for the XT in multiple lanes without passive elements but it consumes much power due to a large number of feedback taps. In this chapter, we propose an 8 Gb/s/µm transceiver for next-generation HBM [20]. The feed-forward-equalizer (FFE)-combined XTC scheme efficiently compensates for XT in multiple lanes without the passive elements by reusing the FFE. Thanks to the XTC scheme, no ground shield is required and the channel pitch is greatly reduced. Moreover, the 3D-staggerd channel architecture is presented to improve the data density per cross-section area by removing the ground-shielding layer between vertically adjacent channels. The 1-tap fractional-UI FFE and a low input-impedance receiver are used to equalize the channel loss. ## **5.2 Proposed 3D-Staggered Channel** The characteristics of the channel greatly affect the link performance. These characteristics vary depending on the width, spacing, and the thickness of the channels. When designing the channel, not only the channel loss and the XT, but also the structure of the transmitter and the receiver should be considered. We propose a XT-cancelling TX, thereby we remove the ground-shielding layer and increase the data density per cross-sectional area at the cost of the large XT. In Chapter 5.2.1, we explain on the structure of the proposed 3D-staggered channel. The characteristics and the lumped element model of the channels are shown in Chapter 5.2.2. ### 5.2.1 Implementation of 3D-Staggered Channel To emulate the silicon interposer channel used in HBM, a 6-mm-long on-chip channel is implemented in the standard CMOS process. Figure 5.2 shows the cross-section view of the 3D-staggered channel. M3 and M5 layers are used for the signal, and M1 and M7 layers are used for the ground. The even and odd numbered channels are implemented with two vertically adjacent metal layers and staggered without the ground-shielding layer. The resistance and inductance per unit length are denoted as R and L, and the capacitances per unit length between the signal and the ground, vertically adjacent channel, and horizontally adjacent channel are denoted as C<sub>G</sub>, C<sub>V</sub>, and C<sub>H</sub>, respectively. The parameters of the 3D-staggered channel is shown in Table 4.1. The simulated values of R, L, C<sub>G</sub>, C<sub>V</sub>, and C<sub>H</sub> are 212 Ω/mm, Fig. 5.2 Cross-section view of the 3D-staggered channel. Fig. 5.3 Cross-section area of the 3D-staggered channel. Fig. 5.4 Cross-section area of the single-ended channel with the ground shield. Fig. 5.5 Cross-section area of the differential channel with the ground shield. | Parameters | | | | | |----------------|------------|--|--|--| | R | 212 Ω/mm | | | | | L | 296 pH/mm | | | | | C <sub>g</sub> | 43.6 fF/mm | | | | | C <sub>1</sub> | 19.0 fF/mm | | | | | C <sub>2</sub> | 19.7 fF/mm | | | | Table 5.1 Parameters of 3D-staggered channel. 296 pH/mm, 43.6 fF/mm, 19.0 fF/mm, and 19.7 fF/mm, respectively. The width and the spacing of the channels are designed to be 0.5 um such that the XT from the vertically and horizontally adjacent channels contributes almost equally. By making the contribution of all aggressors equal, the complexity of the XT canceller in the TX is reduced. This will be explained in detail in Chapter 5.4.2. Figure 5.3, 5.4, and 5.5 shows the occupied cross-sectional area per channel for the various channel configurations. The thickness of the metal and the oxide, and the width and the spacing of the channels, which are denoted as T<sub>M</sub>, T<sub>OX</sub>, W, and S, respectively, are assumed equal for simplicity. The occupied area per channel of the 3D-staggered channel, single ended channel with ground shield and the differential channel with the ground shield is given as $$AREA_{3D-staggered} = 1.5 \cdot (T_M + T_{ox}) \cdot (W + S) \tag{5.1}$$ $$AREA_{singk} = 4 \cdot (T_M + T_{ox}) \cdot (W + S)$$ (5.2) $$AREA_{diff} = 6 \cdot (T_M + T_{ox}) \cdot (W + S). \tag{5.3}$$ The occupied area of the 3D-staggered channel is 62.5 % and 75 % smaller than that of the single ended channel and the differential channel, respectively. The data density is increased as the cross-sectional area occupied by the ground layer is removed, but additional XT appears as the capacitance is formed between the vertically adjacent signals, not to the ground. If the additional XT can be compensated for without significant degradation in the eye margin, the data density can be increased by 2.7 times compared with the single-ended channels by staggering the channel vertically. ### 5.2.2 Channel Characteristics and Modeling The channel on the silicon interposer is very lossy due to the miniaturization of the wire cross section [12]. Since the series resistance is much larger than the series inductance from DC to Nyquist frequency (wL = 3.72 $\Omega$ /mm << R = 212 $\Omega$ /mm), the 3D-staggered channel operates in the RC-dominated region and is modeled as the distributed-RC channel. Figure 5.6 shows the lumped element model of the 3D-staggered channel. The distributed-RC channel is approximated as 30 stages of Fig. 5.6 Lumped element model of the 3D-staggered channel. On the other hand, since the channel operates in the RC-dominated region, reflection is negligible and impedance matching is not necessary. To determine the output impedance of the TX and the input impedance RX input, the channel loss should be considered instead of the reflection [33]. In case of a conventional voltage-mode signaling, the output resistance of the TX is low and the input resistance of the RX input is high. However, in this work, a low-impedance resistive termination is used at the TX output and also a low-impedance RX input is employed to increase the bandwidth of the channel at the cost of the signal swing. The equivalent circuit representation with the high input-impedance receiver and the low input-impedance receiver is shown in Fig. 5.7 and Fig. 5.8. The approximated bandwidth extension with the low input-impedance RX can be derived using the zero-value time constant (ZVTC) method [34] and the dominant pole approximation, which is given as $$-\frac{1}{p_{h\dot{y}}} = \lim_{N \to \infty} \sum_{k=1}^{N} \tau_k^0$$ $$= \lim_{N \to \infty} \sum_{k=1}^{N} \left[ \left( R_{TX} + \frac{k}{N} R_{ch} \right) * \frac{C_{ch}}{N} \right]$$ $$= C_{ch} \left( R_{TX} + \frac{R_{ch}}{2} \right)$$ (5.4) $$-\frac{1}{p_{bwZ}} = \lim_{N \to \infty} \sum_{k=1}^{N} \tau_k^0$$ (5.5) $$= \lim_{N \to \infty} \sum_{k=1}^{N} \left[ \left\{ \left( R_{TX} + \frac{k}{N} R_{ch} \right) \parallel \left( \frac{N - k}{N} R_{ch} + R_{RX} \right) \right\} * \frac{C_{ch}}{N} \right]$$ $$= C_{ch} \lim_{N \to \infty} \sum_{k=1}^{N} \left[ \frac{R_{TX} R_{RX} + R_{ch} \left( \frac{k}{N} R_{RX} + \frac{N - k}{N} R_{TX} + \frac{N(N - k)}{N^2} R_{ch} \right)}{N(R_{TX} + R_{RX} + R_{CH})} \right]$$ $$= C_{ch} \frac{R_{TX} R_{RX} + \frac{1}{2} R_{ch} (R_{TX} + R_{RX} + \frac{1}{3} R_{ch})}{R_{TX} + R_{ch} + R_{TX}}$$ where $R_{TX}$ is the output impedance of the $T_X$ , $R_{RX}$ is the input impedance of the RX, Fig. 5.7 Equivalent circuit representation with the high input-impedance receiver. Fig. 5.8 Equivalent circuit representation with the low input-impedance receiver. R<sub>ch</sub> is the total resistance of the channel, and C<sub>ch</sub> is the total capacitance of the channel. Substituting the designed values (R\_{TX} = 200 $~\Omega~$ , $R_{RX}$ = 300 $~\Omega~$ , $R_{ch}$ = 1272 $~\Omega)$ in Equation (5.4) and (5.5), the bandwidth extension obtained by using the low input-impedance RX is about 2.3 times. If R<sub>TX</sub> and R<sub>RX</sub> is small enough compared to $R_{ch},\,(R_{ch}>>R_{TX},\,R_{ch}>>R_{RX})$ the bandwidth with the low input-impedance RX becomes 3 times larger than that with the high input-impedance RX. This is the maximum achievable bandwidth extension by using the low input-impedance RX. However, in this case, the voltage swing also approaches to zero, therefore this design point cannot be used. The simulated channel loss with the low and high input- Fig. 5.9 Channel loss with the low and high input-impedance receiver. impedance RX is presented in Figure 5.9. The insertion loss at DC is increased to 15.4 dB using the low input-impedance RX. However, the bandwidth is increased 2.56 times from 281 MHz to 720 MHz, which is close to the calculated value. The signal power at Nyquist frequency component is boosted by 7.7 dB by using the low input-impedance RX. Figure 5.10 shows the frequency response of the 3D-staggered channel. The frequency dependent loss at the Nyquist frequency is 10.2 dB. The far-end XT (FEXT) from the immediately adjacent channels is denoted as FEXT1, the FEXT Fig. 5.10 Frequency response of the 3D-staggered channels. from the second-adjacent channels is denoted as FEXT2, and the FEXT from the third-adjacent channels is donated as FEXT3 as shown in Figure 4.3. As the frequency increases, it can be seen that the signal to XT ratio decreases. The signal to XT ratios of FEXT1 and FEXT2 at the Nyquist frequency are 6.2 dB and 7.4 dB, respectively. The signal to XT ratio of FEXT3 is 14.1 dB and the effect of FEXT3 on the eye margin is negligible. #### **Proposed Feed-Forward-Equalizer-**5.3 ### **Combined Crosstalk Cancellation Scheme** Figure 5.11 shows the conceptual diagram of the proposed FFE-combined XTC scheme. The basic idea of the proposed method is to pre-distort the TX output to remove the XT. If we can calculate the amplitude of the XT and subtract it from the TX output, the XT is cancelled out at the channel output. This is similar to the way the FFE compensates for the channel loss. Since the XT shows up only on a transition in the aggressor, the amplitude of the XT can be calculated through the edge detector's outputs of the aggressors. When the FFE is used at the TX to compensate for the channel loss, the TX can distort the output by adjusting the amplitude of the FFE output. The amplitude of the FFE output is adjusted by summing the edge detector's outputs of the aggressors. Figure 5.12 shows the simplified waveforms of the input and output of the channel. To cancel the XT accurately, the compensation signal with the same shape as the XT but in the opposite polarity should be added. It is well known that the waveform of the XT is close to a derivative of the waveform of the aggressor. Therefore, the waveform of the XT is confined within the transition time of the aggressor, which is denoted as $T_{TR}$ , and the XTC signal should also be confined within $T_{TR}$ . To make this short XTC waveform, we use a fractional-UI FFE. With the fractional-UI FFE which can sufficiently compensate for the channel loss, most of the transition are finished within the pulse width of the FFE, T<sub>PUL</sub>. Therefore, both the XT Fig. 5.11 Conceptual diagram of the FFE-combined XTC scheme. Fig. 5.12 Simplified waveforms of the input and output of the channel. and the XT cancellation signal are confined within T<sub>PUL</sub>. As a result, the XT is cancelled out by adjusting the amplitude of the FFE output according to the amplitude of the XT. Figure 5.13 shows the effect of the fractional-UI FFE on the XT-induced jitter (CIJ). The eye diagrams are simulated using a data rate of 4 Gb/s, T<sub>PUL</sub> of 70 ps and the channel model decribed in Chapter 5.2. In both cases, only one aggressor is enabled and the optimum coefficients are used. Since the fractional-UI FFE can amplify higher frequency components than the integer-UI FFE, the transition time of the integer-UI FFE is longer than that of the fractional-UI FFE. Therefore, the effect of the XT spreads over a longer period of time and it is difficult to accurately compen- Fig. 5.13 Comparison of XT-induced jitter according to the FFE structure. sate for it. In the case of the integer-UI FFE, the residual CIJ is 93 ps and the vertical eye opening is 39.6 mV. On the other hand, in the case of the fractional-UI FFE, the residual CIJ is reduced to 46 ps, but the vertical eye opening is also reduced to 20.2 mV. The reason for the reduced vertical eye opening is that the optimum FFE coefficient of the fractional-UI FFE is larger than that of the integer-UI FFE. In situations where the XT is large, the fractional-UI FFE is more advantageous because the eye can be closed due to the residual CIJ. # 5.4 Circuit Implementation #### **5.4.1 Overall Architecture** Figure 5.14 shows the overall architecture of the proposed transceiver with the FFE-combined XTC scheme. It consists of an encoder, the TX with the FFE-combined XTC, the 3D-staggered channel, and the RX. The encoder receives a 1:4 de-serialized data from all channels and calculates the required amplitude of the FFE output of each channel in digital codes. This digital codes for pull-up and pull-down amplitude are serialized to 5 bits each, and each bit is converted to 70-ps wide pulses by the pulse generator for the fractional-UI FFE operation. The 3D-staggered channels are implemented on-chip to emulate the silicon interposer between the HBM and the processor. The RX amplifies the input signal and provides the low input impedance to increase the bandwidth of the channel. #### **5.4.2** Transmitter with FFE-Combined XTC The TX uses the charge-injecting FFE driver proposed in [35] to eliminate the short current caused by the subtraction in the FFE. The charge-injecting FFE driver prevents turning on the pull-up and pull-down path simultaneously, eliminating the short-circuit current. However, it results in the non-linear relationship between the amplitude of the FFE output and the strength of the driver. The required strength of the driver according to the data pattern is determined based on the numerical simulation and shown in Figure 5.15. Since the channel is designed such that the XT from 4 aggressors are the same, the amplitude of the XT has a total of 9 levels. Note that if the XT from 4 aggressors has different amplitude, the amplitude of the XT can | Data | Ris | е | Fall | | Hold | | |--------|--------|-------|--------|--------------|------|------| | XT | ctrl | Str. | ctrl | Str. | ctrl | Str. | | 4 fall | P4+3+1 | 11.2x | N0 | <b>1</b> x | P3 | 4.1x | | 3 fall | P4+3 | 9.6x | N1 | 1.6x | P2+0 | 3.5x | | 2 fall | P4+2 | 8x | N2+0 | 3.5x | P2 | 2.5x | | 1 fall | P4+1 | 7.1x | N3 | 4.1x | P0 | 1x | | 0 | P4 | 5.5x | N4 | <b>5.5</b> x | - | | | 1 rise | P3 | 4.1x | N4+1 | 7.1x | N0 | 1x | | 2 rise | P2+0 | 3.5x | N4+2 | <b>8x</b> | N2 | 2.5x | | 3 rise | P1 | 1.6x | N4+3 | 9.6x | N2+0 | 3.5x | | 4 rise | P0 | 1x | N4+3+1 | 11.2x | N3 | 4.1x | Fig. 5.15 Required strength of the driver and the corresponding values of N<sub>CTRL</sub> and P<sub>CTRL</sub> according to the data pattern. have up to a total of 3<sup>4</sup> levels. By making the XT from 4 aggressors equal, the number of output entries is significantly reduced, which reduces the complexity of the encoder. The driver is divided into 5 segments, and the strength of each segment is 5.5x, 4.1x, 2.5x, 1.6x, and 1x, respectively. The required strength of the driver is produced by changing the combination of switched-on driver segments. For example, when the data is rising and 4 aggressors are falling, the required pull-up driver strength is 11.2x. Therefore, P<sub>CTRL</sub> is encoded to 11010 and N<sub>CTRL</sub> is encoded to 00000. ### 5.4.3 Receiver Figure 5.16 shows the circuit implementation of the RX and the simulated RX gain curve. The regulated-cascode (RGC) trans-impedance amplifier (TIA) with a positive amplifier [36] is used as the RX to provide the low input impedance. This RGC-TIA provides the lower input impedance than the conventional RGC-TIA. The simulation results show that the input impedance of the RX is 200 $\Omega$ and the TIA gain is 2.22 k\O at low frequency. The RX consumes 0.5 mW per channel with 1.2-V supply. Fig. 5.16 Circuit implementation of the RX and simulated RX gain curve. ## 5.5 Measurement Result The prototype transceiver and the 3D-staggered channel have been fabricated in the 65-nm CMOS process. The die photomicrograph and the performance summary are shown in Figure 5.17 and Figure 5.18, respectively. The transceiver occupies 0.05 mm² and consumes 36.6 mW at 4 Gb/s when all 6 lanes are active. Figure 5.19 and Figure 5.20 show the measurement setup and the block diagram of the measurement setup of the test chip. The test chip receives the TX and RX clocks form the BERT (Agilent N4903A) and return the sampled data to the BERT. The internal pattern generator generates 4Gb/s PRBS7 pattern and the internal data sampler and BERT are used to measure the eye diagram. The power supply (Agilent B2926A) generates the supply voltage for a linear regulator (LT3042) and the reference voltage used for the eye diagram measurement. The regulation board provides the lownoise power to the test chip. The eye diagram measurement is automated using PYTHON scripts. The PC controls BERT and power supply through the VISA interface and controls test chip through the I²C interface. Figure 5.21 and Figure 5.22 shows the measured eye diagrams with and without the proposed XTC scheme at the output of CH3 (RX<sub>3,out</sub>). The pattern of the aggressors is PRBS7 with varying initial values. If one aggressor is switched on, when the XTC is enabled the horizontal eye opening is increased from 0.34 UI to 0.6UI and the vertical eye opening is increased from 18 mV to 44 mV. If two or more aggressors are switched on, the eye is completely closed without the XTC scheme. However, the eye width with the XTC is larger than 0.4 UI even when the all aggressors Fig. 5.17 Die photomicrograph. | | This work | | |---------------------------------|------------------------------------------------------------|--| | Technology (nm) | 65 | | | VDD (V) | 1.2 | | | Link length (mm) | 6 | | | Data rate per wire<br>(Gb/s/ch) | 4 | | | Throughput<br>(Gb/s/um) | 8 | | | Area (mm²) | Encoder : 0.03<br>TX : 0.016<br>RX : 0.004<br>Total : 0.05 | | | Power (mW) | Encoder : 12<br>TX : 21.6<br>RX : 3<br>Total : 36.6 | | Fig. 5.18 Performance summary. Fig. 5.19 Measurement setup. Fig. 5.20 Block diagram of the measurement setup. Fig. 5.21 Measured eye diagrams of the proposed transceiver with and without XTC scheme. Fig. 5.22 Measured eye diagrams of the proposed transceiver with and without XTC scheme Fig. 5.23 Measured XT induced jitter with and without the XTC. Fig. 5.24 Measured bathtub curve at 4 Gb/s. are switched on. The vertical and horizontal eye opening without the XT is 62 mV and 0.74 UI, respectively. Figure 5.23 shows the measured CIJ with and without the XTC. The CIJ is measured by subtracting the jitter without the XT from the jitter with the XT. The CIJ is measured at 1.24 Gb/s because the CIJ without the XTC is larger than 1 UI at 4 Gb/s. All other conditions are the same in the eye diagram measurements. Without the XT, the peak-to-peak jitter including the random jitter and ISI is measured as 76 ps. When all aggressors are switched on, the proposed XTC scheme reduces the CIJ by 74 %, from 330 ps to 85 ps. When 3 aggressors are switched on, the XTC scheme reduces CIJ at the largest rate of 78 %. The bathtub curve at 4 Gb/s is plotted in Figure 5.24. When all aggressors are switched on and the XTC is disabled, bit error rate (BER) is above 10<sup>-2</sup> at all sampling points. When the XTC is enabled, the eye is open with a 0.32 UI margin at the BER of 10<sup>-12</sup>. The measured CIJ and CIJ reduction ratio over the various number of aggressors with and without the XTC scheme is | Number | Jitter <sub>p-p</sub> | Jitter <sub>p-</sub> | <sub>p</sub> w/ XT | CIJ | | CIJ reduction | | |---------|-----------------------|----------------------|--------------------|------------|-----------|-----------------------|-------------| | of Agg. | w/o XT | w/o<br>XTC | w/<br>XTC | w/o<br>XTC | w/<br>XTC | Jitter <sub>p-p</sub> | ratio | | 1 | | 176 ps | 111 ps | 100 ps | 35 ps | 65 ps | <b>65</b> % | | 2 | 76 ps | 321 ps | 136 ps | 245 ps | 60 ps | 185 ps | 76 % | | 3 | | 366 ps | 141 ps | 290 ps | 65 ps | 225 ps | 78 % | | 4 | | 406 ps | 161 ps | 330 ps | 85 ps | 245 ps | 74 % | Fig. 5.25 Measured XT-induced jitter. summarized in Figure 5.25. Table 5.2 summarizes the performance of the proposed transceiver and compares our work with the other multi-channel on-chip serial links. Since the bandwidth of the distributed-RC channel is inversely proportional to the square of the line length and proportional to the cross-sectional area, the data rate is multiplied by the square of the line length and divided by the cross-sectional area to calculate the figure-of-merit<sub>1</sub> (FoM<sub>1</sub>) [33] . However, since many papers do not report T<sub>M</sub> and T<sub>ox</sub>, a FoM<sub>2</sub> is also shown by dividing the data rate by the channel pitch rather than the cross-sectional area. The proposed transceiver provides the highest throughput of 8 Gb/s/um and the best FoM<sub>1,2</sub> compared with the other works. This is because the channel pitch is significantly reduced by using the proposed XTC scheme. Figure Fig. 5.26 Power breakdown of the transceiver. 5.26 shows the power breakdown of the proposed transceiver. Note that most of the power is consumed by the XTC scheme to compensate for the XT of 3D-staggered channels. Table 5.3 shows the comparison with the other multi-channel XTC schemes. The transceiver achieves the comparable CIJ reduction ratio and the highest energy efficiency of 1.5 pJ/b by utilizing the FFE-combined XTC scheme. Figure 5.27 shows the throughput versus the channel length among the multi-channel on-chip serial links published over the past 10 years. The proposed transceiver achieves a high throughput compared to the line length, resulting in the best FoM<sub>2</sub>. Table 5.2 Performance summary and comparison with other multi-channel on-chip serial links. | 288 | 48.1 | 137.5 | 12.3 | 125 | 100 | 92.3 | FoM <sub>2</sub> ****<br>(Gb/s/um*mm²) | |---------------|------|-------|------|-------|-----------------|----------------|------------------------------------------------------------------| | 121.5 | N/A | N/A | 3.74 | 78.1 | 62.5 | 30.8 | FoM <sub>1</sub> ***<br>(Gb/s/um <sup>2</sup> *mm <sup>2</sup> ) | | Yes | No | Yes | No | Yes | No | No | XT compensation | | 254 | 104 | 180 | 120 | 41 | 48.4 | 174 | Energy efficiency per length (fJ/b/mm) | | 0.57<br>/0.57 | N/A | N/A | 1/1 | INF** | INF**<br>/INF** | INF**<br>/0.5* | Upper/lower $T_{ox}$ (um) | | 0.22 | N/A | N/A | 0.64 | 0.2* | 0.2* | 0.5* | T <sub>M</sub> (um) | | 8 | 1.92 | 5.5 | 1.96 | 1.25 | 4 | 2.56 | Throughput<br>(Gb/s/um) | | 0.5 | 1.04 | 0.4 | 10.2 | 2 | _ | 3.9 | Channel pitch (um) | | 4 | 2 | 2.2 | 20 | 2.5 | 4 | 10 | Data rate per wire<br>(Gb/s/ch) | | 6 | 5 | 5 | 2.5 | 10 | 5 | 6 | Line length (mm) | | 6 | 8 | 4 | 3 | 3 | 16 | 9 | Number of channels | | 65 | 130 | 65 | 28 | 130 | 65 | 65 | Technology (nm) | | This work | [18] | [17] | [16] | [15] | [14] | [13] | | <sup>\*</sup> Not given in the paper; estimated values based on typical technology data \*\* Assumed $3^{\star} I_M$ when calculating FOM; <sup>\*\*\*</sup> Data rate / Cross-section area \* (Line length)²; proposed in [18] \*\*\*\* Throughput \* (Line length)² | | [29] | [30] | [32] | This<br>work | |-----------------------------|-------|------|-------|--------------| | Tech. (nm) | 65 | 130 | 32 | 65 | | # of channels | 4 | 3 | 8 | 6 | | Data rate (Gb/s) | 12 | 5 | 7 | 4 | | Energy<br>efficiency (pJ/b) | 1.8* | 4.3 | 5.9* | 1.5 | | CIJ reduction ratio | 90%** | 75% | 63%** | 78% | | CIJ reduction<br>(ps) | N/A | 36 | N/A | 245 | Table 5.3 Comparison with other multi-channel XTC schemes. <sup>\*\*</sup> Estimated from the reduction ratio of XT noise amplitude Fig. 5.27 Throughput versus channel length among the multi-channel on-chip serial link papers published over the past 10 years. <sup>\*</sup> RX only # Chapter 6 ## **Conclusion** In this thesis, the design techniques for low-power and high-density transceiver for next-generation HBM are proposed. At first, a 4.8-Gbps DQ receiver with a self-tracking loop for HBM3 is proposed. The self-tracking loop compensates for the phase skew between the DQ and DQS due to the voltage or thermal drift, allowing the robust operation. An analog-assisted baud-rate PD which has low power consumption and small area is proposed. The proposed PD converts the phase skew to a voltage difference and detects the phase skew form the voltage difference using sampler. An offset calibration scheme for efficiently compensating offsets of the PD using a write training is also proposed. The proposed DQ receiver is fabricated in the 65-nm CMOS process and occupies 0.0056 mm². The DQ receiver achieves a power efficiency of 370 fJ/b and shows robust operation under a $\pm$ 10% supply variation. Secondly, an 8-Gb/s/um link for the future HBM utilizing the FFE-combined XTC scheme is presented. The proposed 3D-staggered channel structure achieves the high data rate per cross-sectional area by removing the ground shield between the horizontally and vertically adjacent channels. A 1-tap fractional-UI FFE and a low input-impedance RX are used to compensate for the channel loss of the distributed-RC channel. An analysis on the effect of the low input-impedance RX is also presented. In addition, the FFE-combined XTC scheme is proposed which efficiently reduces the CIJ using the existing FFE circuits, resulting in the high energy efficiency compared with the other XTC schemes. The proposed transceiver achieves the highest throughput of 8 Gb/s/um and the lowest energy efficiency of 1.5 pJ/b among the comparison groups. # **Bibliography** - [1] D. U. Lee et al., "25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV," 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, 2014, pp. 432-433. - [2] K. Sohn et al., "A 1.2 V 20 nm 307 GB/s HBM DRAM With At-Speed Wafer-Level IO Test Scheme and Adaptive Refresh Considering Temperature Distribution," in IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 250-260, Jan. 2017. - [3] High Bandwidth Memory (HBM) DRAM, document JESD235A, JEDEC Solid State Technology Association, Nov. 2015. - [4] H. Jun et al., "HBM (High Bandwidth Memory) DRAM Technology and Architecture," 2017 IEEE International Memory Workshop (IMW), Monterey, CA, 2017, pp. 1-4. - [5] Wikipedia, Random-access memory. [Online] [Accessed on 9<sup>th</sup> JUN. 2020] https://en.wikipedia.org/wiki/Random-access memory. - [6] K. Sohn et al., "A 1.2 V 20 nm 307 GB/s HBM DRAM With At-Speed Wafer-Level IO Test Scheme and Adaptive Refresh Considering Temperature Distribution," in IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 250-260, Jan. 2017. - [7] K. Tran, "The era of high bandwidth memory," 2016 IEEE Hot Chips 28 - Symposium (HCS), Cupertino, CA, 2016, pp. 1-22. - [8] High Bandwidth Memory DRAM (HBM1, HBM2), JESD235C, JEDEC, NOV. 2018. - [9] S. Chen, H. Li, L. Yang, Z. Yang, W. Hu and P. Y. Chiang, "A 1.2 pJ/b 6.4 Gb/s 8+1-lane forwarded-clock receiver with PVT-variation-tolerant all-digital clock and data recovery in 28nm CMOS," Proceedings of the IEEE 2013 Custom Integrated Circuits Conference, San Jose, CA, 2013, pp. 1-4. - [10] W. Bae, G. Jeong, K. Park, S. Cho, Y. Kim and D. Jeong, "A 0.36 pJ/bit, 0.025 mm², 12.5 Gb/s Forwarded-Clock Receiver With a Stuck-Free Delay-Locked Loop and a Half-Bit Delay Line in 65-nm CMOS Technology," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 63, no. 9, pp. 1393-1403, Sept. 2016. - [11] Hao Li et al., "A 0.8V, 560fJ/bit, 14Gb/s injection-locked receiver with input duty-cycle distortion tolerable edge-rotating 5/4X sub-rate CDR in 65nm CMOS," 2014 Symposium on VLSI Circuits Digest of Technical Papers, Honolulu, HI, 2014, pp. 1-2. - [12] X. Gu et al., "High-density silicon carrier transmission line design for chip-to-chip interconnects," 2011 IEEE 20th Conference on Electrical Performance of Electronic Packaging and Systems, San Jose, CA, 2011, pp. 27-30. [13] D. Walter et al., "A source-synchronous 90Gb/s capacitively driven serial on-chip link over 6mm in 65nm CMOS," 2012 IEEE International Solid-State Circuits Conference, San Francisco, CA, 2012, pp. 180-182. - [14] M. Chen, M. F. Chang and C. K. Yang, "A low-PDP and low-area repeater using passive CTLE for on-chip interconnects," 2015 Symposium on VLSI Circuits (VLSI Circuits), Kyoto, 2015, pp. C244-C245. - [15] J. Lee, W. Lee and S. Cho, "A 2.5-Gb/s On-Chip Interconnect Transceiver With Crosstalk and ISI Equalizer in 130 nm CMOS," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 59, no. 1, pp. 124-136, Jan. 2012. - [16] B. Dehlaghi and A. Chan Carusone, "A 0.3 pJ/bit 20 Gb/s/Wire Parallel Interface for Die-to-Die Communication," in IEEE Journal of Solid-State Circuits, vol. 51, no. 11, pp. 2690-2701, Nov. 2016. - [17] J. Seo, D. Blaauw and D. Sylvester, "Crosstalk-Aware PWM-Based On-Chip Links With Self-Calibration in 65 nm CMOS," in IEEE Journal of Solid-State Circuits, vol. 46, no. 9, pp. 2041-2052, Sept. 2011. - [18] J. You, J. Song and C. Kim, "A 2-Gb/s/ch Data-Dependent Swing-Limited On-Chip Signaling for Single-Ended Global I/O in SDRAM," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 64, no. 10, pp. 1207-1211, Oct. 2017. [19] S. Höppner et al., "An Energy Efficient Multi-Gbit/s NoC Transceiver Architecture With Combined AC/DC Drivers and Stoppable Clocking in 65 nm and 28 nm CMOS," in IEEE Journal of Solid-State Circuits, vol. 50, no. 3, pp. 749-762, March 2015. - [20] H. Ko, S. Shin, J. Oh, K. Park and D. Jeong, "An 8Gb/s/μm FFE-Combined Crosstalk-Cancellation Scheme for HBM on Silicon Interposer with 3D-Staggered Channels," 2020 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 2020, pp. 128-130. - [21] Byungsub Kim. (2010). Equalized On-Chip Interconnect: Modeling, Analysis, and Design (Doctoral dissertation). Retrieved from https://dspace.mit.edu/ - [22] D. Pozar, Microwave Engineering. Reading, MA: Addison-Wesley, 1990. - [23] A. J. Rainal, "Transmission properties of various styles of printed wiring boards," in The Bell System Technical Journal, vol. 58, no. 5, pp. 995-1025, May-June 1979. - [24] M. S. Lin, A. H. Engvik and J. S. Loos, "Measurements of crosstalk between closely-packed lossy microstrips on silicon substrates," in Electronics Letters, vol. 26, no. 11, pp. 714-716, 24 May 1990. - [25] J. F. Buckwalter and A. Hajimiri, "Cancellation of crosstalk-induced jit- ter," in IEEE Journal of Solid-State Circuits, vol. 41, no. 3, pp. 621-632, March 2006. - [26] H. Jung, I. Yi, S. Lee, J. Sim and H. Park, "A Transmitter to Compensate for Crosstalk-Induced Jitter by Subtracting a Rectangular Crosstalk Waveform From Data Signal During the Data Transition Time in Coupled Microstrip Lines," in IEEE Journal of Solid-State Circuits, vol. 47, no. 9, pp. 2068-2079, Sept. 2012. - [27] M. H. Nazari and A. Emami-Neyestanak, "A 15-Gb/s 0.5-mW/Gbps Two-Tap DFE Receiver With Far-End Crosstalk Cancellation," in IEEE Journal of Solid-State Circuits, vol. 47, no. 10, pp. 2420-2432, Oct. 2012. - [28] S. Kao and S. Liu, "A 7.5-Gb/s One-Tap-FFE Transmitter With Adaptive Far-End Crosstalk Cancellation Using Duty Cycle Detection," in IEEE Journal of Solid-State Circuits, vol. 48, no. 2, pp. 391-404, Feb. 2013. - [29] T. Oh and R. Harjani, "A 12-Gb/s Multichannel I/O Using MIMO Crosstalk Cancellation and Signal Reutilization in 65-nm CMOS," in IEEE Journal of Solid-State Circuits, vol. 48, no. 6, pp. 1383-1397, June 2013. - [30] S. Lee, B. Kim, H. Park and J. Sim, "A 5 Gb/s Single-Ended Parallel Receiver With Adaptive Crosstalk-Induced Jitter Cancellation," in IEEE Journal of Solid-State Circuits, vol. 48, no. 9, pp. 2118-2127, Sept. 2013. - [31] K. Hwang and L. Kim, "A 5 Gbps 1.6 mW/G bps/CH Adaptive Cross- talk Cancellation Scheme With Reference-less Digital Calibration and Switched Termination Resistors for Single-Ended Parallel Interface," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 61, no. 10, pp. 3016-3024, Oct. 2014. - [32] C. Aprile et al., "An Eight-Lane 7-Gb/s/pin Source Synchronous Single-Ended RX With Equalization and Far-End Crosstalk Cancellation for Backplane Channels," in IEEE Journal of Solid-State Circuits, vol. 53, no. 3, pp. 861-872, March 2018. - [33] E. Mensink, D. Schinkel, E. A. M. Klumperink, E. van Tuijl and B. Nauta, "Power Efficient Gigabit Communication Over Capacitively Driven RC-Limited On-Chip Interconnects," in IEEE Journal of Solid-State Circuits, vol. 45, no. 2, pp. 447-457, Feb. 2010. - [34] A. Hajimiri, "Generalized Time- and Transfer-Constant Circuit Analysis," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 57, no. 6, pp. 1105-1121, June 2010. - [35] B. Kim and V. Stojanovic, "A 4Gb/s/ch 356fJ/b 10mm equalized onchip interconnect with nonlinear charge-injecting transmit filter and transimpedance receiver in 90nm CMOS," 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers, San Francisco, CA, 2009, pp. 66-67. - [36] Y. Kim, E. Jung and S. Lee, "Bandwidth enhancement technique for CMOS RGC transimpedance amplifier," in Electronics Letters, vol. 50, no. 12, pp. 882-884, 5 June 2014. # 초 록 본 논문에서는 차세대 HBM 을 위한 고집적 저전력 송수신기 설계 방법을 제안한다. 첫 번째로, 전압 및 온도 변화에 의한 데이터와 클럭 간위상 차이를 보상할 수 있는 자체 추적 루프를 가진 데이터 수신기를 제안한다. 제안하는 자체 추적 루프는 데이터 전송 속도와 같은 속도로 동작하는 위상 검출기를 사용하여 전력 소모와 면적을 줄였다. 또한 메모리의 쓰기 훈련 (write training) 과정을 이용하여 효과적으로 위상 검출기의오프셋을 보상할 수 있는 방법을 제안한다. 제안하는 데이터 수신기는 65 mm 공정으로 제작되어 4.8 Gb/s 에서 370 fJ/b을 소모하였다. 또한 10 % 의전압 변화에 대하여 안정적으로 동작하는 것을 확인하였다. 두 번째로, 피드 포워드 이퀄라이저와 결합된 크로스 토크 보상 방식을 활용한 고집적 송수신기를 제안한다. 제안하는 송신기는 크로스 토크 크기에 해당하는 만큼 송신기 출력을 왜곡하여 크로스 토크를 보상한다. 제안하는 크로스 토크 보상 방식은 채널 손실을 보상하기 위해 구현된 피드 포워드 이퀄라이저를 재활용함으로써 추가적인 회로를 최소화한다. 제안하는 송수신기는 크로스 토크가 보상 가능하기 때문에, 채널 간격을 크게 줄여 고집적 통신을 구현하였다. 또한 집적도를 더 증가시키기 위해세로로 인접한 채널 사이의 차폐 층을 제거한 적층 채널 구조를 제안한다. 6 개의 송수신기를 포함한 프로토타입 칩은 65 nm 공정으로 제작되었다. HBM 과 프로세서 사이의 silicon interposer channel 을 모사하기 위한 6 mm의 채널이 칩 위에 구현되었다. 제안하는 크로스 토크 보상 방식은 0.5 um 간격의 6 개의 인접한 채널에 동시에 데이터를 전송하여 검증되었으 며, 크로스 토크로 인한 지터를 최대 78 % 감소시켰다. 제안하는 송수신 기는 8 Gb/s/um 의 처리량을 가지며 6 개의 송수신기가 총 36.6 mW 의 전 력을 소모하였다. 주요어 : Baud-rate CDR, crosstalk cancellation, crosstalk-induced jitter, far-end crosstalk, feed-forward equalizer (FFE), forwarded-clock (FC) receiver, HBM, memory interface, on-chip interconnect, parallel links, phase detector, RC-dominant wire, single-ended signaling, silicon interposer, transceiver. 학 번 : 2015-20883