Search CORE

80 research outputs found

Ring-Based Resonant Standing Wave Oscillators for 3D Clocking Applications

Author: Douglass Andrew James
Publication venue
Publication date: 16/10/2019
Field of study

Ring-based resonant standing wave oscillators have been shown to be a useful clocking tech-nique that can distribute and generate a high frequency, low skew, low power, and stable clock signal. By using through-silicon-vias, this type of standing wave oscillator can be used to gener-ate the clocking scheme for 3D integrated circuits. In this thesis, we propose the use of such 3D standing wave oscillators and show how independent 3D oscillators in different stacks can syn-chronize through the use of a redistribution layer stub. Inter-chip clock synchronization is then accomplished without the need for a PLL. In addition, we propose the first 3D ring-based resonant standing wave oscillator bootstrap and reset circuit to initialize and stop oscillation. Using a 3D ring-based resonant standing wave oscillator, we propose a ring-based data fabric for 3D stacked DRAM and compare the results with existing approaches such as High Bandwidth Memory (HBM) or Wide I/O memory. We show that our Memory Architecture using a Ring-based Scheme (MARS) can provide the increases in speed necessary to overcome current memory bottlenecks, and can scale effectively as future 3D stacks become larger. Our MARS can trade off power, throughput, and latency to match different application requirements. By using a narrow bus, and connecting it to all channels, the MARS8 can provide an alternative memory configuration with ∼ 6.9× lower power consumption than HBM, and ∼ 2.7× faster speeds than Wide I/O. Using multiple ring topologies in the same stack, the channel count can double from 8 to 16, and then to 32. This is possible since MARS uses about 4× fewer TSVs per channel than HBM or Wide I/O. This provides speeds up to ∼ 4.2× faster than traditional HBM. This scalable architecture allows higher throughput and faster system performance for next-generation DRAM. The MARS topology proposed in this thesis can be used in a variety of computing systems, from lightweight IoT to large-scale data centers

Texas A&M Repository

고속 DRAM 인터페이스를 위한 전압 및 온도에 둔감한 클록 패스와 위상 오류 교정기 설계

Author: 신소영
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2021. 2. 정덕균.To cope with problems caused by the high-speed operation of the dynamic random access memory (DRAM) interface, several approaches are proposed that are focused on the clock path of the DRAM. Two delay-locked loop (DLL) based schemes, a forwarded-clock (FC) receiver (RX) with self-tracking loop and a quadrature error corrector, are proposed. Moreover, an open-loop based scheme is presented for drift compensation in the clock distribution. The open-loop scheme consumes less power consumption and reduces design complexity. The FC RX uses DLLs to compensate for voltage and temperature (VT) drift in unmatched memory interfaces. The self-tracking loop consists of two-stage cascaded DLLs to operate in a DRAM environment. With the write training and the proposed DLL, the timing relationship between the data and the sampling clock is always optimal. The proposed scheme compensates for delay drift without relying on data transitions or re-training. The proposed FC RX is fabricated in 65-nm CMOS process and has an active area containing 4 data lanes of 0.0329 mm2. After the write training is completed at the supply voltage of 1 V, the measured timing margin remains larger than 0.31-unit interval (UI) when the supply voltage drifts in the range of 0.94 V and 1.06 V from the training voltage, 1 V. At the data rate of 6.4 Gb/s, the proposed FC RX achieves an energy efficiency of 0.45 pJ/bit. Contrary to the aforementioned scheme, an open-loop-based voltage drift compensation method is proposed to minimize power consumption and occupied area. The overall clock distribution is composed of a current mode logic (CML) path and a CMOS path. In the proposed scheme, the architecture of the CML-to-CMOS converter (C2C) and the inverter is changed to compensate for supply voltage drift. The bias generator provides bias voltages to the C2C and inverters according to supply voltage for delay adjustment. The proposed clock tree is fabricated in 40 nm CMOS process and the active area is 0.004 mm2. When the supply voltage is modulated by a sinusoidal wave with 1 MHz, 100 mV peak-to-peak swing from the center of 1.1 V, applying the proposed scheme reduces the measured root-mean-square (RMS) jitter from 3.77 psRMS to 1.61 psRMS. At 6 GHz output clock, the power consumption of the proposed scheme is 11.02 mW. A DLL-based quadrature error corrector (QEC) with a wide correction range is proposed for the DRAM whose clocks are distributed over several millimeters. The quadrature error is corrected by adjusting delay lines using information from the phase error detector. The proposed error correction method minimizes increased jitter due to phase error correction by setting at least one of the delay lines in the quadrature clock path to the minimum delay. In addition, the asynchronous calibration on-off scheme reduces power consumption after calibration is complete. The proposed QEC is fabricated in 40 nm CMOS process and has an active area of 0.048 mm2. The proposed QEC exhibits a wide correctable error range of 101.6 ps and the remaining phase errors are less than 2.18° from 0.8 GHz to 2.3 GHz clock. At 2.3 GHz, the QEC contributes 0.53 psRMS jitter. Also, at 2.3 GHz, the power consumption is reduced from 8.89 mW to 3.39 mW when the calibration is off.본 논문에서는 동적 랜덤 액세스 메모리 (DRAM)의 속도가 증가함에 따라 클록 패스에서 발생할 수 있는 문제에 대처하기 위한 세 가지 회로들을 제안하였다. 제안한 회로들 중 두 방식들은 지연동기루프 (delay-locked loop) 방식을 사용하였고 나머지 한 방식은 면적과 전력 소모를 줄이기 위해 오픈 루프 방식을 사용하였다. DRAM의 비정합 수신기 구조에서 데이터 패스와 클록 패스 간의 지연 불일치로 인해 전압 및 온도 변화에 따라 셋업 타임 및 홀드 타임이 줄어드는 문제를 해결하기 위해 지연동기루프를 사용하였다. 제안한 지연동기루프 회로는 DRAM 환경에서 동작하도록 두 개의 지연동기루프로 나누었다. 또한 초기 쓰기 훈련을 통해 데이터와 클록을 타이밍 마진 관점에서 최적의 위치에 둘 수 있다. 따라서 제안하는 방식은 데이터 천이 정보가 필요하지 않다. 65-nm CMOS 공정을 이용하여 만들어진 칩은 6.4 Gb/s에서 0.45 pJ/bit의 에너지 효율을 가진다. 또한 1 V에서 쓰기 훈련 및 지연동기루프를 고정시키고 0.94 V에서 1.06 V까지 공급 전압이 바뀌었을 때 타이밍 마진은 0.31 UI보다 큰 값을 유지하였다. 다음으로 제안하는 회로는 클록 분포 트리에서 전압 변화로 인해 클록 패스의 지연이 달라지는 것을 앞서 제시한 방식과 달리 오픈 루프 방식으로 보상하였다. 기존 클록 패스의 인버터와 CML-to-CMOS 변환기의 구조를 변경하여 바이어스 생성 회로에서 생성한 공급 전압에 따라 바뀌는 바이어스 전압을 가지고 지연을 조절할 수 있게 하였다. 40-nm CMOS 공정을 이용하여 만들어진 칩의 6 GHz 클록에서의 전력 소모는 11.02 mW로 측정되었다. 1.1 V 중심으로 1 MHz, 100 mV 피크 투 피크를 가지는 사인파 성분으로 공급 전압을 변조하였을 때 제안한 방식에서의 지터는 기존 방식의 3.77 psRMS에서 1.61 psRMS로 줄어들었다. DRAM의 송신기 구조에서 다중 위상 클록 간의 위상 오차는 송신된 데이터의 데이터 유효 창을 감소시킨다. 이를 해결하기 위해 지연동기루프를 도입하게 되면 증가된 지연으로 인해 위상이 교정된 클록에서 지터가 증가한다. 본 논문에서는 증가된 지터를 최소화하기 위해 위상 교정으로 인해 증가된 지연을 최소화하는 위상 교정 회로를 제시하였다. 또한 유휴 상태에서 전력 소모를 줄이기 위해 위상 오차를 교정하는 회로를 입력 클록과 비동기식으로 끌 수 있는 방법 또한 제안하였다. 40-nm CMOS 공정을 이용하여 만들어진 칩의 위상 교정 범위는 101.6 ps이고 0.8 GHz 부터 2.3 GHz까지의 동작 주파수 범위에서 위상 교정기의 출력 클록의 위상 오차는 2.18°보다 작다. 제안하는 위상 교정 회로로 인해 추가된 지터는 2.3 GHz에서 0.53 psRMS이고 교정 회로를 껐을 때 전력 소모는 교정 회로가 켜졌을 때인 8.89 mW에서 3.39 mW로 줄어들었다.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Thesis Organization 4 Chapter 2 Background on DRAM Interface 5 2.1 Overview 5 2.2 Memory Interface 7 Chapter 3 Background on DLL 11 3.1 Overview 11 3.2 Building Blocks 15 3.2.1 Delay Line 15 3.2.2 Phase Detector 17 3.2.3 Charge Pump 19 3.2.4 Loop filter 20 Chapter 4 Forwarded-Clock Receiver with DLL-based Self-tracking Loop for Unmatched Memory Interfaces 21 4.1 Overview 21 4.2 Proposed Separated DLL 25 4.2.1 Operation of the Proposed Separated DLL 27 4.2.2 Operation of the Digital Loop Filter in DLL 31 4.3 Circuit Implementation 33 4.4 Measurement Results 37 4.4.1 Measurement Setup and Sequence 38 4.4.2 VT Drift Measurement and Simulation 40 Chapter 5 Open-loop-based Voltage Drift Compensation in Clock Distribution 46 5.1 Overview 46 5.2 Prior Works 50 5.3 Voltage Drift Compensation Method 52 5.4 Circuit Implementation 57 5.5 Measurement Results 61 Chapter 6 Quadrature Error Corrector with Minimum Total Delay Tracking 68 6.1 Overview 68 6.2 Prior Works 70 6.3 Quadrature Error Correction Method 73 6.4 Circuit Implementation 82 6.5 Measurement Results 88 Chapter 7 Conclusion 96 Bibliography 98 초록 102Docto

SNU Open Repository and Archive

A chip multiprocessor for a large-scale neural simulator

Author: Painkras Eustace
Publication venue
Publication date: 01/08/2013
Field of study

The University of Manchester - Institutional Repository

VLSI Implementation of a Spiking Neural Network

Author: Grübl Andreas
Publication venue
Publication date: 01/01/2007
Field of study

Im Rahmen der vorliegenden Arbeit wurden Konzepte und dedizierte Hardware entwickelt, die es erlauben, großskalige pulsgekoppelte neuronale Netze in Hardware zu realisieren. Die Arbeit basiert auf dem analogen VLSI-Modell eines pulsgekoppelten neuronalen Netzes, welches synaptische Plastizität (STPD) in jeder einzelnen Synapse beinhaltet. Das Modell arbeitet analog mit einem Geschwindigkeitszuwachs von bis zu 10^5 im Vergleich zur biologischen Echtzeit. Aktionspotentiale werden als digitale Ereignisse übertragen. Inhalt dieser Arbeit sind vornehmlich die digitale Hardware und die Übertragung dieser Ereignisse. Das analoge VLSI-Modell wurde in Verbindung mit Digitallogik, welche zur Verarbeitung neuronaler Ereignisse und zu Konfigurationszwecken dient, in einen gemischt analog-digitalen ASIC integriert, wobei zu diesem Zweck ein automatisierter Arbeitsablauf entwickelt wurde. Außerdem wurde eine entsprechende Kontrolleinheit in programmierbarer Logik implementiert und eine Hardware-Plattform zum parallelen Betrieb mehrerer neuronaler Netzwerkchips vorgestellt. Um das VLSI-Modell auf mehrere neuronale Netzwerkchips ausdehnen zu können, wurde ein Routing-Algorithmus entwickelt, welcher die Übertragung von Ereignissen zwischen Neuronen und Synapsen auf unterschiedlichen Chips ermöglicht. Die zeitlich korrekte Übertragung der Ereignisse, welche eine zwingende Bedingung für das Funktionieren von Plastizitätsmechanismen ist, wird durch diesen Algorithmus sichergestellt. Die Funktionalität des Algorithmus wird mittels Simulationen verifiziert. Weiterhin wird die korrekte Realisierung des gemischt analog-digitalen ASIC in Verbindung mit dem zugehörigen Hardware-System demonstriert und die Durchführbarkeit biologisch realistischer Experimente gezeigt. Das vorgestellte großskalige physikalische Modell eines neuronalen Netzwerks wird aufgrund seiner schnellen und parallelen Arbeitsweise für Experimentierzwecke in den Neurowissenschaften einsetzbar sein. Als Ergänzung zu numerischen Simulationen bietet es vor allem die Möglichkeit der intuitiven und umfangreichen Suche nach geeigneten Modellparametern

Heidelberger Dokumentenserver

Source-synchronous I/O Links using Adaptive Interface Training for High Bandwidth Applications

Author: Jaiswal Ashok Kumar
Publication venue
Publication date: 16/07/2014
Field of study

Mobility is the key to the global business which requires people to be always connected to a central server. With the exponential increase in smart phones, tablets, laptops, mobile traffic will soon reach in the range of Exabytes per month by 2018. Applications like video streaming, on-demand-video, online gaming, social media applications will further increase the traffic load. Future application scenarios, such as Smart Cities, Industry 4.0, Machine-to-Machine (M2M) communications bring the concepts of Internet of Things (IoT) which requires high-speed low power communication infrastructures. Scientific applications, such as space exploration, oil exploration also require computing speed in the range of Exaflops/s by 2018 which means TB/s bandwidth at each memory node. To achieve such bandwidth, Input/Output (I/O) link speed between two devices needs to be increased to GB/s. The data at high speed between devices can be transferred serially using complex Clock-Data-Recovery (CDR) I/O links or parallely using simple source-synchronous I/O links. Even though CDR is more efficient than the source-synchronous method for single I/O link, but to achieve TB/s bandwidth from a single device, additional I/O links will be required and the source-synchronous method will be more advantageous in terms of area and power requirements as additional I/O links do not require extra hardware resources. At high speed, there are several non-idealities (Supply noise, crosstalk, Inter- Symbol-Interference (ISI), etc.) which create unwanted skew problem among parallel source-synchronous I/O links. To solve these problems, adaptive trainings are used in time domain to synchronize parallel source-synchronous I/O links irrespective of these non-idealities. In this thesis, two novel adaptive training architectures for source-synchronous I/O links are discussed which require significantly less silicon area and power in comparison to state-of-the-art architectures. First novel adaptive architecture is based on the unit delay concept to synchronize two parallel clocks by adjusting the phase of one clock in only one direction. Second novel adaptive architecture concept consists of Phase Interpolator (PI)-based Phase Locked Loop (PLL) which can adjust the phase in both direction and achieve faster synchronization at the expense of added complexity. With an increase in parallel I/O links, clock skew which is generated by the improper clock tree, also affects the timing margin. Incorrect duty cycle further reduces the timing margin mainly in Double Data Rate (DDR) systems which are generally used to increase the bandwidth of a high-speed communication system. To solve clock skew and duty cycle problems, a novel clock tree buffering algorithm and a novel duty cycle corrector are described which further reduce the power consumption of a source-synchronous system

TUbiblio

tuprints

Active Buffer Development in CBM Experiment

Author: Gao Wenxue
Publication venue
Publication date: 01/01/2012
Field of study

Die Anforderungen an das Datenerfassungssystem (DAQ) des CBM Experiments an der GSI sind mit einer Datenrate von 1TB/s und einer Ereignisrate von 100 kHz sehr hoch und stellen auch im Vergleich zu anderen Experimenten in der Hochenergiephysik eine Herausforderung dar. Bei der Datennahme wird daher ein aktiver Zwischenspeicher („active buffer“) eingesetzt, der durch eine Vorsortierung der Datenfragmente und eine intelligente Übertragung in den Hostrechner den Aufbau der Datenstrukturen zur Ereignisverarbeitung unterstützt. Das Projekt erfordert ein modulares Framework und die Arbeit umfasst die Entwicklung, Verifikation und Test von FPGA Modulen zum effizienten Datentransfer, zur Zwischenspeicherung und zur Rekonfiguration, sowie von Software zur automatischen Transformation von HDL Beschreibungen. Die zentralen Bauteile dieses Zwischenspeichers sind ein leistungsfähiges FPGA zur Datenflusssteuerung und ein DDR2 SDRAM Modul mit einer Kapazität von 512MB. Durch eine spezielle Ansteuerungsmethode kann das Speichermodul zusammen mit den FPGA-internen Speicherelementen als leistungsfähiges, großes FIFO betrieben werden. Den Datantransfer vom Zwischenspeicher zum PC übernimmt eine spezielle DMA Einheit, die an den PCIe-Kern im FPGA angeschlossen ist. Die zwei DMA Kanäle arbeiten mit Scatter-Gather Unterstützung und erreichen beim Transfer zum PC 543 MB/s und in der Gegenrichtung 790MB/s. Die für die Vorsortierung wichtige Übertragung der Zeitstempel („epoch marker“) erfolgt ebenfalls mit einem DMA Kanal. Die Verifikation ist eine wichtige Stufe bei der Entwicklung einer umfangreichen FPGA Anwendungen wie des aktiven Zwischenspeichers. Daher wurden die HDL Module der Funktionen für das PCI Express „transaction layer“ mit einer Reihe unterschiedlicher Simulationsumgebungen verifiziert. Auf dieser Grundlage können Verbesserungen an der Funktionalität schnell und zuverlässig umgesetzt werden, womit eine konsistente Weiterentwicklung gewährleistet ist. Aufgrund der typischen PC-Architektur muss die PCIe-Einheit im FPGA bereits während des Startvorgangs funktionsfähig sein, wohingegen die eigentliche aktive Zwischenspeicherfunktion erst zusammen mit der entsprechenden Anwendungssoftware verfügbar sein muss. Strikte Modularisierung zusammen mit dynamischer, partieller Rekonfigurierung („DPR“) ermöglichen Veränderungen in der Zwischenspeicherfunktion zur Laufzeit. Ein weiter Grund für die Nutzung der DPR sind die Lizenzbedingungen der PCIe-Core-Implementierung mit Virtex4-FPGAs. DPR kann bei den FPGA Familien Virtex-4, -5 und -6 im Rahmen der „PlanAhead“ Software von Xilinx benutzt werden. DPR wird im Projekt im Sinne eines allgemeinen Coprozessors eingesetzt, indem die FPGA Konfiguration über die PCIe und die interne Konfigurationsschnittstelle („ICAP“) im FPGA nachgeladen wird. Um DPR bei hohen Taktgeschwindigkeiten einsetzen zu können, muss die Verbindungslogik zwischen den statischen und dynamischen Modulen speziellen Anforderungen genügen. Da die manuelle Anpassung existierenden Module an diese Anforderungen aufwändig und fehleranfällig ist, wurde das Programm „Logro“ entwickelt, das HDL Beschreibungen mittels einer speziellen Pipeline- Neustrukturierung automatisch so transformiert, dass die DPR Anforderungen erfüllt werden. Mit Logro V1.0 wurden dabei gute Ergebnisse erzielt, die hier vorgestellt werden

Heidelberger Dokumentenserver

GSI Repository

A LPDDR4 MEMORY CONTROLLER DESIGN WITH EYE CENTER DETECTION ALGORITHM

Author: 홍기문
Publication venue: 서울대학교 대학원
Publication date: 01/02/2016
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. 김수환.The demand for higher bandwidth with reduced power consumption in mobile memory is increasing. In this thesis, architecture of the LPDDR4 memory controller, operated with a LPDDR4 memory, is proposed and designed, and efficient training algorithm, which is appropriate for this architecture, is proposed for memory training and verification. The operation speed range of the LPDDR4 memory specification is from 533Mbps to 4266Mbps, and the LPDDR4 memory controller is designed to support that range of the LPDDR4 memory. The phase-locked loop in the LPDDR4 memory controller is designed to operate between 1333MHz and 2133MHz. To cover the range of the LPDDR4 memory, the selectable frequency divider is used to provide operation clock. The output frequency of the phase-locked loop with divider is from 266MHz to 2133MHz. The delay-locked loop in the LPDDR4 memory controller is designed to operate between 266MHz and 2133MHz with 180˚ phase locking. The delay-locked loop is used each training operation, which is command training, data read and write training. To complete training in each training stage, eye center detection algorithm is used. The circuits for the proposed eye center detection algorithm such as delay line, phase interpolator and reference generator are designed and validated. The proposed 1x2y3x eye center detection algorithm is 23 times faster than conventional two-dimensional eye center detection algorithm and it can be implemented simply. Using 65nm CMOS process, the proposed LPDDR4 memory controller occupies 12mm2. The verification of the LPDDR4 memory controller is performed with commodity LPDDR4 memory. The verification of all training sequence, which is power on, initializing, boot up, command training, write leveling, read training, write training, is performed in this environment. The low voltage swing terminated logic driver and other several functions, including write leveling and data transmission, are verified at 4266Mbps and the entire LPDDR4 memory controller operations from 566Mbps to 1600Mbps are verified. The proposed eye center detection algorithm is verified from 566Mbps to 2843Mbps.CHAPTER 1 INTRODUCTION 1 1.1 MOTIVATION 1 1.2 INTRODUCTION 5 1.3 THESIS ORGANIZATION 7 CHAPTER 2 LPDDR4 MEMORY CONTROLLER DESIGN 8 2.1 DIFFERENCE BETWEEN LPDDR3 AND LPDDR4 MEMORY 8 2.1.1 ARCHITECTURAL DIFFERENCE BETWEEN LPDDR3 AND LPDDR4 MEMORY 10 2.1.2 SOURCE SYNCHRONOUS MATCHED SCHEME AND UNMATCHED SCHEME 11 2.1.3 LOW VOLTAGE SWING TERMINATED LOGIC DRIVER AND TERMINATION SCHEME 12 2.2 LPDDR4 MEMORY CONTROLLER SPECIFICATION 15 2.3 DESIGN PROCEDURE 18 CHAPTER 3 LPDDR4 MEMORY CONTROLLER ARCHITECTURE BASED ON MEMORY TRAINING 20 3.1 LPDDR4 MEMORY TRAINING SEQUENCE 20 3.2 LPDDR4 MEMORY TRAINING EYE DETECTION ALGORITHM 24 3.2.1 EYE CENTER DETECTION 24 3.2.2 1X2Y3X EYE CENTER DETECTION ALGORITHM 27 3.3. LPDDR4 MEMORY CONTROLLER DESIGN BASED ON MEMORY TRAINING 31 3.3.1 ARCHITECTURE FOR MEMORY BOOT UP AND POWER UP 31 3.3.2 CLOCK PATH ARCHITECTURE AND CLOCK TREE 34 3.3.3 COMMAND TRAINING AND COMMAND PATH ARCHITECTURE 35 3.3.4 WRITE LEVELING AND DATA STROBE TRANSMISSION PATH ARCHITECTURE 39 3.3.5 READ TRAINING AND READ PATH ARCHITECTURE 41 3.3.6 WRITE TRAINING AND WRITE PATH ARCHITECTURE 43 3.3.7 NORMAL READ/WRITE OPERATION AND MARGIN TEST 46 CHAPTER 4 LPDDR4 MEMORY CONTROLLER ARCHITECTURE MODELING AND CIRCUIT DESIGN 48 4.1 OVERALL LPDDR4 MEMORY CONTROLLER ARCHITECTURE MODELING 48 4.2 SIMULATION RESULT OF LPDDR4 MEMORY CONTROLLER MODELING 51 4.3 LPDDR4 MEMORY CONTROLLER CIRCUIT DESIGN 61 4.3.1 PHASE-LOCKED LOOP 61 4.3.2 DELAY-LOCKED LOOP 65 4.3.3 TRANSMITTER OF LPDDR4 MEMORY CONTROLLER: WRITE PATH 70 4.3.4 DE-SERIALIZER WITH CLOCK DOMAIN CROSSING 75 CHAPTER 5 MEASUREMENT RESULT OF LPDDR4 MEMORY CONTROLLER 77 5.1 LPDDR4 MEMORY CONTROLLER MEASUREMENT SETUP 77 5.1.1 LPDDR4 MEMORY CONTROLLER FLOOR PLAN AND LAYOUT 77 5.1.2 PACKAGE AND TEST BOARD 79 5.2 LPDDR4 MEMORY CONTROLLER SUB-BLOCK MEASUREMENT 81 5.2.1 PHASE-LOCKED LOOP 81 5.2.2 DELAY-LOCKED LOOP 83 5.2.3 200PS AND 800PS DELAY LINE 85 5.2.4 VOLTAGE REFERENCE GENERATOR 86 5.2.5 PHASE INTERPOLATOR 87 5.3 LPDDR4 MEMORY SYSTEM OPERATION MEASUREMENT 90 CHAPTER 6 CONCLUSION 93 APPENDIX OPERATION FLOW CHART OF THE PROPOSED LPDDR4 MEMORY CONTROLLER 95 BIBLIOGRAPHY 118 KOREAN ABSTRACT 124Docto

SNU Open Repository and Archive

Efficient Design and Communication for 3D Stacked Dynamic Memory

Author: Douglass Andrew J
Publication venue
Publication date: 09/07/2019
Field of study

As computer memory increases in size and processors continue to get faster, memory becomes an increasing bottleneck to system performance. To mitigate the slow DRAM memory chip speeds, a new generation of 3D stacked DRAM will allow lower power consumption and higher bandwidth. To communicate between these chips, this paper proposes the use of ring based standing wave oscillators for fast data transfer. With a fast clocking scheme, multiple channels can share the same bus to reduce TSVs and maintain similar memory latencies. Simulations with the new clocking scheme and data transfers are performed to show the improvements that can be made in memory communication. Experimental results show that a ring based clocking scheme can obtain two times the speed up of current stacked memory chips. Variations of this clocking scheme can also provide half the power consumption with comparable speeds. These ring-based architectures allow higher memory speeds without compromising the complexity of the hardware. This allows the ring-based memory architecture to trade off power, throughput, and latency to improve system performance for different applications

Texas A&M Repository

통계적 주파수 검출기 기반 기준 주파수를 사용하지 않는 클록 및 데이터 복원 회로의 설계 방법론

Author: 최홍석
Publication venue: 서울대학교 대학원
Publication date: 01/08/2022
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2022. 8. 정덕균.In this thesis, a design of a high-speed, power-efficient, wide-range clock and data recovery (CDR) without a reference clock is proposed. A frequency acquisition scheme using a stochastic frequency detector (SFD) based on the Alexander phase detector (PD) is utilized for the referenceless operation. Pat-tern histogram analysis is presented to analyze the frequency acquisition behavior of the SFD and verified by simulation. Based on the information obtained by pattern histogram analysis, SFD using autocovariance is proposed. With a direct-proportional path and a digital integral path, the proposed referenceless CDR achieves frequency lock at all measurable conditions, and the measured frequency acquisition time is within 7μs. The prototype chip has been fabricated in a 40-nm CMOS process and occupies an active area of 0.032 mm2. The proposed referenceless CDR achieves the BER of less than 10-12 at 32 Gb/s and exhibits an energy efficiency of 1.15 pJ/b at 32 Gb/s with a 1.0 V supply.본 논문은 기준 클럭이 없는 고속, 저전력, 광대역으로 동작하는 클럭 및 데이터 복원회로의 설계를 제안한다. 기준 클럭이 없는 동작을 위해서 알렉산더 위상 검출기에 기반한 통계적 주파수 검출기를 사용하는 주파수 획득 방식이 사용된다. 통계적 주파수 검출기의 주파수 추적 양상을 분석하기 위해 패턴 히스토그램 분석 방법론을 제시하였고 시뮬레이션을 통해 검증하였다. 패턴 히스토그램 분석을 통해 얻은 정보를 바탕으로 자기공분산을 이용한 통계적 주파수 검출기를 제안한다. 직접 비례 경로와 디지털 적분 경로를 통해 제안된 기준 클럭이 없는 클럭 및 데이터 복원회로는 모든 측정 가능한 조건에서 주파수 잠금을 달성하는 데 성공하였고, 모든 경우에서 측정된 주파수 추적 시간은 7μs 이내이다. 40-nm CMOS 공정을 이용하여 만들어진 칩은 0.032 mm2의 면적을 차지한다. 제안하는 클럭 및 데이터 복원회로는 32 Gb/s의 속도에서 비트에러율 10-12 이하로 동작하였고, 에너지 효율은 32Gb/s의 속도에서 1.0V 공급전압을 사용하여 1.15 pJ/b을 달성하였다.CHAPTER 1 INTRODUCTION 1 1.1 MOTIVATION 1 1.2 THESIS ORGANIZATION 13 CHAPTER 2 BACKGROUNDS 14 2.1 CLOCKING ARCHITECTURES IN SERIAL LINK INTERFACE 14 2.2 GENERAL CONSIDERATIONS FOR CLOCK AND DATA RECOVERY 24 2.2.1 OVERVIEW 24 2.2.2 JITTER 26 2.2.3 CDR JITTER CHARACTERISTICS 33 2.3 CDR ARCHITECTURES 39 2.3.1 PLL-BASED CDR – WITH EXTERNAL REFERENCE CLOCK 39 2.3.2 DLL/PI-BASED CDR 44 2.3.3 PLL-BASED CDR – WITHOUT EXTERNAL REFERENCE CLOCK 47 2.4 FREQUENCY ACQUISITION SCHEME 50 2.4.1 TYPICAL FREQUENCY DETECTORS 50 2.4.1.1 DIGITAL QUADRICORRELATOR FREQUENCY DETECTOR 50 2.4.1.2 ROTATIONAL FREQUENCY DETECTOR 54 2.4.2 PRIOR WORKS 56 CHAPTER 3 DESIGN OF THE REFERENCELESS CDR USING SFD 58 3.1 OVERVIEW 58 3.2 PROPOSED FREQUENCY DETECTOR 62 3.2.1 MOTIVATION 62 3.2.2 PATTERN HISTOGRAM ANALYSIS 68 3.2.3 INTRODUCTION OF AUTOCOVARIANCE TO STOCHASTIC FREQUENCY DETECTOR 75 3.3 CIRCUIT IMPLEMENTATION 83 3.3.1 IMPLEMENTATION OF THE PROPOSED REFERENCELESS CDR 83 3.3.2 CONTINUOUS-TIME LINEAR EQUALIZER (CTLE) 85 3.3.3 DIGITALLY-CONTROLLED OSCILLATOR (DCO) 87 3.4 MEASUREMENT RESULTS 89 CHAPTER 4 CONCLUSION 99 APPENDIX A DETAILED FREQUENCY ACQUISITION WAVEFORMS OF THE PROPOSED SFD 100 BIBLIOGRAPHY 108 초 록 122박

SNU Open Repository and Archive

Digital radar receiver design based on highly efficient bandpass sampling FPGA architecture

Author: Meier John
Publication venue
Publication date: 01/01/2009
Field of study

Thesis (M.S.)--University of Oklahoma, 2009.Includes bibliographical references (leaves 62-64).As digital electronics become faster and more efficient, it becomes possible to move the analog/digital interface in a radar downconversion system further towards the antenna. Instead of digitizing radar echoes at the end of the down-conversion process, digital logic can perform the same operations previously performed by analog components. Taking full advantage of this opportunity will result in a more highly integrated and reconfigurable design. By removing unnecessary analog components, the error from component variability and noise injected into the signal of interest is reduced, the size of the receiver and the power required for operation is minimized, and the overall cost of the system can be lowered. This research is focused on employing software defined radio concepts for weather observation, thus creating a low-cost digital radar receiver at the University of Oklahoma for use in radar projects as a way of obviating the need for commmercial radar receivers, which can be many times more expensive. Software-defined radio techniques, such as bandpass sampling, are used to achieve a high data processing bandwidth and oversampling ratio with the smallest logic resource utilization. Two novel digital receiver designs are discussed in this thesis. A prototype compact single-channel digital receiver based on a 14-bit analog-to-digital converter and a hand-solderable Xilinx FPGA was built and tested both in the laboratory and at the National Weather Radar Testbed (NWRT). Building on the lessons learned from testing the single-channel digital radar receiver, a second digital receiver was designed for expanded capabilities. Through the utilization of a low-power, simultaneous-sampling eight channel ADC with high-speed serial data links and a cost-efficient FPGA with integrated DSP slices, eight data channels can be digitized, processed and transferred at the same time in a compact form factor. An ethernet interface has been included which allows for a scalable control channel so that the digital receiver's operations can be quickly modified. This also makes it possible to remotely change the firmware of the FPGA in seconds, without the need for physical access. Development of host computer platforms to store and process each digital receiver's output are also discussed

SHAREOK repository