80 research outputs found

    Ring-Based Resonant Standing Wave Oscillators for 3D Clocking Applications

    Get PDF
    Ring-based resonant standing wave oscillators have been shown to be a useful clocking tech-nique that can distribute and generate a high frequency, low skew, low power, and stable clock signal. By using through-silicon-vias, this type of standing wave oscillator can be used to gener-ate the clocking scheme for 3D integrated circuits. In this thesis, we propose the use of such 3D standing wave oscillators and show how independent 3D oscillators in different stacks can syn-chronize through the use of a redistribution layer stub. Inter-chip clock synchronization is then accomplished without the need for a PLL. In addition, we propose the first 3D ring-based resonant standing wave oscillator bootstrap and reset circuit to initialize and stop oscillation. Using a 3D ring-based resonant standing wave oscillator, we propose a ring-based data fabric for 3D stacked DRAM and compare the results with existing approaches such as High Bandwidth Memory (HBM) or Wide I/O memory. We show that our Memory Architecture using a Ring-based Scheme (MARS) can provide the increases in speed necessary to overcome current memory bottlenecks, and can scale effectively as future 3D stacks become larger. Our MARS can trade off power, throughput, and latency to match different application requirements. By using a narrow bus, and connecting it to all channels, the MARS8 can provide an alternative memory configuration with โˆผ 6.9ร— lower power consumption than HBM, and โˆผ 2.7ร— faster speeds than Wide I/O. Using multiple ring topologies in the same stack, the channel count can double from 8 to 16, and then to 32. This is possible since MARS uses about 4ร— fewer TSVs per channel than HBM or Wide I/O. This provides speeds up to โˆผ 4.2ร— faster than traditional HBM. This scalable architecture allows higher throughput and faster system performance for next-generation DRAM. The MARS topology proposed in this thesis can be used in a variety of computing systems, from lightweight IoT to large-scale data centers

    ๊ณ ์† DRAM ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์œ„ํ•œ ์ „์•• ๋ฐ ์˜จ๋„์— ๋‘”๊ฐํ•œ ํด๋ก ํŒจ์Šค์™€ ์œ„์ƒ ์˜ค๋ฅ˜ ๊ต์ •๊ธฐ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. ์ •๋•๊ท .To cope with problems caused by the high-speed operation of the dynamic random access memory (DRAM) interface, several approaches are proposed that are focused on the clock path of the DRAM. Two delay-locked loop (DLL) based schemes, a forwarded-clock (FC) receiver (RX) with self-tracking loop and a quadrature error corrector, are proposed. Moreover, an open-loop based scheme is presented for drift compensation in the clock distribution. The open-loop scheme consumes less power consumption and reduces design complexity. The FC RX uses DLLs to compensate for voltage and temperature (VT) drift in unmatched memory interfaces. The self-tracking loop consists of two-stage cascaded DLLs to operate in a DRAM environment. With the write training and the proposed DLL, the timing relationship between the data and the sampling clock is always optimal. The proposed scheme compensates for delay drift without relying on data transitions or re-training. The proposed FC RX is fabricated in 65-nm CMOS process and has an active area containing 4 data lanes of 0.0329 mm2. After the write training is completed at the supply voltage of 1 V, the measured timing margin remains larger than 0.31-unit interval (UI) when the supply voltage drifts in the range of 0.94 V and 1.06 V from the training voltage, 1 V. At the data rate of 6.4 Gb/s, the proposed FC RX achieves an energy efficiency of 0.45 pJ/bit. Contrary to the aforementioned scheme, an open-loop-based voltage drift compensation method is proposed to minimize power consumption and occupied area. The overall clock distribution is composed of a current mode logic (CML) path and a CMOS path. In the proposed scheme, the architecture of the CML-to-CMOS converter (C2C) and the inverter is changed to compensate for supply voltage drift. The bias generator provides bias voltages to the C2C and inverters according to supply voltage for delay adjustment. The proposed clock tree is fabricated in 40 nm CMOS process and the active area is 0.004 mm2. When the supply voltage is modulated by a sinusoidal wave with 1 MHz, 100 mV peak-to-peak swing from the center of 1.1 V, applying the proposed scheme reduces the measured root-mean-square (RMS) jitter from 3.77 psRMS to 1.61 psRMS. At 6 GHz output clock, the power consumption of the proposed scheme is 11.02 mW. A DLL-based quadrature error corrector (QEC) with a wide correction range is proposed for the DRAM whose clocks are distributed over several millimeters. The quadrature error is corrected by adjusting delay lines using information from the phase error detector. The proposed error correction method minimizes increased jitter due to phase error correction by setting at least one of the delay lines in the quadrature clock path to the minimum delay. In addition, the asynchronous calibration on-off scheme reduces power consumption after calibration is complete. The proposed QEC is fabricated in 40 nm CMOS process and has an active area of 0.048 mm2. The proposed QEC exhibits a wide correctable error range of 101.6 ps and the remaining phase errors are less than 2.18ยฐ from 0.8 GHz to 2.3 GHz clock. At 2.3 GHz, the QEC contributes 0.53 psRMS jitter. Also, at 2.3 GHz, the power consumption is reduced from 8.89 mW to 3.39 mW when the calibration is off.๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋™์  ๋žœ๋ค ์•ก์„ธ์Šค ๋ฉ”๋ชจ๋ฆฌ (DRAM)์˜ ์†๋„๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ํด๋ก ํŒจ์Šค์—์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ์— ๋Œ€์ฒ˜ํ•˜๊ธฐ ์œ„ํ•œ ์„ธ ๊ฐ€์ง€ ํšŒ๋กœ๋“ค์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ํšŒ๋กœ๋“ค ์ค‘ ๋‘ ๋ฐฉ์‹๋“ค์€ ์ง€์—ฐ๋™๊ธฐ๋ฃจํ”„ (delay-locked loop) ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€๊ณ  ๋‚˜๋จธ์ง€ ํ•œ ๋ฐฉ์‹์€ ๋ฉด์ ๊ณผ ์ „๋ ฅ ์†Œ๋ชจ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์˜คํ”ˆ ๋ฃจํ”„ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. DRAM์˜ ๋น„์ •ํ•ฉ ์ˆ˜์‹ ๊ธฐ ๊ตฌ์กฐ์—์„œ ๋ฐ์ดํ„ฐ ํŒจ์Šค์™€ ํด๋ก ํŒจ์Šค ๊ฐ„์˜ ์ง€์—ฐ ๋ถˆ์ผ์น˜๋กœ ์ธํ•ด ์ „์•• ๋ฐ ์˜จ๋„ ๋ณ€ํ™”์— ๋”ฐ๋ผ ์…‹์—… ํƒ€์ž„ ๋ฐ ํ™€๋“œ ํƒ€์ž„์ด ์ค„์–ด๋“œ๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ง€์—ฐ๋™๊ธฐ๋ฃจํ”„๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ œ์•ˆํ•œ ์ง€์—ฐ๋™๊ธฐ๋ฃจํ”„ ํšŒ๋กœ๋Š” DRAM ํ™˜๊ฒฝ์—์„œ ๋™์ž‘ํ•˜๋„๋ก ๋‘ ๊ฐœ์˜ ์ง€์—ฐ๋™๊ธฐ๋ฃจํ”„๋กœ ๋‚˜๋ˆ„์—ˆ๋‹ค. ๋˜ํ•œ ์ดˆ๊ธฐ ์“ฐ๊ธฐ ํ›ˆ๋ จ์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์™€ ํด๋ก์„ ํƒ€์ด๋ฐ ๋งˆ์ง„ ๊ด€์ ์—์„œ ์ตœ์ ์˜ ์œ„์น˜์— ๋‘˜ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ์‹์€ ๋ฐ์ดํ„ฐ ์ฒœ์ด ์ •๋ณด๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค. 65-nm CMOS ๊ณต์ •์„ ์ด์šฉํ•˜์—ฌ ๋งŒ๋“ค์–ด์ง„ ์นฉ์€ 6.4 Gb/s์—์„œ 0.45 pJ/bit์˜ ์—๋„ˆ์ง€ ํšจ์œจ์„ ๊ฐ€์ง„๋‹ค. ๋˜ํ•œ 1 V์—์„œ ์“ฐ๊ธฐ ํ›ˆ๋ จ ๋ฐ ์ง€์—ฐ๋™๊ธฐ๋ฃจํ”„๋ฅผ ๊ณ ์ •์‹œํ‚ค๊ณ  0.94 V์—์„œ 1.06 V๊นŒ์ง€ ๊ณต๊ธ‰ ์ „์••์ด ๋ฐ”๋€Œ์—ˆ์„ ๋•Œ ํƒ€์ด๋ฐ ๋งˆ์ง„์€ 0.31 UI๋ณด๋‹ค ํฐ ๊ฐ’์„ ์œ ์ง€ํ•˜์˜€๋‹ค. ๋‹ค์Œ์œผ๋กœ ์ œ์•ˆํ•˜๋Š” ํšŒ๋กœ๋Š” ํด๋ก ๋ถ„ํฌ ํŠธ๋ฆฌ์—์„œ ์ „์•• ๋ณ€ํ™”๋กœ ์ธํ•ด ํด๋ก ํŒจ์Šค์˜ ์ง€์—ฐ์ด ๋‹ฌ๋ผ์ง€๋Š” ๊ฒƒ์„ ์•ž์„œ ์ œ์‹œํ•œ ๋ฐฉ์‹๊ณผ ๋‹ฌ๋ฆฌ ์˜คํ”ˆ ๋ฃจํ”„ ๋ฐฉ์‹์œผ๋กœ ๋ณด์ƒํ•˜์˜€๋‹ค. ๊ธฐ์กด ํด๋ก ํŒจ์Šค์˜ ์ธ๋ฒ„ํ„ฐ์™€ CML-to-CMOS ๋ณ€ํ™˜๊ธฐ์˜ ๊ตฌ์กฐ๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ๋ฐ”์ด์–ด์Šค ์ƒ์„ฑ ํšŒ๋กœ์—์„œ ์ƒ์„ฑํ•œ ๊ณต๊ธ‰ ์ „์••์— ๋”ฐ๋ผ ๋ฐ”๋€Œ๋Š” ๋ฐ”์ด์–ด์Šค ์ „์••์„ ๊ฐ€์ง€๊ณ  ์ง€์—ฐ์„ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€๋‹ค. 40-nm CMOS ๊ณต์ •์„ ์ด์šฉํ•˜์—ฌ ๋งŒ๋“ค์–ด์ง„ ์นฉ์˜ 6 GHz ํด๋ก์—์„œ์˜ ์ „๋ ฅ ์†Œ๋ชจ๋Š” 11.02 mW๋กœ ์ธก์ •๋˜์—ˆ๋‹ค. 1.1 V ์ค‘์‹ฌ์œผ๋กœ 1 MHz, 100 mV ํ”ผํฌ ํˆฌ ํ”ผํฌ๋ฅผ ๊ฐ€์ง€๋Š” ์‚ฌ์ธํŒŒ ์„ฑ๋ถ„์œผ๋กœ ๊ณต๊ธ‰ ์ „์••์„ ๋ณ€์กฐํ•˜์˜€์„ ๋•Œ ์ œ์•ˆํ•œ ๋ฐฉ์‹์—์„œ์˜ ์ง€ํ„ฐ๋Š” ๊ธฐ์กด ๋ฐฉ์‹์˜ 3.77 psRMS์—์„œ 1.61 psRMS๋กœ ์ค„์–ด๋“ค์—ˆ๋‹ค. DRAM์˜ ์†ก์‹ ๊ธฐ ๊ตฌ์กฐ์—์„œ ๋‹ค์ค‘ ์œ„์ƒ ํด๋ก ๊ฐ„์˜ ์œ„์ƒ ์˜ค์ฐจ๋Š” ์†ก์‹ ๋œ ๋ฐ์ดํ„ฐ์˜ ๋ฐ์ดํ„ฐ ์œ ํšจ ์ฐฝ์„ ๊ฐ์†Œ์‹œํ‚จ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ง€์—ฐ๋™๊ธฐ๋ฃจํ”„๋ฅผ ๋„์ž…ํ•˜๊ฒŒ ๋˜๋ฉด ์ฆ๊ฐ€๋œ ์ง€์—ฐ์œผ๋กœ ์ธํ•ด ์œ„์ƒ์ด ๊ต์ •๋œ ํด๋ก์—์„œ ์ง€ํ„ฐ๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ฆ๊ฐ€๋œ ์ง€ํ„ฐ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์œ„์ƒ ๊ต์ •์œผ๋กœ ์ธํ•ด ์ฆ๊ฐ€๋œ ์ง€์—ฐ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ์œ„์ƒ ๊ต์ • ํšŒ๋กœ๋ฅผ ์ œ์‹œํ•˜์˜€๋‹ค. ๋˜ํ•œ ์œ ํœด ์ƒํƒœ์—์„œ ์ „๋ ฅ ์†Œ๋ชจ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์œ„์ƒ ์˜ค์ฐจ๋ฅผ ๊ต์ •ํ•˜๋Š” ํšŒ๋กœ๋ฅผ ์ž…๋ ฅ ํด๋ก๊ณผ ๋น„๋™๊ธฐ์‹์œผ๋กœ ๋Œ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ• ๋˜ํ•œ ์ œ์•ˆํ•˜์˜€๋‹ค. 40-nm CMOS ๊ณต์ •์„ ์ด์šฉํ•˜์—ฌ ๋งŒ๋“ค์–ด์ง„ ์นฉ์˜ ์œ„์ƒ ๊ต์ • ๋ฒ”์œ„๋Š” 101.6 ps์ด๊ณ  0.8 GHz ๋ถ€ํ„ฐ 2.3 GHz๊นŒ์ง€์˜ ๋™์ž‘ ์ฃผํŒŒ์ˆ˜ ๋ฒ”์œ„์—์„œ ์œ„์ƒ ๊ต์ •๊ธฐ์˜ ์ถœ๋ ฅ ํด๋ก์˜ ์œ„์ƒ ์˜ค์ฐจ๋Š” 2.18ยฐ๋ณด๋‹ค ์ž‘๋‹ค. ์ œ์•ˆํ•˜๋Š” ์œ„์ƒ ๊ต์ • ํšŒ๋กœ๋กœ ์ธํ•ด ์ถ”๊ฐ€๋œ ์ง€ํ„ฐ๋Š” 2.3 GHz์—์„œ 0.53 psRMS์ด๊ณ  ๊ต์ • ํšŒ๋กœ๋ฅผ ๊ป์„ ๋•Œ ์ „๋ ฅ ์†Œ๋ชจ๋Š” ๊ต์ • ํšŒ๋กœ๊ฐ€ ์ผœ์กŒ์„ ๋•Œ์ธ 8.89 mW์—์„œ 3.39 mW๋กœ ์ค„์–ด๋“ค์—ˆ๋‹ค.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Thesis Organization 4 Chapter 2 Background on DRAM Interface 5 2.1 Overview 5 2.2 Memory Interface 7 Chapter 3 Background on DLL 11 3.1 Overview 11 3.2 Building Blocks 15 3.2.1 Delay Line 15 3.2.2 Phase Detector 17 3.2.3 Charge Pump 19 3.2.4 Loop filter 20 Chapter 4 Forwarded-Clock Receiver with DLL-based Self-tracking Loop for Unmatched Memory Interfaces 21 4.1 Overview 21 4.2 Proposed Separated DLL 25 4.2.1 Operation of the Proposed Separated DLL 27 4.2.2 Operation of the Digital Loop Filter in DLL 31 4.3 Circuit Implementation 33 4.4 Measurement Results 37 4.4.1 Measurement Setup and Sequence 38 4.4.2 VT Drift Measurement and Simulation 40 Chapter 5 Open-loop-based Voltage Drift Compensation in Clock Distribution 46 5.1 Overview 46 5.2 Prior Works 50 5.3 Voltage Drift Compensation Method 52 5.4 Circuit Implementation 57 5.5 Measurement Results 61 Chapter 6 Quadrature Error Corrector with Minimum Total Delay Tracking 68 6.1 Overview 68 6.2 Prior Works 70 6.3 Quadrature Error Correction Method 73 6.4 Circuit Implementation 82 6.5 Measurement Results 88 Chapter 7 Conclusion 96 Bibliography 98 ์ดˆ๋ก 102Docto

    VLSI Implementation of a Spiking Neural Network

    Get PDF
    Im Rahmen der vorliegenden Arbeit wurden Konzepte und dedizierte Hardware entwickelt, die es erlauben, groรŸskalige pulsgekoppelte neuronale Netze in Hardware zu realisieren. Die Arbeit basiert auf dem analogen VLSI-Modell eines pulsgekoppelten neuronalen Netzes, welches synaptische Plastizitรคt (STPD) in jeder einzelnen Synapse beinhaltet. Das Modell arbeitet analog mit einem Geschwindigkeitszuwachs von bis zu 10^5 im Vergleich zur biologischen Echtzeit. Aktionspotentiale werden als digitale Ereignisse รผbertragen. Inhalt dieser Arbeit sind vornehmlich die digitale Hardware und die รœbertragung dieser Ereignisse. Das analoge VLSI-Modell wurde in Verbindung mit Digitallogik, welche zur Verarbeitung neuronaler Ereignisse und zu Konfigurationszwecken dient, in einen gemischt analog-digitalen ASIC integriert, wobei zu diesem Zweck ein automatisierter Arbeitsablauf entwickelt wurde. AuรŸerdem wurde eine entsprechende Kontrolleinheit in programmierbarer Logik implementiert und eine Hardware-Plattform zum parallelen Betrieb mehrerer neuronaler Netzwerkchips vorgestellt. Um das VLSI-Modell auf mehrere neuronale Netzwerkchips ausdehnen zu kรถnnen, wurde ein Routing-Algorithmus entwickelt, welcher die รœbertragung von Ereignissen zwischen Neuronen und Synapsen auf unterschiedlichen Chips ermรถglicht. Die zeitlich korrekte รœbertragung der Ereignisse, welche eine zwingende Bedingung fรผr das Funktionieren von Plastizitรคtsmechanismen ist, wird durch diesen Algorithmus sichergestellt. Die Funktionalitรคt des Algorithmus wird mittels Simulationen verifiziert. Weiterhin wird die korrekte Realisierung des gemischt analog-digitalen ASIC in Verbindung mit dem zugehรถrigen Hardware-System demonstriert und die Durchfรผhrbarkeit biologisch realistischer Experimente gezeigt. Das vorgestellte groรŸskalige physikalische Modell eines neuronalen Netzwerks wird aufgrund seiner schnellen und parallelen Arbeitsweise fรผr Experimentierzwecke in den Neurowissenschaften einsetzbar sein. Als Ergรคnzung zu numerischen Simulationen bietet es vor allem die Mรถglichkeit der intuitiven und umfangreichen Suche nach geeigneten Modellparametern

    Source-synchronous I/O Links using Adaptive Interface Training for High Bandwidth Applications

    Get PDF
    Mobility is the key to the global business which requires people to be always connected to a central server. With the exponential increase in smart phones, tablets, laptops, mobile traffic will soon reach in the range of Exabytes per month by 2018. Applications like video streaming, on-demand-video, online gaming, social media applications will further increase the traffic load. Future application scenarios, such as Smart Cities, Industry 4.0, Machine-to-Machine (M2M) communications bring the concepts of Internet of Things (IoT) which requires high-speed low power communication infrastructures. Scientific applications, such as space exploration, oil exploration also require computing speed in the range of Exaflops/s by 2018 which means TB/s bandwidth at each memory node. To achieve such bandwidth, Input/Output (I/O) link speed between two devices needs to be increased to GB/s. The data at high speed between devices can be transferred serially using complex Clock-Data-Recovery (CDR) I/O links or parallely using simple source-synchronous I/O links. Even though CDR is more efficient than the source-synchronous method for single I/O link, but to achieve TB/s bandwidth from a single device, additional I/O links will be required and the source-synchronous method will be more advantageous in terms of area and power requirements as additional I/O links do not require extra hardware resources. At high speed, there are several non-idealities (Supply noise, crosstalk, Inter- Symbol-Interference (ISI), etc.) which create unwanted skew problem among parallel source-synchronous I/O links. To solve these problems, adaptive trainings are used in time domain to synchronize parallel source-synchronous I/O links irrespective of these non-idealities. In this thesis, two novel adaptive training architectures for source-synchronous I/O links are discussed which require significantly less silicon area and power in comparison to state-of-the-art architectures. First novel adaptive architecture is based on the unit delay concept to synchronize two parallel clocks by adjusting the phase of one clock in only one direction. Second novel adaptive architecture concept consists of Phase Interpolator (PI)-based Phase Locked Loop (PLL) which can adjust the phase in both direction and achieve faster synchronization at the expense of added complexity. With an increase in parallel I/O links, clock skew which is generated by the improper clock tree, also affects the timing margin. Incorrect duty cycle further reduces the timing margin mainly in Double Data Rate (DDR) systems which are generally used to increase the bandwidth of a high-speed communication system. To solve clock skew and duty cycle problems, a novel clock tree buffering algorithm and a novel duty cycle corrector are described which further reduce the power consumption of a source-synchronous system

    Active Buffer Development in CBM Experiment

    Get PDF
    Die Anforderungen an das Datenerfassungssystem (DAQ) des CBM Experiments an der GSI sind mit einer Datenrate von 1TB/s und einer Ereignisrate von 100 kHz sehr hoch und stellen auch im Vergleich zu anderen Experimenten in der Hochenergiephysik eine Herausforderung dar. Bei der Datennahme wird daher ein aktiver Zwischenspeicher (โ€žactive bufferโ€œ) eingesetzt, der durch eine Vorsortierung der Datenfragmente und eine intelligente รœbertragung in den Hostrechner den Aufbau der Datenstrukturen zur Ereignisverarbeitung unterstรผtzt. Das Projekt erfordert ein modulares Framework und die Arbeit umfasst die Entwicklung, Verifikation und Test von FPGA Modulen zum effizienten Datentransfer, zur Zwischenspeicherung und zur Rekonfiguration, sowie von Software zur automatischen Transformation von HDL Beschreibungen. Die zentralen Bauteile dieses Zwischenspeichers sind ein leistungsfรคhiges FPGA zur Datenflusssteuerung und ein DDR2 SDRAM Modul mit einer Kapazitรคt von 512MB. Durch eine spezielle Ansteuerungsmethode kann das Speichermodul zusammen mit den FPGA-internen Speicherelementen als leistungsfรคhiges, groรŸes FIFO betrieben werden. Den Datantransfer vom Zwischenspeicher zum PC รผbernimmt eine spezielle DMA Einheit, die an den PCIe-Kern im FPGA angeschlossen ist. Die zwei DMA Kanรคle arbeiten mit Scatter-Gather Unterstรผtzung und erreichen beim Transfer zum PC 543 MB/s und in der Gegenrichtung 790MB/s. Die fรผr die Vorsortierung wichtige รœbertragung der Zeitstempel (โ€žepoch markerโ€œ) erfolgt ebenfalls mit einem DMA Kanal. Die Verifikation ist eine wichtige Stufe bei der Entwicklung einer umfangreichen FPGA Anwendungen wie des aktiven Zwischenspeichers. Daher wurden die HDL Module der Funktionen fรผr das PCI Express โ€žtransaction layerโ€œ mit einer Reihe unterschiedlicher Simulationsumgebungen verifiziert. Auf dieser Grundlage kรถnnen Verbesserungen an der Funktionalitรคt schnell und zuverlรคssig umgesetzt werden, womit eine konsistente Weiterentwicklung gewรคhrleistet ist. Aufgrund der typischen PC-Architektur muss die PCIe-Einheit im FPGA bereits wรคhrend des Startvorgangs funktionsfรคhig sein, wohingegen die eigentliche aktive Zwischenspeicherfunktion erst zusammen mit der entsprechenden Anwendungssoftware verfรผgbar sein muss. Strikte Modularisierung zusammen mit dynamischer, partieller Rekonfigurierung (โ€žDPRโ€œ) ermรถglichen Verรคnderungen in der Zwischenspeicherfunktion zur Laufzeit. Ein weiter Grund fรผr die Nutzung der DPR sind die Lizenzbedingungen der PCIe-Core-Implementierung mit Virtex4-FPGAs. DPR kann bei den FPGA Familien Virtex-4, -5 und -6 im Rahmen der โ€žPlanAheadโ€œ Software von Xilinx benutzt werden. DPR wird im Projekt im Sinne eines allgemeinen Coprozessors eingesetzt, indem die FPGA Konfiguration รผber die PCIe und die interne Konfigurationsschnittstelle (โ€žICAPโ€œ) im FPGA nachgeladen wird. Um DPR bei hohen Taktgeschwindigkeiten einsetzen zu kรถnnen, muss die Verbindungslogik zwischen den statischen und dynamischen Modulen speziellen Anforderungen genรผgen. Da die manuelle Anpassung existierenden Module an diese Anforderungen aufwรคndig und fehleranfรคllig ist, wurde das Programm โ€žLogroโ€œ entwickelt, das HDL Beschreibungen mittels einer speziellen Pipeline- Neustrukturierung automatisch so transformiert, dass die DPR Anforderungen erfรผllt werden. Mit Logro V1.0 wurden dabei gute Ergebnisse erzielt, die hier vorgestellt werden

    A LPDDR4 MEMORY CONTROLLER DESIGN WITH EYE CENTER DETECTION ALGORITHM

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2016. 2. ๊น€์ˆ˜ํ™˜.The demand for higher bandwidth with reduced power consumption in mobile memory is increasing. In this thesis, architecture of the LPDDR4 memory controller, operated with a LPDDR4 memory, is proposed and designed, and efficient training algorithm, which is appropriate for this architecture, is proposed for memory training and verification. The operation speed range of the LPDDR4 memory specification is from 533Mbps to 4266Mbps, and the LPDDR4 memory controller is designed to support that range of the LPDDR4 memory. The phase-locked loop in the LPDDR4 memory controller is designed to operate between 1333MHz and 2133MHz. To cover the range of the LPDDR4 memory, the selectable frequency divider is used to provide operation clock. The output frequency of the phase-locked loop with divider is from 266MHz to 2133MHz. The delay-locked loop in the LPDDR4 memory controller is designed to operate between 266MHz and 2133MHz with 180หš phase locking. The delay-locked loop is used each training operation, which is command training, data read and write training. To complete training in each training stage, eye center detection algorithm is used. The circuits for the proposed eye center detection algorithm such as delay line, phase interpolator and reference generator are designed and validated. The proposed 1x2y3x eye center detection algorithm is 23 times faster than conventional two-dimensional eye center detection algorithm and it can be implemented simply. Using 65nm CMOS process, the proposed LPDDR4 memory controller occupies 12mm2. The verification of the LPDDR4 memory controller is performed with commodity LPDDR4 memory. The verification of all training sequence, which is power on, initializing, boot up, command training, write leveling, read training, write training, is performed in this environment. The low voltage swing terminated logic driver and other several functions, including write leveling and data transmission, are verified at 4266Mbps and the entire LPDDR4 memory controller operations from 566Mbps to 1600Mbps are verified. The proposed eye center detection algorithm is verified from 566Mbps to 2843Mbps.CHAPTER 1 INTRODUCTION 1 1.1 MOTIVATION 1 1.2 INTRODUCTION 5 1.3 THESIS ORGANIZATION 7 CHAPTER 2 LPDDR4 MEMORY CONTROLLER DESIGN 8 2.1 DIFFERENCE BETWEEN LPDDR3 AND LPDDR4 MEMORY 8 2.1.1 ARCHITECTURAL DIFFERENCE BETWEEN LPDDR3 AND LPDDR4 MEMORY 10 2.1.2 SOURCE SYNCHRONOUS MATCHED SCHEME AND UNMATCHED SCHEME 11 2.1.3 LOW VOLTAGE SWING TERMINATED LOGIC DRIVER AND TERMINATION SCHEME 12 2.2 LPDDR4 MEMORY CONTROLLER SPECIFICATION 15 2.3 DESIGN PROCEDURE 18 CHAPTER 3 LPDDR4 MEMORY CONTROLLER ARCHITECTURE BASED ON MEMORY TRAINING 20 3.1 LPDDR4 MEMORY TRAINING SEQUENCE 20 3.2 LPDDR4 MEMORY TRAINING EYE DETECTION ALGORITHM 24 3.2.1 EYE CENTER DETECTION 24 3.2.2 1X2Y3X EYE CENTER DETECTION ALGORITHM 27 3.3. LPDDR4 MEMORY CONTROLLER DESIGN BASED ON MEMORY TRAINING 31 3.3.1 ARCHITECTURE FOR MEMORY BOOT UP AND POWER UP 31 3.3.2 CLOCK PATH ARCHITECTURE AND CLOCK TREE 34 3.3.3 COMMAND TRAINING AND COMMAND PATH ARCHITECTURE 35 3.3.4 WRITE LEVELING AND DATA STROBE TRANSMISSION PATH ARCHITECTURE 39 3.3.5 READ TRAINING AND READ PATH ARCHITECTURE 41 3.3.6 WRITE TRAINING AND WRITE PATH ARCHITECTURE 43 3.3.7 NORMAL READ/WRITE OPERATION AND MARGIN TEST 46 CHAPTER 4 LPDDR4 MEMORY CONTROLLER ARCHITECTURE MODELING AND CIRCUIT DESIGN 48 4.1 OVERALL LPDDR4 MEMORY CONTROLLER ARCHITECTURE MODELING 48 4.2 SIMULATION RESULT OF LPDDR4 MEMORY CONTROLLER MODELING 51 4.3 LPDDR4 MEMORY CONTROLLER CIRCUIT DESIGN 61 4.3.1 PHASE-LOCKED LOOP 61 4.3.2 DELAY-LOCKED LOOP 65 4.3.3 TRANSMITTER OF LPDDR4 MEMORY CONTROLLER: WRITE PATH 70 4.3.4 DE-SERIALIZER WITH CLOCK DOMAIN CROSSING 75 CHAPTER 5 MEASUREMENT RESULT OF LPDDR4 MEMORY CONTROLLER 77 5.1 LPDDR4 MEMORY CONTROLLER MEASUREMENT SETUP 77 5.1.1 LPDDR4 MEMORY CONTROLLER FLOOR PLAN AND LAYOUT 77 5.1.2 PACKAGE AND TEST BOARD 79 5.2 LPDDR4 MEMORY CONTROLLER SUB-BLOCK MEASUREMENT 81 5.2.1 PHASE-LOCKED LOOP 81 5.2.2 DELAY-LOCKED LOOP 83 5.2.3 200PS AND 800PS DELAY LINE 85 5.2.4 VOLTAGE REFERENCE GENERATOR 86 5.2.5 PHASE INTERPOLATOR 87 5.3 LPDDR4 MEMORY SYSTEM OPERATION MEASUREMENT 90 CHAPTER 6 CONCLUSION 93 APPENDIX OPERATION FLOW CHART OF THE PROPOSED LPDDR4 MEMORY CONTROLLER 95 BIBLIOGRAPHY 118 KOREAN ABSTRACT 124Docto

    Efficient Design and Communication for 3D Stacked Dynamic Memory

    Get PDF
    As computer memory increases in size and processors continue to get faster, memory becomes an increasing bottleneck to system performance. To mitigate the slow DRAM memory chip speeds, a new generation of 3D stacked DRAM will allow lower power consumption and higher bandwidth. To communicate between these chips, this paper proposes the use of ring based standing wave oscillators for fast data transfer. With a fast clocking scheme, multiple channels can share the same bus to reduce TSVs and maintain similar memory latencies. Simulations with the new clocking scheme and data transfers are performed to show the improvements that can be made in memory communication. Experimental results show that a ring based clocking scheme can obtain two times the speed up of current stacked memory chips. Variations of this clocking scheme can also provide half the power consumption with comparable speeds. These ring-based architectures allow higher memory speeds without compromising the complexity of the hardware. This allows the ring-based memory architecture to trade off power, throughput, and latency to improve system performance for different applications

    ํ†ต๊ณ„์  ์ฃผํŒŒ์ˆ˜ ๊ฒ€์ถœ๊ธฐ ๊ธฐ๋ฐ˜ ๊ธฐ์ค€ ์ฃผํŒŒ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ํด๋ก ๋ฐ ๋ฐ์ดํ„ฐ ๋ณต์› ํšŒ๋กœ์˜ ์„ค๊ณ„ ๋ฐฉ๋ฒ•๋ก 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022. 8. ์ •๋•๊ท .In this thesis, a design of a high-speed, power-efficient, wide-range clock and data recovery (CDR) without a reference clock is proposed. A frequency acquisition scheme using a stochastic frequency detector (SFD) based on the Alexander phase detector (PD) is utilized for the referenceless operation. Pat-tern histogram analysis is presented to analyze the frequency acquisition behavior of the SFD and verified by simulation. Based on the information obtained by pattern histogram analysis, SFD using autocovariance is proposed. With a direct-proportional path and a digital integral path, the proposed referenceless CDR achieves frequency lock at all measurable conditions, and the measured frequency acquisition time is within 7ฮผs. The prototype chip has been fabricated in a 40-nm CMOS process and occupies an active area of 0.032 mm2. The proposed referenceless CDR achieves the BER of less than 10-12 at 32 Gb/s and exhibits an energy efficiency of 1.15 pJ/b at 32 Gb/s with a 1.0 V supply.๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์ค€ ํด๋Ÿญ์ด ์—†๋Š” ๊ณ ์†, ์ €์ „๋ ฅ, ๊ด‘๋Œ€์—ญ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ํด๋Ÿญ ๋ฐ ๋ฐ์ดํ„ฐ ๋ณต์›ํšŒ๋กœ์˜ ์„ค๊ณ„๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์ค€ ํด๋Ÿญ์ด ์—†๋Š” ๋™์ž‘์„ ์œ„ํ•ด์„œ ์•Œ๋ ‰์‚ฐ๋” ์œ„์ƒ ๊ฒ€์ถœ๊ธฐ์— ๊ธฐ๋ฐ˜ํ•œ ํ†ต๊ณ„์  ์ฃผํŒŒ์ˆ˜ ๊ฒ€์ถœ๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ฃผํŒŒ์ˆ˜ ํš๋“ ๋ฐฉ์‹์ด ์‚ฌ์šฉ๋œ๋‹ค. ํ†ต๊ณ„์  ์ฃผํŒŒ์ˆ˜ ๊ฒ€์ถœ๊ธฐ์˜ ์ฃผํŒŒ์ˆ˜ ์ถ”์  ์–‘์ƒ์„ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด ํŒจํ„ด ํžˆ์Šคํ† ๊ทธ๋žจ ๋ถ„์„ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•˜์˜€๊ณ  ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ํ†ตํ•ด ๊ฒ€์ฆํ•˜์˜€๋‹ค. ํŒจํ„ด ํžˆ์Šคํ† ๊ทธ๋žจ ๋ถ„์„์„ ํ†ตํ•ด ์–ป์€ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ž๊ธฐ๊ณต๋ถ„์‚ฐ์„ ์ด์šฉํ•œ ํ†ต๊ณ„์  ์ฃผํŒŒ์ˆ˜ ๊ฒ€์ถœ๊ธฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ง์ ‘ ๋น„๋ก€ ๊ฒฝ๋กœ์™€ ๋””์ง€ํ„ธ ์ ๋ถ„ ๊ฒฝ๋กœ๋ฅผ ํ†ตํ•ด ์ œ์•ˆ๋œ ๊ธฐ์ค€ ํด๋Ÿญ์ด ์—†๋Š” ํด๋Ÿญ ๋ฐ ๋ฐ์ดํ„ฐ ๋ณต์›ํšŒ๋กœ๋Š” ๋ชจ๋“  ์ธก์ • ๊ฐ€๋Šฅํ•œ ์กฐ๊ฑด์—์„œ ์ฃผํŒŒ์ˆ˜ ์ž ๊ธˆ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐ ์„ฑ๊ณตํ•˜์˜€๊ณ , ๋ชจ๋“  ๊ฒฝ์šฐ์—์„œ ์ธก์ •๋œ ์ฃผํŒŒ์ˆ˜ ์ถ”์  ์‹œ๊ฐ„์€ 7ฮผs ์ด๋‚ด์ด๋‹ค. 40-nm CMOS ๊ณต์ •์„ ์ด์šฉํ•˜์—ฌ ๋งŒ๋“ค์–ด์ง„ ์นฉ์€ 0.032 mm2์˜ ๋ฉด์ ์„ ์ฐจ์ง€ํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ํด๋Ÿญ ๋ฐ ๋ฐ์ดํ„ฐ ๋ณต์›ํšŒ๋กœ๋Š” 32 Gb/s์˜ ์†๋„์—์„œ ๋น„ํŠธ์—๋Ÿฌ์œจ 10-12 ์ดํ•˜๋กœ ๋™์ž‘ํ•˜์˜€๊ณ , ์—๋„ˆ์ง€ ํšจ์œจ์€ 32Gb/s์˜ ์†๋„์—์„œ 1.0V ๊ณต๊ธ‰์ „์••์„ ์‚ฌ์šฉํ•˜์—ฌ 1.15 pJ/b์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.CHAPTER 1 INTRODUCTION 1 1.1 MOTIVATION 1 1.2 THESIS ORGANIZATION 13 CHAPTER 2 BACKGROUNDS 14 2.1 CLOCKING ARCHITECTURES IN SERIAL LINK INTERFACE 14 2.2 GENERAL CONSIDERATIONS FOR CLOCK AND DATA RECOVERY 24 2.2.1 OVERVIEW 24 2.2.2 JITTER 26 2.2.3 CDR JITTER CHARACTERISTICS 33 2.3 CDR ARCHITECTURES 39 2.3.1 PLL-BASED CDR โ€“ WITH EXTERNAL REFERENCE CLOCK 39 2.3.2 DLL/PI-BASED CDR 44 2.3.3 PLL-BASED CDR โ€“ WITHOUT EXTERNAL REFERENCE CLOCK 47 2.4 FREQUENCY ACQUISITION SCHEME 50 2.4.1 TYPICAL FREQUENCY DETECTORS 50 2.4.1.1 DIGITAL QUADRICORRELATOR FREQUENCY DETECTOR 50 2.4.1.2 ROTATIONAL FREQUENCY DETECTOR 54 2.4.2 PRIOR WORKS 56 CHAPTER 3 DESIGN OF THE REFERENCELESS CDR USING SFD 58 3.1 OVERVIEW 58 3.2 PROPOSED FREQUENCY DETECTOR 62 3.2.1 MOTIVATION 62 3.2.2 PATTERN HISTOGRAM ANALYSIS 68 3.2.3 INTRODUCTION OF AUTOCOVARIANCE TO STOCHASTIC FREQUENCY DETECTOR 75 3.3 CIRCUIT IMPLEMENTATION 83 3.3.1 IMPLEMENTATION OF THE PROPOSED REFERENCELESS CDR 83 3.3.2 CONTINUOUS-TIME LINEAR EQUALIZER (CTLE) 85 3.3.3 DIGITALLY-CONTROLLED OSCILLATOR (DCO) 87 3.4 MEASUREMENT RESULTS 89 CHAPTER 4 CONCLUSION 99 APPENDIX A DETAILED FREQUENCY ACQUISITION WAVEFORMS OF THE PROPOSED SFD 100 BIBLIOGRAPHY 108 ์ดˆ ๋ก 122๋ฐ•

    Digital radar receiver design based on highly efficient bandpass sampling FPGA architecture

    Get PDF
    Thesis (M.S.)--University of Oklahoma, 2009.Includes bibliographical references (leaves 62-64).As digital electronics become faster and more efficient, it becomes possible to move the analog/digital interface in a radar downconversion system further towards the antenna. Instead of digitizing radar echoes at the end of the down-conversion process, digital logic can perform the same operations previously performed by analog components. Taking full advantage of this opportunity will result in a more highly integrated and reconfigurable design. By removing unnecessary analog components, the error from component variability and noise injected into the signal of interest is reduced, the size of the receiver and the power required for operation is minimized, and the overall cost of the system can be lowered. This research is focused on employing software defined radio concepts for weather observation, thus creating a low-cost digital radar receiver at the University of Oklahoma for use in radar projects as a way of obviating the need for commmercial radar receivers, which can be many times more expensive. Software-defined radio techniques, such as bandpass sampling, are used to achieve a high data processing bandwidth and oversampling ratio with the smallest logic resource utilization. Two novel digital receiver designs are discussed in this thesis. A prototype compact single-channel digital receiver based on a 14-bit analog-to-digital converter and a hand-solderable Xilinx FPGA was built and tested both in the laboratory and at the National Weather Radar Testbed (NWRT). Building on the lessons learned from testing the single-channel digital radar receiver, a second digital receiver was designed for expanded capabilities. Through the utilization of a low-power, simultaneous-sampling eight channel ADC with high-speed serial data links and a cost-efficient FPGA with integrated DSP slices, eight data channels can be digitized, processed and transferred at the same time in a compact form factor. An ethernet interface has been included which allows for a scalable control channel so that the digital receiver's operations can be quickly modified. This also makes it possible to remotely change the firmware of the FPGA in seconds, without the need for physical access. Development of host computer platforms to store and process each digital receiver's output are also discussed
    • โ€ฆ
    corecore