39 research outputs found

    High-Performance, Energy-Efficient CMOS Arithmetic Circuits

    Get PDF
    In a modern microprocessor, datapath/arithmetic circuits have always been an important building block in delivering high-performance, energy-efficient computing, because arithmetic operations such as addition and binary number comparison are two of the most commonly used computing instructions. Besides the manufacturing CMOS process, the two most critical design considerations for arithmetic circuits are the logic style and micro-architecture. In this thesis, a constant-delay (CD) logic style is proposed targeting full-custom high-speed applications. The constant delay characteristic of this logic style (regardless of the logic type) makes it suitable for implementing complicated logic expressions such as addition. CD logic exhibits a unique characteristic where the output is pre-evaluated before the inputs from the preceding stage are ready. This feature enables a performance advantage over static and dynamic domino logic styles in a single cycle, multi-stage circuit block. Several design considerations including timing window width adjustment and clock distribution are discussed. Using a 65-nm general-purpose CMOS technology, the proposed logic style demonstrates an average speedup of 94% and 56% over static and dynamic domino logic, respectively, in five different logic gates. Simulation results of 8-bit ripple carry adders conclude that CD logic is 39% and 23% faster than the static and dynamic-based adders, respectively. CD logic also demonstrates 39% speedup and 64% (22%) energy-delay product reduction from static logic at 100% (10%) data activity in 32-bit carry lookahead adders. To confirm CD logic's potential, a 148 ps, single-cycle 64-bit adder with CD logic implemented in the critical path is fabricated in a 65-nm, 1-V CMOS process. A new 64-bit Ling adder micro-architecture, which utilizes both inversion and absorption properties to minimize the number of CD logic and the number of logic stage in the critical path, is also proposed. At 1-V supply, this adder's measured worst-case power and leakage power are 135 mW and 0.22 mW, respectively. A single-cycle 64-bit binary comparator utilizing a radix-2 tree structure is also proposed. This comparator architecture is specifically designed for static logic to achieve both low-power and high-performance operation, especially in low input data activity environments. At 65-nm technology with 25% (10%) data activity, the proposed design demonstrates 2.3x (3.5x) and 3.7x (5.8x) power and energy-delay product efficiency, respectively. This comparator is also 2.7x faster at iso-energy (80 fJ) or 3.3x more energy-efficient at iso-delay (200 ps) than existing designs. An improved comparator, where CD logic is utilized in the critical path to achieve high performance without sacrificing the overall energy efficiency, is also realized in a 65-nm 1-V CMOS process. At 1-V supply, the proposed comparator's measured delay is 167 ps, and has an average power and a leakage power of 2.34 mW and 0.06 mW, respectively. At 0.3-pJ iso-energy or 250-ps iso-delay budget, the proposed comparator with CD logic is 20% faster or 17% more energy-efficient compared to a comparator implemented with just the static logic

    고속 DRAM μΈν„°νŽ˜μ΄μŠ€λ₯Ό μœ„ν•œ μ „μ•• 및 μ˜¨λ„μ— λ‘”κ°ν•œ 클둝 νŒ¨μŠ€μ™€ μœ„μƒ 였λ₯˜ ꡐ정기 섀계

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021. 2. 정덕균.To cope with problems caused by the high-speed operation of the dynamic random access memory (DRAM) interface, several approaches are proposed that are focused on the clock path of the DRAM. Two delay-locked loop (DLL) based schemes, a forwarded-clock (FC) receiver (RX) with self-tracking loop and a quadrature error corrector, are proposed. Moreover, an open-loop based scheme is presented for drift compensation in the clock distribution. The open-loop scheme consumes less power consumption and reduces design complexity. The FC RX uses DLLs to compensate for voltage and temperature (VT) drift in unmatched memory interfaces. The self-tracking loop consists of two-stage cascaded DLLs to operate in a DRAM environment. With the write training and the proposed DLL, the timing relationship between the data and the sampling clock is always optimal. The proposed scheme compensates for delay drift without relying on data transitions or re-training. The proposed FC RX is fabricated in 65-nm CMOS process and has an active area containing 4 data lanes of 0.0329 mm2. After the write training is completed at the supply voltage of 1 V, the measured timing margin remains larger than 0.31-unit interval (UI) when the supply voltage drifts in the range of 0.94 V and 1.06 V from the training voltage, 1 V. At the data rate of 6.4 Gb/s, the proposed FC RX achieves an energy efficiency of 0.45 pJ/bit. Contrary to the aforementioned scheme, an open-loop-based voltage drift compensation method is proposed to minimize power consumption and occupied area. The overall clock distribution is composed of a current mode logic (CML) path and a CMOS path. In the proposed scheme, the architecture of the CML-to-CMOS converter (C2C) and the inverter is changed to compensate for supply voltage drift. The bias generator provides bias voltages to the C2C and inverters according to supply voltage for delay adjustment. The proposed clock tree is fabricated in 40 nm CMOS process and the active area is 0.004 mm2. When the supply voltage is modulated by a sinusoidal wave with 1 MHz, 100 mV peak-to-peak swing from the center of 1.1 V, applying the proposed scheme reduces the measured root-mean-square (RMS) jitter from 3.77 psRMS to 1.61 psRMS. At 6 GHz output clock, the power consumption of the proposed scheme is 11.02 mW. A DLL-based quadrature error corrector (QEC) with a wide correction range is proposed for the DRAM whose clocks are distributed over several millimeters. The quadrature error is corrected by adjusting delay lines using information from the phase error detector. The proposed error correction method minimizes increased jitter due to phase error correction by setting at least one of the delay lines in the quadrature clock path to the minimum delay. In addition, the asynchronous calibration on-off scheme reduces power consumption after calibration is complete. The proposed QEC is fabricated in 40 nm CMOS process and has an active area of 0.048 mm2. The proposed QEC exhibits a wide correctable error range of 101.6 ps and the remaining phase errors are less than 2.18Β° from 0.8 GHz to 2.3 GHz clock. At 2.3 GHz, the QEC contributes 0.53 psRMS jitter. Also, at 2.3 GHz, the power consumption is reduced from 8.89 mW to 3.39 mW when the calibration is off.λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 동적 랜덀 μ•‘μ„ΈμŠ€ λ©”λͺ¨λ¦¬ (DRAM)의 속도가 증가함에 따라 클둝 νŒ¨μŠ€μ—μ„œ λ°œμƒν•  수 μžˆλŠ” λ¬Έμ œμ— λŒ€μ²˜ν•˜κΈ° μœ„ν•œ μ„Έ 가지 νšŒλ‘œλ“€μ„ μ œμ•ˆν•˜μ˜€λ‹€. μ œμ•ˆν•œ νšŒλ‘œλ“€ 쀑 두 방식듀은 지연동기루프 (delay-locked loop) 방식을 μ‚¬μš©ν•˜μ˜€κ³  λ‚˜λ¨Έμ§€ ν•œ 방식은 면적과 μ „λ ₯ μ†Œλͺ¨λ₯Ό 쀄이기 μœ„ν•΄ μ˜€ν”ˆ 루프 방식을 μ‚¬μš©ν•˜μ˜€λ‹€. DRAM의 λΉ„μ •ν•© μˆ˜μ‹ κΈ° κ΅¬μ‘°μ—μ„œ 데이터 νŒ¨μŠ€μ™€ 클둝 패슀 κ°„μ˜ 지연 뢈일치둜 인해 μ „μ•• 및 μ˜¨λ„ 변화에 따라 μ…‹μ—… νƒ€μž„ 및 ν™€λ“œ νƒ€μž„μ΄ μ€„μ–΄λ“œλŠ” 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 지연동기루프λ₯Ό μ‚¬μš©ν•˜μ˜€λ‹€. μ œμ•ˆν•œ 지연동기루프 νšŒλ‘œλŠ” DRAM ν™˜κ²½μ—μ„œ λ™μž‘ν•˜λ„λ‘ 두 개의 μ§€μ—°λ™κΈ°λ£¨ν”„λ‘œ λ‚˜λˆ„μ—ˆλ‹€. λ˜ν•œ 초기 μ“°κΈ° ν›ˆλ ¨μ„ 톡해 데이터와 클둝을 타이밍 λ§ˆμ§„ κ΄€μ μ—μ„œ 졜적의 μœ„μΉ˜μ— λ‘˜ 수 μžˆλ‹€. λ”°λΌμ„œ μ œμ•ˆν•˜λŠ” 방식은 데이터 천이 정보가 ν•„μš”ν•˜μ§€ μ•Šλ‹€. 65-nm CMOS 곡정을 μ΄μš©ν•˜μ—¬ λ§Œλ“€μ–΄μ§„ 칩은 6.4 Gb/sμ—μ„œ 0.45 pJ/bit의 μ—λ„ˆμ§€ νš¨μœ¨μ„ 가진닀. λ˜ν•œ 1 Vμ—μ„œ μ“°κΈ° ν›ˆλ ¨ 및 지연동기루프λ₯Ό κ³ μ •μ‹œν‚€κ³  0.94 Vμ—μ„œ 1.06 VκΉŒμ§€ 곡급 전압이 λ°”λ€Œμ—ˆμ„ λ•Œ 타이밍 λ§ˆμ§„μ€ 0.31 UI보닀 큰 값을 μœ μ§€ν•˜μ˜€λ‹€. λ‹€μŒμœΌλ‘œ μ œμ•ˆν•˜λŠ” νšŒλ‘œλŠ” 클둝 뢄포 νŠΈλ¦¬μ—μ„œ μ „μ•• λ³€ν™”λ‘œ 인해 클둝 패슀의 지연이 λ‹¬λΌμ§€λŠ” 것을 μ•žμ„œ μ œμ‹œν•œ 방식과 달리 μ˜€ν”ˆ 루프 λ°©μ‹μœΌλ‘œ λ³΄μƒν•˜μ˜€λ‹€. κΈ°μ‘΄ 클둝 패슀의 인버터와 CML-to-CMOS λ³€ν™˜κΈ°μ˜ ꡬ쑰λ₯Ό λ³€κ²½ν•˜μ—¬ λ°”μ΄μ–΄μŠ€ 생성 νšŒλ‘œμ—μ„œ μƒμ„±ν•œ 곡급 전압에 따라 λ°”λ€ŒλŠ” λ°”μ΄μ–΄μŠ€ 전압을 가지고 지연을 μ‘°μ ˆν•  수 있게 ν•˜μ˜€λ‹€. 40-nm CMOS 곡정을 μ΄μš©ν•˜μ—¬ λ§Œλ“€μ–΄μ§„ 칩의 6 GHz ν΄λ‘μ—μ„œμ˜ μ „λ ₯ μ†Œλͺ¨λŠ” 11.02 mW둜 μΈ‘μ •λ˜μ—ˆλ‹€. 1.1 V μ€‘μ‹¬μœΌλ‘œ 1 MHz, 100 mV 피크 투 피크λ₯Ό κ°€μ§€λŠ” μ‚¬μΈνŒŒ μ„±λΆ„μœΌλ‘œ 곡급 전압을 λ³€μ‘°ν•˜μ˜€μ„ λ•Œ μ œμ•ˆν•œ λ°©μ‹μ—μ„œμ˜ μ§€ν„°λŠ” κΈ°μ‘΄ λ°©μ‹μ˜ 3.77 psRMSμ—μ„œ 1.61 psRMS둜 μ€„μ–΄λ“€μ—ˆλ‹€. DRAM의 솑신기 κ΅¬μ‘°μ—μ„œ 닀쀑 μœ„μƒ 클둝 κ°„μ˜ μœ„μƒ μ˜€μ°¨λŠ” μ†‘μ‹ λœ λ°μ΄ν„°μ˜ 데이터 유효 창을 κ°μ†Œμ‹œν‚¨λ‹€. 이λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 지연동기루프λ₯Ό λ„μž…ν•˜κ²Œ 되면 μ¦κ°€λœ μ§€μ—°μœΌλ‘œ 인해 μœ„μƒμ΄ κ΅μ •λœ ν΄λ‘μ—μ„œ 지터가 μ¦κ°€ν•œλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” μ¦κ°€λœ 지터λ₯Ό μ΅œμ†Œν™”ν•˜κΈ° μœ„ν•΄ μœ„μƒ κ΅μ •μœΌλ‘œ 인해 μ¦κ°€λœ 지연을 μ΅œμ†Œν™”ν•˜λŠ” μœ„μƒ ꡐ정 회둜λ₯Ό μ œμ‹œν•˜μ˜€λ‹€. λ˜ν•œ 유휴 μƒνƒœμ—μ„œ μ „λ ₯ μ†Œλͺ¨λ₯Ό 쀄이기 μœ„ν•΄ μœ„μƒ 였차λ₯Ό κ΅μ •ν•˜λŠ” 회둜λ₯Ό μž…λ ₯ 클둝과 λΉ„λ™κΈ°μ‹μœΌλ‘œ 끌 수 μžˆλŠ” 방법 λ˜ν•œ μ œμ•ˆν•˜μ˜€λ‹€. 40-nm CMOS 곡정을 μ΄μš©ν•˜μ—¬ λ§Œλ“€μ–΄μ§„ 칩의 μœ„μƒ ꡐ정 λ²”μœ„λŠ” 101.6 ps이고 0.8 GHz λΆ€ν„° 2.3 GHzκΉŒμ§€μ˜ λ™μž‘ 주파수 λ²”μœ„μ—μ„œ μœ„μƒ κ΅μ •κΈ°μ˜ 좜λ ₯ 클둝의 μœ„μƒ μ˜€μ°¨λŠ” 2.18°보닀 μž‘λ‹€. μ œμ•ˆν•˜λŠ” μœ„μƒ ꡐ정 회둜둜 인해 μΆ”κ°€λœ μ§€ν„°λŠ” 2.3 GHzμ—μ„œ 0.53 psRMS이고 ꡐ정 회둜λ₯Ό 껐을 λ•Œ μ „λ ₯ μ†Œλͺ¨λŠ” ꡐ정 νšŒλ‘œκ°€ μΌœμ‘Œμ„ λ•ŒμΈ 8.89 mWμ—μ„œ 3.39 mW둜 μ€„μ–΄λ“€μ—ˆλ‹€.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Thesis Organization 4 Chapter 2 Background on DRAM Interface 5 2.1 Overview 5 2.2 Memory Interface 7 Chapter 3 Background on DLL 11 3.1 Overview 11 3.2 Building Blocks 15 3.2.1 Delay Line 15 3.2.2 Phase Detector 17 3.2.3 Charge Pump 19 3.2.4 Loop filter 20 Chapter 4 Forwarded-Clock Receiver with DLL-based Self-tracking Loop for Unmatched Memory Interfaces 21 4.1 Overview 21 4.2 Proposed Separated DLL 25 4.2.1 Operation of the Proposed Separated DLL 27 4.2.2 Operation of the Digital Loop Filter in DLL 31 4.3 Circuit Implementation 33 4.4 Measurement Results 37 4.4.1 Measurement Setup and Sequence 38 4.4.2 VT Drift Measurement and Simulation 40 Chapter 5 Open-loop-based Voltage Drift Compensation in Clock Distribution 46 5.1 Overview 46 5.2 Prior Works 50 5.3 Voltage Drift Compensation Method 52 5.4 Circuit Implementation 57 5.5 Measurement Results 61 Chapter 6 Quadrature Error Corrector with Minimum Total Delay Tracking 68 6.1 Overview 68 6.2 Prior Works 70 6.3 Quadrature Error Correction Method 73 6.4 Circuit Implementation 82 6.5 Measurement Results 88 Chapter 7 Conclusion 96 Bibliography 98 초둝 102Docto
    corecore