143 research outputs found

    HPC Accelerators with 3D Memory

    Get PDF
    Artรญculo invitado, publicado en las actas del congreso por IEEE Society Press. Pรกginas 320 a 328. ISBN: 978-1-5090-3593-9.DOI 10.1109/CSE-EUC-DCABES-2016.203After a decade evolving in the High Performance Computing arena, GPU-equipped supercomputers have con- quered the top500 and green500 lists, providing us unprecedented levels of computational power and memory bandwidth. This year, major vendors have introduced new accelerators based on 3D memory, like Xeon Phi Knights Landing by Intel and Pascal architecture by Nvidia. This paper reviews hardware features of those new HPC accelerators and unveils potential performance for scientific applications, with an emphasis on Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM) used by commercial products according to roadmaps already announced.Universidad de Mรกlaga. Campus de Excelencia Internacional Andalucia Tec

    3D ์ ์ธต DRAM์„ ์œ„ํ•œ ์‹ค์šฉ์ ์ธ Partial Row Activation ๋ฐ ๋”ฅ ๋Ÿฌ๋‹ ์›Œํฌ๋กœ๋“œ์—์˜ ์ ์šฉ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ์ด์žฌ์šฑ.GPUs are widely used to run deep learning applications. Today's high-end GPUs adopt 3D stacked DRAM technologies like High-Bandwidth Memory (HBM) to provide massive bandwidth, which consumes lots of power. Thousands of concurrent threads on GPU cause frequent row buffer conflicts to waste a significant amount of DRAM energy. To reduce this waste we propose a practical partial row activation scheme for 3D stacked DRAM. Exploiting the latency tolerance of deep learning workloads with abundant memory-level parallelism, we trade DRAM latency for energy savings. The proposed design demonstrates substantial savings of DRAM activation energy with minimal performance degradation for both the deep learning and other conventional GPU workloads. This benefit comes with a very low area cost and only minimal adjustments of DRAM timing parameters to the standard HBM2 DRAM interface.GPU๋Š” ์‹ฌ์ธต ํ•™์Šต ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰ํ•˜๋Š” ๋ฐ ๋„๋ฆฌ ์‚ฌ์šฉ๋œ๋‹ค. ์˜ค๋Š˜๋‚ ์˜ high-end GPU๋Š” HBM (High-Bandwidth Memory)๊ณผ ๊ฐ™์€ 3D ์ ์ธต DRAM ๊ธฐ์ˆ ์„ ์ฑ„ํƒํ•˜์—ฌ ์—„์ฒญ๋‚œ ๋Œ€์—ญํญ์„ ์ œ๊ณตํ•˜๋ฏ€๋กœ ๋งŽ์€ ์ „๋ ฅ์„ ์†Œ๋น„ํ•œ๋‹ค. GPU์—์„œ ์ˆ˜์ฒœ ๊ฐœ์˜ ๋™์‹œ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ๋นˆ๋ฒˆํ•œ row buffer conflict๋กœ ์ธํ•ด ์ƒ๋‹นํ•œ ์–‘์˜ DRAM ์—๋„ˆ์ง€๊ฐ€ ๋‚ญ๋น„๋œ๋‹ค. ์ด๋Ÿฌํ•œ ๋‚ญ๋น„๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด 3D ์ ์ธต DRAM์— ๋Œ€ํ•œ partial row activation ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ํ’๋ถ€ํ•œ memory-level parallelism ์ด ์žˆ๋Š” ๋”ฅ ๋Ÿฌ๋‹ ์›Œํฌ ๋กœ๋“œ์˜ latency tolerance๋ฅผ ํ™œ์šฉํ•ด์„œ, DRAM latency๋ฅผ ์ง€๋ถˆํ•˜๊ณ  ์—๋„ˆ์ง€ ์ ˆ๊ฐ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ์ œ์•ˆ์—์„œ ๋”ฅ ๋Ÿฌ๋‹ ๋ฐ ๊ธฐํƒ€ ๊ธฐ์กด GPU ์›Œํฌ ๋กœ๋“œ์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ DRAM activation energy์˜ ์ƒ๋‹นํ•œ ์ ˆ๊ฐ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๋ณธ ์ œ์•ˆ์€ ๋งค์šฐ ๋‚ฎ์€ ๋ฉด์  ๋น„์šฉ์œผ๋กœ ํ‘œ์ค€ HBM2 DRAM ์ธํ„ฐํŽ˜์ด์Šค์— ๋Œ€ํ•œ DRAM ํƒ€์ด๋ฐ์˜ ์ตœ์†Œํ•œ์˜ ๋ณ€๊ฒฝ๋งŒ์œผ๋กœ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.Abstract i Contents iv Chapter 1Introduction 1 Chapter 2 Background and Motivation 4 2.1 Deep Learning Workloads 4 2.2 DRAM Access Patterns on GPU 7 2.3 Partial Row Activation 9 2.4 Performance/Area Trade-off in Partial Activation 10 2.5 Latency-Tolerance of Deep Learning Workload on GPU 11 Chapter 3 Practical Partial Row Activation 13 3.1 Overview 13 3.2 BankStructure 13 3.3 DelayedActivation 17 Chapter 4 Evaluation 19 4.1 Methodology 19 4.2 EnergyImprovement 21 4.3 Performance Degradation 22 4.4 AreaOverhead 24 Chapter 5 Conclusion 25 Bibliography 26 ๊ตญ๋ฌธ์ดˆ๋ก 30 Acknowledgments 31Maste

    Signal and Power Integrity Challenges for High Density System-on-Package

    Get PDF
    As the increasing desire for more compact, portable devices outpaces Mooreโ€™s law, innovation in packaging and system design has played a significant role in the continued miniaturization of electronic systems.Integrating more active and passive components into the package itself, as the case for system-on-package (SoP), has shown very promising results in overall size reduction and increased performance of electronic systems.With this ability to shrink electrical systems comes the many challenges of sustaining, let alone improving, reliability and performance. The fundamental signal, power, and thermal integrity issues are discussed in detail, along with published techniques from around the industry to mitigate these issues in SoP applications

    ์ฐจ์„ธ๋Œ€ HBM ์šฉ ๊ณ ์ง‘์ , ์ €์ „๋ ฅ ์†ก์ˆ˜์‹ ๊ธฐ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์ •๋•๊ท .This thesis presents design techniques for high-density power-efficient transceiver for the next-generation high bandwidth memory (HBM). Unlike the other memory interfaces, HBM uses a 3D-stacked package using through-silicon via (TSV) and a silicon interposer. The transceiver for HBM should be able to solve the problems caused by the 3D-stacked package and TSV. At first, a data (DQ) receiver for HBM with a self-tracking loop that tracks a phase skew between DQ and data strobe (DQS) due to a voltage or thermal drift is proposed. The self-tracking loop achieves low power and small area by uti-lizing an analog-assisted baud-rate phase detector. The proposed pulse-to-charge (PC) phase detector (PD) converts the phase skew to a voltage differ-ence and detects the phase skew from the voltage difference. An offset calibra-tion scheme that can compensates for a mismatch of the PD is also proposed. The proposed calibration scheme operates without any additional sensing cir-cuits by taking advantage of the write training of HBM. Fabricated in 65 nm CMOS, the DQ receiver shows a power efficiency of 370 fJ/b at 4.8 Gb/s and occupies 0.0056 mm2. The experimental results show that the DQ receiver op-erates without any performance degradation under a ยฑ 10% supply variation. In a second prototype IC, a high-density transceiver for HBM with a feed-forward-equalizer (FFE)-combined crosstalk (XT) cancellation scheme is pre-sented. To compensate for the XT, the transmitter pre-distorts the amplitude of the FFE output according to the XT. Since the proposed XT cancellation (XTC) scheme reuses the FFE implemented to equalize the channel loss, additional circuits for the XTC is minimized. Thanks to the XTC scheme, a channel pitch can be significantly reduced, allowing for the high channel density. Moreover, the 3D-staggered channel structure removes the ground layer between the verti-cally adjacent channels, which further reduces a cross-sectional area of the channel per lane. The test chip including 6 data lanes is fabricated in 65 nm CMOS technology. The 6-mm channels are implemented on chip to emulate the silicon interposer between the HBM and the processor. The operation of the XTC scheme is verified by simultaneously transmitting 4-Gb/s data to the 6 consecutive channels with 0.5-um pitch and the XTC scheme reduces the XT-induced jitter up to 78 %. The measurement result shows that the transceiver achieves the throughput of 8 Gb/s/um. The transceiver occupies 0.05 mm2 for 6 lanes and consumes 36.6 mW at 6 x 4 Gb/s.๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ฐจ์„ธ๋Œ€ HBM์„ ์œ„ํ•œ ๊ณ ์ง‘์  ์ €์ „๋ ฅ ์†ก์ˆ˜์‹ ๊ธฐ ์„ค๊ณ„ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ์ „์•• ๋ฐ ์˜จ๋„ ๋ณ€ํ™”์— ์˜ํ•œ ๋ฐ์ดํ„ฐ์™€ ํด๋Ÿญ ๊ฐ„ ์œ„์ƒ ์ฐจ์ด๋ฅผ ๋ณด์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์ž์ฒด ์ถ”์  ๋ฃจํ”„๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ ์ˆ˜์‹ ๊ธฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์ž์ฒด ์ถ”์  ๋ฃจํ”„๋Š” ๋ฐ์ดํ„ฐ ์ „์†ก ์†๋„์™€ ๊ฐ™์€ ์†๋„๋กœ ๋™์ž‘ํ•˜๋Š” ์œ„์ƒ ๊ฒ€์ถœ๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „๋ ฅ ์†Œ๋ชจ์™€ ๋ฉด์ ์„ ์ค„์˜€๋‹ค. ๋˜ํ•œ ๋ฉ”๋ชจ๋ฆฌ์˜ ์“ฐ๊ธฐ ํ›ˆ๋ จ (write training) ๊ณผ์ •์„ ์ด์šฉํ•˜์—ฌ ํšจ๊ณผ์ ์œผ๋กœ ์œ„์ƒ ๊ฒ€์ถœ๊ธฐ์˜ ์˜คํ”„์…‹์„ ๋ณด์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ˆ˜์‹ ๊ธฐ๋Š” 65 nm ๊ณต์ •์œผ๋กœ ์ œ์ž‘๋˜์–ด 4.8 Gb/s์—์„œ 370 fJ/b์„ ์†Œ๋ชจํ•˜์˜€๋‹ค. ๋˜ํ•œ 10 % ์˜ ์ „์•• ๋ณ€ํ™”์— ๋Œ€ํ•˜์—ฌ ์•ˆ์ •์ ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ดํ€„๋ผ์ด์ €์™€ ๊ฒฐํ•ฉ๋œ ํฌ๋กœ์Šค ํ† ํฌ ๋ณด์ƒ ๋ฐฉ์‹์„ ํ™œ์šฉํ•œ ๊ณ ์ง‘์  ์†ก์ˆ˜์‹ ๊ธฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์†ก์‹ ๊ธฐ๋Š” ํฌ๋กœ์Šค ํ† ํฌ ํฌ๊ธฐ์— ํ•ด๋‹นํ•˜๋Š” ๋งŒํผ ์†ก์‹ ๊ธฐ ์ถœ๋ ฅ์„ ์™œ๊ณกํ•˜์—ฌ ํฌ๋กœ์Šค ํ† ํฌ๋ฅผ ๋ณด์ƒํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ํฌ๋กœ์Šค ํ† ํฌ ๋ณด์ƒ ๋ฐฉ์‹์€ ์ฑ„๋„ ์†์‹ค์„ ๋ณด์ƒํ•˜๊ธฐ ์œ„ํ•ด ๊ตฌํ˜„๋œ ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ดํ€„๋ผ์ด์ €๋ฅผ ์žฌํ™œ์šฉํ•จ์œผ๋กœ์จ ์ถ”๊ฐ€์ ์ธ ํšŒ๋กœ๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์†ก์ˆ˜์‹ ๊ธฐ๋Š” ํฌ๋กœ์Šค ํ† ํฌ๊ฐ€ ๋ณด์ƒ ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ฑ„๋„ ๊ฐ„๊ฒฉ์„ ํฌ๊ฒŒ ์ค„์—ฌ ๊ณ ์ง‘์  ํ†ต์‹ ์„ ๊ตฌํ˜„ํ•˜์˜€๋‹ค. ๋˜ํ•œ ์ง‘์ ๋„๋ฅผ ๋” ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์„ธ๋กœ๋กœ ์ธ์ ‘ํ•œ ์ฑ„๋„ ์‚ฌ์ด์˜ ์ฐจํ ์ธต์„ ์ œ๊ฑฐํ•œ ์ ์ธต ์ฑ„๋„ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. 6๊ฐœ์˜ ์†ก์ˆ˜์‹ ๊ธฐ๋ฅผ ํฌํ•จํ•œ ํ”„๋กœํ† ํƒ€์ž… ์นฉ์€ 65 nm ๊ณต์ •์œผ๋กœ ์ œ์ž‘๋˜์—ˆ๋‹ค. HBM๊ณผ ํ”„๋กœ์„ธ์„œ ์‚ฌ์ด์˜ silicon interposer channel ์„ ๋ชจ์‚ฌํ•˜๊ธฐ ์œ„ํ•œ 6 mm ์˜ ์ฑ„๋„์ด ์นฉ ์œ„์— ๊ตฌํ˜„๋˜์—ˆ๋‹ค. ์ œ์•ˆํ•˜๋Š” ํฌ๋กœ์Šค ํ† ํฌ ๋ณด์ƒ ๋ฐฉ์‹์€ 0.5 um ๊ฐ„๊ฒฉ์˜ 6๊ฐœ์˜ ์ธ์ ‘ํ•œ ์ฑ„๋„์— ๋™์‹œ์— ๋ฐ์ดํ„ฐ๋ฅผ ์ „์†กํ•˜์—ฌ ๊ฒ€์ฆ๋˜์—ˆ์œผ๋ฉฐ, ํฌ๋กœ์Šค ํ† ํฌ๋กœ ์ธํ•œ ์ง€ํ„ฐ๋ฅผ ์ตœ๋Œ€ 78 % ๊ฐ์†Œ์‹œ์ผฐ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์†ก์ˆ˜์‹ ๊ธฐ๋Š” 8 Gb/s/um ์˜ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๊ฐ€์ง€๋ฉฐ 6 ๊ฐœ์˜ ์†ก์ˆ˜์‹ ๊ธฐ๊ฐ€ ์ด 36.6 mW์˜ ์ „๋ ฅ์„ ์†Œ๋ชจํ•˜์˜€๋‹ค.CHAPTER 1 INTRODUCTION 1 1.1 MOTIVATION 1 1.2 THESIS ORGANIZATION 4 CHAPTER 2 BACKGROUND ON HIGH-BANDWIDTH MEMORY 6 2.1 OVERVIEW 6 2.2 TRANSCEIVER ARCHITECTURE 10 2.3 READ/WRITE OPERATION 15 2.3.1 READ OPERATION 15 2.3.2 WRITE OPERATION 19 CHAPTER 3 BACKGROUNDS ON COUPLED WIRES 21 3.1 GENERALIZED MODEL 21 3.2 EFFECT OF CROSSTALK 26 CHAPTER 4 DQ RECEIVER WITH BAUD-RATE SELF-TRACKING LOOP 29 4.1 OVERVIEW 29 4.2 FEATURES OF DQ RECEIVER FOR HBM 33 4.3 PROPOSED PULSE-TO-CHARGE PHASE DETECTOR 35 4.3.1 OPERATION OF PULSE-TO-CHARGE PHASE DETECTOR 35 4.3.2 OFFSET CALIBRATION 37 4.3.3 OPERATION SEQUENCE 39 4.4 CIRCUIT IMPLEMENTATION 42 4.5 MEASUREMENT RESULT 46 CHAPTER 5 HIGH-DENSITY TRANSCEIVER FOR HBM WITH 3D-STAGGERED CHANNEL AND CROSSTALK CANCELLATION SCHEME 57 5.1 OVERVIEW 57 5.2 PROPOSED 3D-STAGGERED CHANNEL 61 5.2.1 IMPLEMENTATION OF 3D-STAGGERED CHANNEL 61 5.2.2 CHANNEL CHARACTERISTICS AND MODELING 66 5.3 PROPOSED FEED-FORWARD-EQUALIZER-COMBINED CROSSTALK CANCELLATION SCHEME 72 5.4 CIRCUIT IMPLEMENTATION 77 5.4.1 OVERALL ARCHITECTURE 77 5.4.2 TRANSMITTER WITH FFE-COMBINED XTC 79 5.4.3 RECEIVER 81 5.5 MEASUREMENT RESULT 82 CHAPTER 6 CONCLUSION 93 BIBLIOGRAPHY 95 ์ดˆ ๋ก 102Docto

    Architectural Techniques to Enable Reliable and Scalable Memory Systems

    Get PDF
    High capacity and scalable memory systems play a vital role in enabling our desktops, smartphones, and pervasive technologies like Internet of Things (IoT). Unfortunately, memory systems are becoming increasingly prone to faults. This is because we rely on technology scaling to improve memory density, and at small feature sizes, memory cells tend to break easily. Today, memory reliability is seen as the key impediment towards using high-density devices, adopting new technologies, and even building the next Exascale supercomputer. To ensure even a bare-minimum level of reliability, present-day solutions tend to have high performance, power and area overheads. Ideally, we would like memory systems to remain robust, scalable, and implementable while keeping the overheads to a minimum. This dissertation describes how simple cross-layer architectural techniques can provide orders of magnitude higher reliability and enable seamless scalability for memory systems while incurring negligible overheads.Comment: PhD thesis, Georgia Institute of Technology (May 2017

    Demystifying the Characteristics of High Bandwidth Memory for Real-Time Systems

    Get PDF
    The number of functionalities controlled by software on every critical real-time product is on the rise in domains like automotive, avionics and space. To implement these advanced functionalities, software applications increasingly adopt artificial intelligence algorithms that manage massive amounts of data transmitted from various sensors. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and high-integration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In this work, we perform to our knowledge the first timing analysis of HBM. We show the unique structural and timing characteristics of HBM with respect to DRAM memories and how they can be exploited for better time predictability, with emphasis on increased isolation among tasks and reduced worst-case memory latency.This work has been partially supported by the Spanish Ministry of Science and Innovation under grant PID2019-107255GB-C21/AEI/10.13039/501100011033; the European Unionโ€™s Horizon 2020 Framework Programme under grant agreement No. 878752 (MASTECS) and agreement No. 779877 (Mont-Blanc 2020); the European Research Council (ERC) grant agreement No. 772773 (SuPerCom); and the Natural Sciences and Engineering Research Council of Canada (NSERC)Peer ReviewedPostprint (author's final draft

    Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube

    Full text link
    Memories that exploit three-dimensional (3D)-stacking technology, which integrate memory and logic dies in a single stack, are becoming popular. These memories, such as Hybrid Memory Cube (HMC), utilize a network-on-chip (NoC) design for connecting their internal structural organizations. This novel usage of NoC, in addition to aiding processing-in-memory capabilities, enables numerous benefits such as high bandwidth and memory-level parallelism. However, the implications of NoCs on the characteristics of 3D-stacked memories in terms of memory access latency and bandwidth have not been fully explored. This paper addresses this knowledge gap by (i) characterizing an HMC prototype on the AC-510 accelerator board and revealing its access latency behaviors, and (ii) by investigating the implications of such behaviors on system and software designs
    • โ€ฆ
    corecore