1,037 research outputs found

    LLM: Realizing Low-Latency Memory byย Exploiting Embedded Silicon Photonics forย Irregular Workloads

    Get PDF
    As emerging workloads exhibit irregular memory access patterns with poor data reuse and locality, they would benefit from a DRAM that achieves low latency without sacrificing bandwidth and energy efficiency. We propose LLM (Low Latency Memory), a codesign of the DRAM microarchitecture, the memory controller and the LLC/DRAM interconnect by leveraging embedded silicon photonics in 2.5D/3D integrated system on chip. LLM relies on Wavelength Division Multiplexing (WDM)-based photonic interconnects to reduce the contention throughout the memory subsystem. LLM also increases the bank-level parallelism, eliminates bus conflicts by using dedicated optical data paths, and reduces the access energy per bit with shorter global bitlines and smaller row buffers. We evaluate the design space of LLM for a variety of synthetic benchmarks and representative graph workloads on a full-system simulator (gem5). LLM exhibits low memory access latency for traffics with both regular and irregular access patterns. For irregular traffic, LLM achieves high bandwidth utilization (over 80% peak throughput compared to 20% of HBM2.0). For real workloads, LLM achieves 3 ร— and 1.8 ร— lower execution time compared to HBM2.0 and a state-of-the-art memory system with high memory level parallelism, respectively. This study also demonstrates that by reducing queuing on the data path, LLM can achieve on average 3.4 ร— lower memory latency variation compared to HBM2.0

    3D ์ ์ธต DRAM์„ ์œ„ํ•œ ์‹ค์šฉ์ ์ธ Partial Row Activation ๋ฐ ๋”ฅ ๋Ÿฌ๋‹ ์›Œํฌ๋กœ๋“œ์—์˜ ์ ์šฉ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ์ด์žฌ์šฑ.GPUs are widely used to run deep learning applications. Today's high-end GPUs adopt 3D stacked DRAM technologies like High-Bandwidth Memory (HBM) to provide massive bandwidth, which consumes lots of power. Thousands of concurrent threads on GPU cause frequent row buffer conflicts to waste a significant amount of DRAM energy. To reduce this waste we propose a practical partial row activation scheme for 3D stacked DRAM. Exploiting the latency tolerance of deep learning workloads with abundant memory-level parallelism, we trade DRAM latency for energy savings. The proposed design demonstrates substantial savings of DRAM activation energy with minimal performance degradation for both the deep learning and other conventional GPU workloads. This benefit comes with a very low area cost and only minimal adjustments of DRAM timing parameters to the standard HBM2 DRAM interface.GPU๋Š” ์‹ฌ์ธต ํ•™์Šต ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰ํ•˜๋Š” ๋ฐ ๋„๋ฆฌ ์‚ฌ์šฉ๋œ๋‹ค. ์˜ค๋Š˜๋‚ ์˜ high-end GPU๋Š” HBM (High-Bandwidth Memory)๊ณผ ๊ฐ™์€ 3D ์ ์ธต DRAM ๊ธฐ์ˆ ์„ ์ฑ„ํƒํ•˜์—ฌ ์—„์ฒญ๋‚œ ๋Œ€์—ญํญ์„ ์ œ๊ณตํ•˜๋ฏ€๋กœ ๋งŽ์€ ์ „๋ ฅ์„ ์†Œ๋น„ํ•œ๋‹ค. GPU์—์„œ ์ˆ˜์ฒœ ๊ฐœ์˜ ๋™์‹œ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ๋นˆ๋ฒˆํ•œ row buffer conflict๋กœ ์ธํ•ด ์ƒ๋‹นํ•œ ์–‘์˜ DRAM ์—๋„ˆ์ง€๊ฐ€ ๋‚ญ๋น„๋œ๋‹ค. ์ด๋Ÿฌํ•œ ๋‚ญ๋น„๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด 3D ์ ์ธต DRAM์— ๋Œ€ํ•œ partial row activation ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ํ’๋ถ€ํ•œ memory-level parallelism ์ด ์žˆ๋Š” ๋”ฅ ๋Ÿฌ๋‹ ์›Œํฌ ๋กœ๋“œ์˜ latency tolerance๋ฅผ ํ™œ์šฉํ•ด์„œ, DRAM latency๋ฅผ ์ง€๋ถˆํ•˜๊ณ  ์—๋„ˆ์ง€ ์ ˆ๊ฐ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ์ œ์•ˆ์—์„œ ๋”ฅ ๋Ÿฌ๋‹ ๋ฐ ๊ธฐํƒ€ ๊ธฐ์กด GPU ์›Œํฌ ๋กœ๋“œ์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ DRAM activation energy์˜ ์ƒ๋‹นํ•œ ์ ˆ๊ฐ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๋ณธ ์ œ์•ˆ์€ ๋งค์šฐ ๋‚ฎ์€ ๋ฉด์  ๋น„์šฉ์œผ๋กœ ํ‘œ์ค€ HBM2 DRAM ์ธํ„ฐํŽ˜์ด์Šค์— ๋Œ€ํ•œ DRAM ํƒ€์ด๋ฐ์˜ ์ตœ์†Œํ•œ์˜ ๋ณ€๊ฒฝ๋งŒ์œผ๋กœ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.Abstract i Contents iv Chapter 1Introduction 1 Chapter 2 Background and Motivation 4 2.1 Deep Learning Workloads 4 2.2 DRAM Access Patterns on GPU 7 2.3 Partial Row Activation 9 2.4 Performance/Area Trade-off in Partial Activation 10 2.5 Latency-Tolerance of Deep Learning Workload on GPU 11 Chapter 3 Practical Partial Row Activation 13 3.1 Overview 13 3.2 BankStructure 13 3.3 DelayedActivation 17 Chapter 4 Evaluation 19 4.1 Methodology 19 4.2 EnergyImprovement 21 4.3 Performance Degradation 22 4.4 AreaOverhead 24 Chapter 5 Conclusion 25 Bibliography 26 ๊ตญ๋ฌธ์ดˆ๋ก 30 Acknowledgments 31Maste
    • โ€ฆ
    corecore