570 research outputs found

    3D ์ ์ธต DRAM์„ ์œ„ํ•œ ์‹ค์šฉ์ ์ธ Partial Row Activation ๋ฐ ๋”ฅ ๋Ÿฌ๋‹ ์›Œํฌ๋กœ๋“œ์—์˜ ์ ์šฉ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ์ด์žฌ์šฑ.GPUs are widely used to run deep learning applications. Today's high-end GPUs adopt 3D stacked DRAM technologies like High-Bandwidth Memory (HBM) to provide massive bandwidth, which consumes lots of power. Thousands of concurrent threads on GPU cause frequent row buffer conflicts to waste a significant amount of DRAM energy. To reduce this waste we propose a practical partial row activation scheme for 3D stacked DRAM. Exploiting the latency tolerance of deep learning workloads with abundant memory-level parallelism, we trade DRAM latency for energy savings. The proposed design demonstrates substantial savings of DRAM activation energy with minimal performance degradation for both the deep learning and other conventional GPU workloads. This benefit comes with a very low area cost and only minimal adjustments of DRAM timing parameters to the standard HBM2 DRAM interface.GPU๋Š” ์‹ฌ์ธต ํ•™์Šต ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰ํ•˜๋Š” ๋ฐ ๋„๋ฆฌ ์‚ฌ์šฉ๋œ๋‹ค. ์˜ค๋Š˜๋‚ ์˜ high-end GPU๋Š” HBM (High-Bandwidth Memory)๊ณผ ๊ฐ™์€ 3D ์ ์ธต DRAM ๊ธฐ์ˆ ์„ ์ฑ„ํƒํ•˜์—ฌ ์—„์ฒญ๋‚œ ๋Œ€์—ญํญ์„ ์ œ๊ณตํ•˜๋ฏ€๋กœ ๋งŽ์€ ์ „๋ ฅ์„ ์†Œ๋น„ํ•œ๋‹ค. GPU์—์„œ ์ˆ˜์ฒœ ๊ฐœ์˜ ๋™์‹œ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ๋นˆ๋ฒˆํ•œ row buffer conflict๋กœ ์ธํ•ด ์ƒ๋‹นํ•œ ์–‘์˜ DRAM ์—๋„ˆ์ง€๊ฐ€ ๋‚ญ๋น„๋œ๋‹ค. ์ด๋Ÿฌํ•œ ๋‚ญ๋น„๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด 3D ์ ์ธต DRAM์— ๋Œ€ํ•œ partial row activation ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ํ’๋ถ€ํ•œ memory-level parallelism ์ด ์žˆ๋Š” ๋”ฅ ๋Ÿฌ๋‹ ์›Œํฌ ๋กœ๋“œ์˜ latency tolerance๋ฅผ ํ™œ์šฉํ•ด์„œ, DRAM latency๋ฅผ ์ง€๋ถˆํ•˜๊ณ  ์—๋„ˆ์ง€ ์ ˆ๊ฐ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ์ œ์•ˆ์—์„œ ๋”ฅ ๋Ÿฌ๋‹ ๋ฐ ๊ธฐํƒ€ ๊ธฐ์กด GPU ์›Œํฌ ๋กœ๋“œ์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ DRAM activation energy์˜ ์ƒ๋‹นํ•œ ์ ˆ๊ฐ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๋ณธ ์ œ์•ˆ์€ ๋งค์šฐ ๋‚ฎ์€ ๋ฉด์  ๋น„์šฉ์œผ๋กœ ํ‘œ์ค€ HBM2 DRAM ์ธํ„ฐํŽ˜์ด์Šค์— ๋Œ€ํ•œ DRAM ํƒ€์ด๋ฐ์˜ ์ตœ์†Œํ•œ์˜ ๋ณ€๊ฒฝ๋งŒ์œผ๋กœ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.Abstract i Contents iv Chapter 1Introduction 1 Chapter 2 Background and Motivation 4 2.1 Deep Learning Workloads 4 2.2 DRAM Access Patterns on GPU 7 2.3 Partial Row Activation 9 2.4 Performance/Area Trade-off in Partial Activation 10 2.5 Latency-Tolerance of Deep Learning Workload on GPU 11 Chapter 3 Practical Partial Row Activation 13 3.1 Overview 13 3.2 BankStructure 13 3.3 DelayedActivation 17 Chapter 4 Evaluation 19 4.1 Methodology 19 4.2 EnergyImprovement 21 4.3 Performance Degradation 22 4.4 AreaOverhead 24 Chapter 5 Conclusion 25 Bibliography 26 ๊ตญ๋ฌธ์ดˆ๋ก 30 Acknowledgments 31Maste

    New Logic-In-Memory Paradigms: An Architectural and Technological Perspective

    Get PDF
    Processing systems are in continuous evolution thanks to the constant technological advancement and architectural progress. Over the years, computing systems have become more and more powerful, providing support for applications, such as Machine Learning, that require high computational power. However, the growing complexity of modern computing units and applications has had a strong impact on power consumption. In addition, the memory plays a key role on the overall power consumption of the system, especially when considering data-intensive applications. These applications, in fact, require a lot of data movement between the memory and the computing unit. The consequence is twofold: Memory accesses are expensive in terms of energy and a lot of time is wasted in accessing the memory, rather than processing, because of the performance gap that exists between memories and processing units. This gap is known as the memory wall or the von Neumann bottleneck and is due to the different rate of progress between complementary metal-oxide semiconductor (CMOS) technology and memories. However, CMOS scaling is also reaching a limit where it would not be possible to make further progress. This work addresses all these problems from an architectural and technological point of view by: (1) Proposing a novel Configurable Logic-in-Memory Architecture that exploits the in-memory computing paradigm to reduce the memory wall problem while also providing high performance thanks to its flexibility and parallelism; (2) exploring a non-CMOS technology as possible candidate technology for the Logic-in-Memory paradigm

    A survey of near-data processing architectures for neural networks

    Get PDF
    Data-intensive workloads and applications, such as machine learning (ML), are fundamentally limited by traditional computing systems based on the von-Neumann architecture. As data movement operations and energy consumption become key bottlenecks in the design of computing systems, the interest in unconventional approaches such as Near-Data Processing (NDP), machine learning, and especially neural network (NN)-based accelerators has grown significantly. Emerging memory technologies, such as ReRAM and 3D-stacked, are promising for efficiently architecting NDP-based accelerators for NN due to their capabilities to work as both high-density/low-energy storage and in/near-memory computation/search engine. In this paper, we present a survey of techniques for designing NDP architectures for NN. By classifying the techniques based on the memory technology employed, we underscore their similarities and differences. Finally, we discuss open challenges and future perspectives that need to be explored in order to improve and extend the adoption of NDP architectures for future computing platforms. This paper will be valuable for computer architects, chip designers, and researchers in the area of machine learning.This work has been supported by the CoCoUnit ERC Advanced Grant of the EUโ€™s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, and the ICREA Academia program.Peer ReviewedPostprint (published version

    Exploring New Computing Paradigms for Data-Intensive Applications

    Get PDF
    L'abstract รจ presente nell'allegato / the abstract is in the attachmen

    Data processing and information classificationโ€” an in-memory approach

    Get PDF
    9noTo live in the information society means to be surrounded by billions of electronic devices full of sensors that constantly acquire data. This enormous amount of data must be processed and classified. A solution commonly adopted is to send these data to server farms to be remotely elaborated. The drawback is a huge battery drain due to high amount of information that must be exchanged. To compensate this problem data must be processed locally, near the sensor itself. But this solution requires huge computational capabilities. While microprocessors, even mobile ones, nowadays have enough computational power, their performance are severely limited by the Memory Wall problem. Memories are too slow, so microprocessors cannot fetch enough data from them, greatly limiting their performance. A solution is the Processing-In-Memory (PIM) approach. New memories are designed that can elaborate data inside them eliminating the Memory Wall problem. In this work we present an example of such a system, using as a case of study the Bitmap Indexing algorithm. Such algorithm is used to classify data coming from many sources in parallel. We propose a hardware accelerator designed around the Processing-In-Memory approach, that is capable of implementing this algorithm and that can also be reconfigured to do other tasks or to work as standard memory. The architecture has been synthesized using CMOS technology. The results that we have obtained highlights that, not only it is possible to process and classify huge amount of data locally, but also that it is possible to obtain this result with a very low power consumption.openopenAndrighetti, M. .; Turvani, G.; Santoro, G.; Vacca, M.; Marchesin, A.; Ottati, F.; Roch, M.R.; Graziano, M.; Zamboni, M.Andrighetti, M.; Turvani, G.; Santoro, G.; Vacca, M.; Marchesin, A.; Ottati, F.; Roch, M. R.; Graziano, M.; Zamboni, M

    Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

    Get PDF
    Deep neural networks have achieved impressive results in computer vision and machine learning. Unfortunately, state-of-the-art networks are extremely compute and memory intensive which makes them unsuitable for mW-devices such as IoT end-nodes. Aggressive quantization of these networks dramatically reduces the computation and memory footprint. Binary-weight neural networks (BWNs) follow this trend, pushing weight quantization to the limit. Hardware accelerators for BWNs presented up to now have focused on core efficiency, disregarding I/O bandwidth and system-level efficiency that are crucial for deployment of accelerators in ultra-low power devices. We present Hyperdrive: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, which can be used for arbitrarily sized convolutional neural network architecture and input resolution by exploiting the natural scalability of the compute units both at chip-level and system-level by arranging Hyperdrive chips systolically in a 2D mesh while processing the entire feature map together in parallel. Hyperdrive achieves 4.3 TOp/s/W system-level efficiency (i.e., including I/Os)---3.1x higher than state-of-the-art BWN accelerators, even if its core uses resource-intensive FP16 arithmetic for increased robustness
    • โ€ฆ
    corecore