5,309 research outputs found

    Vector computers, Monte Carlo simulation, and regression analysis: An introduction (Version 2)

    Get PDF
    Monte Carlo Technique;Supercomputer;computer science

    ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ๊ธฐ๊ณ„ํ•™์Šต ์‘์šฉํ”„๋กœ๊ทธ๋žจ์„ ์œ„ํ•œ ๋””๋žจ ๊ธฐ๋ฐ˜ ํ”„๋กœ์„ธ์‹ฑ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋งˆ์ดํฌ๋กœ์•„ํ‚คํ…์ฒ˜

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2022.2. ์•ˆ์ •ํ˜ธ.Recently, as research on neural networks has gained significant traction, a number of memory-intensive neural network models such as recurrent neural network (RNN) models and recommendation models are introduced to process various tasks. RNN models and recommendation models spend most of their execution time processing matrix-vector multiplication (MV-mul) and processing embedding layers, respectively. A fundamental primitive of embedding layers, tensor gather-and-reduction (GnR), gathers embedding vectors and then reduces them to a new embedding vector. Because the matrices in RNNs and the embedding tables in recommendation models have poor reusability and the ever-increasing sizes of the matrices and the embedding tables become too large to fit in the on-chip storage of devices, the performance and energy efficiency of MV-mul and GnR are determined by those of main-memory DRAM. Therefore, computing these operations within DRAM draws significant attention. In this dissertation, we first propose a main-memory architecture called MViD, which performs MV-mul by placing MAC units inside DRAM banks. For higher computational efficiency, we use a sparse matrix format and exploit quantization. Because of the limited power budget for DRAM devices, we implement the MAC units only on a portion of the DRAM banks. We architect MViD to slow down or pause MV-mul for concurrently processing memory requests from processors while satisfying the limited power budget. Our results show that MViD provides 7.2ร— higher throughput compared to the baseline system with four DRAM ranks (performing MV-mul in a chip-multiprocessor) while running inference of Deep Speech 2 with a memory-intensive workload. Then we propose TRiM, an NDP architecture for accelerating recommendation systems. Based on the observation that the DRAM datapath has a hierarchical tree structure, TRiM augments the DRAM datapath with "in-DRAM" reduction units at the DDR4/5 rank/bank-group/bank level. We modify the interface of DRAM to provide commands effectively to multiple reduction units running in parallel. We also propose a host-side architecture with hot embedding-vector replication to alleviate the load imbalance that arises across the reduction units. An optimal TRiM design based on DDR5 achieves up to a 7.7ร— and 3.9ร— speedup and reduces by 55% and 50% the energy consumption of the embedding vector gather and reduction over the baseline and the state-of-the-art NDP architecture with minimal area overhead equivalent to 2.66% of DRAM chips.์ตœ๊ทผ ๋งŽ์€ ์‹ ๊ฒฝ๋ง ์—ฐ๊ตฌ๋“ค์ด ๊ด€์‹ฌ์„ ๋ฐ›์œผ๋ฉด์„œ, RNN ๋ชจ๋ธ ํ˜น์€ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ชจ๋ธ๊ณผ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ๋“ค์ด ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋“ฑ์žฅํ•˜๊ณ ์žˆ๋‹ค. RNN ๋ชจ๋ธ๊ณผ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ชจ๋ธ์€ ๋Œ€๋ถ€๋ถ„์˜ ์‹คํ–‰ ์‹œ๊ฐ„ ๋™์•ˆ ๊ฐ๊ฐ ํ–‰๋ ฌ-๋ฒกํ„ฐ ๊ณฑ์„ ์—ฐ์‚ฐํ•˜๊ณ  ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋ฅผ ์ฒ˜๋ฆฌํ•œ๋‹ค. ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด์˜ ๊ธฐ๋ณธ ์—ฐ์‚ฐ์ธ GnR ์—ฐ์‚ฐ์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๋ชจ์€ ๋‹ค์Œ ์ด๋“ค์„ ํ•ฉ์น˜๋Š” ๋™์ž‘์„ ํ•œ๋‹ค. RNN ์ฒ˜๋ฆฌ์‹œ ํ•„์š”ํ•œ ํ–‰๋ ฌ๊ณผ ์ถ”์ฒœ ์‹œ์Šคํ…œ ๋ชจ๋ธ ์ฒ˜๋ฆฌ์‹œ ํ•„์š”ํ•œ ์ž„๋ฒ ๋”ฉ ํ…Œ์ด๋ธ”์€ ์žฌ์‚ฌ์šฉ์„ฑ์ด ๋‚ฎ๊ณ  ์ด๋“ค์˜ ํฌ๊ธฐ๋Š” ๊ณ„์† ์ฆ๊ฐ€ํ•˜์—ฌ ์˜จ์นฉ ์Šคํ† ๋ฆฌ์ง€์— ์ €์žฅ๋  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ํ–‰๋ ฌ-๋ฒกํ„ฐ ๊ณฑ ๋ฐ GnR ์—ฐ์‚ฐ์˜ ์„ฑ๋Šฅ ๋ฐ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์€ ์ฃผ ๋ฉ”๋ชจ๋ฆฌ DRAM์˜ ์„ฑ๋Šฅ ๋ฐ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์— ์˜ํ•ด ๊ฒฐ์ •๋œ๋‹ค. ๋”ฐ๋ผ์„œ DRAM ๋‚ด์—์„œ ์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ด ๊ด€์‹ฌ์„ ๋Œ๊ณ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋จผ์ € DRAM ๋ฑ…ํฌ ๋‚ด๋ถ€์— MAC ์œ ๋‹›์„ ๋ฐฐ์น˜ํ•˜์—ฌ ํ–‰๋ ฌ-๋ฒกํ„ฐ ๊ณฑ์„ ์ˆ˜ํ–‰ํ•˜๋Š” MViD๋ผ๋Š” ์ฃผ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋” ๋†’์€ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ํฌ์†Œ ํ–‰๋ ฌ ํ˜•์‹์„ ์‚ฌ์šฉํ•˜๊ณ  ์–‘์žํ™”๋ฅผ ํ™œ์šฉํ•œ๋‹ค. DRAM ์žฅ์น˜๊ฐ€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ œํ•œ๋œ ์ „๋ ฅ ๋•Œ๋ฌธ์— DRAM ๋ฑ…ํฌ์˜ ์ผ๋ถ€์—๋งŒ MAC ์žฅ์น˜๋ฅผ ๊ตฌํ˜„ํ•œ๋‹ค. ์ „๋ ฅ ์ œํ•œ ์กฐ๊ฑด์„ ์ถฉ์กฑํ•˜๋ฉด์„œ ํ”„๋กœ์„ธ์„œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ํ–‰๋ ฌ-๋ฒกํ„ฐ๊ณฑ์„ ๋Šฆ์ถ”๊ฑฐ๋‚˜ ์ผ์‹œ ์ค‘์ง€ํ•˜๋„๋ก MViD๋ฅผ ์„ค๊ณ„ํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋กœ MViD๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ์›Œํฌ๋กœ๋“œ๋กœ Deep Speech 2์˜ ์ถ”๋ก ์„ ์‹คํ–‰ํ•˜๋ฉด์„œ 4๊ฐœ์˜ DRAM ๋žญํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ํ”„๋กœ์„ธ์„œ์—์„œ ํ–‰๋ ฌ-๋ฒกํ„ฐ๊ณฑ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ์ค€ ์‹œ์Šคํ…œ์— ๋น„ํ•ด 7.2๋ฐฐ ๋” ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์ œ๊ณตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ๋Š” ์ถ”์ฒœ ์‹œ์Šคํ…œ์„ ๊ฐ€์†ํ•˜๊ธฐ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ทผ์ฒ˜ ์ฒ˜๋ฆฌ ๊ตฌ์กฐ์ธ TRiM์„ ์ œ์•ˆํ•œ๋‹ค. DRAM ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๊ฐ€ ๊ณ„์ธต์  ํŠธ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š”๋‹ค๋Š” ์‚ฌ์‹ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ TRiM์€ DDR4/5 ๋žญํฌ/๋ฑ…ํฌ๊ทธ๋ฃน/๋ฑ…ํฌ ์ˆ˜์ค€์—์„œ DRAM ๋‚ด๋ถ€ ๋ฒกํ„ฐ ๊ฐ์†Œ ์žฅ์น˜๋กœ DRAM ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ๋ฅผ ๊ฐ•ํ™”ํ•œ๋‹ค. ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋˜๋Š” ์—ฌ๋Ÿฌ ๋ฒกํ„ฐ ๊ฐ์†Œ ์žฅ์น˜์— ๋ช…๋ น์„ ํšจ๊ณผ์ ์œผ๋กœ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด DRAM์˜ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ˆ˜์ •ํ•œ๋‹ค. ๋˜ํ•œ ๋ฒกํ„ฐ ๊ฐ์†Œ ์žฅ์น˜์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ถ€ํ•˜ ๋ถˆ๊ท ํ˜•์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ํ˜ธ์ŠคํŠธ ์ธก ๊ตฌ์กฐ์— ํ•ซ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๋ณต์ œ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. DDR5๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ตœ์ ์˜ TRiM ์„ค๊ณ„๋Š” DRAM ์นฉ์˜ 2.66%์— ํ•ด๋‹นํ•˜๋Š” ํฌ๊ธฐ ์˜ค๋ฒ„ํ—ค๋“œ๋งŒ์œผ๋กœ ์ตœ๋Œ€ 7.7๋ฐฐ ๋ฐ 3.9๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜๊ณ  ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ์ˆ˜์ง‘์˜ ์—๋„ˆ์ง€ ์†Œ๋น„๋ฅผ 55% ๋ฐ 50% ์ค„์ธ๋‹ค.Abstract i Contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Accelerating RNNs on Edge 3 1.2 Accelerating Recommendation Model 5 1.3 Research Contributions 8 1.4 Outline 9 2 Background 11 2.1 Memory-intensive Machine Learning Applications 11 2.2 DRAM Organization and Operations 13 3 MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks 18 3.1 Background and Motivation 18 3.1.1 Energy-efficient RNN Mobile Inference 18 3.1.2 How to Improve the Energy Efficiency and Bandwidth of DRAM Accesses in MV-mul 21 3.2 MV-mul in DRAM 23 3.2.1 Exploiting Quantization and Sparsity in RNN's Matrix Elements 23 3.2.2 The Operation Sequence of MV-mul in DRAM 27 3.2.3 Concurrently Serving Requests from Processors and Performing MV-mul in DRAM 32 3.2.4 Put It All Together: MViD Architecture 37 3.2.5 Additional Optimization Schemes 38 3.3 Evaluation 39 3.3.1 Power/Area/Timing Analysis 39 3.3.2 Performance/Energy Evaluation 42 3.4 Discussion 48 4 TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in Memory 51 4.1 Prior NDP architectures for accelerating Tensor Gather-andReduction 51 4.1.1 Tensor Gather-and-Reduction in RecSys 51 4.1.2 Prior NDP accelerators for GnR 52 4.1.3 Quantitative Analysis 56 4.1.4 Additional Schemes for Accelerating GnR 58 4.2 Tensor Reduction in Memory 58 4.2.1 Basic Concept for TRiM 59 4.2.2 How to Provision C/A Bandwidth 62 4.2.3 Exploring NDP Unit Placement 65 4.2.4 TRiM-G Organization and Operations 68 4.2.5 Host-side Architecture for TRiM 70 4.2.6 Schemes for Improving Reliability 75 4.3 Experimental Setup 76 4.4 Evaluation 77 4.4.1 Performance and Energy Efficiency 79 4.4.2 Sensitivity Study of Hot-entry Replication 82 4.4.3 Design Overhead 82 4.5 Discussion 83 5 Discussion 86 6 Related work 89 7 Conclusion 92 REFERENCES 94 ๊ตญ๋ฌธ์ดˆ๋ก 117๋ฐ•

    Enlarging instruction streams

    Get PDF
    The stream fetch engine is a high-performance fetch architecture based on the concept of an instruction stream. We call a sequence of instructions from the target of a taken branch to the next taken branch, potentially containing multiple basic blocks, a stream. The long length of instruction streams makes it possible for the stream fetch engine to provide a high fetch bandwidth and to hide the branch predictor access latency, leading to performance results close to a trace cache at a lower implementation cost and complexity. Therefore, enlarging instruction streams is an excellent way to improve the stream fetch engine. In this paper, we present several hardware and software mechanisms focused on enlarging those streams that finalize at particular branch types. However, our results point out that focusing on particular branch types is not a good strategy due to Amdahl's law. Consequently, we propose the multiple-stream predictor, a novel mechanism that deals with all branch types by combining single streams into long virtual streams. This proposal tolerates the prediction table access latency without requiring the complexity caused by additional hardware mechanisms like prediction overriding. Moreover, it provides high-performance results which are comparable to state-of-the-art fetch architectures but with a simpler design that consumes less energy.Peer ReviewedPostprint (published version

    Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation

    Full text link
    The increasing number of processing elements and decreas- ing memory to core ratio in modern high-performance platforms makes efficient strong scaling a key requirement for numerical algorithms. In order to achieve efficient scalability on massively parallel systems scientific software must evolve across the entire stack to exploit the multiple levels of parallelism exposed in modern architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP parallelisation to optimise parallel sparse matrix-vector multiplication in PETSc, a widely used scientific library for the scalable solution of partial differential equations. Using large matrices generated by Fluidity, an open source CFD application code which uses PETSc as its linear solver engine, we evaluate the effect of explicit communication overlap using task-based parallelism and show how to further improve performance by explicitly load balancing threads within MPI processes. We demonstrate a significant speedup over the pure-MPI mode and efficient strong scaling of sparse matrix-vector multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems
    • โ€ฆ
    corecore