813 research outputs found

    ์ด์ข… ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์„ ์œ„ํ•œ ํ™•์žฅํ˜• ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. ๊น€์žฅ์šฐ.Modern neural-network (NN) accelerators have been successful by accelerating a small number of basic operations (e.g., convolution, fully-connected, feedback) comprising the specific target neural-network models (e.g., CNN, RNN). However, this approach no longer works for the emerging full-scale natural language processing (NLP)-based neural network models (e.g., Memory networks, Transformer, BERT), which consist of different combinations of complex and heterogeneous operations (e.g., self-attention, multi-head attention, large-scale feed-forward). Existing acceleration proposals cover only the proposal-specific basic operations and/or customize them for specific models only, which leads to the low performance improvement and the narrow model coverage. Therefore, an ideal NLP accelerator should first identify all performance-critical operations required by different NLP models and support them as a single accelerator to achieve a high model coverage, and can adaptively optimize its architecture to achieve the best performance for the given model. To address these scalability and model/config diversity issues, the dissertation introduces two novel projects (i.e., MnnFast and NLP-Fast) to efficiently accelerate a wide spectrum of full-scale NLP models. First, MnnFast proposes three novel optimizations to resolve three major performance problems (i.e., high memory bandwidth, heavy computation, and cache contention) in memory-augmented neural networks. Next, NLP-Fast adopts three optimization techniques to resolve the huge performance variation due to the model/config diversity in emerging NLP models. We implement both MnnFast and NLP-Fast on different hardware platforms (i.e., CPU, GPU, FPGA) and thoroughly evaluate their performance improvement on each platform.์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ์ค‘์š”์„ฑ์ด ๋Œ€๋‘๋จ์— ๋”ฐ๋ผ ์—ฌ๋Ÿฌ ๊ธฐ์—… ๋ฐ ์—ฐ๊ตฌ์ง„๋“ค์€ ๋‹ค์–‘ํ•˜๊ณ  ๋ณต์žกํ•œ ์ข…๋ฅ˜์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ๋“ค์„ ์ œ์‹œํ•˜๊ณ  ์žˆ๋‹ค. ์ฆ‰ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ๋“ค์€ ํ˜•ํƒœ๊ฐ€ ๋ณต์žกํ•ด์ง€๊ณ ,๋กœ๊ทœ๋ชจ๊ฐ€ ์ปค์ง€๋ฉฐ, ์ข…๋ฅ˜๊ฐ€ ๋‹ค์–‘ํ•ด์ง€๋Š” ์–‘์ƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ๋ณต์žก์„ฑ, ํ™•์žฅ์„ฑ, ๋‹ค์–‘์„ฑ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋ฅผ ์ œ์‹œํ•˜์˜€๋‹ค. ๊ฐ๊ฐ์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (1) ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ์˜ค๋ฒ„ํ—ค๋“œ ๋ถ„ํฌ๋„๋ฅผ ์•Œ์•„๋‚ด๊ธฐ ์œ„ํ•œ ์ •์ /๋™์  ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. (2) ์„ฑ๋Šฅ ๋ถ„์„์„ ํ†ตํ•ด ์•Œ์•„๋‚ธ ์ฃผ๋œ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ ์š”์†Œ๋“ค์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ตœ์ ํ™” ํ•˜๊ธฐ ์œ„ํ•œ ์ „์ฒด๋ก ์  ๋ชจ๋ธ ๋ณ‘๋ ฌํ™” ๊ธฐ์ˆ ์„ ์ œ์‹œํ•œ๋‹ค. (3) ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ๋“ค์˜ ์—ฐ์‚ฐ๋Ÿ‰์„ ๊ฐ์†Œํ•˜๋Š” ๊ธฐ์ˆ ๊ณผ ์—ฐ์‚ฐ๋Ÿ‰ ๊ฐ์†Œ๋กœ ์ธํ•œ skewness ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ dynamic scheduler ๊ธฐ์ˆ ์„ ์ œ์‹œํ•œ๋‹ค. (4) ํ˜„ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ๋‹ค์–‘์„ฑ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๋ชจ๋ธ์— ์ตœ์ ํ™”๋œ ๋””์ž์ธ์„ ์ œ์‹œํ•˜๋Š” ๊ธฐ์ˆ ์„ ์ œ์‹œํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ํ•ต์‹ฌ ๊ธฐ์ˆ ๋“ค์€ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ํ•˜๋“œ์›จ์–ด ๊ฐ€์†๊ธฐ (์˜ˆ: CPU, GPU, FPGA, ASIC) ์—๋„ ๋ฒ”์šฉ์ ์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋งค์šฐ ํšจ๊ณผ์ ์ด๋ฏ€๋กœ, ์ œ์‹œ๋œ ๊ธฐ์ˆ ๋“ค์€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์„ ์œ„ํ•œ ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ ์„ค๊ณ„ ๋ถ„์•ผ์— ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•ด๋‹น ๊ธฐ์ˆ ๋“ค์„ ์ ์šฉํ•˜์—ฌ CPU, GPU, FPGA ๊ฐ๊ฐ์˜ ํ™˜๊ฒฝ์—์„œ, ์ œ์‹œ๋œ ๊ธฐ์ˆ ๋“ค์ด ๋ชจ๋‘ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค.1 INTRODUCTION 1 2 Background 6 2.1 Memory Networks 6 2.2 Deep Learning for NLP 9 3 A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 14 3.1 Motivation & Design Goals 14 3.1.1 Performance Problems in MemNN - High Off-chip Memory Bandwidth Requirements 15 3.1.2 Performance Problems in MemNN - High Computation 16 3.1.3 Performance Problems in MemNN - Shared Cache Contention 17 3.1.4 Design Goals 18 3.2 MnnFast 19 3.2.1 Column-Based Algorithm 19 3.2.2 Zero Skipping 22 3.2.3 Embedding Cache 25 3.3 Implementation 26 3.3.1 General-Purpose Architecture - CPU 26 3.3.2 General-Purpose Architecture - GPU 28 3.3.3 Custom Hardware (FPGA) 29 3.4 Evaluation 31 3.4.1 Experimental Setup 31 3.4.2 CPU 33 3.4.3 GPU 35 3.4.4 FPGA 37 3.4.5 Comparison Between CPU and FPGA 39 3.5 Conclusion 39 4 A Fast, Scalable, and Flexible System for Large-Scale Heterogeneous NLP Models 40 4.1 Motivation & Design Goals 40 4.1.1 High Model Complexity 40 4.1.2 High Memory Bandwidth 41 4.1.3 Heavy Computation 42 4.1.4 Huge Performance Variation 43 4.1.5 Design Goals 43 4.2 NLP-Fast 44 4.2.1 Bottleneck Analysis of NLP Models 44 4.2.2 Holistic Model Partitioning 47 4.2.3 Cross-operation Zero Skipping 51 4.2.4 Adaptive Hardware Reconfiguration 54 4.3 NLP-Fast Toolkit 56 4.4 Implementation 59 4.4.1 General-Purpose Architecture - CPU 59 4.4.2 General-Purpose Architecture - GPU 61 4.4.3 Custom Hardware (FPGA) 62 4.5 Evaluation 64 4.5.1 Experimental Setup 65 4.5.2 CPU 65 4.5.3 GPU 67 4.5.4 FPGA 69 4.6 Conclusion 72 5 Related Work 73 5.1 Various DNN Accelerators 73 5.2 Various NLP Accelerators 74 5.3 Model Partitioning 75 5.4 Approximation 76 5.5 Improving Flexibility 78 5.6 Resource Optimization 78 6 Conclusion 80 Abstract (In Korean) 106Docto

    An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

    Full text link
    Near-sensor data analytics is a promising direction for IoT endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data is stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a System-on-Chip based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN consuming 3.16pJ per equivalent RISC op; local CNN-based face detection with secured remote recognition in 5.74pJ/op; and seizure detection with encrypted data collection from EEG within 12.7pJ/op.Comment: 15 pages, 12 figures, accepted for publication to the IEEE Transactions on Circuits and Systems - I: Regular Paper

    EIE: Efficient Inference Engine on Compressed Deep Neural Network

    Full text link
    State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.Comment: External Links: TheNextPlatform: http://goo.gl/f7qX0L ; O'Reilly: https://goo.gl/Id1HNT ; Hacker News: https://goo.gl/KM72SV ; Embedded-vision: http://goo.gl/joQNg8 ; Talk at NVIDIA GTC'16: http://goo.gl/6wJYvn ; Talk at Embedded Vision Summit: https://goo.gl/7abFNe ; Talk at Stanford University: https://goo.gl/6lwuer. Published as a conference paper in ISCA 201

    FOS: A Modular FPGA Operating System for Dynamic Workloads

    Get PDF
    With FPGAs now being deployed in the cloud and at the edge, there is a need for scalable design methods which can incorporate the heterogeneity present in the hardware and software components of FPGA systems. Moreover, these FPGA systems need to be maintainable and adaptable to changing workloads while improving accessibility for the application developers. However, current FPGA systems fail to achieve modularity and support for multi-tenancy due to dependencies between system components and lack of standardised abstraction layers. To solve this, we introduce a modular FPGA operating system -- FOS, which adopts a modular FPGA development flow to allow each system component to be changed and be agnostic to the heterogeneity of EDA tool versions, hardware and software layers. Further, to dynamically maximise the utilisation transparently from the users, FOS employs resource-elastic scheduling to arbitrate the FPGA resources in both time and spatial domain for any type of accelerators. Our evaluation on different FPGA boards shows that FOS can provide performance improvements in both single-tenant and multi-tenant environments while substantially reducing the development time and, at the same time, improving flexibility

    Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009)(revised 08/2009)

    Get PDF
    Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009) which was held Feb. 12th 2009 in Mannheim, Germany. The 1st International Workshop for Research on HyperTransport is an international high quality forum for scientists, researches and developers working in the area of HyperTransport. This includes not only developments and research in HyperTransport itself, but also work which is based on or enabled by HyperTransport. HyperTransport (HT) is an interconnection technology which is typically used as system interconnect in modern computer systems, connecting the CPUs among each other and with the I/O bridges. Primarily designed as interconnect between high performance CPUs it provides an extremely low latency, high bandwidth and excellent scalability. The definition of the HTX connector allows the use of HT even for add-in cards. In opposition to other peripheral interconnect technologies like PCI-Express no protocol conversion or intermediate bridging is necessary. HT is a direct connection between device and CPU with minimal latency. Another advantage is the possibility of cache coherent devices. Because of these properties HT is of high interest for high performance I/O like networking and storage, but also for co-processing and acceleration based on ASIC or FPGA technologies. In particular acceleration sees a resurgence of interest today. One reason is the possibility to reduce power consumption by the use of accelerators. In the area of parallel computing the low latency communication allows for fine grain communication schemes and is perfectly suited for scalable systems. Summing up, HT technology offers key advantages and great performance to any research aspect related to or based on interconnects. For more information please consult the workshop website (http://whtra.uni-hd.de)

    Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009)

    Get PDF
    Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA2009) which was held Feb. 12th 2009 in Mannheim, Germany. The 1st International Workshop for Research on HyperTransport is an international high quality forum for scientists, researches and developers working in the area of HyperTransport. This includes not only developments and research in HyperTransport itself, but also work which is based on or enabled by HyperTransport. HyperTransport (HT) is an interconnection technology which is typically used as system interconnect in modern computer systems, connecting the CPUs among each other and with the I/O bridges. Primarily designed as interconnect between high performance CPUs it provides an extremely low latency, high bandwidth and excellent scalability. The definition of the HTX connector allows the use of HT even for add-in cards. In opposition to other peripheral interconnect technologies like PCI-Express no protocol conversion or intermediate bridging is necessary. HT is a direct connection between device and CPU with minimal latency. Another advantage is the possibility of cache coherent devices. Because of these properties HT is of high interest for high performance I/O like networking and storage, but also for co-processing and acceleration based on ASIC or FPGA technologies. In particular acceleration sees a resurgence of interest today. One reason is the possibility to reduce power consumption by the use of accelerators. In the area of parallel computing the low latency communication allows for fine grain communication schemes and is perfectly suited for scalable systems. Summing up, HT technology offers key advantages and great performance to any research aspect related to or based on interconnects. For more information please consult the workshop website (http://whtra.uni-hd.de)
    • โ€ฆ
    corecore