427 research outputs found

    An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

    Get PDF
    We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect of environmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.Comment: To appear at the DSN 2020 conferenc

    Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

    Full text link
    We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202

    Reliable and Energy Efficient MLC STT-RAM Buffer for CNN Accelerators

    Get PDF
    We propose a lightweight scheme where the formation of a data block is changed in such a way that it can tolerate soft errors significantly better than the baseline. The key insight behind our work is that CNN weights are normalized between -1 and 1 after each convolutional layer, and this leaves one bit unused in half-precision floating-point representation. By taking advantage of the unused bit, we create a backup for the most significant bit to protect it against the soft errors. Also, considering the fact that in MLC STT-RAMs the cost of memory operations (read and write), and reliability of a cell are content-dependent (some patterns take larger current and longer time, while they are more susceptible to soft error), we rearrange the data block to minimize the number of costly bit patterns. Combining these two techniques provides the same level of accuracy compared to an error-free baseline while improving the read and write energy by 9% and 6%, respectively

    ๊ธฐ๊ธฐ ์ƒ์—์„œ์˜ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ๊ฐœ์ธํ™” ๋ฐฉ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. Egger, Bernhard.There exist several deep neural network (DNN) architectures suitable for embedded inference, however little work has focused on training neural networks on-device. User customization of DNNs is desirable due to the difficulty of collecting a training set representative of real world scenarios. Additionally, inter-user variation means that a general model has a limitation on its achievable accuracy. In this thesis, a DNN architecture that allows for low power on-device user customization is proposed. This approach is applied to handwritten character recognition of both the Latin and the Korean alphabets. Experiments show a 3.5-fold reduction of the prediction error after user customization for both alphabets compared to a DNN trained with general data. This architecture is additionally evaluated using a number of embedded processors demonstrating its practical application.๋‚ด์žฅํ˜• ๊ธฐ๊ธฐ์—์„œ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์„ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ๋Š” ์•„ํ‚คํ…์ฒ˜๋“ค์€ ์กด์žฌํ•˜์ง€๋งŒ ๋‚ด์žฅํ˜• ๊ธฐ๊ธฐ์—์„œ ์‹ ๊ฒฝ๋ง์„ ํ•™์Šตํ•˜๋Š” ์—ฐ๊ตฌ๋Š” ๋ณ„๋กœ ์ด๋ค„์ง€์ง€ ์•Š์•˜๋‹ค. ์‹ค์ œ ํ™˜๊ฒฝ์„ ๋ฐ˜์˜ํ•˜๋Š” ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ๋ชจ์œผ๋Š” ๊ฒƒ์ด ์–ด๋ ต๊ณ  ์‚ฌ์šฉ์ž๊ฐ„์˜ ๋‹ค์–‘์„ฑ์œผ๋กœ ์ธํ•ด ์ผ๋ฐ˜์ ์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์ด ์ถฉ๋ถ„ํ•œ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง€๊ธฐ์—” ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉ์ž ๋งž์ถคํ˜• ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์ด ํ•„์š”ํ•˜๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ๊ธฐ์ƒ์—์„œ ์ €์ „๋ ฅ์œผ๋กœ ์‚ฌ์šฉ์ž ๋งž์ถคํ™”๊ฐ€ ๊ฐ€๋Šฅํ•œ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์€ ๋ผํ‹ด์–ด์™€ ํ•œ๊ธ€์˜ ํ•„๊ธฐ์ฒด ๊ธ€์ž ์ธ์‹์— ์ ์šฉ๋œ๋‹ค. ๋ผํ‹ด์–ด์™€ ํ•œ๊ธ€์— ์‚ฌ์šฉ์ž ๋งž์ถคํ™”๋ฅผ ์ ์šฉํ•˜์—ฌ ์ผ๋ฐ˜์ ์ธ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง๋ณด๋‹ค 3.5๋ฐฐ๋‚˜ ์ž‘์€ ์˜ˆ์ธก ์˜ค๋ฅ˜์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค. ๋˜ํ•œ ์ด ์•„ํ‚คํ…์ฒ˜์˜ ์‹ค์šฉ์„ฑ์„ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๋‚ด์žฅํ˜• ํ”„๋กœ์„ธ์„œ์—์„œ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.Abstract i Contents iii List of Figures vii List of Tables ix Chapter 1 Introduction 1 Chapter 2 Motivation 4 Chapter 3 Background 6 3.1 Deep Neural Networks 6 3.1.1 Inference 6 3.1.2 Training 7 3.2 Convolutional Neural Networks 8 3.3 On-Device Acceleration 9 3.3.1 Hardware Accelerators 9 3.3.2 Software Optimization 10 Chapter 4 Methodology 12 4.1 Initialization 13 4.2 On-Device Training 14 Chapter 5 Implementation 16 5.1 Pre-processing 16 5.2 Latin Handwritten Character Recognition 17 5.2.1 Dataset and BIE Selection 17 5.2.2 AE Design 17 5.3 Korean Handwritten Character Recognition 21 5.3.1 Dataset and BIE Selection 21 5.3.2 AE Design 21 Chapter 6 On-Device Acceleration 26 6.1 Architecure Optimizations 27 6.2 Compiler Optimizations 29 Chapter 7 Experimental Setup 30 Chapter 8 Evaluation 33 8.1 Latin Handwritten Character Recognition 33 8.2 Korean Handwritten Character Recognition 38 8.3 On-Device Acceleration 40 Chapter 9 Related Work 44 Chapter 10 Conclusion 47 Bibliography 47 ์š”์•ฝ 55 Acknowledgements 56Maste

    HAQ: Hardware-Aware Automated Quantization with Mixed Precision

    Full text link
    Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback signals (latency and energy) to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, energy and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.Comment: CVPR 2019. The first three authors contributed equally to this work. Project page: https://hanlab.mit.edu/projects/haq
    • โ€ฆ
    corecore