427 research outputs found
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration
We empirically evaluate an undervolting technique, i.e., underscaling the
circuit supply voltage below the nominal level, to improve the power-efficiency
of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable
Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing
faults due to excessive circuit latency increase. We evaluate the
reliability-power trade-off for such accelerators. Specifically, we
experimentally study the reduced-voltage operation of multiple components of
real FPGAs, characterize the corresponding reliability behavior of CNN
accelerators, propose techniques to minimize the drawbacks of reduced-voltage
operation, and combine undervolting with architectural CNN optimization
techniques, i.e., quantization and pruning. We investigate the effect of
environmental temperature on the reliability-power trade-off of such
accelerators. We perform experiments on three identical samples of modern
Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification
CNN benchmarks. This approach allows us to study the effects of our
undervolting technique for both software and hardware variability. We achieve
more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain
is the result of eliminating the voltage guardband region, i.e., the safe
voltage region below the nominal level that is set by FPGA vendor to ensure
correct functionality in worst-case environmental and circuit conditions. 43%
of the power-efficiency gain is due to further undervolting below the
guardband, which comes at the cost of accuracy loss in the CNN accelerator. We
evaluate an effective frequency underscaling technique that prevents this
accuracy loss, and find that it reduces the power-efficiency gain from 43% to
25%.Comment: To appear at the DSN 2020 conferenc
Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators
We show that DNN accelerator micro-architectures and their program mappings
represent specific choices of loop order and hardware parallelism for computing
the seven nested loops of DNNs, which enables us to create a formal taxonomy of
all existing dense DNN accelerators. Surprisingly, the loop transformations
needed to create these hardware variants can be precisely and concisely
represented by Halide's scheduling language. By modifying the Halide compiler
to generate hardware, we create a system that can fairly compare these prior
accelerators. As long as proper loop blocking schemes are used, and the
hardware can support mapping replicated loops, many different hardware
dataflows yield similar energy efficiency with good performance. This is
because the loop blocking can ensure that most data references stay on-chip
with good locality and the processing units have high resource utilization. How
resources are allocated, especially in the memory system, has a large impact on
energy and performance. By optimizing hardware resource allocation while
keeping throughput constant, we achieve up to 4.2X energy improvement for
Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long
Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202
Reliable and Energy Efficient MLC STT-RAM Buffer for CNN Accelerators
We propose a lightweight scheme where the formation of a data block is changed in such a way that it can tolerate soft errors significantly better than the baseline. The key insight behind our work is that CNN weights are normalized between -1 and 1 after each convolutional layer, and this leaves one bit unused in half-precision floating-point representation. By taking advantage of the unused bit, we create a backup for the most significant bit to protect it against the soft errors. Also, considering the fact that in MLC STT-RAMs the cost of memory operations (read and write), and reliability of a cell are content-dependent (some patterns take larger current and longer time, while they are more susceptible to soft error), we rearrange the data block to minimize the number of costly bit patterns. Combining these two techniques provides the same level of accuracy compared to an error-free baseline while improving the read and write energy by 9% and 6%, respectively
๊ธฐ๊ธฐ ์์์์ ์ฌ์ธต ์ ๊ฒฝ๋ง ๊ฐ์ธํ ๋ฐฉ๋ฒ
ํ์๋
ผ๋ฌธ (์์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2019. 2. Egger, Bernhard.There exist several deep neural network (DNN) architectures suitable for embedded inference, however little work has focused on training neural networks on-device.
User customization of DNNs is desirable due to the difficulty of collecting a training set representative of real world scenarios.
Additionally, inter-user variation means that a general model has a limitation on its achievable accuracy.
In this thesis, a DNN architecture that allows for low power on-device user customization is proposed.
This approach is applied to handwritten character recognition of both the Latin and the Korean alphabets.
Experiments show a 3.5-fold reduction of the prediction error after user customization for both alphabets compared to a DNN trained with general data.
This architecture is additionally evaluated using a number of embedded processors demonstrating its practical application.๋ด์ฅํ ๊ธฐ๊ธฐ์์ ์ฌ์ธต ์ ๊ฒฝ๋ง์ ์ถ๋ก ํ ์ ์๋ ์ํคํ
์ฒ๋ค์ ์กด์ฌํ์ง๋ง ๋ด์ฅํ ๊ธฐ๊ธฐ์์ ์ ๊ฒฝ๋ง์ ํ์ตํ๋ ์ฐ๊ตฌ๋ ๋ณ๋ก ์ด๋ค์ง์ง ์์๋ค. ์ค์ ํ๊ฒฝ์ ๋ฐ์ํ๋ ํ์ต์ฉ ๋ฐ์ดํฐ ์งํฉ์ ๋ชจ์ผ๋ ๊ฒ์ด ์ด๋ ต๊ณ ์ฌ์ฉ์๊ฐ์ ๋ค์์ฑ์ผ๋ก ์ธํด ์ผ๋ฐ์ ์ผ๋ก ํ์ต๋ ๋ชจ๋ธ์ด ์ถฉ๋ถํ ์ ํ๋๋ฅผ ๊ฐ์ง๊ธฐ์ ํ๊ณ๊ฐ ์กด์ฌํ๊ธฐ ๋๋ฌธ์ ์ฌ์ฉ์ ๋ง์ถคํ ์ฌ์ธต ์ ๊ฒฝ๋ง์ด ํ์ํ๋ค. ์ด ๋
ผ๋ฌธ์์๋ ๊ธฐ๊ธฐ์์์ ์ ์ ๋ ฅ์ผ๋ก ์ฌ์ฉ์ ๋ง์ถคํ๊ฐ ๊ฐ๋ฅํ ์ฌ์ธต ์ ๊ฒฝ๋ง ์ํคํ
์ฒ๋ฅผ ์ ์ํ๋ค. ์ด๋ฌํ ์ ๊ทผ ๋ฐฉ๋ฒ์ ๋ผํด์ด์ ํ๊ธ์ ํ๊ธฐ์ฒด ๊ธ์ ์ธ์์ ์ ์ฉ๋๋ค. ๋ผํด์ด์ ํ๊ธ์ ์ฌ์ฉ์ ๋ง์ถคํ๋ฅผ ์ ์ฉํ์ฌ ์ผ๋ฐ์ ์ธ ๋ฐ์ดํฐ๋ก ํ์ตํ ์ฌ์ธต ์ ๊ฒฝ๋ง๋ณด๋ค 3.5๋ฐฐ๋ ์์ ์์ธก ์ค๋ฅ์ ๊ฒฐ๊ณผ๋ฅผ ์ป์๋ค. ๋ํ ์ด ์ํคํ
์ฒ์ ์ค์ฉ์ฑ์ ๋ณด์ฌ์ฃผ๊ธฐ ์ํ์ฌ ๋ค์ํ ๋ด์ฅํ ํ๋ก์ธ์์์ ์คํ์ ์งํํ์๋ค.Abstract i
Contents iii
List of Figures vii
List of Tables ix
Chapter 1 Introduction 1
Chapter 2 Motivation 4
Chapter 3 Background 6
3.1 Deep Neural Networks 6
3.1.1 Inference 6
3.1.2 Training 7
3.2 Convolutional Neural Networks 8
3.3 On-Device Acceleration 9
3.3.1 Hardware Accelerators 9
3.3.2 Software Optimization 10
Chapter 4 Methodology 12
4.1 Initialization 13
4.2 On-Device Training 14
Chapter 5 Implementation 16
5.1 Pre-processing 16
5.2 Latin Handwritten Character Recognition 17
5.2.1 Dataset and BIE Selection 17
5.2.2 AE Design 17
5.3 Korean Handwritten Character Recognition 21
5.3.1 Dataset and BIE Selection 21
5.3.2 AE Design 21
Chapter 6 On-Device Acceleration 26
6.1 Architecure Optimizations 27
6.2 Compiler Optimizations 29
Chapter 7 Experimental Setup 30
Chapter 8 Evaluation 33
8.1 Latin Handwritten Character Recognition 33
8.2 Korean Handwritten Character Recognition 38
8.3 On-Device Acceleration 40
Chapter 9 Related Work 44
Chapter 10 Conclusion 47
Bibliography 47
์์ฝ 55
Acknowledgements 56Maste
HAQ: Hardware-Aware Automated Quantization with Mixed Precision
Model quantization is a widely used technique to compress and accelerate deep
neural network (DNN) inference. Emergent DNN hardware accelerators begin to
support mixed precision (1-8 bits) to further improve the computation
efficiency, which raises a great challenge to find the optimal bitwidth for
each layer: it requires domain experts to explore the vast design space trading
off among accuracy, latency, energy, and model size, which is both
time-consuming and sub-optimal. Conventional quantization algorithm ignores the
different hardware architectures and quantizes all the layers in a uniform way.
In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ)
framework which leverages the reinforcement learning to automatically determine
the quantization policy, and we take the hardware accelerator's feedback in the
design loop. Rather than relying on proxy signals such as FLOPs and model size,
we employ a hardware simulator to generate direct feedback signals (latency and
energy) to the RL agent. Compared with conventional methods, our framework is
fully automated and can specialize the quantization policy for different neural
network architectures and hardware architectures. Our framework effectively
reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with
negligible loss of accuracy compared with the fixed bitwidth (8 bits)
quantization. Our framework reveals that the optimal policies on different
hardware architectures (i.e., edge and cloud architectures) under different
resource constraints (i.e., latency, energy and model size) are drastically
different. We interpreted the implication of different quantization policies,
which offer insights for both neural network architecture design and hardware
architecture design.Comment: CVPR 2019. The first three authors contributed equally to this work.
Project page: https://hanlab.mit.edu/projects/haq
- โฆ