320 research outputs found
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
μλμ§ ν¨μ¨μ μΈκ³΅μ κ²½λ§ μ€κ³
νμλ
Όλ¬Έ (λ°μ¬)-- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2019. 2. μ΅κΈ°μ.μ΅κ·Ό μ¬μΈ΅ νμ΅μ μ΄λ―Έμ§ λΆλ₯, μμ± μΈμ λ° κ°ν νμ΅κ³Ό κ°μ μμμμ λλΌμ΄ μ±κ³Όλ₯Ό κ±°λκ³ μλ€.
μ΅μ²¨λ¨ μ¬μΈ΅ μΈκ³΅μ κ²½λ§ μ€ μΌλΆλ μ΄λ―Έ μΈκ°μ λ₯λ ₯μ λμ΄μ μ±λ₯μ 보μ¬μ£Όκ³ μλ€.
κ·Έλ¬λ μΈκ³΅μ κ²½λ§μ μμ²λ μμ κ³ μ λ° κ³μ°κ³Ό μλ°±λ§κ°μ λ§€κ° λ³μλ₯Ό μ΄μ©νκΈ° μν λΉλ²ν λ©λͺ¨λ¦¬ μ‘μΈμ€λ₯Ό μλ°νλ€.
μ΄λ μμ²λ μΉ© 곡κ°κ³Ό μλμ§ μλͺ¨ λ¬Έμ λ₯Ό μΌκΈ°νμ¬ μλ² λλ μμ€ν
μμ μΈκ³΅μ κ²½λ§μ΄ μ¬μ©λλ κ²μ μ ννκ² λλ€. μ΄ λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄ μΈκ³΅μ κ²½λ§μ λμ μλμ§ ν¨μ¨μ±μ κ°λλ‘ μ€κ³νλ λ°©λ²μ μ μνλ€.
첫λ²μ§Έ ννΈμμλ κ°μ€ μ€νμ΄ν¬λ₯Ό μ΄μ©νμ¬ μ§§μ μΆλ‘ μκ°κ³Ό μ μ μλμ§ μλͺ¨μ μ₯μ μ κ°λ μ€νμ΄νΉ μΈκ³΅μ κ²½λ§ μ€κ³ λ°©λ²μ λ€λ£¬λ€.
μ€νμ΄νΉ μΈκ³΅μ κ²½λ§μ μΈκ³΅μ κ²½λ§μ λμ μλμ§ μλΉ λ¬Έμ λ₯Ό 극볡νκΈ° μν μ λ§ν λμ μ€ νλμ΄λ€.
κΈ°μ‘΄ μ°κ΅¬μμ μ¬μΈ΅ μΈκ³΅μ κ²½λ§μ μ νλ μμ€μμ΄ μ€νμ΄νΉ μΈκ³΅μ κ²½λ§μΌλ‘ λ³ννλ λ°©λ²μ΄ λ°νλμλ€.
κ·Έλ¬λ κΈ°μ‘΄μ λ°©λ²λ€μ rate codingμ μ¬μ©νκΈ° λλ¬Έμ κΈ΄ μΆλ‘ μκ°μ κ°κ² λκ³ μ΄κ²μ΄ λ§μ μλμ§ μλͺ¨λ₯Ό μΌκΈ°νκ² λλ λ¨μ μ΄ μλ€.
μ΄ ννΈμμλ νμ΄μ¦μ λ°λΌ λ€λ₯Έ μ€νμ΄ν¬ κ°μ€μΉλ₯Ό λΆμ¬νλ λ°©λ²μΌλ‘ μΆλ‘ μκ°μ ν¬κ² μ€μ΄λ λ°©λ²μ μ μνλ€.
MNIST, SVHN, CIFAR-10, CIFAR-100 λ°μ΄ν°μ
μμμ μ€ν κ²°κ³Όλ μ μλ λ°©λ²μ μ΄μ©ν μ€νμ΄νΉ μΈκ³΅μ κ²½λ§μ΄ κΈ°μ‘΄ λ°©λ²μ λΉν΄ ν° νμΌλ‘ μΆλ‘ μκ°κ³Ό μ€νμ΄ν¬ λ°μ λΉλλ₯Ό μ€μ¬μ λ³΄λ€ μλμ§ ν¨μ¨μ μΌλ‘ λμν¨μ 보μ¬μ€λ€.
λλ²μ§Έ ννΈμμλ 곡μ λ³μ΄κ° μλ μν©μμ λμνλ κ³ μλμ§ν¨μ¨ μλ λ‘κ·Έ μΈκ³΅μ κ²½λ§ μ€κ³ λ°©λ²μ λ€λ£¨κ³ μλ€.
μΈκ³΅μ κ²½λ§μ μλ λ‘κ·Έ νλ‘λ₯Ό μ¬μ©νμ¬ κ΅¬ννλ©΄ λμ λ³λ ¬μ±κ³Ό μλμ§ ν¨μ¨μ±μ μ»μ μ μλ μ₯μ μ΄ μλ€.
νμ§λ§, μλ λ‘κ·Έ μμ€ν
μ λ
Έμ΄μ¦μ μ·¨μ½ν μ€λν κ²°μ μ κ°μ§κ³ μλ€.
μ΄λ¬ν λ
Έμ΄μ¦ μ€ νλλ‘ κ³΅μ λ³μ΄λ₯Ό λ€ μ μλλ°, μ΄λ μλ λ‘κ·Έ νλ‘μ μ μ λμ μ§μ μ λ³νμμΌ μ¬κ°ν μ±λ₯ μ ν λλ μ€λμμ μ λ°νλ μμΈμ΄λ€.
μ΄ ννΈμμλ ReRAMμ κΈ°λ°ν κ³ μλμ§ ν¨μ¨ μλ λ‘κ·Έ μ΄μ§ μΈκ³΅μ κ²½λ§μ ꡬννκ³ , 곡μ λ³μ΄ λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄ νμ±λ μΌμΉ λ°©λ²μ μ¬μ©ν 곡μ λ³μ΄ 보μ κΈ°λ²μ μ μνλ€.
μ μλ μΈκ³΅μ κ²½λ§μ 1T1R ꡬ쑰μ ReRAM λ°°μ΄κ³Ό μ°¨λμ¦νκΈ°λ₯Ό μ΄μ©ν λ΄λ°μ μ΄μ©νμ¬ κ³ λ°λ μ§μ κ³Ό κ³ μλμ§ ν¨μ¨ λμμ΄ κ°λ₯νκ² κ΅¬μ±λμλ€.
λν, μλ λ‘κ·Έ λ΄λ° νλ‘μ 곡μ λ³μ΄ μ·¨μ½μ± λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄ μ΄μμ μΈ λ΄λ°μ νμ±λμ λμΌν νμ±λλ₯Ό κ°λλ‘ λ΄λ°μ λ°μ΄μ΄μ€λ₯Ό μ‘°μ νλ λ°©λ²μ μκ°νλ€.
μ μλ λ°©λ²μ μ¬μ©νμ¬ 32nm 곡μ μμ ꡬνλ μΈκ³΅μ κ²½λ§μ 3-sigma μ§μ μμ 50% λ¬Έν± μ μ λ³μ΄μ 15%μ μ νκ° λ³μ΄κ° μλ μν©μμλ MNISTμμ 98.55%, CIFAR-10μμ 89.63%μ μ νλλ₯Ό λ¬μ±νμμΌλ©°,
970 TOPS/Wμ λ¬νλ λ§€μ° λμ μλμ§ ν¨μ¨μ±μ λ¬μ±νμλ€.Recently, deep learning has shown astounding performances on specific tasks such as image classification, speech recognition, and reinforcement learning. Some of the state-of-the-art deep neural networks have already gone over humans ability. However, neural networks involve tremendous number of high precision computations and frequent off-chip memory accesses with millions of parameters. It incurs problems of large area and exploding energy consumption, which hinder neural networks from being exploited in embedded systems. To cope with the problem, techniques for designing energy efficient neural networks are proposed.
The first part of this dissertation addresses the design of spiking neural networks with weighted spikes which has advantages of shorter inference latency and smaller energy consumption compared to the conventional spiking neural networks. Spiking neural networks are being regarded as one of the promising alternative techniques to overcome the high energy costs of artificial neural networks. It is supported by many researches showing that a deep convolutional neural network can be converted into a spiking neural network with near zero accuracy loss. However, the advantage on energy consumption of spiking neural networks comes at a cost of long classification latency due to the use of Poisson-distributed spike trains (rate coding), especially in deep networks.
We propose to use weighted spikes, which can greatly reduce the latency by assigning a different weight to a spike depending on which time phase it belongs. Experimental results on MNIST, SVHN, CIFAR-10, and CIFAR-100 show that the proposed spiking neural networks with weighted spikes achieve significant reduction in classification latency and number of spikes, which leads to faster and more energy-efficient spiking neural networks than the conventional spiking neural networks with rate coding. We also show that one of the state-of-the-art networks the deep residual network can be converted into spiking neural network without accuracy loss.
The second part of this dissertation focuses on the design of highly energy-efficient analog neural networks in the presence of variations. Analog hardware accelerators for deep neural networks have taken center stage in the aspect of high parallelism and energy efficiency. However, a critical weakness of the analog hardware systems is vulnerability to noise. One of the biggest noise sources is a process variation. It is a big obstacle to using analog circuits since the variation shifts various parameters of analog circuits from the correct operating points, which causes severe performance degradation or even malfunction.
To achieve high energy efficiency with analog neural networks, we propose resistive random access memory (ReRAM) based analog implementation of binarized neural networks (BNNs) with a novel variation compensation technique through activation matching (VCAM). The proposed architecture consists of 1-transistor-1-resistor (1T1R) structured ReRAM synaptic arrays and differential amplifier based neurons, which leads to high-density integration and energy efficiency. To cope with the vulnerability of analog neurons due to process variation, the biases of all neurons are adjusted in the direction that matches average output activation of ideal neurons without variation. The technique effectively restores the classification accuracy degraded by the variation. Experimental results on 32nm technology show that the proposed architecture achieves the classification accuracy of 98.55% on MNIST and 89.63% on CIFAR-10 in the presence of 50% threshold voltage variation and 15% resistance variation at 3-sigma point. It also achieves 970 TOPS/W energy efficiency with MLP on MNIST.1 Introduction 1
1.1 Deep Neural Networks with Weighted Spikes . . . . . . . . . . . . . 2
1.2 VCAM: Variation Compensation through Activation Matching for Analog
Binarized Neural Networks . . . . . . . . . . . . . . . . . . . . . 5
2 Background 8
2.1 Spiking neural network . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Spiking neuron model . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Rate coding in SNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Binarized neural networks . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Resistive random access memory . . . . . . . . . . . . . . . . . . . . 18
3 RelatedWork 22
3.1 Training SNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 SNNs with various spike coding schemes . . . . . . . . . . . . . . . 25
3.3 BNN implementations . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Deep Neural Networks withWeighted Spikes 33
4.1 SNN with weighted spikes . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 Weighted spikes . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 Spiking neuron model for weighted spikes . . . . . . . . . . . 35
4.1.3 Noise spike . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.4 Approximation of the ReLU activation . . . . . . . . . . . . 39
4.1.5 ANN-to-SNN conversion . . . . . . . . . . . . . . . . . . . . 41
4.2 Optimization techniques . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Skipping initial input currents in the output layer . . . . . . . 45
4.2.2 The number of phases in a period . . . . . . . . . . . . . . . 47
4.2.3 Accuracy-energy trade-off by early decision . . . . . . . . . . 50
4.2.4 Consideration on hardware implementation . . . . . . . . . . 52
4.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.1 Comparison between SNN-RC and SNN-WS . . . . . . . . . 56
4.4.2 Trade-off by early decision . . . . . . . . . . . . . . . . . . . 64
4.4.3 Comparison with other algorithms . . . . . . . . . . . . . . . 67
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 VCAM: Variation Compensation through Activation Matching for Analog
Binarized Neural Networks 71
5.1 Modification of Binarized Neural Network . . . . . . . . . . . . . . . 72
5.1.1 Binarized Neural Network . . . . . . . . . . . . . . . . . . . 72
5.1.2 Use of 0 and 1 Activations . . . . . . . . . . . . . . . . . . . 72
5.1.3 Removal of Batch Normalization Layer . . . . . . . . . . . . 73
5.2 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 ReRAM Synaptic Array . . . . . . . . . . . . . . . . . . . . 75
5.2.2 Neuron Circuit . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.3 Issues with Neuron Circuit . . . . . . . . . . . . . . . . . . . 82
5.3 Variation Compensation . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1 Variation Modeling . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.2 Impact of VT Variation . . . . . . . . . . . . . . . . . . . . . 87
5.3.3 Variation Compensation Techniques . . . . . . . . . . . . . . 88
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.2 Accuracy of the Modified BNN Algorithm . . . . . . . . . . 94
5.4.3 Variation Compensation . . . . . . . . . . . . . . . . . . . . 95
5.4.4 Performance Comparison . . . . . . . . . . . . . . . . . . . . 99
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6 Conclusion 102Docto
Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs
Using FPGAs to accelerate ConvNets has attracted significant attention in
recent years. However, FPGA accelerator design has not leveraged the latest
progress of ConvNets. As a result, the key application characteristics such as
frames-per-second (FPS) are ignored in favor of simply counting GOPs, and
results on accuracy, which is critical to application success, are often not
even reported. In this work, we adopt an algorithm-hardware co-design approach
to develop a ConvNet accelerator called Synetgy and a novel ConvNet model
called DiracDeltaNet. Both the accelerator and ConvNet are tailored
to FPGA requirements. DiracDeltaNet, as the name suggests, is a ConvNet with
only convolutions while spatial convolutions are replaced by more
efficient shift operations. DiracDeltaNet achieves competitive accuracy on
ImageNet (88.7\% top-5), but with 42 fewer parameters and 48
fewer OPs than VGG16. We further quantize DiracDeltaNet's weights to 4-bit and
activations to 4-bits, with less than 1\% accuracy loss. These quantizations
exploit well the nature of FPGA hardware. In short, DiracDeltaNet's small model
size, low computational OP count, low precision and simplified operators allow
us to co-design a highly customized computing unit for an FPGA. We implement
the computing units for DiracDeltaNet on an Ultra96 SoC system through
high-level synthesis. Our accelerator's final top-5 accuracy of 88.1\% on
ImageNet, is higher than all the previously reported embedded FPGA
accelerators. In addition, the accelerator reaches an inference speed of 66.3
FPS on the ImageNet classification task, surpassing prior works with similar
accuracy by at least 11.6.Comment: Update to the latest result
- β¦