320 research outputs found

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    μ—λ„ˆμ§€ 효율적 인곡신경망 섀계

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2019. 2. 졜기영.졜근 심측 ν•™μŠ΅μ€ 이미지 λΆ„λ₯˜, μŒμ„± 인식 및 κ°•ν™” ν•™μŠ΅κ³Ό 같은 μ˜μ—­μ—μ„œ λ†€λΌμš΄ μ„±κ³Όλ₯Ό 거두고 μžˆλ‹€. μ΅œμ²¨λ‹¨ 심측 인곡신경망 쀑 μΌλΆ€λŠ” 이미 μΈκ°„μ˜ λŠ₯λ ₯을 λ„˜μ–΄μ„  μ„±λŠ₯을 보여주고 μžˆλ‹€. κ·ΈλŸ¬λ‚˜ 인곡신경망은 μ—„μ²­λ‚œ 수의 κ³ μ •λ°€ 계산과 수백만개의 맀개 λ³€μˆ˜λ₯Ό μ΄μš©ν•˜κΈ° μœ„ν•œ λΉˆλ²ˆν•œ λ©”λͺ¨λ¦¬ μ•‘μ„ΈμŠ€λ₯Ό μˆ˜λ°˜ν•œλ‹€. μ΄λŠ” μ—„μ²­λ‚œ μΉ© 곡간과 μ—λ„ˆμ§€ μ†Œλͺ¨ 문제λ₯Ό μ•ΌκΈ°ν•˜μ—¬ μž„λ² λ””λ“œ μ‹œμŠ€ν…œμ—μ„œ 인곡신경망이 μ‚¬μš©λ˜λŠ” 것을 μ œν•œν•˜κ²Œ λœλ‹€. 이 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 인곡신경망을 높은 μ—λ„ˆμ§€ νš¨μœ¨μ„±μ„ 갖도둝 μ„€κ³„ν•˜λŠ” 방법을 μ œμ•ˆν•œλ‹€. 첫번째 νŒŒνŠΈμ—μ„œλŠ” 가쀑 슀파이크λ₯Ό μ΄μš©ν•˜μ—¬ 짧은 μΆ”λ‘  μ‹œκ°„κ³Ό 적은 μ—λ„ˆμ§€ μ†Œλͺ¨μ˜ μž₯점을 κ°–λŠ” μŠ€νŒŒμ΄ν‚Ή 인곡신경망 섀계 방법을 닀룬닀. μŠ€νŒŒμ΄ν‚Ή 인곡신경망은 μΈκ³΅μ‹ κ²½λ§μ˜ 높은 μ—λ„ˆμ§€ μ†ŒλΉ„ 문제λ₯Ό κ·Ήλ³΅ν•˜κΈ° μœ„ν•œ μœ λ§ν•œ λŒ€μ•ˆ 쀑 ν•˜λ‚˜μ΄λ‹€. κΈ°μ‘΄ μ—°κ΅¬μ—μ„œ 심측 인곡신경망을 정확도 손싀없이 μŠ€νŒŒμ΄ν‚Ή μΈκ³΅μ‹ κ²½λ§μœΌλ‘œ λ³€ν™˜ν•˜λŠ” 방법이 λ°œν‘œλ˜μ—ˆλ‹€. κ·ΈλŸ¬λ‚˜ 기쑴의 방법듀은 rate coding을 μ‚¬μš©ν•˜κΈ° λ•Œλ¬Έμ— κΈ΄ μΆ”λ‘  μ‹œκ°„μ„ κ°–κ²Œ 되고 이것이 λ§Žμ€ μ—λ„ˆμ§€ μ†Œλͺ¨λ₯Ό μ•ΌκΈ°ν•˜κ²Œ λ˜λŠ” 단점이 μžˆλ‹€. 이 νŒŒνŠΈμ—μ„œλŠ” νŽ˜μ΄μ¦ˆμ— 따라 λ‹€λ₯Έ 슀파이크 κ°€μ€‘μΉ˜λ₯Ό λΆ€μ—¬ν•˜λŠ” λ°©λ²•μœΌλ‘œ μΆ”λ‘  μ‹œκ°„μ„ 크게 μ€„μ΄λŠ” 방법을 μ œμ•ˆν•œλ‹€. MNIST, SVHN, CIFAR-10, CIFAR-100 λ°μ΄ν„°μ…‹μ—μ„œμ˜ μ‹€ν—˜ κ²°κ³ΌλŠ” μ œμ•ˆλœ 방법을 μ΄μš©ν•œ μŠ€νŒŒμ΄ν‚Ή 인곡신경망이 κΈ°μ‘΄ 방법에 λΉ„ν•΄ 큰 폭으둜 μΆ”λ‘  μ‹œκ°„κ³Ό 슀파이크 λ°œμƒ λΉˆλ„λ₯Ό μ€„μ—¬μ„œ 보닀 μ—λ„ˆμ§€ 효율적으둜 λ™μž‘ν•¨μ„ 보여쀀닀. λ‘λ²ˆμ§Έ νŒŒνŠΈμ—μ„œλŠ” 곡정 변이가 μžˆλŠ” μƒν™©μ—μ„œ λ™μž‘ν•˜λŠ” κ³ μ—λ„ˆμ§€νš¨μœ¨ μ•„λ‚ λ‘œκ·Έ 인곡신경망 섀계 방법을 닀루고 μžˆλ‹€. 인곡신경망을 μ•„λ‚ λ‘œκ·Έ 회둜λ₯Ό μ‚¬μš©ν•˜μ—¬ κ΅¬ν˜„ν•˜λ©΄ 높은 병렬성과 μ—λ„ˆμ§€ νš¨μœ¨μ„±μ„ 얻을 수 μžˆλŠ” μž₯점이 μžˆλ‹€. ν•˜μ§€λ§Œ, μ•„λ‚ λ‘œκ·Έ μ‹œμŠ€ν…œμ€ λ…Έμ΄μ¦ˆμ— μ·¨μ•½ν•œ μ€‘λŒ€ν•œ 결점을 가지고 μžˆλ‹€. μ΄λŸ¬ν•œ λ…Έμ΄μ¦ˆ 쀑 ν•˜λ‚˜λ‘œ 곡정 변이λ₯Ό λ“€ 수 μžˆλŠ”λ°, μ΄λŠ” μ•„λ‚ λ‘œκ·Έ 회둜의 적정 λ™μž‘ 지점을 λ³€ν™”μ‹œμΌœ μ‹¬κ°ν•œ μ„±λŠ₯ μ €ν•˜ λ˜λŠ” μ˜€λ™μž‘μ„ μœ λ°œν•˜λŠ” 원인이닀. 이 νŒŒνŠΈμ—μ„œλŠ” ReRAM에 κΈ°λ°˜ν•œ κ³ μ—λ„ˆμ§€ 효율 μ•„λ‚ λ‘œκ·Έ 이진 인곡신경망을 κ΅¬ν˜„ν•˜κ³ , 곡정 변이 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ ν™œμ„±λ„ 일치 방법을 μ‚¬μš©ν•œ 곡정 변이 보상 기법을 μ œμ•ˆν•œλ‹€. μ œμ•ˆλœ 인곡신경망은 1T1R ꡬ쑰의 ReRAM λ°°μ—΄κ³Ό 차동증폭기λ₯Ό μ΄μš©ν•œ λ‰΄λŸ°μ„ μ΄μš©ν•˜μ—¬ 고밀도 집적과 κ³ μ—λ„ˆμ§€ 효율 λ™μž‘μ΄ κ°€λŠ₯ν•˜κ²Œ κ΅¬μ„±λ˜μ—ˆλ‹€. λ˜ν•œ, μ•„λ‚ λ‘œκ·Έ λ‰΄λŸ° 회둜의 곡정 변이 μ·¨μ•½μ„± 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ 이상적인 λ‰΄λŸ°μ˜ ν™œμ„±λ„μ™€ λ™μΌν•œ ν™œμ„±λ„λ₯Ό 갖도둝 λ‰΄λŸ°μ˜ λ°”μ΄μ–΄μŠ€λ₯Ό μ‘°μ ˆν•˜λŠ” 방법을 μ†Œκ°œν•œλ‹€. μ œμ•ˆλœ 방법을 μ‚¬μš©ν•˜μ—¬ 32nm κ³΅μ •μ—μ„œ κ΅¬ν˜„λœ 인곡신경망은 3-sigma μ§€μ μ—μ„œ 50% λ¬Έν„± μ „μ•• 변이와 15%의 μ €ν•­κ°’ 변이가 μžˆλŠ” μƒν™©μ—μ„œλ„ MNISTμ—μ„œ 98.55%, CIFAR-10μ—μ„œ 89.63%의 정확도λ₯Ό λ‹¬μ„±ν•˜μ˜€μœΌλ©°, 970 TOPS/W에 λ‹¬ν•˜λŠ” 맀우 높은 μ—λ„ˆμ§€ νš¨μœ¨μ„±μ„ λ‹¬μ„±ν•˜μ˜€λ‹€.Recently, deep learning has shown astounding performances on specific tasks such as image classification, speech recognition, and reinforcement learning. Some of the state-of-the-art deep neural networks have already gone over humans ability. However, neural networks involve tremendous number of high precision computations and frequent off-chip memory accesses with millions of parameters. It incurs problems of large area and exploding energy consumption, which hinder neural networks from being exploited in embedded systems. To cope with the problem, techniques for designing energy efficient neural networks are proposed. The first part of this dissertation addresses the design of spiking neural networks with weighted spikes which has advantages of shorter inference latency and smaller energy consumption compared to the conventional spiking neural networks. Spiking neural networks are being regarded as one of the promising alternative techniques to overcome the high energy costs of artificial neural networks. It is supported by many researches showing that a deep convolutional neural network can be converted into a spiking neural network with near zero accuracy loss. However, the advantage on energy consumption of spiking neural networks comes at a cost of long classification latency due to the use of Poisson-distributed spike trains (rate coding), especially in deep networks. We propose to use weighted spikes, which can greatly reduce the latency by assigning a different weight to a spike depending on which time phase it belongs. Experimental results on MNIST, SVHN, CIFAR-10, and CIFAR-100 show that the proposed spiking neural networks with weighted spikes achieve significant reduction in classification latency and number of spikes, which leads to faster and more energy-efficient spiking neural networks than the conventional spiking neural networks with rate coding. We also show that one of the state-of-the-art networks the deep residual network can be converted into spiking neural network without accuracy loss. The second part of this dissertation focuses on the design of highly energy-efficient analog neural networks in the presence of variations. Analog hardware accelerators for deep neural networks have taken center stage in the aspect of high parallelism and energy efficiency. However, a critical weakness of the analog hardware systems is vulnerability to noise. One of the biggest noise sources is a process variation. It is a big obstacle to using analog circuits since the variation shifts various parameters of analog circuits from the correct operating points, which causes severe performance degradation or even malfunction. To achieve high energy efficiency with analog neural networks, we propose resistive random access memory (ReRAM) based analog implementation of binarized neural networks (BNNs) with a novel variation compensation technique through activation matching (VCAM). The proposed architecture consists of 1-transistor-1-resistor (1T1R) structured ReRAM synaptic arrays and differential amplifier based neurons, which leads to high-density integration and energy efficiency. To cope with the vulnerability of analog neurons due to process variation, the biases of all neurons are adjusted in the direction that matches average output activation of ideal neurons without variation. The technique effectively restores the classification accuracy degraded by the variation. Experimental results on 32nm technology show that the proposed architecture achieves the classification accuracy of 98.55% on MNIST and 89.63% on CIFAR-10 in the presence of 50% threshold voltage variation and 15% resistance variation at 3-sigma point. It also achieves 970 TOPS/W energy efficiency with MLP on MNIST.1 Introduction 1 1.1 Deep Neural Networks with Weighted Spikes . . . . . . . . . . . . . 2 1.2 VCAM: Variation Compensation through Activation Matching for Analog Binarized Neural Networks . . . . . . . . . . . . . . . . . . . . . 5 2 Background 8 2.1 Spiking neural network . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Spiking neuron model . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Rate coding in SNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Binarized neural networks . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Resistive random access memory . . . . . . . . . . . . . . . . . . . . 18 3 RelatedWork 22 3.1 Training SNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 SNNs with various spike coding schemes . . . . . . . . . . . . . . . 25 3.3 BNN implementations . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Deep Neural Networks withWeighted Spikes 33 4.1 SNN with weighted spikes . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.1 Weighted spikes . . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.2 Spiking neuron model for weighted spikes . . . . . . . . . . . 35 4.1.3 Noise spike . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.4 Approximation of the ReLU activation . . . . . . . . . . . . 39 4.1.5 ANN-to-SNN conversion . . . . . . . . . . . . . . . . . . . . 41 4.2 Optimization techniques . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.1 Skipping initial input currents in the output layer . . . . . . . 45 4.2.2 The number of phases in a period . . . . . . . . . . . . . . . 47 4.2.3 Accuracy-energy trade-off by early decision . . . . . . . . . . 50 4.2.4 Consideration on hardware implementation . . . . . . . . . . 52 4.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.1 Comparison between SNN-RC and SNN-WS . . . . . . . . . 56 4.4.2 Trade-off by early decision . . . . . . . . . . . . . . . . . . . 64 4.4.3 Comparison with other algorithms . . . . . . . . . . . . . . . 67 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 VCAM: Variation Compensation through Activation Matching for Analog Binarized Neural Networks 71 5.1 Modification of Binarized Neural Network . . . . . . . . . . . . . . . 72 5.1.1 Binarized Neural Network . . . . . . . . . . . . . . . . . . . 72 5.1.2 Use of 0 and 1 Activations . . . . . . . . . . . . . . . . . . . 72 5.1.3 Removal of Batch Normalization Layer . . . . . . . . . . . . 73 5.2 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2.1 ReRAM Synaptic Array . . . . . . . . . . . . . . . . . . . . 75 5.2.2 Neuron Circuit . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.3 Issues with Neuron Circuit . . . . . . . . . . . . . . . . . . . 82 5.3 Variation Compensation . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.1 Variation Modeling . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.2 Impact of VT Variation . . . . . . . . . . . . . . . . . . . . . 87 5.3.3 Variation Compensation Techniques . . . . . . . . . . . . . . 88 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 93 5.4.2 Accuracy of the Modified BNN Algorithm . . . . . . . . . . 94 5.4.3 Variation Compensation . . . . . . . . . . . . . . . . . . . . 95 5.4.4 Performance Comparison . . . . . . . . . . . . . . . . . . . . 99 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6 Conclusion 102Docto

    Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs

    Full text link
    Using FPGAs to accelerate ConvNets has attracted significant attention in recent years. However, FPGA accelerator design has not leveraged the latest progress of ConvNets. As a result, the key application characteristics such as frames-per-second (FPS) are ignored in favor of simply counting GOPs, and results on accuracy, which is critical to application success, are often not even reported. In this work, we adopt an algorithm-hardware co-design approach to develop a ConvNet accelerator called Synetgy and a novel ConvNet model called DiracDeltaNet†^{\dagger}. Both the accelerator and ConvNet are tailored to FPGA requirements. DiracDeltaNet, as the name suggests, is a ConvNet with only 1Γ—11\times 1 convolutions while spatial convolutions are replaced by more efficient shift operations. DiracDeltaNet achieves competitive accuracy on ImageNet (88.7\% top-5), but with 42Γ—\times fewer parameters and 48Γ—\times fewer OPs than VGG16. We further quantize DiracDeltaNet's weights to 4-bit and activations to 4-bits, with less than 1\% accuracy loss. These quantizations exploit well the nature of FPGA hardware. In short, DiracDeltaNet's small model size, low computational OP count, low precision and simplified operators allow us to co-design a highly customized computing unit for an FPGA. We implement the computing units for DiracDeltaNet on an Ultra96 SoC system through high-level synthesis. Our accelerator's final top-5 accuracy of 88.1\% on ImageNet, is higher than all the previously reported embedded FPGA accelerators. In addition, the accelerator reaches an inference speed of 66.3 FPS on the ImageNet classification task, surpassing prior works with similar accuracy by at least 11.6Γ—\times.Comment: Update to the latest result
    • …
    corecore