484 research outputs found

    ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง FPGA ๊ฐ€์†๊ธฐ๋ฅผ ์œ„ํ•œ ๋ ˆ์ด์–ด ๊ฐ๋„์— ๋”ฐ๋ฅธ ์ ์‘ํ˜• ๋„คํŠธ์›Œํฌ ์••์ถ• ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. Bernhard Egger.Systolic ๋ฐฐ์—ด์— ๊ธฐ๋ฐ˜ํ•œ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ๊ฐ€์†๊ธฐ๋Š” ์ ์€ ์—๋„ˆ์ง€ ์†Œ๋น„์™€ ๋†’์€ ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ค€๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ์ผ๋ฐ˜์ ์ธ systolic ๋ฐฐ์—ด์˜ ๊ตฌ์กฐ๋Š” ์‹ ๊ฒฝ๋ง์˜ ํšจ์œจ์ ์ธ ์••์ถ•๊ณผ pruning์„ ์–ด๋ ต๊ฒŒ ๋งŒ๋“ ๋‹ค. ๋‘ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์€ ์‹ ๊ฒฝ๋ง์˜ ์‹œ๊ฐ„๋ณต์žก๋„์™€ ์ €์žฅ๊ณต๊ฐ„์„ ํฌ๊ฒŒ ๊ฐ์†Œ์‹œํ‚จ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—๋Š”, ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ์ถ”๋ก ์„ ์œ„ํ•œ FPGA ๊ธฐ๋ฐ˜ ๊ณ ์† ๊ฐ€์†๊ธฐ์ธ AIX๋ฅผ ์†Œ๊ฐœํ•˜๊ณ , systolic ๋ฐฐ์—ด์„ ์œ„ํ•œ ํšจ์œจ์ ์ธ pruning ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ํƒ๊ตฌํ•œ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ AIX์˜ ์‹คํ–‰ ๋ชจ๋ธ์„ ๊ณ ๋ คํ•˜๋ฉฐ, ์‹ ๊ฒฝ๋ง์˜ ํฌ๊ธฐ๋ฅผ ์ค„์—ฌ ๋‚˜๊ฐ„๋‹ค. ๋˜ํ•œ, ๋…๋ฆฝ์ ์œผ๋กœ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง ์ธต ๋‚ด ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ๋ธ”๋ก์„ ์ œ๊ฑฐํ•จ์œผ๋กœ์จ, AIX ๊ฐ€์†๊ธฐ์˜ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์˜ ์‹คํ–‰์‹œ๊ฐ„์„ ์ง์ ‘์ ์œผ๋กœ ๋‹จ์ถ•์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. YOLOv1, YOLOv2 ๋ฐ Tiny-YOLOv2์™€ ๊ฐ™์€ ๋Œ€ํ‘œ์ ์ธ ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง์— ์ ์šฉํ•˜์˜€๊ณ , ์ œ์‹œ๋œ ๊ธฐ์ˆ ์€ ์ตœ์‹  ์••์ถ•๋ฅ ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, YOLOv2๋ฅผ ์ตœ์†Œํ•œ์˜ ์ •ํ™•๋„ ์†์‹ค ๋กœ ์ถ”๋ก  ์‹œ๊ฐ„์„ 1.6 ๋ฐฐ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.Deep neural network (DNN) accelerators based on systolic arrays have been shown to achieve a high throughput at a low energy consumption. The regular architecture of the systolic array, however, makes it difficult to effectively apply network pruning and compression; two important optimization techniques that can significantly reduce the computational complexity and the storage requirements of a network. This work presents AIX, an FPGA-based high-speed accelerator for DNN inference, and explores effective methods for pruning systolic arrays. The techniques consider the execution model of the AIX and prune the individual convolutional layers of a network in fixed sized blocks that not only reduce the weights of the network but also translate directly into a reduction of the execution time of a convolutional neural network (CNN) on the AIX. Applied to representative CNNs such as YOLOv1, YOLOv2 and Tiny-YOLOv2, the presented techniques achieve state-of-the-art compression ratios and are able to reduce interference latency by a factor of two at a minimal loss of accuracy.Chapter 1 Introduction and Motivation 1 Chapter 2 Background 4 1 Object Detection 4 1.1 mean Average Precision (mAP) 4 1.2 YOLOv2 6 2 AIX Accelerator 7 2.1 Overview of AIX Architecture 7 2.2 Dataflow of AIX Architecture 9 Chapter 3 Implementation of Pruning on AIX Accelerator 12 3.1 Convolutional Neural Network (CNN) 12 3.2 Granularity of Sparsity for Pruning CNNs 13 3.3 Network Compression for Channel Pruning 15 3.4 CNN Pruning on AIX Accelerator 16 3.4.1 Block-Granularity for Pruning 16 3.4.2 Network Compression for Block Pruning 18 Chapter 4 Adaptive Layer Sensitivity Pruning 19 4.1 Overview 19 4.2 Layer Sensitivity Graph 20 4.3 Concept of Adaptive Layer Sensitivity Pruning Algorithm 22 4.4 Discussion on Adaptive Layer Sensitivity Pruning Algorithm 23 4.5 Compression for YOLOv2 multi-branches 24 4.6 Fine-tune 26 Chapter 5 Experimental Setup 28 Chapter 6 Experimental Results 30 6.1 Overall Results 30 6.2 Effect of Adaptive Layer Sensitivity Pruning 31 6.3 Comparision Adaptive vs Static Layer Sensitivity Pruning 33 Chapter 7 Related Work 35 Chapter 8 Conclusion and Future Work 37 8.1 Conclusion 37 8.2 Future Work 38 Bibliography 40Maste

    Data Visualization for Benchmarking Neural Networks in Different Hardware Platforms

    Get PDF
    The computational complexity of Convolutional Neural Networks has increased enor mously; hence numerous algorithmic optimization techniques have been widely proposed. However, in a space design so complex, it is challenging to choose which optimization will benefit from which type of hardware platform. This is why QuTiBench - a benchmarking methodology - was recently proposed, and it provides clarity into the design space. With measurements resulting in more than nine thousand data points, it became difficult to get useful and rich information quickly and intuitively from the vast data collected. Thereby this effort describes the creation of a web portal where all data is exposed and can be adequately visualized. All the code developed in this project resides in an online public GitHub repository, allowing contributions. Using visualizations which grab our interest and keep our eyes on the message is the perfect way to understand the data and spot trends. Thus, several types of plots were used: rooflines, heatmaps, line plots, bar plots and Box and Whisker Plots. Furthermore, as level-0 of QuTiBench performs a theoretical analysis of the data, with no measurements required, performance predictions were evaluated. We concluded that predictions successfully predicted performance trends. Although being somewhat optimistic because predictions become inaccurate with the increased pruning and quan tization. The theoretical analysis could be improved by the increased awareness of what data is stored in the on and off-chip memory. Moreover, for the FPGAs, performance predictions can be further enhanced by taking the actual resource utilization and the achieved clock frequency of the FPGA circuit into account. With these improvements to level-0 of QuTiBench, this benchmarking methodology can become more accurate on the next measurements, becoming more reliable and useful to designers. Moreover, more measurements were taken, in particular, power, performance and accuracy measurements were taken for Googleโ€™s USB Accelerator benchmarking Efficient Net S, EfficientNet M and EfficientNet L. In general, performance measurements were reproduced; however, it was not possible to reproduce accuracy measurements

    TinyVers: A Tiny Versatile System-on-chip with State-Retentive eMRAM for ML Inference at the Extreme Edge

    Full text link
    Extreme edge devices or Internet-of-thing nodes require both ultra-low power always-on processing as well as the ability to do on-demand sampling and processing. Moreover, support for IoT applications like voice recognition, machine monitoring, etc., requires the ability to execute a wide range of ML workloads. This brings challenges in hardware design to build flexible processors operating in ultra-low power regime. This paper presents TinyVers, a tiny versatile ultra-low power ML system-on-chip to enable enhanced intelligence at the Extreme Edge. TinyVers exploits dataflow reconfiguration to enable multi-modal support and aggressive on-chip power management for duty-cycling to enable smart sensing applications. The SoC combines a RISC-V host processor, a 17 TOPS/W dataflow reconfigurable ML accelerator, a 1.7 ฮผ\muW deep sleep wake-up controller, and an eMRAM for boot code and ML parameter retention. The SoC can perform up to 17.6 GOPS while achieving a power consumption range from 1.7 ฮผ\muW-20 mW. Multiple ML workloads aimed for diverse applications are mapped on the SoC to showcase its flexibility and efficiency. All the models achieve 1-2 TOPS/W of energy efficiency with power consumption below 230 ฮผ\muW in continuous operation. In a duty-cycling use case for machine monitoring, this power is reduced to below 10 ฮผ\muW.Comment: Accepted in IEEE Journal of Solid-State Circuit

    HMC-Based Accelerator Design For Compressed Deep Neural Networks

    Get PDF
    Deep Neural Networks (DNNs) offer remarkable performance of classifications and regressions in many high dimensional problems and have been widely utilized in real-word cognitive applications. In DNN applications, high computational cost of DNNs greatly hinder their deployment in resource-constrained applications, real-time systems and edge computing platforms. Moreover, energy consumption and performance cost of moving data between memory hierarchy and computational units are higher than that of the computation itself. To overcome the memory bottleneck, data locality and temporal data reuse are improved in accelerator design. In an attempt to further improve data locality, memory manufacturers have invented 3D-stacked memory where multiple layers of memory arrays are stacked on top of each other. Inherited from the concept of Process-In-Memory (PIM), some 3D-stacked memory architectures also include a logic layer that can integrate general-purpose computational logic directly within main memory to take advantages of high internal bandwidth during computation. In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compression and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling controller. In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation. In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compres- sion and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling con- troller. In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation
    • โ€ฆ
    corecore