Search CORE

484 research outputs found

심층 신경망 FPGA 가속기를 위한 레이어 감도에 따른 적응형 네트워크 압축 기법

Author: 고지웅
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2020. 8. Bernhard Egger.Systolic 배열에 기반한 심층 신경망 가속기는 적은 에너지 소비와 높은 처리를 가능하게 해준다. 그러나, 일반적인 systolic 배열의 구조는 신경망의 효율적인 압축과 pruning을 어렵게 만든다. 두 최적화 방법들은 신경망의 시간복잡도와 저장공간을 크게 감소시킨다. 본 논문에는, 심층 신경망 추론을 위한 FPGA 기반 고속 가속기인 AIX를 소개하고, systolic 배열을 위한 효율적인 pruning 방법에 대해서 탐구한다. 이 방법은 AIX의 실행 모델을 고려하며, 신경망의 크기를 줄여 나간다. 또한, 독립적으로 합성곱 신경망 층 내 고정된 크기의 블록을 제거함으로써, AIX 가속기의 합성곱 신경망의 실행시간을 직접적으로 단축시킬 수 있다. YOLOv1, YOLOv2 및 Tiny-YOLOv2와 같은 대표적인 합성곱 신경망에 적용하였고, 제시된 기술은 최신 압축률을 달성하였다. 그 결과, YOLOv2를 최소한의 정확도 손실 로 추론 시간을 1.6 배로 줄일 수 있습니다.Deep neural network (DNN) accelerators based on systolic arrays have been shown to achieve a high throughput at a low energy consumption. The regular architecture of the systolic array, however, makes it difficult to effectively apply network pruning and compression; two important optimization techniques that can significantly reduce the computational complexity and the storage requirements of a network. This work presents AIX, an FPGA-based high-speed accelerator for DNN inference, and explores effective methods for pruning systolic arrays. The techniques consider the execution model of the AIX and prune the individual convolutional layers of a network in fixed sized blocks that not only reduce the weights of the network but also translate directly into a reduction of the execution time of a convolutional neural network (CNN) on the AIX. Applied to representative CNNs such as YOLOv1, YOLOv2 and Tiny-YOLOv2, the presented techniques achieve state-of-the-art compression ratios and are able to reduce interference latency by a factor of two at a minimal loss of accuracy.Chapter 1 Introduction and Motivation 1 Chapter 2 Background 4 1 Object Detection 4 1.1 mean Average Precision (mAP) 4 1.2 YOLOv2 6 2 AIX Accelerator 7 2.1 Overview of AIX Architecture 7 2.2 Dataflow of AIX Architecture 9 Chapter 3 Implementation of Pruning on AIX Accelerator 12 3.1 Convolutional Neural Network (CNN) 12 3.2 Granularity of Sparsity for Pruning CNNs 13 3.3 Network Compression for Channel Pruning 15 3.4 CNN Pruning on AIX Accelerator 16 3.4.1 Block-Granularity for Pruning 16 3.4.2 Network Compression for Block Pruning 18 Chapter 4 Adaptive Layer Sensitivity Pruning 19 4.1 Overview 19 4.2 Layer Sensitivity Graph 20 4.3 Concept of Adaptive Layer Sensitivity Pruning Algorithm 22 4.4 Discussion on Adaptive Layer Sensitivity Pruning Algorithm 23 4.5 Compression for YOLOv2 multi-branches 24 4.6 Fine-tune 26 Chapter 5 Experimental Setup 28 Chapter 6 Experimental Results 30 6.1 Overall Results 30 6.2 Effect of Adaptive Layer Sensitivity Pruning 31 6.3 Comparision Adaptive vs Static Layer Sensitivity Pruning 33 Chapter 7 Related Work 35 Chapter 8 Conclusion and Future Work 37 8.1 Conclusion 37 8.2 Future Work 38 Bibliography 40Maste

SNU Open Repository and Archive

Data Visualization for Benchmarking Neural Networks in Different Hardware Platforms

Author: Vasilciuc Alina
Publication venue
Publication date: 01/01/2020
Field of study

The computational complexity of Convolutional Neural Networks has increased enor mously; hence numerous algorithmic optimization techniques have been widely proposed. However, in a space design so complex, it is challenging to choose which optimization will benefit from which type of hardware platform. This is why QuTiBench - a benchmarking methodology - was recently proposed, and it provides clarity into the design space. With measurements resulting in more than nine thousand data points, it became difficult to get useful and rich information quickly and intuitively from the vast data collected. Thereby this effort describes the creation of a web portal where all data is exposed and can be adequately visualized. All the code developed in this project resides in an online public GitHub repository, allowing contributions. Using visualizations which grab our interest and keep our eyes on the message is the perfect way to understand the data and spot trends. Thus, several types of plots were used: rooflines, heatmaps, line plots, bar plots and Box and Whisker Plots. Furthermore, as level-0 of QuTiBench performs a theoretical analysis of the data, with no measurements required, performance predictions were evaluated. We concluded that predictions successfully predicted performance trends. Although being somewhat optimistic because predictions become inaccurate with the increased pruning and quan tization. The theoretical analysis could be improved by the increased awareness of what data is stored in the on and off-chip memory. Moreover, for the FPGAs, performance predictions can be further enhanced by taking the actual resource utilization and the achieved clock frequency of the FPGA circuit into account. With these improvements to level-0 of QuTiBench, this benchmarking methodology can become more accurate on the next measurements, becoming more reliable and useful to designers. Moreover, more measurements were taken, in particular, power, performance and accuracy measurements were taken for Google’s USB Accelerator benchmarking Efficient Net S, EfficientNet M and EfficientNet L. In general, performance measurements were reproduced; however, it was not possible to reproduce accuracy measurements

Repositório da Universidade Nova de Lisboa

TinyVers: A Tiny Versatile System-on-chip with State-Retentive eMRAM for ML Inference at the Extreme Edge

Author: Boons Bert
De Roose Jaro
Giraldo Sebastian
Jain Vikram
Mei Linyan
Verhelst Marian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/01/2023
Field of study

Extreme edge devices or Internet-of-thing nodes require both ultra-low power always-on processing as well as the ability to do on-demand sampling and processing. Moreover, support for IoT applications like voice recognition, machine monitoring, etc., requires the ability to execute a wide range of ML workloads. This brings challenges in hardware design to build flexible processors operating in ultra-low power regime. This paper presents TinyVers, a tiny versatile ultra-low power ML system-on-chip to enable enhanced intelligence at the Extreme Edge. TinyVers exploits dataflow reconfiguration to enable multi-modal support and aggressive on-chip power management for duty-cycling to enable smart sensing applications. The SoC combines a RISC-V host processor, a 17 TOPS/W dataflow reconfigurable ML accelerator, a 1.7

\mu

W deep sleep wake-up controller, and an eMRAM for boot code and ML parameter retention. The SoC can perform up to 17.6 GOPS while achieving a power consumption range from 1.7

\mu

W-20 mW. Multiple ML workloads aimed for diverse applications are mapped on the SoC to showcase its flexibility and efficiency. All the models achieve 1-2 TOPS/W of energy efficiency with power consumption below 230

\mu

W in continuous operation. In a duty-cycling use case for machine monitoring, this power is reduced to below 10

\mu

W.Comment: Accepted in IEEE Journal of Solid-State Circuit

arXiv.org e-Print Archive

HMC-Based Accelerator Design For Compressed Deep Neural Networks

Author: Min Chuhan
Publication venue
Publication date: 29/01/2020
Field of study

Deep Neural Networks (DNNs) offer remarkable performance of classifications and regressions in many high dimensional problems and have been widely utilized in real-word cognitive applications. In DNN applications, high computational cost of DNNs greatly hinder their deployment in resource-constrained applications, real-time systems and edge computing platforms. Moreover, energy consumption and performance cost of moving data between memory hierarchy and computational units are higher than that of the computation itself. To overcome the memory bottleneck, data locality and temporal data reuse are improved in accelerator design. In an attempt to further improve data locality, memory manufacturers have invented 3D-stacked memory where multiple layers of memory arrays are stacked on top of each other. Inherited from the concept of Process-In-Memory (PIM), some 3D-stacked memory architectures also include a logic layer that can integrate general-purpose computational logic directly within main memory to take advantages of high internal bandwidth during computation. In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compression and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling controller. In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation. In this dissertation, we are going to investigate hardware/software co-design for neural network accelerator. Specifically, we introduce a two-phase filter pruning framework for model compres- sion and an accelerator tailored for efficient DNN execution on HMC, which can dynamically offload the primitives and functions to PIM logic layer through a latency-aware scheduling con- troller. In our compression framework, we formulate filter pruning process as an optimization problem and propose a filter selection criterion measured by conditional entropy. The key idea of our proposed approach is to establish a quantitative connection between filters and model accuracy. We define the connection as conditional entropy over filters in a convolutional layer, i.e., distribution of entropy conditioned on network loss. Based on the definition, different pruning efficiencies of global and layer-wise pruning strategies are compared, and two-phase pruning method is proposed. The proposed pruning method can achieve a reduction of 88% filters and 46% inference time reduction on VGG16 within 2% accuracy degradation

D-Scholarship@Pitt