477 research outputs found
Automatic Nested Loop Acceleration on FPGAs Using Soft CGRA Overlay
Session 1: HLS Toolingpostprin
κΈ°κΈ° μμμμ μ¬μΈ΅ μ κ²½λ§ κ°μΈν λ°©λ²
νμλ
Όλ¬Έ (μμ¬)-- μμΈλνκ΅ λνμ : 곡과λν μ»΄ν¨ν°κ³΅νλΆ, 2019. 2. Egger, Bernhard.There exist several deep neural network (DNN) architectures suitable for embedded inference, however little work has focused on training neural networks on-device.
User customization of DNNs is desirable due to the difficulty of collecting a training set representative of real world scenarios.
Additionally, inter-user variation means that a general model has a limitation on its achievable accuracy.
In this thesis, a DNN architecture that allows for low power on-device user customization is proposed.
This approach is applied to handwritten character recognition of both the Latin and the Korean alphabets.
Experiments show a 3.5-fold reduction of the prediction error after user customization for both alphabets compared to a DNN trained with general data.
This architecture is additionally evaluated using a number of embedded processors demonstrating its practical application.λ΄μ₯ν κΈ°κΈ°μμ μ¬μΈ΅ μ κ²½λ§μ μΆλ‘ ν μ μλ μν€ν
μ²λ€μ μ‘΄μ¬νμ§λ§ λ΄μ₯ν κΈ°κΈ°μμ μ κ²½λ§μ νμ΅νλ μ°κ΅¬λ λ³λ‘ μ΄λ€μ§μ§ μμλ€. μ€μ νκ²½μ λ°μνλ νμ΅μ© λ°μ΄ν° μ§ν©μ λͺ¨μΌλ κ²μ΄ μ΄λ ΅κ³ μ¬μ©μκ°μ λ€μμ±μΌλ‘ μΈν΄ μΌλ°μ μΌλ‘ νμ΅λ λͺ¨λΈμ΄ μΆ©λΆν μ νλλ₯Ό κ°μ§κΈ°μ νκ³κ° μ‘΄μ¬νκΈ° λλ¬Έμ μ¬μ©μ λ§μΆ€ν μ¬μΈ΅ μ κ²½λ§μ΄ νμνλ€. μ΄ λ
Όλ¬Έμμλ κΈ°κΈ°μμμ μ μ λ ₯μΌλ‘ μ¬μ©μ λ§μΆ€νκ° κ°λ₯ν μ¬μΈ΅ μ κ²½λ§ μν€ν
μ²λ₯Ό μ μνλ€. μ΄λ¬ν μ κ·Ό λ°©λ²μ λΌν΄μ΄μ νκΈμ ν기체 κΈμ μΈμμ μ μ©λλ€. λΌν΄μ΄μ νκΈμ μ¬μ©μ λ§μΆ€νλ₯Ό μ μ©νμ¬ μΌλ°μ μΈ λ°μ΄ν°λ‘ νμ΅ν μ¬μΈ΅ μ κ²½λ§λ³΄λ€ 3.5λ°°λ μμ μμΈ‘ μ€λ₯μ κ²°κ³Όλ₯Ό μ»μλ€. λν μ΄ μν€ν
μ²μ μ€μ©μ±μ 보μ¬μ£ΌκΈ° μνμ¬ λ€μν λ΄μ₯ν νλ‘μΈμμμ μ€νμ μ§ννμλ€.Abstract i
Contents iii
List of Figures vii
List of Tables ix
Chapter 1 Introduction 1
Chapter 2 Motivation 4
Chapter 3 Background 6
3.1 Deep Neural Networks 6
3.1.1 Inference 6
3.1.2 Training 7
3.2 Convolutional Neural Networks 8
3.3 On-Device Acceleration 9
3.3.1 Hardware Accelerators 9
3.3.2 Software Optimization 10
Chapter 4 Methodology 12
4.1 Initialization 13
4.2 On-Device Training 14
Chapter 5 Implementation 16
5.1 Pre-processing 16
5.2 Latin Handwritten Character Recognition 17
5.2.1 Dataset and BIE Selection 17
5.2.2 AE Design 17
5.3 Korean Handwritten Character Recognition 21
5.3.1 Dataset and BIE Selection 21
5.3.2 AE Design 21
Chapter 6 On-Device Acceleration 26
6.1 Architecure Optimizations 27
6.2 Compiler Optimizations 29
Chapter 7 Experimental Setup 30
Chapter 8 Evaluation 33
8.1 Latin Handwritten Character Recognition 33
8.2 Korean Handwritten Character Recognition 38
8.3 On-Device Acceleration 40
Chapter 9 Related Work 44
Chapter 10 Conclusion 47
Bibliography 47
μμ½ 55
Acknowledgements 56Maste
A Micro Power Hardware Fabric for Embedded Computing
Field Programmable Gate Arrays (FPGAs) mitigate many of the problemsencountered with the development of ASICs by offering flexibility, faster time-to-market, and amortized NRE costs, among other benefits. While FPGAs are increasingly being used for complex computational applications such as signal and image processing, networking, and cryptology, they are far from ideal for these tasks due to relatively high power consumption and silicon usage overheads compared to direct ASIC implementation. A reconfigurable device that exhibits ASIC-like power characteristics and FPGA-like costs and tool support is desirable to fill this void. In this research, a parameterized, reconfigurable fabric model named as domain specific fabric (DSF) is developed that exhibits ASIC-like power characteristics for Digital Signal Processing (DSP) style applications. Using this model, the impact of varying different design parameters on power and performance has been studied. Different optimization techniques like local search and simulated annealing are used to determine the appropriate interconnect for a specific set of applications. A design space exploration tool has been developed to automate and generate a tailored architectural instance of the fabric.The fabric has been synthesized on 160 nm cell-based ASIC fabrication process from OKI and 130 nm from IBM. A detailed power-performance analysis has been completed using signal and image processing benchmarks from the MediaBench benchmark suite and elsewhere with comparisons to other hardware and software implementations. The optimized fabric implemented using the 130 nm process yields energy within 3X of a direct ASIC implementation, 330X better than a Virtex-II Pro FPGA and 2016X better than an Intel XScale processor
LTE implementation on CGRA based SiLago Platform
Abstract. This thesis implements long term evolution (LTE) transmission layer on a coarse grained reconfigurable called, dynamically reconfigurable resource array (DRRA). Specifically, we implement physical downlink shared channel baseband signal processing blocks (PDSCH) at high level. The overall implementation follows silicon large grain object (SiLago) design methodology. The methodology employs SiLago blocks instead of mainstream standard cells. The main ambition of this thesis was to prove that a standard as complex as LTE can be implemented using the in-house SiLago framework. The work aims to prove that customized design with efficiency close to application specific integrated circuit (ASIC) for LTE can be generated with the programming ease of MATLAB. During this thesis, we have generated a completely parametrizable LTE standard at high level
- β¦