815 research outputs found
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
ESPecIaL an Embedded Systems Programming Language
Nowadays embedded systems, available at very low cost, are becoming more and more present in many fields such as industry, automotive and education. This master thesis presents a prototype implementation of an embedded systems programming language. This report focuses on a high-level language, specially developed to build embedded applications, based on the dataflow paradigm. Using ready-to-use blocks, the user describes the block diagram of his application, and its corresponding C++ code is generated automatically, for a specific target embedded system. With the help of this prototype Domain Specific Language (DSL), implemented using the Scala programming language, embedded applications can be built with ease. Low-level C/C++ codes are no more necessary. Real-world applications based on the developed Embedded Systems Programming Language are presented at the end of this document
λ³λ ¬ λ° λΆμ° μλ² λλ μμ€ν μ μν λͺ¨λΈ κΈ°λ° μ½λ μμ± νλ μμν¬
νμλ
Όλ¬Έ(λ°μ¬)--μμΈλνκ΅ λνμ :곡과λν μ»΄ν¨ν°κ³΅νλΆ,2020. 2. νμν.μννΈμ¨μ΄ μ€κ³ μμ°μ± λ° μ μ§λ³΄μμ±μ ν₯μμν€κΈ° μν΄ λ€μν μννΈμ¨μ΄ κ°λ° λ°©λ²λ‘ μ΄ μ μλμμ§λ§, λλΆλΆμ μ°κ΅¬λ μμ© μννΈμ¨μ΄λ₯Ό νλμ νλ‘μΈμμμ λμμν€λ λ°μ μ΄μ μ λ§μΆκ³ μλ€. λν, μλ² λλ μμ€ν
μ κ°λ°νλ λ°μ νμν μ§μ°μ΄λ μμ μꡬ μ¬νμ λν λΉκΈ°λ₯μ μꡬ μ¬νμ κ³ λ €νμ§ μκ³ μκΈ° λλ¬Έμ μΌλ°μ μΈ μννΈμ¨μ΄ κ°λ° λ°©λ²λ‘ μ μλ² λλ μννΈμ¨μ΄λ₯Ό κ°λ°νλ λ°μ μ μ©νλ κ²μ μ ν©νμ§ μλ€.
μ΄ λ
Όλ¬Έμμλ λ³λ ¬ λ° λΆμ° μλ² λλ μμ€ν
μ λμμΌλ‘ νλ μννΈμ¨μ΄λ₯Ό λͺ¨λΈλ‘ νννκ³ , μ΄λ₯Ό μννΈμ¨μ΄ λΆμμ΄λ κ°λ°μ νμ©νλ κ°λ° λ°©λ²λ‘ μ μκ°νλ€. μ°λ¦¬μ λͺ¨λΈμμ μμ© μννΈμ¨μ΄λ κ³μΈ΅μ μΌλ‘ ννν μ μλ μ¬λ¬ κ°μ νμ€ν¬λ‘ μ΄λ£¨μ΄μ Έ μμΌλ©°, νλμ¨μ΄ νλ«νΌκ³Ό λ
립μ μΌλ‘ λͺ
μΈνλ€. νμ€ν¬ κ°μ ν΅μ λ° λκΈ°νλ λͺ¨λΈμ΄ μ μν κ·μ½μ΄ μ ν΄μ Έ μκ³ , μ΄λ¬ν κ·μ½μ ν΅ν΄ μ€μ νλ‘κ·Έλ¨μ μ€ννκΈ° μ μ μννΈμ¨μ΄ μλ¬λ₯Ό μ μ λΆμμ ν΅ν΄ νμΈν μ μκ³ , μ΄λ μμ©μ κ²μ¦ 볡μ‘λλ₯Ό μ€μ΄λ λ°μ κΈ°μ¬νλ€. μ§μ ν νλμ¨μ΄ νλ«νΌμμ λμνλ νλ‘κ·Έλ¨μ νμ€ν¬λ€μ νλ‘μΈμμ 맀νν μ΄νμ μλμ μΌλ‘ ν©μ±ν μ μλ€.
μμ λͺ¨λΈ κΈ°λ° μννΈμ¨μ΄ κ°λ° λ°©λ²λ‘ μμ μ¬μ©νλ νλ‘κ·Έλ¨ ν©μ±κΈ°λ₯Ό λ³Έ λ
Όλ¬Έμμ μ μνμλλ°, λͺ
μΈν νλ«νΌ μꡬ μ¬νμ λ°νμΌλ‘ λ³λ ¬ λ° λΆμ° μλ² λλ μμ€ν
μμμ λμνλ μ½λλ₯Ό μμ±νλ€. μ¬λ¬ κ°μ μ νμ λͺ¨λΈλ€μ κ³μΈ΅μ μΌλ‘ νννμ¬ μμ©μ λμ ννλ₯Ό λνκ³ , ν©μ±κΈ°λ μ¬λ¬ λͺ¨λΈλ‘ ꡬμ±λ κ³μΈ΅μ μΈ λͺ¨λΈλ‘λΆν° λ³λ ¬μ±μ κ³ λ €νμ¬ νμ€ν¬λ₯Ό μ€νν μ μλ€. λν, νλ‘κ·Έλ¨ ν©μ±κΈ°μμ λ€μν νλ«νΌμ΄λ λ€νΈμν¬λ₯Ό μ§μν μ μλλ‘ μ½λλ₯Ό κ΄λ¦¬νλ λ°©λ²λ 보μ¬μ£Όκ³ μλ€. λ³Έ λ
Όλ¬Έμμ μ μνλ μννΈμ¨μ΄ κ°λ° λ°©λ²λ‘ μ 6κ°μ νλμ¨μ΄ νλ«νΌκ³Ό 3 μ’
λ₯μ λ€νΈμν¬λ‘ ꡬμ±λμ΄ μλ μ€μ κ°μ μννΈμ¨μ΄ μμ€ν
μμ© μμ μ μ΄μ’
λ©ν° νλ‘μΈμλ₯Ό νμ©νλ μ격 λ₯ λ¬λ μμ λ₯Ό μννμ¬ κ°λ° λ°©λ²λ‘ μ μ μ© κ°λ₯μ±μ μννμλ€. λν, νλ‘κ·Έλ¨ ν©μ±κΈ°κ° μλ‘μ΄ νλ«νΌμ΄λ λ€νΈμν¬λ₯Ό μ§μνκΈ° μν΄ νμλ‘ νλ κ°λ° λΉμ©λ μ€μ μΈ‘μ λ° μμΈ‘νμ¬ μλμ μΌλ‘ μ μ λ
Έλ ₯μΌλ‘ μλ‘μ΄ νλ«νΌμ μ§μν μ μμμ νμΈνμλ€.
λ§μ μλ² λλ μμ€ν
μμ μμμΉ λͺ»ν νλμ¨μ΄ μλ¬μ λν΄ κ²°ν¨μ κ°λ΄νλ κ²μ νμλ‘ νκΈ° λλ¬Έμ κ²°ν¨ κ°λ΄μ λν μ½λλ₯Ό μλμΌλ‘ μμ±νλ μ°κ΅¬λ μ§ννμλ€. λ³Έ κΈ°λ²μμ κ²°ν¨ κ°λ΄ μ€μ μ λ°λΌ νμ€ν¬ κ·Έλνλ₯Ό μμ νλ λ°©μμ νμ©νμμΌλ©°, κ²°ν¨ κ°λ΄μ λΉκΈ°λ₯μ μꡬ μ¬νμ μμ© κ°λ°μκ° μ½κ² μ μ©ν μ μλλ‘ νμλ€. λν, κ²°ν¨ κ°λ΄ μ§μνλ κ²κ³Ό κ΄λ ¨νμ¬ μ€μ μλμΌλ‘ ꡬννμ κ²½μ°μ λΉκ΅νμκ³ , κ²°ν¨ μ£Όμ
λꡬλ₯Ό μ΄μ©νμ¬ κ²°ν¨ λ°μ μλ리μ€λ₯Ό μ¬ννκ±°λ, μμλ‘ κ²°ν¨μ μ£Όμ
νλ μ€νμ μννμλ€.
λ§μ§λ§μΌλ‘ κ²°ν¨ κ°λ΄λ₯Ό μ€νν λμ νμ©ν κ²°ν¨ μ£Όμ
λꡬλ λ³Έ λ
Όλ¬Έμ λ λ€λ₯Έ κΈ°μ¬ μ¬ν μ€ νλλ‘ λ¦¬λ
μ€ νκ²½μΌλ‘ λμμΌλ‘ μμ© μμ λ° μ»€λ μμμ κ²°ν¨μ μ£Όμ
νλ λꡬλ₯Ό κ°λ°νμλ€. μμ€ν
μ κ²¬κ³ μ±μ κ²μ¦νκΈ° μν΄ κ²°ν¨μ μ£Όμ
νμ¬ κ²°ν¨ μλ리μ€λ₯Ό μ¬ννλ κ²μ λ리 μ¬μ©λλ λ°©λ²μΌλ‘, λ³Έ λ
Όλ¬Έμμ κ°λ°λ κ²°ν¨ μ£Όμ
λꡬλ μμ€ν
μ΄ λμνλ λμ€μ μ¬ν κ°λ₯ν κ²°ν¨μ μ£Όμ
ν μ μλ λꡬμ΄λ€. 컀λ μμμμμ κ²°ν¨ μ£Όμ
μ μν΄ λ μ’
λ₯μ κ²°ν¨ μ£Όμ
λ°©λ²μ μ 곡νλ©°, νλλ 컀λ GNU λλ²κ±°λ₯Ό μ΄μ©ν λ°©λ²μ΄κ³ , λ€λ₯Έ νλλ ARM νλμ¨μ΄ λΈλ μ΄ν¬ν¬μΈνΈλ₯Ό νμ©ν λ°©λ²μ΄λ€. μμ© μμμμ κ²°ν¨μ μ£Όμ
νκΈ° μν΄ GDB κΈ°λ° κ²°ν¨ μ£Όμ
λ°©λ²μ μ΄μ©νμ¬ λμΌ μμ€ν
νΉμ μ격 μμ€ν
μ μμ©μ κ²°ν¨μ μ£Όμ
ν μ μλ€. κ²°ν¨ μ£Όμ
λꡬμ λν μ€νμ ODROID-XU4 보λμμ μ§ννμλ€.While various software development methodologies have been proposed to increase the design productivity and maintainability of software, they usually focus on the development of application software running on a single processing element, without concern about the non-functional requirements of an embedded system such as latency and resource requirements.
In this thesis, we present a model-based software development method for parallel and distributed embedded systems. An application is specified as a set of tasks that follow a set of given rules for communication and synchronization in a hierarchical fashion, independently of the hardware platform. Having such rules enables us to perform static analysis to check some software errors at compile time to reduce the verification difficulty. Platform-specific program is synthesized automatically after mapping of tasks onto processing elements is determined.
The program synthesizer is also proposed to generate codes which satisfies platform requirements for parallel and distributed embedded systems. As multiple models which can express dynamic behaviors can be depicted hierarchically, the synthesizer supports to manage multiple task graphs with a different hierarchy to run tasks with parallelism. Also, the synthesizer shows methods of managing codes for heterogeneous platforms and generating various communication methods. The viability of the proposed software development method is verified with a real-life surveillance application that runs on six processing elements with three remote communication methods, and remote deep learning example is conducted to use heterogeneous multiprocessing components on distributed systems. Also, supporting a new platform and network requires a small effort by measuring and estimating development costs.
Since tolerance to unexpected errors is a required feature of many embedded systems, we also support an automatic fault-tolerant code generation. Fault tolerance can be applied by modifying the task graph based on the selected fault tolerance configurations, so the non-functional requirement of fault tolerance can be easily adopted by an application developer. To compare the effort of supporting fault tolerance, manual implementation of fault tolerance is performed. Also, the fault tolerance method is tested with the fault injection tool to emulate fault scenarios and inject faults randomly.
Our fault injection tool, which has used for testing our fault-tolerance method, is another work of this thesis. Emulating fault scenarios by intentionally injecting faults is commonly used to test and verify the robustness of a system. To emulate faults on an embedded system, we present a run-time fault injection framework that can inject a fault on both a kernel and application layer of Linux-based systems. For injecting faults on a kernel layer, two complementary fault injection techniques are used. One is based on Kernel GNU Debugger, and the other is using a hardware breakpoint supported by the ARM architecture. For application-level fault injection, the GDB-based fault injection method is used to inject a fault on a remote application. The viability of the proposed fault injection tool is proved by real-life experiments with an ODROID-XU4 system.Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Contribution 6
1.3 Dissertation Organization 8
Chapter 2 Background 9
2.1 HOPES: Hope of Parallel Embedded Software 9
2.1.1 Software Development Procedure 9
2.1.2 Components of HOPES 12
2.2 Universal Execution Model 13
2.2.1 Task Graph Specification 13
2.2.2 Dataflow specification of an Application 15
2.2.3 Task Code Specification and Generic APIs 21
2.2.4 Meta-data Specification 23
Chapter 3 Program Synthesis for Parallel and Distributed Embedded Systems 24
3.1 Motivational Example 24
3.2 Program Synthesis Overview 26
3.3 Program Synthesis from Hierarchically-mixed Models 30
3.4 Platform Code Synthesis 33
3.5 Communication Code Synthesis 36
3.6 Experiments 40
3.6.1 Development Cost of Supporting New Platforms and Networks 40
3.6.2 Program Synthesis for the Surveillance System Example 44
3.6.3 Remote GPU-accelerated Deep Learning Example 46
3.7 Document Generation 48
3.8 Related Works 49
Chapter 4 Model Transformation for Fault-tolerant Code Synthesis 56
4.1 Fault-tolerant Code Synthesis Techniques 56
4.2 Applying Fault Tolerance Techniques in HOPES 61
4.3 Experiments 62
4.3.1 Development Cost of Applying Fault Tolerance 62
4.3.2 Fault Tolerance Experiments 62
4.4 Random Fault Injection Experiments 65
4.5 Related Works 68
Chapter 5 Fault Injection Framework for Linux-based Embedded Systems 70
5.1 Background 70
5.1.1 Fault Injection Techniques 70
5.1.2 Kernel GNU Debugger 71
5.1.3 ARM Hardware Breakpoint 72
5.2 Fault Injection Framework 74
5.2.1 Overview 74
5.2.2 Architecture 75
5.2.3 Fault Injection Techniques 79
5.2.4 Implementation 83
5.3 Experiments 90
5.3.1 Experiment Setup 90
5.3.2 Performance Comparison of Two Fault Injection Methods 90
5.3.3 Bit-flip Fault Experiments 92
5.3.4 eMMC Controller Fault Experiments 94
Chapter 6 Conclusion 97
Bibliography 99
μ μ½ 108Docto
Methodology for complex dataflow application development
This thesis addresses problems inherent to the development of complex applications for reconfig- urable systems. Many projects fail to complete or take much longer than originally estimated by relying on traditional iterative software development processes typically used with conventional computers. Even though designer productivity can be increased by abstract programming and execution models, e.g., dataflow, development methodologies considering the specific properties of reconfigurable systems do not exist.
The first contribution of this thesis is a design methodology to facilitate systematic develop- ment of complex applications using reconfigurable hardware in the context of High-Performance Computing (HPC). The proposed methodology is built upon a careful analysis of the original application, a software model of the intended hardware system, an analytical prediction of performance and on-chip area usage, and an iterative architectural refinement to resolve identi- fied bottlenecks before writing a single line of code targeting the reconfigurable hardware. It is successfully validated using two real applications and both achieve state-of-the-art performance.
The second contribution extends this methodology to provide portability between devices in two steps. First, additional tool support for contemporary multi-die Field-Programmable Gate Arrays (FPGAs) is developed. An algorithm to automatically map logical memories to hetero- geneous physical memories with special attention to die boundaries is proposed. As a result, only the proposed algorithm managed to successfully place and route all designs used in the evaluation while the second-best algorithm failed on one third of all large applications. Second, best practices for performance portability between different FPGA devices are collected and evaluated on a financial use case, showing efficient resource usage on five different platforms.
The third contribution applies the extended methodology to a real, highly demanding emerging application from the radiotherapy domain. A Monte-Carlo based simulation of dose accumu- lation in human tissue is accelerated using the proposed methodology to meet the real time requirements of adaptive radiotherapy.Open Acces
SdrLift: A Domain-Specific Intermediate Hardware Synthesis Framework for Prototyping Software-Defined Radios
Modern design of Software-Defined Radio (SDR) applications is based on Field Programmable Gate Arrays (FPGA) due to their ability to be configured into solution architectures that are well suited to domain-specific problems while achieving the best trade-off between performance, power, area, and flexibility. FPGAs are well known for rich computational resources, which traditionally include logic, register, and routing resources. The increased technological advances have seen FPGAs incorporating more complex components that comprise sophisticated memory blocks, Digital Signal Processing (DSP) blocks, and high-speed interfacing to Gigabit Ethernet (GbE) and Peripheral Component Interconnect Express (PCIe) bus. Gateware for programming FPGAs is described at a lowlevel of design abstraction using Register Transfer Language (RTL), typically using either VHSIC-HDL (VHDL) or Verilog code. In practice, the low-level description languages have a very steep learning curve, provide low productivity for hardware designers and lack readily available open-source library support for fundamental designs, and consequently limit the design to only hardware experts. These limitations have led to the adoption of High-Level Synthesis (HLS) tools that raise design abstraction using syntax, semantics, and software development notations that are well-known to most software developers. However, while HLS has made programming of FPGAs more accessible and can increase the productivity of design, they are still not widely adopted in the design community due to the low-level skills that are still required to produce efficient designs. Additionally, the resultant RTL code from HLS tools is often difficult to decipher, modify and optimize due to the functionality and micro-architecture that are coupled together in a single High-Level Language (HLL). In order to alleviate these problems, Domain-Specific Languages (DSL) have been introduced to capture algorithms at a high level of abstraction with more expressive power and providing domain-specific optimizations that factor in new transformations and the trade-off between resource utilization and system performance. The problem of existing DSLs is that they are designed around imperative languages with an instruction sequence that does not match the hardware structure and intrinsics, leading to hardware designs with system properties that are unconformable to the high-level specifications and constraints. The aim of this thesis is, therefore, to design and implement an intermediatelevel framework namely SdrLift for use in high-level rapid prototyping of SDR applications that are based on an FPGA. The SdrLift input is a HLL developed using functional language constructs and design patterns that specify the structural behavior of the application design. The functionality of the SdrLift language is two-fold, first, it can be used directly by a designer to develop the SDR applications, secondly, it can be used as the Intermediate Representation (IR) step that is generated by a higher-level language or a DSL. The SdrLift compiler uses the dataflow graph as an IR to structurally represent the accelerator micro-architecture in which the components correspond to the fine-level and coarse-level Hardware blocks (HW Block) which are either auto-synthesized or integrated from existing reusable Intellectual Property (IP) core libraries. Another IR is in the form of a dataflow model and it is used for composition and global interconnection of the HW Blocks while making efficient interfacing decisions in an attempt to satisfy speed and resource usage objectives. Moreover, the dataflow model provides rules and properties that will be used to provide a theoretical framework that formally analyzes the characteristics of SDR applications (i.e. the throughput, sample rate, latency, and buffer size among other factors). Using both the directed graph flow (DFG) and the dataflow model in the SdrLift compiler provides two benefits: an abstraction of the microarchitecture from the high-level algorithm specifications and also decoupling of the microarchitecture from the low-level RTL implementation. Following the IR creation and model analyses is the VHDL code generation which employs the low-level optimizations that ensure optimal hardware design results. The code generation process per forms analysis to ensure the resultant hardware system conforms to the high-level design specifications and constraints. SdrLift is evaluated by developing representative SDR case studies, in which the VHDL code for eight different SDR applications is generated. The experimental results show that SdrLift achieves the desired performance and flexibility, while also conserving the hardware resources utilized
- β¦