470 research outputs found

    LIPIcs, Volume 251, ITCS 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 251, ITCS 2023, Complete Volum

    Database System Acceleration on FPGAs

    Get PDF
    Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems. The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases. Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs 1 INTRODUCTION 1.1 Databases & the Importance of Performance 1.2 Accelerators & FPGAs 1.3 Requirements 1.4 Outline & Summary of Contributions 2 BACKGROUND ON DATABASE SYSTEMS 2.1 Databases 2.1.1 Storage Model 2.1.2 Storage Medium 2.2 Database Operators 2.2.1 Projection 2.2.2 Filter 2.2.3 Sort 2.2.4 Aggregation 2.2.5 Join 2.2.6 Operator Classification 2.3 Database Queries 2.4 Impact of Acceleration 3 BACKGROUND ON FPGAS 3.1 FPGA 3.1.1 Logic Element 3.1.2 Block RAM (BRAM) 3.1.3 Digital Signal Processor (DSP) 3.1.4 IO Element 3.1.5 Programmable Interconnect 3.2 FPGADesignFlow 3.2.1 Specifications 3.2.2 RTL Description 3.2.3 Verification 3.2.4 Synthesis, Mapping, Placement, and Routing 3.2.5 TimingAnalysis 3.2.6 Bitstream Generation and FPGA Programming 3.3 Implementation Quality Metrics 3.4 FPGA Cards 3.5 Benefits of Using FPGAs 3.6 Challenges of Using FPGAs 4 RELATED WORK 4.1 Summary of Related Work 4.2 Platform Type 4.2.1 Accelerator Card 4.2.2 Coprocessor 4.2.3 Smart Storage 4.2.4 Network Processor 4.3 Implementation 4.3.1 Loop-based implementation 4.3.2 Sort-based Implementation 4.3.3 Hash-based Implementation 4.3.4 Mixed Implementation 4.4 A Note on Quantitative Performance Comparisons II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK) 5 OBJECTIVES AND ARCHITECTURE OVERVIEW 5.1 From Requirements to Objectives 5.2 Architecture Overview 5.3 Outlineof Part II 6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS 6.1 Programming FPGAs 6.2 RelatedWork 6.3 Architecture 6.3.1 Global Architecture 6.3.2 Sorter Architecture 6.3.3 Merger Architecture 6.3.4 Scalability and Resource Adaptability 6.4 Experiments 6.4.1 OpenCL Sort-Merge Implementation 6.4.2 RTLSorters 6.4.3 RTLMergers 6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation 6.5 Summary & Discussion 7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS 7.1 The Case for Resource Efficiency 7.2 Related Work 7.3 Architecture 7.3.1 Sorters 7.3.2 Sort-Network 7.3.3 X:Y Mergers 7.3.4 Merge-Network 7.3.5 Join Materialiser (JoinMat) 7.4 Experiments 7.4.1 Experimental Setup 7.4.2 Implementation Description & Tuning 7.4.3 Sort Benchmarks 7.4.4 Aggregation Benchmarks 7.4.5 Join Benchmarks 7. Summary 8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS 8.1 The Scope of Database System Accelerators 8.2 Related Work 8.3 Key-Reduce Radix Sort(KeRRaS) 8.3.1 Time Complexity 8.3.2 Space Complexity (Memory Utilization) 8.3.3 Discussion and Optimizations 8.4 Architecture 8.4.1 MSM 8.4.2 MSMK: Extending MSM with KeRRaS 8.4.3 Payload, Aggregation and Join Processing 8.4.4 Limitations 8.5 Experiments 8.5.1 Experimental Setup 8.5.2 Datasets 8.5.3 MSMK vs. MSM 8.5.4 Payload-Less Benchmarks 8.5.5 Payload-Based Benchmarks 8.5.6 Flexibility 8.6 Summary 9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS 9.1 Early Aggregation 9.2 Background & Related Work 9.2.1 Sort-Based Early Aggregation 9.2.2 Cache-Based Early Aggregation 9.3 Simulations 9.3.1 Datasets 9.3.2 Metrics 9.3.3 Sort-Based Versus Cache-Based Early Aggregation 9.3.4 Comparison of Set-Associative Caches 9.3.5 Comparison of Cache Structures 9.3.6 Comparison of Replacement Policies 9.3.7 Cache Selection Methodology 9.4 Cache System Architecture 9.4.1 Window Aggregator 9.4.2 Compressor & Hasher 9.4.3 Collision Detector 9.4.4 Collision Resolver 9.4.5 Cache 9.5 Experiments 9.5.1 Experimental Setup 9.5.2 Resource Utilization and Parameter Tuning 9.5.3 Datasets 9.5.4 Benchmarks on Synthetic Data 9.5.5 Benchmarks on Real Data 9.6 Summary 10 THE FULL PICTURE 10.1 System Architecture 10.2 Benchmarks 10.3 Meeting the Objectives III Conclusion 11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH 11.1 Summary 11.2 Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLE

    Transactional memory for high-performance embedded systems

    Get PDF
    The increasing demand for computational power in embedded systems, which is required for various tasks, such as autonomous driving, can only be achieved by exploiting the resources offered by modern hardware. Due to physical limitations, hardware manufacturers have moved to increase the number of cores per processor instead of further increasing clock rates. Therefore, in our view, the additionally required computing power can only be achieved by exploiting parallelism. Unfortunately writing parallel code is considered a difficult and complex task. Hardware Transactional Memories (HTMs) are a suitable tool to write sophisticated parallel software. However, HTMs were not specifically developed for embedded systems and therefore cannot be used without consideration. The use of conventional HTMs increases complexity and makes it more difficult to foresee implications with other important properties of embedded systems. This thesis therefore describes how an HTM for embedded systems could be implemented. The HTM was designed to allow the parallel execution of software and to offer functionality which is useful for embedded systems. Hereby the focus lay on: elimination of the typical limitations of conventional HTMs, several conflict resolution mechanisms, investigation of real time behavior, and a feature to conserve energy. To enable the desired functionalities, the structure of the HTM described in this work strongly differs from a conventional HTM. In comparison to the baseline HTM, which was also designed and implemented in this thesis, the biggest adaptation concerns the conflict detection. It was modified so that conflicts can be detected and resolved centrally. For this, the cache hierarchy as well as the cache coherence had to be adapted and partially extended. The system was implemented in the cycle-accurate gem5 simulator. The eight benchmarks of the STAMP benchmark suite were used for evaluation. The evaluation of the various functionalities shows that the mechanisms work and add value for the operation in embedded systems.Der immer größer werdende Bedarf an Rechenleistung in eingebetteten Systemen, der für verschiedene Aufgaben wie z. B. dem autonomen Fahren benötigt wird, kann nur durch die effiziente Nutzung der zur Verfügung stehenden Ressourcen erreicht werden. Durch physikalische Grenzen sind Prozessorhersteller dazu übergegangen, Prozessoren mit mehreren Prozessorkernen auszustatten, statt die Taktraten weiter anzuheben. Daher kann die zusätzlich benötigte Rechenleistung aus unserer Sicht nur durch eine Steigerung der Parallelität gelingen. Hardwaretransaktionsspeicher (HTS) erlauben es ihren Nutzern schnell und einfach parallele Programme zu schreiben. Allerdings wurden HTS nicht speziell für eingebettete Systeme entwickelt und sind daher nur eingeschränkt für diese nutzbar. Durch den Einsatz herkömmlicher HTS steigt die Komplexität und es wird somit schwieriger abzusehen, ob andere wichtige Eigenschaften erreicht werden können. Um den Einsatz von HTS in eingebettete Systeme besser zu ermöglichen, beschreibt diese Arbeit einen konkreten Ansatz. Der HTS wurde hierzu so entwickelt, dass er eine parallele Ausführung von Programmen ermöglicht und Eigenschaften besitzt, welche für eingebettete Systeme nützlich sind. Dazu gehören unter anderem: Wegfall der typischen Limitierungen herkömmlicher HTS, Einflussnahme auf den Konfliktauflösungsmechanismus, Unterstützung einer abschätzbaren Ausführung und eine Funktion, um Energie einzusparen. Um die gewünschten Funktionalitäten zu ermöglichen, unterscheidet sich der Aufbau des in dieser Arbeit beschriebenen HTS stark von einem klassischen HTS. Im Vergleich zu dem Referenz HTS, der ebenfalls im Rahmen dieser Arbeit entworfen und implementiert wurde, betrifft die größte Anpassung die Konflikterkennung. Sie wurde derart verändert, dass die Konflikte zentral erkannt und aufgelöst werden können. Hierfür mussten die Cache-Hierarchie und Cache-Kohärenz stark angepasst und teilweise erweitert werden. Das System wurde in einem taktgenauen Simulator, dem gem5-Simulator, umgesetzt. Zur Evaluation wurden die acht Benchmarks der STAMP-Benchmark-Suite eingesetzt. Die Evaluation der verschiedenen Funktionen zeigt, dass die Mechanismen funktionieren und somit einen Mehrwert für eingebettete Systeme bieten

    A multi-level functional IR with rewrites for higher-level synthesis of accelerators

    Get PDF
    Specialised accelerators deliver orders of magnitude higher energy-efficiency than general-purpose processors. Field Programmable Gate Arrays (FPGAs) have become the substrate of choice, because the ever-changing nature of modern workloads, such as machine learning, demands reconfigurability. However, they are notoriously hard to program directly using Hardware Description Languages (HDLs). Traditional High-Level Synthesis (HLS) tools improve productivity, but come with their own problems. They often produce sub-optimal designs and programmers are still required to write hardware-specific code, thus development cycles remain long. This thesis proposes Shir, a higher-level synthesis approach for high-performance accelerator design with a hardware-agnostic programming entry point, a multi-level Intermediate Representation (IR), a compiler and rewrite rules for optimisation. First, a novel, multi-level functional IR structure for accelerator design is described. The IRs operate on different levels of abstraction, cleanly separating different hardware concerns. They enable the expression of different forms of parallelism and standard memory features, such as asynchronous off-chip memories or synchronous on-chip buffers, as well as arbitration of such shared resources. Exposing these features at the IR level is essential for achieving high performance. Next, mechanical lowering procedures are introduced to automatically compile a program specification through Shir’s functional IRs until low-level HDL code for FPGA synthesis is emitted. Each lowering step gradually adds implementation details. Finally, this thesis presents rewrite rules for automatic optimisations around parallelisation, buffering and data reshaping. Reshaping operations pose a challenge to functional approaches in particular. They introduce overheads that compromise performance or even prevent the generation of synthesisable hardware designs altogether. This fundamental issue is solved by the application of rewrite rules. The viability of this approach is demonstrated by running matrix multiplication and 2D convolution on an Intel Arria 10 FPGA. A limited design space exploration is conducted, confirming the ability of the IR to exploit various hardware features. Using rewrite rules for optimisation, it is possible to generate high-performance designs that are competitive with highly tuned OpenCL implementations and that outperform hardware-agnostic OpenCL code. The performance impact of the optimisations is further evaluated showing that they are essential to achieving high performance, and in many cases also necessary to produce hardware that fits the resource constraints

    Co-designing reliability and performance for datacenter memory

    Get PDF
    Memory is one of the key components that affects reliability and performance of datacenter servers. Memory in today’s servers is organized and shared in several ways to provide the most performant and efficient access to data. For example, cache hierarchy in multi-core chips to reduce access latency, non-uniform memory access (NUMA) in multi-socket servers to improve scalability, disaggregation to increase memory capacity. In all these organizations, hardware coherence protocols are used to maintain memory consistency of this shared memory and implicitly move data to the requesting cores. This thesis aims to provide fault-tolerance against newer models of failure in the organization of memory in datacenter servers. While designing for improved reliability, this thesis explores solutions that can also enhance performance of applications. The solutions build over modern coherence protocols to achieve these properties. First, we observe that DRAM memory system failure rates have increased, demanding stronger forms of memory reliability. To combat this, the thesis proposes Dvé, a hardware driven replication mechanism where data blocks are replicated across two different memory controllers in a cache-coherent NUMA system. Data blocks are accompanied by a code with strong error detection capabilities so that when an error is detected, correction is performed using the replica. Dvé’s organization offers two independent points of access to data which enables: (a) strong error correction that can recover from a range of faults affecting any of the components in the memory and (b) higher performance by providing another nearer point of memory access. Dvé’s coherent replication keeps the replicas in sync for reliability and also provides coherent access to read replicas during fault-free operation for improved performance. Dvé can flexibly provide these benefits on-demand at runtime. Next, we observe that the coherence protocol itself requires to be hardened against failures. Memory in datacenter servers is being disaggregated from the compute servers into dedicated memory servers, driven by standards like CXL. CXL specifies the coherence protocol semantics for compute servers to access and cache data from a shared region in the disaggregated memory. However, the CXL specification lacks the requisite level of fault-tolerance necessary to operate at an inter-server scale within the datacenter. Compute servers can fail or be unresponsive in the datacenter and therefore, it is important that the coherence protocol remain available in the presence of such failures. The thesis proposes Āpta, a CXL-based, shared disaggregated memory system for keeping the cached data consistent without compromising availability in the face of compute server failures. Āpta architects a high-performance fault-tolerant object-granular memory server that significantly improves performance for stateless function-as-a-service (FaaS) datacenter applications

    Feasibility Study of High-Level Synthesis : Implementation of a Real-Time HEVC Intra Encoder on FPGA

    Get PDF
    High-Level Synthesis (HLS) on automatisoitu suunnitteluprosessi, joka pyrkii parantamaan tuottavuutta perinteisiin suunnittelumenetelmiin verrattuna, nostamalla suunnittelun abstraktiota rekisterisiirtotasolta (RTL) käyttäytymistasolle. Erilaisia kaupallisia HLS-työkaluja on ollut markkinoilla aina 1990-luvulta lähtien, mutta vasta äskettäin ne ovat alkaneet saada hyväksyntää teollisuudessa sekä akateemisessa maailmassa. Hidas käyttöönottoaste on johtunut pääasiassa huonommasta tulosten laadusta (QoR) kuin mitä on ollut mahdollista tavanomaisilla laitteistokuvauskielillä (HDL). Uusimmat HLS-työkalusukupolvet ovat kuitenkin kaventaneet QoR-aukkoa huomattavasti. Tämä väitöskirja tutkii HLS:n soveltuvuutta videokoodekkien kehittämiseen. Se esittelee useita HLS-toteutuksia High Efficiency Video Coding (HEVC) -koodaukselle, joka on keskeinen mahdollistava tekniikka lukuisille nykyaikaisille mediasovelluksille. HEVC kaksinkertaistaa koodaustehokkuuden edeltäjäänsä Advanced Video Coding (AVC) -standardiin verrattuna, saavuttaen silti saman subjektiivisen visuaalisen laadun. Tämä tyypillisesti saavutetaan huomattavalla laskennallisella lisäkustannuksella. Siksi reaaliaikainen HEVC vaatii automatisoituja suunnittelumenetelmiä, joita voidaan käyttää rautatoteutus- (HW ) ja varmennustyön minimoimiseen. Tässä väitöskirjassa ehdotetaan HLS:n käyttöä koko enkooderin suunnitteluprosessissa. Dataintensiivisistä koodaustyökaluista, kuten intra-ennustus ja diskreetit muunnokset, myös enemmän kontrollia vaativiin kokonaisuuksiin, kuten entropiakoodaukseen. Avoimen lähdekoodin Kvazaar HEVC -enkooderin C-lähdekoodia hyödynnetään tässä työssä referenssinä HLS-suunnittelulle sekä toteutuksen varmentamisessa. Suorituskykytulokset saadaan ja raportoidaan ohjelmoitavalla porttimatriisilla (FPGA). Tämän väitöskirjan tärkein tuotos on HEVC intra enkooderin prototyyppi. Prototyyppi koostuu Nokia AirFrame Cloud Server palvelimesta, varustettuna kahdella 2.4 GHz:n 14-ytiminen Intel Xeon prosessorilla, sekä kahdesta Intel Arria 10 GX FPGA kiihdytinkortista, jotka voidaan kytkeä serveriin käyttäen joko peripheral component interconnect express (PCIe) liitäntää tai 40 gigabitin Ethernettiä. Prototyyppijärjestelmä saavuttaa reaaliaikaisen 4K enkoodausnopeuden, jopa 120 kuvaa sekunnissa. Lisäksi järjestelmän suorituskykyä on helppo skaalata paremmaksi lisäämällä järjestelmään käytännössä minkä tahansa määrän verkkoon kytkettäviä FPGA-kortteja. Monimutkaisen HEVC:n tehokas mallinnus ja sen monipuolisten ominaisuuksien mukauttaminen reaaliaikaiselle HW HEVC enkooderille ei ole triviaali tehtävä, koska HW-toteutukset ovat perinteisesti erittäin aikaa vieviä. Tämä väitöskirja osoittaa, että HLS:n avulla pystytään nopeuttamaan kehitysaikaa, tarjoamaan ennen näkemätöntä suunnittelun skaalautuvuutta, ja silti osoittamaan kilpailukykyisiä QoR-arvoja ja absoluuttista suorituskykyä verrattuna olemassa oleviin toteutuksiin.High-Level Synthesis (HLS) is an automated design process that seeks to improve productivity over traditional design methods by increasing design abstraction from register transfer level (RTL) to behavioural level. Various commercial HLS tools have been available on the market since the 1990s, but only recently they have started to gain adoption across industry and academia. The slow adoption rate has mainly stemmed from lower quality of results (QoR) than obtained with conventional hardware description languages (HDLs). However, the latest HLS tool generations have substantially narrowed the QoR gap. This thesis studies the feasibility of HLS in video codec development. It introduces several HLS implementations for High Efficiency Video Coding (HEVC) , that is the key enabling technology for numerous modern media applications. HEVC doubles the coding efficiency over its predecessor Advanced Video Coding (AVC) standard for the same subjective visual quality, but typically at the cost of considerably higher computational complexity. Therefore, real-time HEVC calls for automated design methodologies that can be used to minimize the HW implementation and verification effort. This thesis proposes to use HLS throughout the whole encoder design process. From data-intensive coding tools, like intra prediction and discrete transforms, to more control-oriented tools, such as entropy coding. The C source code of the open-source Kvazaar HEVC encoder serves as a design entry point for the HLS flow, and it is also utilized in design verification. The performance results are gathered with and reported for field programmable gate array (FPGA) . The main contribution of this thesis is an HEVC intra encoder prototype that is built on a Nokia AirFrame Cloud Server equipped with 2.4 GHz dual 14-core Intel Xeon processors and two Intel Arria 10 GX FPGA Development Kits, that can be connected to the server via peripheral component interconnect express (PCIe) generation 3 or 40 Gigabit Ethernet. The proof-of-concept system achieves real-time. 4K coding speed up to 120 fps, which can be further scaled up by adding practically any number of network-connected FPGA cards. Overcoming the complexity of HEVC and customizing its rich features for a real-time HEVC encoder implementation on hardware is not a trivial task, as hardware development has traditionally turned out to be very time-consuming. This thesis shows that HLS is able to boost the development time, provide previously unseen design scalability, and still result in competitive performance and QoR over state-of-the-art hardware implementations

    MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory

    Full text link
    Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and maintaining high efficiency. We present MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring application-tunable functional units. We designed and implemented an efficient low-latency PE to L1-memory interconnect, an optimized instruction path to ensure each PE's independent execution, and a powerful DMA engine and system interconnect to stream data in and out. MemPool is easy to program, with all the cores sharing a global view of a large, multi-banked, L1 scratchpad memory, accessible within at most five cycles in the absence of conflicts. We provide multiple runtimes to program MemPool at different abstraction levels and illustrate its versatility with a wide set of applications. MemPool runs at 600 MHz (60 gate delays) in typical conditions (TT/0.80V/25{\deg}C) in 22 nm FDX technology and achieves a performance of up to 229 GOPS or 192 GOPS/W with less than 2% of execution stalls.Comment: 14 pages, 17 figures, 2 table

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    Reconfigurable acceleration of Recurrent Neural Networks

    Get PDF
    Recurrent Neural Networks (RNNs) have been successful in a wide range of applications involving temporal sequences such as natural language processing, speech recognition and video analysis. However, RNNs often require a significant amount of memory and computational resources. In addition, the recurrent nature and data dependencies in RNN computations can lead to system stall, resulting in low throughput and high latency. This work describes novel parallel hardware architectures for accelerating RNN inference using Field-Programmable Gate Array (FPGA) technology, which considers the data dependencies and high computational costs of RNNs. The first contribution of this thesis is a latency-hiding architecture that utilizes column-wise matrix-vector multiplication instead of the conventional row-wise operation to eliminate data dependencies and improve the throughput of RNN inference designs. This architecture is further enhanced by a configurable checkerboard tiling strategy which allows large dimensions of weight matrices, while supporting element-based parallelism and vector-based parallelism. The presented reconfigurable RNN designs show significant speedup over CPU, GPU, and other FPGA designs. The second contribution of this thesis is a weight reuse approach for large RNN models with weights stored in off-chip memory, running with a batch size of one. A novel blocking-batching strategy is proposed to optimize the throughput of large RNN designs on FPGAs by reusing the RNN weights. Performance analysis is also introduced to enable FPGA designs to achieve the best trade-off between area, power consumption and performance. Promising power efficiency improvement has been achieved in addition to speeding up over CPU and GPU designs. The third contribution of this thesis is a low latency design for RNNs based on a partially-folded hardware architecture. It also introduces a technique that balances initiation interval of multi-layer RNN inferences to increase hardware efficiency and throughput while reducing latency. The approach is evaluated on a variety of applications, including gravitational wave detection and Bayesian RNN-based ECG anomaly detection. To facilitate the use of this approach, we open source an RNN template which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools.Open Acces
    corecore