91 research outputs found

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    FPGA Accelerators on Heterogeneous Systems: An Approach Using High Level Synthesis

    Get PDF
    La evolución de las FPGAs como dispositivos para el procesamiento con alta eficiencia energética y baja latencia de control, comparada con dispositivos como las CPUs y las GPUs, las han hecho atractivas en el ámbito de la computación de alto rendimiento (HPC).A pesar de las inumerables ventajas de las FPGAs, su inclusión en HPC presenta varios retos. El primero, la complejidad que supone la programación de las FPGAs comparada con dispositivos como las CPUs y las GPUs. Segundo, el tiempo de desarrollo es alto debido al proceso de síntesis del hardware. Y tercero, trabajar con más arquitecturas en HPC requiere el manejo y la sintonización de los detalles de cada dispositivo, lo que añade complejidad.Esta tesis aborda estos 3 problemas en diferentes niveles con el objetivo de mejorar y facilitar la adopción de las FPGAs usando la síntesis de alto nivel(HLS) en sistemas HPC.En un nivel próximo al hardware, en esta tesis se desarrolla un modelo analítico para las aplicaciones limitadas en memoria, que es una situación común en aplicaciones de HPC. El modelo, desarrollado para kernels programados usando HLS, puede predecir el tiempo de ejecución con alta precisión y buena adaptabilidad ante cambios en la tecnología de la memoria, como las DDR4 y HBM2, y en las variaciones en la frecuencia del kernel. Esta solución puede aumentar potencialmente la productividad de las personas que programan, reduciendo el tiempo de desarrollo y optimización de las aplicaciones.Entender los detalles de bajo nivel puede ser complejo para las programadoras promedio, y el desempeño de las aplicaciones para FPGA aún requiere un alto nivel en las habilidades de programación. Por ello, nuestra segunda propuesta está enfocada en la extensión de las bibliotecas con una propuesta para cómputo en visión artificial que sea portable entre diferentes fabricantes de FPGAs. La biblioteca se ha diseñado basada en templates, lo que permite una biblioteca que da flexibilidad a la generación del hardware y oculta decisiones de diseño críticas como la comunicación entre nodos, el modelo de concurrencia, y la integración de las aplicaciones en el sistema heterogéneo para facilitar el desarrollo de grafos de visión artificial que pueden ser complejos.Finalmente, en el runtime del host del sistema heterogéneo, hemos integrado la FPGA para usarla de forma trasparente como un dispositivo acelerador para la co-ejecución en sistemas heterogéneos. Hemos hecho una serie propuestas de altonivel de abstracción que abarca los mecanismos de sincronización y políticas de balanceo en un sistema altamente heterogéneo compuesto por una CPU, una GPU y una FPGA. Se presentan los principales retos que han inspirado esta investigación y los beneficios de la inclusión de una FPGA en rendimiento y energía.En conclusión, esta tesis contribuye a la adopción de las FPGAs para entornos HPC, aportando soluciones que ayudan a reducir el tiempo de desarrollo y mejoran el desempeño y la eficiencia energética del sistema.---------------------------------------------The emergence of FPGAs in the High-Performance Computing domain is arising thanks to their promise of better energy efficiency and low control latency, compared with other devices such as CPUs or GPUs.Albeit these benefits, their complete inclusion into HPC systems still faces several challenges. First, FPGA complexity means its programming more difficult compared to devices such as CPU and GPU. Second, the development time is longer due to the required synthesis effort. And third, working with multiple devices increments the details that should be managed and increase hardware complexity.This thesis tackles these 3 problems at different stack levels to improve and to make easier the adoption of FPGAs using High-Level Synthesis on HPC systems. At a close to the hardware level, this thesis contributes with a new analytical model for memory-bound applications, an usual situation for HPC applications. The model for HLS kernels can anticipate application performance before place and route, reducing the design development time. Our results show a high precision and adaptable model for external memory technologies such as DDR4 and HBM2, and kernel frequency changes. This solution potentially increases productivity, reducing application development time.Understanding low-level implementation details is difficult for average programmers, and the development of FPGA applications still requires high proficiency program- ming skills. For this reason, the second proposal is focused on the extension of a computer vision library to be portable among two of the main FPGA vendors. The template-based library allows hardware flexibility and hides design decisions such as the communication among nodes, the concurrency programming model, and the application’s integration in the heterogeneous system, to develop complex vision graphs easily.Finally, we have transparently integrated the FPGA in a high level framework for co-execution with other devices. We propose a set of high level abstractions covering synchronization mechanism and load balancing policies in a highly heterogeneous system with CPU, GPU, and FPGA devices. We present the main challenges that inspired this research and the benefits of the FPGA use demonstrating performance and energy improvements.<br /

    Efficient Implementation of MIMO Decoders

    Get PDF

    Fast and energy-efficient derivatives risk analysis: Streaming option Greeks on Xilinx and Intel FPGAs

    Full text link
    Whilst FPGAs have enjoyed success in accelerating high-frequency financial workloads for some time, their use for quantitative finance, which is the use of mathematical models to analyse financial markets and securities, has been far more limited to-date. Currently, CPUs are the most common architecture for such workloads, and an important question is whether FPGAs can ameliorate some of the bottlenecks encountered on those architectures. In this paper we extend our previous work accelerating the industry standard Securities Technology Analysis Center's (STAC\textregistered) derivatives risk analysis benchmark STAC-A2\texttrademark{}, by first porting this from our previous Xilinx implementation to an Intel Stratix-10 FPGA, exploring the challenges encountered when moving from one FPGA architecture to another and suitability of techniques. We then present a host-data-streaming approach that ultimately outperforms our previous version on a Xilinx Alveo U280 FPGA by up to 4.6 times and requiring 9 times less energy at the largest problem size, while outperforming the CPU and GPU versions by up to 8.2 and 5.2 times respectively. The result of this work is a significant enhancement in FPGA performance against the previous version for this industry standard benchmark running on both Xilinx and Intel FPGAs, and furthermore an exploration of optimisation and porting techniques that can be applied to other HPC workloads.Comment: Accepted version of paper that appeared in 2022 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC). arXiv admin note: text overlap with arXiv:2206.0371

    Fast and energy-efficient derivatives risk analysis: Streaming option Greeks on Xilinx and Intel FPGAs

    Get PDF
    Whilst FPGAs have enjoyed success in accelerating high-frequency financial workloads for some time, their use for quantitative finance, which is the use of mathematical models to analyse financial markets and securities, has been far more limited to-date. Currently, CPUs are the most common architecture for such workloads, and an important question is whether FPGAs can ameliorate some of the bottlenecks encountered on those architectures. In this paper we extend our previous work accelerating the industry standard Securities Technology Analysis Center's (STAC (R)) derivatives risk analysis benchmark STAC-A2TM, by first porting this from our previous Xilinx implementation to an Intel Stratix-10 FPGA, exploring the challenges encountered when moving from one FPGA architecture to another and suitability of techniques. We then present a host-data-streaming approach that ultimately outperforms our previous version on a Xilinx Alveo U280 FPGA by up to 4.6 times and requiring 9 times less energy at the largest problem size, while outperforming the CPU and GPU versions by up to 8.2 and 5.2 times respectively. The result of this work is a significant enhancement in FPGA performance against the previous version for this industry standard benchmark running on both Xilinx and Intel FPGAs, and furthermore an exploration of optimisation and porting techniques that can be applied to other HPC workloads

    A CAD Tool for Synthesizing Optimized Variants of Altera\u27s Nios II Soft-Core Processor

    Get PDF
    Soft-core processors offer embedded system designers the benefits of customization, flexibility and reusability. Altera\u27s NIOS II soft-core processor is a popular, commercially available soft-core processor that can be implemented on a variety of Altera FPGAs. In this thesis, the Nios II soft-core processor from Altera Corporation was studied and a VHDL implementation, called UW_Nios II, was developed. UW_Nios II was developed to enable us to perform design space exploration (DSE) for the Nios II processor. It was evaluated and compared with Altera Nios II and shown to be competitive. SCBuild is an existing CAD tool that was developed to enable DSE of soft-core processors. We modified SCBuild to automatically explore the design space of the UW_Nios II using a genetic algorithm. This tool can accurately estimate the area and critical path delay of different variants of the UW_Nios II on a field programmable gate array. Through experiments conducted using SCBuild, it was shown that employing a genetic algorithm to explore the design space of parameterized Nios II core, with a large design space, helps designers find optimized variants of UW_Nios II

    Database System Acceleration on FPGAs

    Get PDF
    Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems. The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases. Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs 1 INTRODUCTION 1.1 Databases & the Importance of Performance 1.2 Accelerators & FPGAs 1.3 Requirements 1.4 Outline & Summary of Contributions 2 BACKGROUND ON DATABASE SYSTEMS 2.1 Databases 2.1.1 Storage Model 2.1.2 Storage Medium 2.2 Database Operators 2.2.1 Projection 2.2.2 Filter 2.2.3 Sort 2.2.4 Aggregation 2.2.5 Join 2.2.6 Operator Classification 2.3 Database Queries 2.4 Impact of Acceleration 3 BACKGROUND ON FPGAS 3.1 FPGA 3.1.1 Logic Element 3.1.2 Block RAM (BRAM) 3.1.3 Digital Signal Processor (DSP) 3.1.4 IO Element 3.1.5 Programmable Interconnect 3.2 FPGADesignFlow 3.2.1 Specifications 3.2.2 RTL Description 3.2.3 Verification 3.2.4 Synthesis, Mapping, Placement, and Routing 3.2.5 TimingAnalysis 3.2.6 Bitstream Generation and FPGA Programming 3.3 Implementation Quality Metrics 3.4 FPGA Cards 3.5 Benefits of Using FPGAs 3.6 Challenges of Using FPGAs 4 RELATED WORK 4.1 Summary of Related Work 4.2 Platform Type 4.2.1 Accelerator Card 4.2.2 Coprocessor 4.2.3 Smart Storage 4.2.4 Network Processor 4.3 Implementation 4.3.1 Loop-based implementation 4.3.2 Sort-based Implementation 4.3.3 Hash-based Implementation 4.3.4 Mixed Implementation 4.4 A Note on Quantitative Performance Comparisons II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK) 5 OBJECTIVES AND ARCHITECTURE OVERVIEW 5.1 From Requirements to Objectives 5.2 Architecture Overview 5.3 Outlineof Part II 6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS 6.1 Programming FPGAs 6.2 RelatedWork 6.3 Architecture 6.3.1 Global Architecture 6.3.2 Sorter Architecture 6.3.3 Merger Architecture 6.3.4 Scalability and Resource Adaptability 6.4 Experiments 6.4.1 OpenCL Sort-Merge Implementation 6.4.2 RTLSorters 6.4.3 RTLMergers 6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation 6.5 Summary & Discussion 7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS 7.1 The Case for Resource Efficiency 7.2 Related Work 7.3 Architecture 7.3.1 Sorters 7.3.2 Sort-Network 7.3.3 X:Y Mergers 7.3.4 Merge-Network 7.3.5 Join Materialiser (JoinMat) 7.4 Experiments 7.4.1 Experimental Setup 7.4.2 Implementation Description & Tuning 7.4.3 Sort Benchmarks 7.4.4 Aggregation Benchmarks 7.4.5 Join Benchmarks 7. Summary 8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS 8.1 The Scope of Database System Accelerators 8.2 Related Work 8.3 Key-Reduce Radix Sort(KeRRaS) 8.3.1 Time Complexity 8.3.2 Space Complexity (Memory Utilization) 8.3.3 Discussion and Optimizations 8.4 Architecture 8.4.1 MSM 8.4.2 MSMK: Extending MSM with KeRRaS 8.4.3 Payload, Aggregation and Join Processing 8.4.4 Limitations 8.5 Experiments 8.5.1 Experimental Setup 8.5.2 Datasets 8.5.3 MSMK vs. MSM 8.5.4 Payload-Less Benchmarks 8.5.5 Payload-Based Benchmarks 8.5.6 Flexibility 8.6 Summary 9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS 9.1 Early Aggregation 9.2 Background & Related Work 9.2.1 Sort-Based Early Aggregation 9.2.2 Cache-Based Early Aggregation 9.3 Simulations 9.3.1 Datasets 9.3.2 Metrics 9.3.3 Sort-Based Versus Cache-Based Early Aggregation 9.3.4 Comparison of Set-Associative Caches 9.3.5 Comparison of Cache Structures 9.3.6 Comparison of Replacement Policies 9.3.7 Cache Selection Methodology 9.4 Cache System Architecture 9.4.1 Window Aggregator 9.4.2 Compressor & Hasher 9.4.3 Collision Detector 9.4.4 Collision Resolver 9.4.5 Cache 9.5 Experiments 9.5.1 Experimental Setup 9.5.2 Resource Utilization and Parameter Tuning 9.5.3 Datasets 9.5.4 Benchmarks on Synthetic Data 9.5.5 Benchmarks on Real Data 9.6 Summary 10 THE FULL PICTURE 10.1 System Architecture 10.2 Benchmarks 10.3 Meeting the Objectives III Conclusion 11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH 11.1 Summary 11.2 Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLE
    corecore