55 research outputs found

    AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

    Full text link
    CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

    Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs

    Get PDF
    Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded BlockRAM. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. Both Altera and Xilinx have adopted OpenCL co-design framework from GPU for FPGA designs as a pseudo-automatic development solution. In this paper, a comprehensive evaluation and comparison of Altera and Xilinx OpenCL frameworks for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times

    Comprehensive Evaluation of OpenCL-Based CNN Implementations for FPGAs

    Get PDF
    Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. OpenCL is commonly used to describe these architectures for their execution on GPGPUs or FPGAs. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded parallel BlockRAMs. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. In this paper both Altera and Xilinx adopted OpenCL co-design frameworks for pseudo-automatic development solutions are evaluated. A comprehensive evaluation and comparison for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times.Ministerio de EconomĂ­a y Competitividad TEC2016-77785-

    High Performance Computing via High Level Synthesis

    Get PDF
    As more and more powerful integrated circuits are appearing on the market, more and more applications, with very different requirements and workloads, are making use of the available computing power. This thesis is in particular devoted to High Performance Computing applications, where those trends are carried to the extreme. In this domain, the primary aspects to be taken into consideration are (1) performance (by definition) and (2) energy consumption (since operational costs dominate over procurement costs). These requirements can be satisfied more easily by deploying heterogeneous platforms, which include CPUs, GPUs and FPGAs to provide a broad range of performance and energy-per-operation choices. In particular, as we will see, FPGAs clearly dominate both CPUs and GPUs in terms of energy, and can provide comparable performance. An important aspect of this trend is of course design technology, because these applications were traditionally programmed in high-level languages, while FPGAs required low-level RTL design. The OpenCL (Open Computing Language) developed by the Khronos group enables developers to program CPU, GPU and recently FPGAs using functionally portable (but sadly not performance portable) source code which creates new possibilities and challenges both for research and industry. FPGAs have been always used for mid-size designs and ASIC prototyping thanks to their energy efficient and flexible hardware architecture, but their usage requires hardware design knowledge and laborious design cycles. Several approaches are developed and deployed to address this issue and shorten the gap between software and hardware in FPGA design flow, in order to enable FPGAs to capture a larger portion of the hardware acceleration market in data centers. Moreover, FPGAs usage in data centers is growing already, regardless of and in addition to their use as computational accelerators, because they can be used as high performance, low power and secure switches inside data-centers. High-Level Synthesis (HLS) is the methodology that enables designers to map their applications on FPGAs (and ASICs). It synthesizes parallel hardware from a model originally written C-based programming languages .e.g. C/C++, SystemC and OpenCL. Design space exploration of the variety of implementations that can be obtained from this C model is possible through wide range of optimization techniques and directives, e.g. to pipeline loops and partition memories into multiple banks, which guide RTL generation toward application dependent hardware and benefit designers from flexible parallel architecture of FPGAs. Model Based Design (MBD) is a high-level and visual process used to generate implementations that solve mathematical problems through a varied set of IP-blocks. MBD enables developers with different expertise, e.g. control theory, embedded software development, and hardware design to share a common design framework and contribute to a shared design using the same tool. Simulink, developed by MATLAB, is a model based design tool for simulation and development of complex dynamical systems. Moreover, Simulink embedded code generators can produce verified C/C++ and HDL code from the graphical model. This code can be used to program micro-controllers and FPGAs. This PhD thesis work presents a study using automatic code generator of Simulink to target Xilinx FPGAs using both HDL and C/C++ code to demonstrate capabilities and challenges of high-level synthesis process. To do so, firstly, digital signal processing unit of a real-time radar application is developed using Simulink blocks. Secondly, generated C based model was used for high level synthesis process and finally the implementation cost of HLS is compared to traditional HDL synthesis using Xilinx tool chain. Alternative to model based design approach, this work also presents an analysis on FPGA programming via high-level synthesis techniques for computationally intensive algorithms and demonstrates the importance of HLS by comparing performance-per-watt of GPUs(NVIDIA) and FPGAs(Xilinx) manufactured in the same node running standard OpenCL benchmarks. We conclude that generation of high quality RTL from OpenCL model requires stronger hardware background with respect to the MBD approach, however, the availability of a fast and broad design space exploration ability and portability of the OpenCL code, e.g. to CPUs and GPUs, motivates FPGA industry leaders to provide users with OpenCL software development environment which promises FPGA programming in CPU/GPU-like fashion. Our experiments, through extensive design space exploration(DSE), suggest that FPGAs have higher performance-per-watt with respect to two high-end GPUs manufactured in the same technology(28 nm). Moreover, FPGAs with more available resources and using a more modern process (20 nm) can outperform the tested GPUs while consuming much less power at the cost of more expensive devices

    Distribution of Low Latency Machine Learning Algorithm

    Get PDF
    Mobile networks are evolving towards centralization and cloudification while bringing computing power to the edge, opening its scope to a new range of applications. Ultra-low latency is one of the requirements of such applications in the next generation of mobile networks (5G), where deep learning is expected to play a big role. Hence, to enable the usage of deep learning solutions on the edge cloud, ultra-low latency inference must be investigated. The study presented here relies on the usage of an in-house framework (CRUN) that enables the distribution of acceleration on data center environment. The objective of this thesis is to leverage the best solution for the inference of a machine learning algorithm for an anomaly detection application using neural networks in the edge cloud context. To evaluate the obtained results with CRUN a comparison work is also carried out. Five inference solutions were compared using CPU, GPU and FPGA. The results show a superior performance in terms of latency for all CRUN experiments, that basically comprehends three cases. The first one utilizing the RTL anomaly detection neural network as a baseline solution, the second using the same baseline code but unrolling the biggest layer for obtaining reduced latency and the third by distributing the neural network in two FPGAs. The requirements for this solution were to obtain latency between 20 ÎĽs to 40 ÎĽs for inference time and at least 20000 inferences per second. These goals were categorically fulfilled for all CRUN experiments, providing 30 ÎĽs latency in average, while the second best solution provided 272 ÎĽs

    Performance and Power Optimization of Multi-kernel Applications on Multi-FPGA Platforms

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Energy-efficient hardware design based on high-level synthesis

    Get PDF
    This dissertation describes research activities broadly concerning the area of High-level synthesis (HLS), but more specifically, regarding the HLS-based design of energy-efficient hardware (HW) accelerators. HW accelerators, mostly implemented on FPGAs, are integral to the heterogeneous architectures employed in modern high performance computing (HPC) systems due to their ability to speed up the execution while dramatically reducing the energy consumption of computationally challenging portions of complex applications. Hence, the first activity was regarding an HLS-based approach to directly execute an OpenCL code on an FPGA instead of its traditional GPU-based counterpart. Modern FPGAs offer considerable computational capabilities while consuming significantly smaller power as compared to high-end GPUs. Several different implementations of the K-Nearest Neighbor algorithm were considered on both FPGA- and GPU-based platforms and their performance was compared. FPGAs were generally more energy-efficient than the GPUs in all the test cases. Eventually, we were also able to get a faster (in terms of execution time) FPGA implementation by using an FPGA-specific OpenCL coding style and utilizing suitable HLS directives. The second activity was targeted towards the development of a methodology complementing HLS to automatically derive power optimization directives (also known as "power intent") from a system-level design description and use it to drive the design steps after HLS, by producing a directive file written using the common power format (CPF) to achieve power shut-off (PSO) in case of an ASIC design. The proposed LP-HLS methodology reduces the design effort by enabling designers to infer low power information from the system-level description of a design rather than at the RTL. This methodology required a SystemC description of a generic power management module to describe the design context of a HW module also modeled in SystemC, along with the development of a tool to automatically produce the CPF file to accomplish PSO. Several test cases were considered to validate the proposed methodology and the results demonstrated its ability to correctly extract the low power information and apply it to achieve power optimization in the backend flow

    Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL

    Full text link
    Recent technological advances have proliferated the available computing power, memory, and speed of modern Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Field Programmable Gate Arrays (FPGAs). Consequently, the performance and complexity of Artificial Neural Networks (ANNs) is burgeoning. While GPU accelerated Deep Neural Networks (DNNs) currently offer state-of-the-art performance, they consume large amounts of power. Training such networks on CPUs is inefficient, as data throughput and parallel computation is limited. FPGAs are considered a suitable candidate for performance critical, low power systems, e.g. the Internet of Things (IOT) edge devices. Using the Xilinx SDAccel or Intel FPGA SDK for OpenCL development environment, networks described using the high-level OpenCL framework can be accelerated on heterogeneous platforms. Moreover, the resource utilization and power consumption of DNNs can be further enhanced by utilizing regularization techniques that binarize network weights. In this paper, we introduce, to the best of our knowledge, the first FPGA-accelerated stochastically binarized DNN implementations, and compare them to implementations accelerated using both GPUs and FPGAs. Our developed networks are trained and benchmarked using the popular MNIST and CIFAR-10 datasets, and achieve near state-of-the-art performance, while offering a >16-fold improvement in power consumption, compared to conventional GPU-accelerated networks. Both our FPGA-accelerated determinsitic and stochastic BNNs reduce inference times on MNIST and CIFAR-10 by >9.89x and >9.91x, respectively.Comment: 4 pages, 3 figures, 1 tabl

    Hardware-accelerated image features with subpixel accuracy for SLAM localization and object detection

    Get PDF
    Die Navigation von autonomen Systemen wird durch den Fortschritt der Technik und durch die steigenden Anforderungen der Anwendungen immer komplexer. Eines der wichtigsten offenen Probleme ist die Genauigkeit und die Robustheit der merkmalsbasierten SLAM-Lokalisierung für Anwendungen im dreidimensionalen Raum. In dieser Arbeit werden Methoden zur Optimierung der Merkmalserkennung mit Subpixel-genauer Bestimmung der Merkmalsposition für merkmalsbasiserte 6-DoF SLAM Verfahren untersucht. Zusätzlich wird eine Erweiterung des Merkmalsdeskriptors mit Farbinformationen und einer Subpixel-genauen Rotation des Deskriptor-Patterns betrachtet. Aus den Ergebnissen der Untersuchung wird das Subpixel-accurate Oriented AGAST and Rotated BRIEF (SOARB) Verfahren zur Merkmalserkennung entwickelt, dass trotz der effizienten und Ressourcen-optimierten Implementierung eine Verbesserung der Lokalisierung und Kartenerstellung in Relation zu anderen vergleichbaren Verfahren erreicht. Durch den Einsatz eines PCIe FPGA-Beschleunigers und der Xilinx SDAccel HW-SW-Codesign Umgebung mit OpenCL Unterstützung wird eine FPGA-basierte Version des SOARB Algorithmus zur Anbindung an SLAM-Systeme gezeigt. Die FPGA-Implementierung des SOARB-Verfahrens erreicht dabei Bildraten von 41 Bildern/s. Sie ist damit um Faktor 2,6x schneller als die schnellste getestete GPU-basierte Implementierung der OpenCV-Bibliothek mit Sub-pixel-genauer Bestimmung der Merkmalsposition. Durch eine geringe Leistungsaufnahme von 13,7W der FPGA-Komponente kann die Leistungseffizienz (Bilder/s pro Watt) des Gesamtsystems im Vergleich zu einer ebenfalls erstellten SOARB GPU-Referenzimplementierung um den Faktor 1,28x gesteigert werden. Der SOARB-Algorithmus wird zur Evaluation in das RTAB-Map SLAM System integriert und erreicht in Tests mit Bildaufnahme-Sequenzen aus dem Straßenverkehr eine Verbesserung des Translations- und Rotationsfehlers von durchschnittlich 22% und 19% im Vergleich zu dem häufig genutzten ORB-Verfahren. Die maximale Verbesserung des Root Mean Square Errors (RMSE) liegt bei 50% für die Translation und 40% für die Rotation. Durch einen Deskriptor mit Farbinformationen kann das SOARB-RGB Verfahren in der Evaluation mit dem Oxford Datensatz zur Bewertung von affinen kovarianten Merkmalen ein sehr gutes Inlier-Verhältnis von 99,2% über die ersten drei Bildvergleiche aller Datensätze erzielen.The navigation of autonomous systems is becoming more and more complex due to advances in technology and the increasing demands of applications. One of the most critical open issues is the accuracy and robustness of feature-based SLAM localization for three-dimensional SLAM applications. In this work the optimization of feature detection with subpixel-accurate features points for feature-based 6-DoF SLAM methods is investigated. In addition, an extension of the feature descriptor with color information and sub-pixel accurate rotation of the descriptor pattern is evaluated. This work develops a Subpixel-accurate Oriented AGAST and Rotated BRIEF (SOARB) feature extraction that, despite the efficient and resource-optimized implementation, improves localization and mapping compared to other comparable algorithms. Using a PCIe FPGA accelerator and the Xilinx SDAccel HW-SW Codesign environment with OpenCL support an FPGA-based version of the SOARB algorithm for interfacing to SLAM systems is demonstrated. The hardware implementation uses high-throughput pipeline processing and parallel units for computation. For faster processing, the subpixel interpolation and a bilinear interpolation is performed in fixed-point arithmetic and the angle calculation is implemented using a CORDIC method. The FPGA implementation of the SOARB algorithm achieves frame rates of 41 frames/s. Thus, it is a factor of 2.6 times faster than the fastest of the tested GPU-based OpenCV implementation with subpixel-accurate feature positions. With a low power consumption of 13.7W of the FPGA component, the overall system power efficiency (fps per watt) can be increased by a factor of 1.28x compared to an implemented SOARB-GPU reference implementation. For evaluation the SOARB algorithm is integrated into the RTAB Map SLAM system. It achieves an average of 22% and 19% improvement in translational and rotational errors compared to the commonly used ORB feature extraction in tests with dataset sequences for autonomous driving. The maximum improvement in root mean square error (RMSE) is 50% for translation and 40% for rotation. To analyze the impact of descriptor with color information, the SOARB-RGB method ist evaluated using the Oxford dataset for affine covariant features. The SOARB-RGB achieves a very good inlier-ratio of 99.2% over the first three dataset image of all datasets
    • …
    corecore