388 research outputs found

    LEGaTO: first steps towards energy-efficient toolset for heterogeneous computing

    Get PDF
    LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.Peer ReviewedPostprint (author's final draft

    An Scalable matrix computing unit architecture for FPGA and SCUMO user design interface

    Get PDF
    High dimensional matrix algebra is essential in numerous signal processing and machine learning algorithms. This work describes a scalable square matrix-computing unit designed on the basis of circulant matrices. It optimizes data flow for the computation of any sequence of matrix operations removing the need for data movement for intermediate results, together with the individual matrix operations' performance in direct or transposed form (the transpose matrix operation only requires a data addressing modification). The allowed matrix operations are: matrix-by-matrix addition, subtraction, dot product and multiplication, matrix-by-vector multiplication, and matrix by scalar multiplication. The proposed architecture is fully scalable with the maximum matrix dimension limited by the available resources. In addition, a design environment is also developed, permitting assistance, through a friendly interface, from the customization of the hardware computing unit to the generation of the final synthesizable IP core. For N x N matrices, the architecture requires N ALU-RAM blocks and performs O(N*N), requiring N*N +7 and N +7 clock cycles for matrix-matrix and matrix-vector operations, respectively. For the tested Virtex7 FPGA device, the computation for 500 x 500 matrices allows a maximum clock frequency of 346 MHz, achieving an overall performance of 173 GOPS. This architecture shows higher performance than other state-of-the-art matrix computing units

    Predictive control using an FPGA with application to aircraft control

    Get PDF
    Alternative and more efficient computational methods can extend the applicability of MPC to systems with tight real-time requirements. This paper presents a ``system-on-a-chip'' MPC system, implemented on a field programmable gate array (FPGA), consisting of a sparse structure-exploiting primal dual interior point (PDIP) QP solver for MPC reference tracking and a fast gradient QP solver for steady-state target calculation. A parallel reduced precision iterative solver is used to accelerate the solution of the set of linear equations forming the computational bottleneck of the PDIP algorithm. A numerical study of the effect of reducing the number of iterations highlights the effectiveness of the approach. The system is demonstrated with an FPGA-in-the-loop testbench controlling a nonlinear simulation of a large airliner. This study considers many more manipulated inputs than any previous FPGA-based MPC implementation to date, yet the implementation comfortably fits into a mid-range FPGA, and the controller compares well in terms of solution quality and latency to state-of-the-art QP solvers running on a standard PC.This work was supported by EPSRC (Grants EP/G030308/1, EP/G031576/1 and EP/I012036/1) and the EU FP7 Project EMBOCON grant agreement number FP7-ICT-2009-4 248940, as well as industrial support from Xilinx, the Mathworks, and the European Space Agency.This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at: http://dx.doi.org/10.1109/TCST.2013.2271791. Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]

    Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

    Full text link
    Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing, previously utilized the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an arithmetic architecture that better utilizes 6-LUTs. The improved BISMO achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a Xilinx UltraScale+ MPSoC.Comment: Invited paper at ACM TRETS as extension of FPL'18 paper arXiv:1806.0886

    HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

    Full text link
    We present both a novel Convolutional Neural Network (CNN) accelerator architecture and a network compiler for FPGAs that outperforms all prior work. Instead of having generic processing elements that together process one layer at a time, our network compiler statically partitions available device resources and builds custom-tailored hardware for each layer of a CNN. By building hardware for each layer we can pack our controllers into fewer lookup tables and use dedicated routing. These efficiencies enable our accelerator to utilize 2x the DSPs and operate at more than 2x the frequency of prior work on sparse CNN acceleration on FPGAs. We evaluate the performance of our architecture on both sparse Resnet-50 and dense MobileNet Imagenet classifiers on a Stratix 10 2800 FPGA. We find that the sparse Resnet-50 model has throughput at a batch size of 1 of 4550 images/s, which is nearly 4x the throughput of NVIDIA's fastest machine learning targeted GPU, the V100, and outperforms all prior work on FPGAs.Comment: 8 Pages, 11 Figure

    PRODUCTIVELY SCALING HARDWARE DESIGNS OVER INCREASING RESOURCES USING A SYSTEMATIC DESIGN ANALYSIS APPROACH

    Get PDF
    As processor development shifts from strict single core frequency scaling to het- erogeneous resource scaling two important considerations require evaluation. First, how to design systems with an increasing amount of heterogeneous resources, and second, how to maintain a designer’s productivity as the number of possible con- figurations grows. Therefore, it is necessary to determine what useful information can be gathered from existing designs to help predict or identify a design’s potential scalability, as well as, identifying which routine tasks can be automated to improve a designer’s productivity. Moreover, once this information is collected, how can this information be conveyed to the designer such that it can be used to increase overall productivity when implementing the design over increasing amounts of resources? This research looks at various approaches to analyze designs and attempts to distribute an application efficiently across a heterogeneous cluster of computing re- sources through the use of a Systematic Design Analysis flow and an assortment of productivity tools. These tools provide the designer with projections on the amount of resources needed to scale an existing design to a specified performance, as well as, projecting the performance based on a specified amount of resources. This is accomplished through the combination of static HDL profiling, component synthesis resource utilization, and runtime performance monitoring. For evaluation, four case studies are presented to demonstrate the proposed flow’s scalability on a small scale cluster of FPGAs. The results are highly favorable, providing orders of magnitude speedup with minimal intervention from the designer

    On detection of OFDM signals for cognitive radio applications

    Get PDF
    As the requirement for wireless telecommunications services continues to grow, it has become increasingly important to ensure that the Radio Frequency (RF) spectrum is managed efficiently. As a result of the current spectrum allocation policy, it has been found that portions of RF spectrum belonging to licensed users are often severely underutilised, at particular times and geographical locations. Awareness of this problem has led to the development of Dynamic Spectrum Access (DSA) and Cognitive Radio (CR) as possible solutions. In one variation of the shared-use model for DSA, it is proposed that the inefficient use of licensed spectrum could be overcome by enabling unlicensed users to opportunistically access the spectrum when the licensed user is not transmitting. In order for an unlicensed device to make decisions, it must be aware of its own RF environment and, therefore, it has been proposed that DSA could been abled using CR. One approach that has be identified to allow the CR to gain information about its operating environment is spectrum sensing. An interesting solution that has been identified for spectrum sensing is cyclostationary detection. This property refers to the inherent periodic nature of the second order statistics of many communications signals. One of the most common modulation formats in use today is Orthogonal Frequency Division Multiplexing (OFDM), which exhibits cyclostationarity due to the addition of a Cyclic Prefix (CP). This thesis examines several statistical tests for cyclostationarity in OFDM signals that may be used for spectrum sensing in DSA and CR. In particular, focus is placed on statistical tests that rely on estimation of the Cyclic Autocorrelation Function (CAF). Based on splitting the CAF into two complex component functions, several new statistical tests are introduced and are shown to lead to an improvement in detection performance when compared to the existing algorithms. The performance of each new algorithm is assessed in Additive White Gaussian Noise (AWGN), impulsive noise and when subjected to impairments such as multipath fading and Carrier Frequency Offset (CFO). Finally, each algorithm is targeted for Field Programmable Gate Array (FPGA) implementation using a Xilinx 7 series device. In order to keep resource costs to a minimum, it is suggested that the new algorithms are implemented on the FPGA using hardware sharing, and a simple mathematical re-arrangement of certain tests statistics is proposed to circumvent a costly division operation.As the requirement for wireless telecommunications services continues to grow, it has become increasingly important to ensure that the Radio Frequency (RF) spectrum is managed efficiently. As a result of the current spectrum allocation policy, it has been found that portions of RF spectrum belonging to licensed users are often severely underutilised, at particular times and geographical locations. Awareness of this problem has led to the development of Dynamic Spectrum Access (DSA) and Cognitive Radio (CR) as possible solutions. In one variation of the shared-use model for DSA, it is proposed that the inefficient use of licensed spectrum could be overcome by enabling unlicensed users to opportunistically access the spectrum when the licensed user is not transmitting. In order for an unlicensed device to make decisions, it must be aware of its own RF environment and, therefore, it has been proposed that DSA could been abled using CR. One approach that has be identified to allow the CR to gain information about its operating environment is spectrum sensing. An interesting solution that has been identified for spectrum sensing is cyclostationary detection. This property refers to the inherent periodic nature of the second order statistics of many communications signals. One of the most common modulation formats in use today is Orthogonal Frequency Division Multiplexing (OFDM), which exhibits cyclostationarity due to the addition of a Cyclic Prefix (CP). This thesis examines several statistical tests for cyclostationarity in OFDM signals that may be used for spectrum sensing in DSA and CR. In particular, focus is placed on statistical tests that rely on estimation of the Cyclic Autocorrelation Function (CAF). Based on splitting the CAF into two complex component functions, several new statistical tests are introduced and are shown to lead to an improvement in detection performance when compared to the existing algorithms. The performance of each new algorithm is assessed in Additive White Gaussian Noise (AWGN), impulsive noise and when subjected to impairments such as multipath fading and Carrier Frequency Offset (CFO). Finally, each algorithm is targeted for Field Programmable Gate Array (FPGA) implementation using a Xilinx 7 series device. In order to keep resource costs to a minimum, it is suggested that the new algorithms are implemented on the FPGA using hardware sharing, and a simple mathematical re-arrangement of certain tests statistics is proposed to circumvent a costly division operation

    Methodology for complex dataflow application development

    Get PDF
    This thesis addresses problems inherent to the development of complex applications for reconfig- urable systems. Many projects fail to complete or take much longer than originally estimated by relying on traditional iterative software development processes typically used with conventional computers. Even though designer productivity can be increased by abstract programming and execution models, e.g., dataflow, development methodologies considering the specific properties of reconfigurable systems do not exist. The first contribution of this thesis is a design methodology to facilitate systematic develop- ment of complex applications using reconfigurable hardware in the context of High-Performance Computing (HPC). The proposed methodology is built upon a careful analysis of the original application, a software model of the intended hardware system, an analytical prediction of performance and on-chip area usage, and an iterative architectural refinement to resolve identi- fied bottlenecks before writing a single line of code targeting the reconfigurable hardware. It is successfully validated using two real applications and both achieve state-of-the-art performance. The second contribution extends this methodology to provide portability between devices in two steps. First, additional tool support for contemporary multi-die Field-Programmable Gate Arrays (FPGAs) is developed. An algorithm to automatically map logical memories to hetero- geneous physical memories with special attention to die boundaries is proposed. As a result, only the proposed algorithm managed to successfully place and route all designs used in the evaluation while the second-best algorithm failed on one third of all large applications. Second, best practices for performance portability between different FPGA devices are collected and evaluated on a financial use case, showing efficient resource usage on five different platforms. The third contribution applies the extended methodology to a real, highly demanding emerging application from the radiotherapy domain. A Monte-Carlo based simulation of dose accumu- lation in human tissue is accelerated using the proposed methodology to meet the real time requirements of adaptive radiotherapy.Open Acces
    • …
    corecore