663 research outputs found

    Towards Lattice Quantum Chromodynamics on FPGA devices

    Get PDF
    In this paper we describe a single-node, double precision Field Programmable Gate Array (FPGA) implementation of the Conjugate Gradient algorithm in the context of Lattice Quantum Chromodynamics. As a benchmark of our proposal we invert numerically the Dirac-Wilson operator on a 4-dimensional grid on three Xilinx hardware solutions: Zynq Ultrascale+ evaluation board, the Alveo U250 accelerator and the largest device available on the market, the VU13P device. In our implementation we separate software/hardware parts in such a way that the entire multiplication by the Dirac operator is performed in hardware, and the rest of the algorithm runs on the host. We find out that the FPGA implementation can offer a performance comparable with that obtained using current CPU or Intel's many core Xeon Phi accelerators. A possible multiple node FPGA-based system is discussed and we argue that power-efficient High Performance Computing (HPC) systems can be implemented using FPGA devices only.Comment: 17 pages, 4 figure

    High Performance Reconfigurable Computing for Linear Algebra: Design and Performance Analysis

    Get PDF
    Field Programmable Gate Arrays (FPGAs) enable powerful performance acceleration for scientific computations because of their intrinsic parallelism, pipeline ability, and flexible architecture. This dissertation explores the computational power of FPGAs for an important scientific application: linear algebra. First of all, optimized linear algebra subroutines are presented based on enhancements to both algorithms and hardware architectures. Compared to microprocessors, these routines achieve significant speedup. Second, computing with mixed-precision data on FPGAs is proposed for higher performance. Experimental analysis shows that mixed-precision algorithms on FPGAs can achieve the high performance of using lower-precision data while keeping higher-precision accuracy for finding solutions of linear equations. Third, an execution time model is built for reconfigurable computers (RC), which plays an important role in performance analysis and optimal resource utilization of FPGAs. The accuracy and efficiency of parallel computing performance models often depend on mean maximum computations. Despite significant prior work, there have been no sufficient mathematical tools for this important calculation. This work presents an Effective Mean Maximum Approximation method, which is more general, accurate, and efficient than previous methods. Together, these research results help address how to make linear algebra applications perform better on high performance reconfigurable computing architectures

    An Algebraic Framework for the Real-Time Solution of Inverse Problems on Embedded Systems

    Full text link
    This article presents a new approach to the real-time solution of inverse problems on embedded systems. The class of problems addressed corresponds to ordinary differential equations (ODEs) with generalized linear constraints, whereby the data from an array of sensors forms the forcing function. The solution of the equation is formulated as a least squares (LS) problem with linear constraints. The LS approach makes the method suitable for the explicit solution of inverse problems where the forcing function is perturbed by noise. The algebraic computation is partitioned into a initial preparatory step, which precomputes the matrices required for the run-time computation; and the cyclic run-time computation, which is repeated with each acquisition of sensor data. The cyclic computation consists of a single matrix-vector multiplication, in this manner computation complexity is known a-priori, fulfilling the definition of a real-time computation. Numerical testing of the new method is presented on perturbed as well as unperturbed problems; the results are compared with known analytic solutions and solutions acquired from state-of-the-art implicit solvers. The solution is implemented with model based design and uses only fundamental linear algebra; consequently, this approach supports automatic code generation for deployment on embedded systems. The targeting concept was tested via software- and processor-in-the-loop verification on two systems with different processor architectures. Finally, the method was tested on a laboratory prototype with real measurement data for the monitoring of flexible structures. The problem solved is: the real-time overconstrained reconstruction of a curve from measured gradients. Such systems are commonly encountered in the monitoring of structures and/or ground subsidence.Comment: 24 pages, journal articl

    Vector support for multicore processors with major emphasis on configurable multiprocessors

    Get PDF
    It recently became increasingly difficult to build higher speed uniprocessor chips because of performance degradation and high power consumption. The quadratically increasing circuit complexity forbade the exploration of more instruction-level parallelism (JLP). To continue raising the performance, processor designers then focused on thread-level parallelism (TLP) to realize a new architecture design paradigm. Multicore processor design is the result of this trend. It has proven quite capable in performance increase and provides new opportunities in power management and system scalability. But current multicore processors do not provide powerful vector architecture support which could yield significant speedups for array operations while maintaining arealpower efficiency. This dissertation proposes and presents the realization of an FPGA-based prototype of a multicore architecture with a shared vector unit (MCwSV). FPGA stands for Filed-Programmable Gate Array. The idea is that rather than improving only scalar or TLP performance, some hardware budget could be used to realize a vector unit to greatly speedup applications abundant in data-level parallelism (DLP). To be realistic, limited by the parallelism in the application itself and by the compiler\u27s vectorizing abilities, most of the general-purpose programs can only be partially vectorized. Thus, for efficient resource usage, one vector unit should be shared by several scalar processors. This approach could also keep the overall budget within acceptable limits. We suggest that this type of vector-unit sharing be established in future multicore chips. The design, implementation and evaluation of an MCwSV system with two scalar processors and a shared vector unit are presented for FPGA prototyping. The MicroBlaze processor, which is a commercial IP (Intellectual Property) core from Xilinx, is used as the scalar processor; in the experiments the vector unit is connected to a pair of MicroBlaze processors through standard bus interfaces. The overall system is organized in a decoupled and multi-banked structure. This organization provides substantial system scalability and better vector performance. For a given area budget, benchmarks from several areas show that the MCwSV system can provide significant performance increase as compared to a multicore system without a vector unit. However, a MCwSV system with two MicroBlazes and a shared vector unit is not always an optimized system configuration for various applications with different percentages of vectorization. On the other hand, the MCwSV framework was designed for easy scalability to potentially incorporate various numbers of scalar/vector units and various function units. Also, the flexibility inherent to FPGAs can aid the task of matching target applications. These benefits can be taken into account to create optimized MCwSV systems for various applications. So the work eventually focused on building an architecture design framework incorporating performance and resource management for application-specific MCwSV (AS-MCwSV) systems. For embedded system design, resource usage, power consumption and execution latency are three metrics to be used in design tradeoffs. The product of these metrics is used here to choose the MCwSV system with the smallest value

    Precision analysis for hardware acceleration of numerical algorithms

    No full text
    The precision used in an algorithm affects the error and performance of individual computations, the memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision throughout an algorithm to meet a range or error specification are often overlooked; the major reason is that it is hard to choose a number system which can guarantee any such specification can be met. Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be ‘no worse’ than a software implementation. However, the flexibility in the number representation is one of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring this potential significantly limits the performance achievable. In order to optimise the performance of hardware reliably, we require a method that can tractably calculate tight bounds for the error or range of any variable within an algorithm, but currently only a handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability, whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new method to calculate these bounds, taking into account both input ranges and finite precision effects, which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to tune the hardware to the algorithm specifications. We demonstrate the use of this software to optimise hardware for various algorithms to accelerate the solution of a system of linear equations, which forms the basis of many problems in engineering and science, and show that significant performance gains can be obtained by using this new approach in conjunction with more traditional hardware optimisations

    Gate-Level Simulation of Quantum Circuits

    Get PDF
    While thousands of experimental physicists and chemists are currently trying to build scalable quantum computers, it appears that simulation of quantum computation will be at least as critical as circuit simulation in classical VLSI design. However, since the work of Richard Feynman in the early 1980s little progress was made in practical quantum simulation. Most researchers focused on polynomial-time simulation of restricted types of quantum circuits that fall short of the full power of quantum computation. Simulating quantum computing devices and useful quantum algorithms on classical hardware now requires excessive computational resources, making many important simulation tasks infeasible. In this work we propose a new technique for gate-level simulation of quantum circuits which greatly reduces the difficulty and cost of such simulations. The proposed technique is implemented in a simulation tool called the Quantum Information Decision Diagram (QuIDD) and evaluated by simulating Grover's quantum search algorithm. The back-end of our package, QuIDD Pro, is based on Binary Decision Diagrams, well-known for their ability to efficiently represent many seemingly intractable combinatorial structures. This reliance on a well-established area of research allows us to take advantage of existing software for BDD manipulation and achieve unparalleled empirical results for quantum simulation

    Nonlinear predictive control on a heterogeneous computing platform

    Get PDF
    Nonlinear Model Predictive Control (NMPC) is an advanced control technique that often relies on computationally demanding optimization and integration algorithms. This paper proposes and investigates a heterogeneous hardware implementation of an NMPC controller based on an interior point algorithm. The proposed implementation provides flexibility of splitting the workload between a general-purpose CPU with a fixed architecture and a field-programmable gate array (FPGA) to trade off contradicting design objectives, namely performance and computational resource usage. A new way of exploiting the structure of the Karush-Kuhn-Tucker (KKT) matrix yields significant memory savings, which is crucial for reconfigurable hardware. For the considered case study, a 10x memory savings compared to existing approaches and a 10x speedup over a software implementation are reported. The proposed implementation can be tested from Matlab using a new release of the Protoip software tool, which is another contribution of the paper. Protoip abstracts many low-level details of heterogeneous hardware programming and allows quick prototyping and processor-in-the-loop verification of heterogeneous hardware implementations

    Custom optimization algorithms for efficient hardware implementation

    No full text
    The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resource-constrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interior-point methods and a range of first-order constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interior-point methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special characteristics of our custom architectures we propose several high-level parallel optimal control schemes that can reduce computation time. A novel optimization formulation was devised for reducing the computational effort in solving certain problems independent of the computing platform used. In order to be able to solve optimization problems in fixed-point arithmetic, which is significantly more resource-efficient than floating-point, tailored linear algebra algorithms were developed for solving the linear systems that form the computational bottleneck in many optimization methods. These methods come with guarantees for reliable operation. We also provide finite-precision error analysis for fixed-point implementations of first-order methods that can be used to minimize the use of resources while meeting accuracy specifications. The suggested techniques are demonstrated on several practical examples, including a hardware-in-the-loop setup for optimization-based control of a large airliner.Open Acces
    • …
    corecore