663 research outputs found
Towards Lattice Quantum Chromodynamics on FPGA devices
In this paper we describe a single-node, double precision Field Programmable
Gate Array (FPGA) implementation of the Conjugate Gradient algorithm in the
context of Lattice Quantum Chromodynamics. As a benchmark of our proposal we
invert numerically the Dirac-Wilson operator on a 4-dimensional grid on three
Xilinx hardware solutions: Zynq Ultrascale+ evaluation board, the Alveo U250
accelerator and the largest device available on the market, the VU13P device.
In our implementation we separate software/hardware parts in such a way that
the entire multiplication by the Dirac operator is performed in hardware, and
the rest of the algorithm runs on the host. We find out that the FPGA
implementation can offer a performance comparable with that obtained using
current CPU or Intel's many core Xeon Phi accelerators. A possible multiple
node FPGA-based system is discussed and we argue that power-efficient High
Performance Computing (HPC) systems can be implemented using FPGA devices only.Comment: 17 pages, 4 figure
High Performance Reconfigurable Computing for Linear Algebra: Design and Performance Analysis
Field Programmable Gate Arrays (FPGAs) enable powerful performance acceleration for scientific computations because of their intrinsic parallelism, pipeline ability, and flexible architecture. This dissertation explores the computational power of FPGAs for an important scientific application: linear algebra. First of all, optimized linear algebra subroutines are presented based on enhancements to both algorithms and hardware architectures. Compared to microprocessors, these routines achieve significant speedup. Second, computing with mixed-precision data on FPGAs is proposed for higher performance. Experimental analysis shows that mixed-precision algorithms on FPGAs can achieve the high performance of using lower-precision data while keeping higher-precision accuracy for finding solutions of linear equations. Third, an execution time model is built for reconfigurable computers (RC), which plays an important role in performance analysis and optimal resource utilization of FPGAs. The accuracy and efficiency of parallel computing performance models often depend on mean maximum computations. Despite significant prior work, there have been no sufficient mathematical tools for this important calculation. This work presents an Effective Mean Maximum Approximation method, which is more general, accurate, and efficient than previous methods. Together, these research results help address how to make linear algebra applications perform better on high performance reconfigurable computing architectures
An Algebraic Framework for the Real-Time Solution of Inverse Problems on Embedded Systems
This article presents a new approach to the real-time solution of inverse
problems on embedded systems. The class of problems addressed corresponds to
ordinary differential equations (ODEs) with generalized linear constraints,
whereby the data from an array of sensors forms the forcing function. The
solution of the equation is formulated as a least squares (LS) problem with
linear constraints. The LS approach makes the method suitable for the explicit
solution of inverse problems where the forcing function is perturbed by noise.
The algebraic computation is partitioned into a initial preparatory step, which
precomputes the matrices required for the run-time computation; and the cyclic
run-time computation, which is repeated with each acquisition of sensor data.
The cyclic computation consists of a single matrix-vector multiplication, in
this manner computation complexity is known a-priori, fulfilling the definition
of a real-time computation. Numerical testing of the new method is presented on
perturbed as well as unperturbed problems; the results are compared with known
analytic solutions and solutions acquired from state-of-the-art implicit
solvers. The solution is implemented with model based design and uses only
fundamental linear algebra; consequently, this approach supports automatic code
generation for deployment on embedded systems. The targeting concept was tested
via software- and processor-in-the-loop verification on two systems with
different processor architectures. Finally, the method was tested on a
laboratory prototype with real measurement data for the monitoring of flexible
structures. The problem solved is: the real-time overconstrained reconstruction
of a curve from measured gradients. Such systems are commonly encountered in
the monitoring of structures and/or ground subsidence.Comment: 24 pages, journal articl
Vector support for multicore processors with major emphasis on configurable multiprocessors
It recently became increasingly difficult to build higher speed uniprocessor chips because of performance degradation and high power consumption. The quadratically increasing circuit complexity forbade the exploration of more instruction-level parallelism (JLP). To continue raising the performance, processor designers then focused on thread-level parallelism (TLP) to realize a new architecture design paradigm. Multicore processor design is the result of this trend. It has proven quite capable in performance increase and provides new opportunities in power management and system scalability. But current multicore processors do not provide powerful vector architecture support which could yield significant speedups for array operations while maintaining arealpower efficiency.
This dissertation proposes and presents the realization of an FPGA-based prototype of a multicore architecture with a shared vector unit (MCwSV). FPGA stands for Filed-Programmable Gate Array. The idea is that rather than improving only scalar or TLP performance, some hardware budget could be used to realize a vector unit to greatly speedup applications abundant in data-level parallelism (DLP). To be realistic, limited by the parallelism in the application itself and by the compiler\u27s vectorizing abilities, most of the general-purpose programs can only be partially vectorized. Thus, for efficient resource usage, one vector unit should be shared by several scalar processors. This approach could also keep the overall budget within acceptable limits. We suggest that this type of vector-unit sharing be established in future multicore chips.
The design, implementation and evaluation of an MCwSV system with two scalar processors and a shared vector unit are presented for FPGA prototyping. The MicroBlaze processor, which is a commercial IP (Intellectual Property) core from Xilinx, is used as the scalar processor; in the experiments the vector unit is connected to a pair of MicroBlaze processors through standard bus interfaces. The overall system is organized in a decoupled and multi-banked structure. This organization provides substantial system scalability and better vector performance. For a given area budget, benchmarks from several areas show that the MCwSV system can provide significant performance increase as compared to a multicore system without a vector unit.
However, a MCwSV system with two MicroBlazes and a shared vector unit is not always an optimized system configuration for various applications with different percentages of vectorization. On the other hand, the MCwSV framework was designed for easy scalability to potentially incorporate various numbers of scalar/vector units and various function units. Also, the flexibility inherent to FPGAs can aid the task of matching target applications. These benefits can be taken into account to create optimized MCwSV systems for various applications. So the work eventually focused on building an architecture design framework incorporating performance and resource management for application-specific MCwSV (AS-MCwSV) systems. For embedded system design, resource usage, power consumption and execution latency are three metrics to be used in design tradeoffs. The product of these metrics is used here to choose the MCwSV system with the smallest value
Precision analysis for hardware acceleration of numerical algorithms
The precision used in an algorithm affects the error and performance of individual computations, the
memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating
an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision
throughout an algorithm to meet a range or error specification are often overlooked; the major reason
is that it is hard to choose a number system which can guarantee any such specification can be met.
Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be
‘no worse’ than a software implementation. However, the flexibility in the number representation is one
of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring
this potential significantly limits the performance achievable.
In order to optimise the performance of hardware reliably, we require a method that can tractably
calculate tight bounds for the error or range of any variable within an algorithm, but currently only a
handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability,
whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new
method to calculate these bounds, taking into account both input ranges and finite precision effects,
which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to
tune the hardware to the algorithm specifications.
We demonstrate the use of this software to optimise hardware for various algorithms to accelerate
the solution of a system of linear equations, which forms the basis of many problems in engineering
and science, and show that significant performance gains can be obtained by using this new approach in
conjunction with more traditional hardware optimisations
Gate-Level Simulation of Quantum Circuits
While thousands of experimental physicists and chemists are currently trying
to build scalable quantum computers, it appears that simulation of quantum
computation will be at least as critical as circuit simulation in classical
VLSI design. However, since the work of Richard Feynman in the early 1980s
little progress was made in practical quantum simulation. Most researchers
focused on polynomial-time simulation of restricted types of quantum circuits
that fall short of the full power of quantum computation. Simulating quantum
computing devices and useful quantum algorithms on classical hardware now
requires excessive computational resources, making many important simulation
tasks infeasible. In this work we propose a new technique for gate-level
simulation of quantum circuits which greatly reduces the difficulty and cost of
such simulations. The proposed technique is implemented in a simulation tool
called the Quantum Information Decision Diagram (QuIDD) and evaluated by
simulating Grover's quantum search algorithm. The back-end of our package,
QuIDD Pro, is based on Binary Decision Diagrams, well-known for their ability
to efficiently represent many seemingly intractable combinatorial structures.
This reliance on a well-established area of research allows us to take
advantage of existing software for BDD manipulation and achieve unparalleled
empirical results for quantum simulation
Nonlinear predictive control on a heterogeneous computing platform
Nonlinear Model Predictive Control (NMPC) is an advanced control technique that often relies on computationally demanding optimization and integration algorithms. This paper proposes and investigates a heterogeneous hardware implementation of an NMPC controller based on an interior point algorithm. The proposed implementation provides flexibility of splitting the workload between a general-purpose CPU with a fixed architecture and a field-programmable gate array (FPGA) to trade off contradicting design objectives, namely performance and computational resource usage. A new way of exploiting the structure of the Karush-Kuhn-Tucker (KKT) matrix yields significant memory savings, which is crucial for reconfigurable hardware. For the considered case study, a 10x memory savings compared to existing approaches and a 10x speedup over a software implementation are reported. The proposed implementation can be tested from Matlab using a new release of the Protoip software tool, which is another contribution of the paper. Protoip abstracts many low-level details of heterogeneous hardware programming and allows quick prototyping and processor-in-the-loop verification of heterogeneous hardware implementations
Custom optimization algorithms for efficient hardware implementation
The focus is on real-time optimal decision making with application in advanced control
systems. These computationally intensive schemes, which involve the repeated solution of
(convex) optimization problems within a sampling interval, require more efficient computational
methods than currently available for extending their application to highly dynamical
systems and setups with resource-constrained embedded computing platforms.
A range of techniques are proposed to exploit synergies between digital hardware, numerical
analysis and algorithm design. These techniques build on top of parameterisable
hardware code generation tools that generate VHDL code describing custom computing
architectures for interior-point methods and a range of first-order constrained optimization
methods. Since memory limitations are often important in embedded implementations we
develop a custom storage scheme for KKT matrices arising in interior-point methods for
control, which reduces memory requirements significantly and prevents I/O bandwidth
limitations from affecting the performance in our implementations. To take advantage of
the trend towards parallel computing architectures and to exploit the special characteristics
of our custom architectures we propose several high-level parallel optimal control
schemes that can reduce computation time. A novel optimization formulation was devised
for reducing the computational effort in solving certain problems independent of the computing
platform used. In order to be able to solve optimization problems in fixed-point
arithmetic, which is significantly more resource-efficient than floating-point, tailored linear
algebra algorithms were developed for solving the linear systems that form the computational
bottleneck in many optimization methods. These methods come with guarantees
for reliable operation. We also provide finite-precision error analysis for fixed-point implementations
of first-order methods that can be used to minimize the use of resources while
meeting accuracy specifications. The suggested techniques are demonstrated on several
practical examples, including a hardware-in-the-loop setup for optimization-based control
of a large airliner.Open Acces
- …