463 research outputs found
A space-efficient quantum computer simulator suitable for high-speed FPGA implementation
Conventional vector-based simulators for quantum computers are quite limited
in the size of the quantum circuits they can handle, due to the worst-case
exponential growth of even sparse representations of the full quantum state
vector as a function of the number of quantum operations applied. However, this
exponential-space requirement can be avoided by using general space-time
tradeoffs long known to complexity theorists, which can be appropriately
optimized for this particular problem in a way that also illustrates some
interesting reformulations of quantum mechanics. In this paper, we describe the
design and empirical space-time complexity measurements of a working software
prototype of a quantum computer simulator that avoids excessive space
requirements. Due to its space-efficiency, this design is well-suited to
embedding in single-chip environments, permitting especially fast execution
that avoids access latencies to main memory. We plan to prototype our design on
a standard FPGA development board.Comment: 12 pages, 6 figures, presented at Quantum Information and Computation
VII, Orlando, April 2009. Author reprint of final submitted manuscrip
Recommended from our members
Complexity-reduced hardware-based track-trigger for CMS upgrade
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonThe Compact Muon Solenoid (CMS) detector at the Large Hadron Collider (LHC)
is designed to study the results of proton-proton collisions. The Tracker
sub-detector is designed to detect and reconstruct the trajectories of charged
particles produced by the collisions. During the lifetime of the CMS detector,
there have been several upgrades aimed at increasing the chance of discovering
new physics through increased luminosity levels and instrumentation of
advanced technology. The High-Luminosity upgrade optimises the LHC to
accelerate high-energy particles with an average of 200 proton-proton
interactions per bunch crossing. The Level-1 Trigger system promptly analyses
and filters collisions using hardware to reduce the data volume in real-time. For
the upgrade, the trigger mechanism will use a particle trajectory estimator that
discriminates between particles based on their transverse momentum (pT ).
Particles with pT ≥ 2 GeV/c will be transmitted to the Level-1 Track-Trigger
system for trajectory reconstruction within a fixed 3 μs latency. This thesis
presents a novel Hardware-based Multivariate Linear Fitter (MVLF) system
focusing on robustness in tracking efficiency and reduction in logic resource
usage within the specified latency. The system components are implemented in
Field Programmable Gate Arrays (FPGA), targeting 16 nm FinFET UltraScale+
silicon technology. The development was performed using the High-Level
Synthesis (HLS) automation tools and the Hardware acceleration platform for
Application-Specific Integrated Circuits (ASIC). A firmware demonstrator has
been assembled to verify the feasibility and compatibility of the scaled system
with the CMS Level-1 Track-Trigger infrastructure. The system’s performance is
compared to past and current system developments, and the results are
presented accordingly
Using reconfigurable computing technology to accelerate matrix decomposition and applications
Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications.
The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions:
• We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices.
• We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices.
• We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each.
• We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns.
• By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture.
• We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update.
Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time
Custom optimization algorithms for efficient hardware implementation
The focus is on real-time optimal decision making with application in advanced control
systems. These computationally intensive schemes, which involve the repeated solution of
(convex) optimization problems within a sampling interval, require more efficient computational
methods than currently available for extending their application to highly dynamical
systems and setups with resource-constrained embedded computing platforms.
A range of techniques are proposed to exploit synergies between digital hardware, numerical
analysis and algorithm design. These techniques build on top of parameterisable
hardware code generation tools that generate VHDL code describing custom computing
architectures for interior-point methods and a range of first-order constrained optimization
methods. Since memory limitations are often important in embedded implementations we
develop a custom storage scheme for KKT matrices arising in interior-point methods for
control, which reduces memory requirements significantly and prevents I/O bandwidth
limitations from affecting the performance in our implementations. To take advantage of
the trend towards parallel computing architectures and to exploit the special characteristics
of our custom architectures we propose several high-level parallel optimal control
schemes that can reduce computation time. A novel optimization formulation was devised
for reducing the computational effort in solving certain problems independent of the computing
platform used. In order to be able to solve optimization problems in fixed-point
arithmetic, which is significantly more resource-efficient than floating-point, tailored linear
algebra algorithms were developed for solving the linear systems that form the computational
bottleneck in many optimization methods. These methods come with guarantees
for reliable operation. We also provide finite-precision error analysis for fixed-point implementations
of first-order methods that can be used to minimize the use of resources while
meeting accuracy specifications. The suggested techniques are demonstrated on several
practical examples, including a hardware-in-the-loop setup for optimization-based control
of a large airliner.Open Acces
Superscalar RISC-V Processor with SIMD Vector Extension
With the increasing number of digital products in the market, the need for robust and highly configurable processors rises. The demand is convened by the stable and extensible open-sourced RISC-V instruction set architecture. RISC-V processors are becoming popular in many fields of applications and research.
This thesis presents a dual-issue superscalar RISC-V processor design with dynamic execution. The proposed design employs the global sharing scheme for branch prediction and Tomasulo algorithm for out-of-order execution. The processor is capable of speculative execution with five checkpoints. Data flow in the instruction dispatch and commit stages is optimized to achieve higher instruction throughput.
The superscalar processor is extended with a customized vector instruction set of single-instruction-multiple-data computations to specifically improve the performance on machine learning tasks. According to the definition of the proposed vector instruction set, the scratchpad memory and element-wise arithmetic units are implemented in the vector co-processor.
Different test programs are evaluated on the fully-tested superscalar processor. Compared to the reference work, the proposed design improves 18.9% on average instruction throughput and 4.92% on average prediction hit rate, with 16.9% higher operating clock frequency synthesized on the Intel Arria 10 FPGA board.
The forward propagation of a convolution neural network model is evaluated by the standalone superscalar processor and the integration of the vector co-processor. The vector program with software-level optimizations achieves 9.53× improvement on instruction throughput and 10.18× improvement on real-time throughput. Moreover, the integration also provides 2.22× energy efficiency compared with the superscalar processor along
Acceleration Techniques for Sparse Recovery Based Plane-wave Decomposition of a Sound Field
Plane-wave decomposition by sparse recovery is a reliable and accurate technique for plane-wave decomposition which can be used for source localization, beamforming, etc. In this work, we introduce techniques to accelerate the plane-wave decomposition by sparse recovery. The method consists of two main algorithms which are spherical Fourier transformation (SFT) and sparse recovery. Comparing the two algorithms, the sparse recovery is the most computationally intensive. We implement the SFT on an FPGA and the sparse recovery on a multithreaded computing platform. Then the multithreaded computing platform could be fully utilized for the sparse recovery. On the other hand, implementing the SFT on an FPGA helps to flexibly integrate the microphones and improve the portability of the microphone array. For implementing the SFT on an FPGA, we develop a scalable FPGA design model that enables the quick design of the SFT architecture on FPGAs. The model considers the number of microphones, the number of SFT channels and the cost of the FPGA and provides the design of a resource optimized and cost-effective FPGA architecture as the output. Then we investigate the performance of the sparse recovery algorithm executed on various multithreaded computing platforms (i.e., chip-multiprocessor, multiprocessor, GPU, manycore). Finally, we investigate the influence of modifying the dictionary size on the computational performance and the accuracy of the sparse recovery algorithms. We introduce novel sparse-recovery techniques which use non-uniform dictionaries to improve the performance of the sparse recovery on a parallel architecture
Late-bound code generation
Each time a function or method is invoked during the execution of a program, a stream of instructions is issued to some underlying hardware platform. But exactly what underlying hardware, and which instructions, is usually left implicit. However in certain situations it becomes important to control these decisions. For example, particular problems can only be solved in real-time when scheduled on specialised accelerators, such as graphics coprocessors or computing clusters.
We introduce a novel operator for hygienically reifying the behaviour of a runtime function instance as a syntactic fragment, in a language which may in general differ from the source function definition. Translation and optimisation are performed by recursively invoked, dynamically dispatched code generators. Side-effecting operations are permitted, and their ordering is preserved.
We compare our operator with other techniques for pragmatic control, observing that: the use of our operator supports lifting arbitrary mutable objects, and neither requires rewriting sections of the source program in a multi-level language, nor interferes with the interface to individual software components. Due to its lack of interference at the abstraction level at which software is composed, we believe that our approach poses a significantly lower barrier to practical adoption than current methods.
The practical efficacy of our operator is demonstrated by using it to offload the user interface rendering of a smartphone application to an FPGA coprocessor, including both statically and procedurally defined user interface components. The generated pipeline is an application-specific, statically scheduled processor-per-primitive rendering pipeline, suitable for place-and-route style optimisation.
To demonstrate the compatibility of our operator with existing languages, we show how it may be defined within the Python programming language. We introduce a transformation for weakening mutable to immutable named bindings, termed let-weakening, to solve the problem of propagating information pertaining to named variables between modular code generating units.Open Acces
Computing and Compressing Electron Repulsion Integrals on FPGAs
The computation of electron repulsion integrals (ERIs) over Gaussian-type
orbitals (GTOs) is a challenging problem in quantum-mechanics-based atomistic
simulations. In practical simulations, several trillions of ERIs may have to be
computed for every time step.
In this work, we investigate FPGAs as accelerators for the ERI computation.
We use template parameters, here within the Intel oneAPI tool flow, to create
customized designs for 256 different ERI quartet classes, based on their
orbitals. To maximize data reuse, all intermediates are buffered in FPGA
on-chip memory with customized layout. The pre-calculation of intermediates
also helps to overcome data dependencies caused by multi-dimensional recurrence
relations. The involved loop structures are partially or even fully unrolled
for high throughput of FPGA kernels. Furthermore, a lossy compression algorithm
utilizing arbitrary bitwidth integers is integrated in the FPGA kernels. To our
best knowledge, this is the first work on ERI computation on FPGAs that
supports more than just the single most basic quartet class. Also, the
integration of ERI computation and compression it a novelty that is not even
covered by CPU or GPU libraries so far.
Our evaluation shows that using 16-bit integer for the ERI compression, the
fastest FPGA kernels exceed the performance of 10 GERIS ( ERIs
per second) on one Intel Stratix 10 GX 2800 FPGA, with maximum absolute errors
around - Hartree. The measured throughput can be accurately
explained by a performance model. The FPGA kernels deployed on 2 FPGAs
outperform similar computations using the widely used libint reference on a
two-socket server with 40 Xeon Gold 6148 CPU cores of the same process
technology by factors up to 6.0x and on a new two-socket server with 128 EPYC
7713 CPU cores by up to 1.9x
- …