1,193 research outputs found

    Improving Numerical Accuracy for Non-Negative Matrix Multiplication on GPUs using Recursive Algorithms

    Get PDF
    ABSTRACT Scientific computing is only bound by the limits of Moore's Law and the scalability of high performance mathematical library implementations. Most mathematical libraries however tend to focus only on general inputs, limiting their potential performance and scalability by not tailoring their implementation to specific inputs, such as non-negative inputs. By removing this limitation it is possible to improve the performance and accuracy of a range of problems. In this paper we explore the limitations of hardware to improve accuracy of non-negative matrix multiply by specifically comparing implementations on the GPU and CPU and propose algorithmic solutions to improve accuracy. Next, we demonstrate a matrix multiply implementation that takes advantage of asymptotically fast matrix multiply algorithms, which have been shown to scale better than O(N 3 ) matrix multiply implementations, and improve accuracy by up to a whole digit while increasing performance by up to 27% for matrices where the input is positive. Finally, we propose to extend the BLAS level 3 specification to non-negative matrices to allow easy integration of our solution and allow other library authors to implement their own solutions as part of an existing standard

    Mixed-Precision Numerical Linear Algebra Algorithms: Integer Arithmetic Based LU Factorization and Iterative Refinement for Hermitian Eigenvalue Problem

    Get PDF
    Mixed-precision algorithms are a class of algorithms that uses low precision in part of the algorithm in order to save time and energy with less accurate computation and communication. These algorithms usually utilize iterative refinement processes to improve the approximate solution obtained from low precision to the accuracy we desire from doing all the computation in high precision. Due to the demand of deep learning applications, there are hardware developments offering different low-precision formats including half precision (FP16), Bfloat16 and integer operations for quantized integers, which uses integers with a shared scalar to represent a set of equally spaced numbers. As new hardware architectures focus on bringing performance in these formats, the mixed-precision algorithms have more potential leverage on them and outmatch traditional fixed-precision algorithms. This dissertation consists of two articles. In the first article, we adapt one of the most fundamental algorithms in numerical linear algebra---LU factorization with partial pivoting--- to use integer arithmetic. With the goal of obtaining a low accuracy factorization as the preconditioner of generalized minimal residual (GMRES) to solve systems of linear equations, the LU factorization is adapted to use two different fixed-point formats for matrices L and U. A left-looking variant is also proposed for matrices with unbounded column growth. Finally, GMRES iterative refinement has shown that it can work on matrices with condition numbers up to 10000 with the algorithm that uses int16 as input and int32 accumulator for the update step. The second article targets symmetric and Hermitian eigenvalue problems. In this section we revisit the SICE algorithm from Dongarra et al. By applying the Sherman-Morrison formula on the diagonally-shifted tridiagonal systems, we propose an updated SICE-SM algorithm. By incorporating the latest two-stage algorithms from the PLASMA and MAGMA software libraries for numerical linear algebra, we achieved up to 3.6x speedup using the mixed-precision eigensolver with the blocked SICE-SM algorithm for iterative refinement when compared with full double complex precision solvers for the cases with a portion of eigenvalues and eigenvectors requested

    Rigorous numerical approaches in electronic structure theory

    Get PDF
    Electronic structure theory concerns the description of molecular properties according to the postulates of quantum mechanics. For practical purposes, this is realized entirely through numerical computation, the scope of which is constrained by computational costs that increases rapidly with the size of the system. The significant progress made in this field over the past decades have been facilitated in part by the willingness of chemists to forego some mathematical rigour in exchange for greater efficiency. While such compromises allow large systems to be computed feasibly, there are lingering concerns over the impact that these compromises have on the quality of the results that are produced. This research is motivated by two key issues that contribute to this loss of quality, namely i) the numerical errors accumulated due to the use of finite precision arithmetic and the application of numerical approximations, and ii) the reliance on iterative methods that are not guaranteed to converge to the correct solution. Taking the above issues in consideration, the aim of this thesis is to explore ways to perform electronic structure calculations with greater mathematical rigour, through the application of rigorous numerical methods. Of which, we focus in particular on methods based on interval analysis and deterministic global optimization. The Hartree-Fock electronic structure method will be used as the subject of this study due to its ubiquity within this domain. We outline an approach for placing rigorous bounds on numerical error in Hartree-Fock computations. This is achieved through the application of interval analysis techniques, which are able to rigorously bound and propagate quantities affected by numerical errors. Using this approach, we implement a program called Interval Hartree-Fock. Given a closed-shell system and the current electronic state, this program is able to compute rigorous error bounds on quantities including i) the total energy, ii) molecular orbital energies, iii) molecular orbital coefficients, and iv) derived electronic properties. Interval Hartree-Fock is adapted as an error analysis tool for studying the impact of numerical error in Hartree-Fock computations. It is used to investigate the effect of input related factors such as system size and basis set types on the numerical accuracy of the Hartree-Fock total energy. Consideration is also given to the impact of various algorithm design decisions. Examples include the application of different integral screening thresholds, the variation between single and double precision arithmetic in two-electron integral evaluation, and the adjustment of interpolation table granularity. These factors are relevant to both the usage of conventional Hartree-Fock code, and the development of Hartree-Fock code optimized for novel computing devices such as graphics processing units. We then present an approach for solving the Hartree-Fock equations to within a guaranteed margin of error. This is achieved by treating the Hartree-Fock equations as a non-convex global optimization problem, which is then solved using deterministic global optimization. The main contribution of this work is the development of algorithms for handling quantum chemistry specific expressions such as the one and two-electron integrals within the deterministic global optimization framework. This approach was implemented as an extension to an existing open source solver. Proof of concept calculations are performed for a variety of problems within Hartree-Fock theory, including those in i) point energy calculation, ii) geometry optimization, iii) basis set optimization, and iv) excited state calculation. Performance analyses of these calculations are also presented and discussed

    Using reconfigurable computing technology to accelerate matrix decomposition and applications

    Get PDF
    Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: • We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. • We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. • We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. • We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. • By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. • We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

    Parallel computing 2011, ParCo 2011: book of abstracts

    Get PDF
    This book contains the abstracts of the presentations at the conference Parallel Computing 2011, 30 August - 2 September 2011, Ghent, Belgiu
    • …
    corecore