74 research outputs found

    A bibliography on parallel and vector numerical algorithms

    Get PDF
    This is a bibliography of numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are listed also

    Using reconfigurable computing technology to accelerate matrix decomposition and applications

    Get PDF
    Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: • We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. • We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. • We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. • We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. • By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. • We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

    Solution of partial differential equations on vector and parallel computers

    Get PDF
    The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed

    Reliable and Efficient Parallel Processing Algorithms and Architectures for Modern Signal Processing

    Get PDF
    Least-squares (LS) estimations and spectral decomposition algorithms constitute the heart of modern signal processing and communication problems. Implementations of recursive LS and spectral decomposition algorithms onto parallel processing architectures such as systolic arrays with efficient fault-tolerant schemes are the major concerns of this dissertation. There are four major results in this dissertation. First, we propose the systolic block Householder transformation with application to the recursive least-squares minimization. It is successfully implemented on a systolic array with a two-level pipelined implementation at the vector level as well as at the word level. Second, a real-time algorithm-based concurrent error detection scheme based on the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, order degraded reconfiguration, and performance analysis are also considered. Third, the dynamic range, stability, error detection capability under finite-precision implementation, order degraded performance, and residual estimation under faulty situations for the QRD RLS systolic array are studied in details. Finally, we propose the use of multi-phase systolic algorithms for spectral decomposition based on the QR algorithm. Two systolic architectures, one based on triangular array and another based on rectangular array, are presented for the multiphase operations with fault-tolerant considerations. Eigenvectors and singular vectors can be easily obtained by using the multi-pase operations. Performance issues are also considered

    Design and analysis of numerical algorithms for the solution of linear systems on parallel and distributed architectures

    Get PDF
    The increasing availability of parallel computers is having a very significant impact on all aspects of scientific computation, including algorithm research and software development in numerical linear algebra. In particular, the solution of linear systems, which lies at the heart of most calculations in scientific computing is an important computation found in many engineering and scientific applications. In this thesis, well-known parallel algorithms for the solution of linear systems are compared with implicit parallel algorithms or the Quadrant Interlocking (QI) class of algorithms to solve linear systems. These implicit algorithms are (2x2) block algorithms expressed in explicit point form notation. [Continues.

    Using GPU to Accelerate Linear Computations in Power System Applications

    Get PDF
    With the development of advanced power system controls, the industrial and research community is becoming more interested in simulating larger interconnected power grids. It is always critical to incorporate advanced computing technologies to accelerate these power system computations. Power flow, one of the most fundamental computations in power system analysis, converts the solution of non-linear systems to that of a set of linear systems via the Newton method or one of its variants. An efficient solution to these linear equations is the key to improving the performance of power flow computation, and hence to accelerating other power system applications based on power flow computation, such as optimal power flow, contingency analysis, etc. This dissertation focuses on the exploration of iterative linear solvers and applicable preconditioners, with graphic processing unit (GPU) implementations to achieve performance improvement on the linear computations in power flow computations. An iterative conjugate gradient solver with Chebyshev preconditioner is studied first, and then the preconditioner is extended to a two-step preconditioner. At last, the conjugate gradient solver and the two-step preconditioner are integrated with MATPOWER to solve the practical fast decoupled load flow (FDPF), and an inexact linear solution method is proposed to further save the runtime of FDPF. Performance improvement is reported by applying these methods and GPU-implementation. The final complete GPU-based FDPF with inexact linear solving can achieve nearly 3x performance improvement over the MATPOWER implementation for a test system with 11,624 buses. A supporting study including a quick estimation of the largest eigenvalue of the linear system which is required by the Chebyshev preconditioner is presented as well. This dissertation demonstrates the potential of using GPU with scalable methods in power flow computation

    Error Analysis of the Cholesky QR-Based Block Orthogonalization Process for the One-Sided Block Jacobi SVD Algorithm

    Get PDF
    The one-sided block Jacobi method (OSBJ) has attracted attention as a fast and accurate algorithm for the singular value decomposition (SVD). The computational kernel of OSBJ is orthogonalization of a column block pair, which amounts to computing the SVD of this block pair. Hari proposes three methods for this partial SVD, and we found through numerical experiments that the variant named "V2", which is based on the Cholesky QR method, is the fastest variant and achieves satisfactory accuracy. While it is a good news from a practical viewpoint, it seems strange considering the well-known instability of the Cholesky QR method. In this paper, we perform a detailed error analysis of the V2 variant and explain why and when it can be used to compute the partial SVD accurately. Thus, our results provide a theoretical support for using the V2 variant safely in the OSBJ method

    High-performance computing with PetaBricks and Julia

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Mathematics, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 163-170).We present two recent parallel programming languages, PetaBricks and Julia, and demonstrate how we can use these two languages to re-examine classic numerical algorithms in new approaches for high-performance computing. PetaBricks is an implicitly parallel language that allows programmers to naturally express algorithmic choice explicitly at the language level. The PetaBricks compiler and autotuner is not only able to compose a complex program using fine-grained algorithmic choices but also find the right choice for many other parameters including data distribution, parallelization and blocking. We re-examine classic numerical algorithms with PetaBricks, and show that the PetaBricks autotuner produces nontrivial optimal algorithms that are difficult to reproduce otherwise. We also introduce the notion of variable accuracy algorithms, in which accuracy measures and requirements are supplied by the programmer and incorporated by the PetaBricks compiler and autotuner in the search of optimal algorithms. We demonstrate the accuracy/performance trade-offs by benchmark problems, and show how nontrivial algorithmic choice can change with different user accuracy requirements. Julia is a new high-level programming language that aims at achieving performance comparable to traditional compiled languages, while remaining easy to program and offering flexible parallelism without extensive effort. We describe a problem in large-scale terrain data analysis which motivates the use of Julia. We perform classical filtering techniques to study the terrain profiles and propose a measure based on Singular Value Decomposition (SVD) to quantify terrain surface roughness. We then give a brief tutorial of Julia and present results of our serial blocked SVD algorithm implementation in Julia. We also describe the parallel implementation of our SVD algorithm and discuss how flexible parallelism can be further explored using Julia.by Yee Lok Wong.Ph.D

    Implicit Hari–Zimmermann algorithm for the generalized SVD on the GPUs

    Get PDF
    A parallel, blocked, one-sided Hari–Zimmermann algorithm for the generalized singular value decomposition (GSVD) of a real or a complex matrix pair (F,G) is here proposed, where F and G have the same number of columns, and are both of the full column rank. The algorithm targets either a single graphics processing unit (GPU), or a cluster of those, performs all non-trivial computation exclusively on the GPUs, requires the minimal amount of memory to be reasonably expected, scales acceptably with the increase of the number of GPUs available, and guarantees the reproducible, bitwise identical output of the runs repeated over the same input and with the same number of GPUs
    • …
    corecore