5,739 research outputs found

    Massively parallel Poisson and QR factorization solvers

    Get PDF
    AbstractThe paper brings a massively parallel Poisson solver for rectangle domain and parallel algorithms for computation of QR factorization of a dense matrix A by means of Householder reflections and Givens rotations. The computer model under consideration is a SIMD mesh-connected toroidal n × n processor array.The Dirichlet problem is replaced by its finite-difference analog on an M × N (M + 1, N are powers of two) grid. The algorithm is composed of parallel fast sine transform and cyclic odd-even reduction blocks and runs in a fully parallel fashion. Its computational complexity is O(M N log Ln2), where L = max(M + 1, N). A parallel proposal of QR factorization by the Householder method zeros all subdiagonal elements in each column and updates all elements of the given submatrix in parallel. For the second method with Givens rotations, the parallel scheme of the Sameh and Kuck was chosen where the disjoint rotations can be computed simultaneously.The algorithms were coded in MPF and MPL parallel programming languages and results of computational experiments on the MasPar MP-1 system are also presented

    Stable Sparse Orthogonal Factorization of Ill-Conditioned Banded Matrices for Parallel Computing

    Get PDF
    Sequential and parallel algorithms based on the LU factorization or the QR factorization have been intensely studied and widely used in the problems of computation with large-scale ill-conditioned banded matrices. Great concerns on existing methods include ill-conditioning, sparsity of factor matrices, computational complexity, and scalability. In this dissertation, we study a sparse orthogonal factorization of a banded matrix motivated by parallel computing. Specifically, we develop a process to factorize a banded matrix as a product of a sparse orthogonal matrix and a sparse matrix which can be transformed to an upper triangular matrix by column permutations. We prove that the proposed process requires low complexity, and it is numerically stable, maintaining similar stability results as the modified Gram-Schmidt process. On this basis, we develop a parallel algorithm for the factorization in a distributed computing environment. Through an analysis of its performance, we show that the communication costs reach the theoretical least upper bounds, while its parallel complexity or speedup approaches the optimal bound. For an ill-conditioned banded system, we construct a sequential solver that breaks it down into small-scale underdetermined systems, which are solved by the proposed factorization with high accuracy. We also implement a parallel solver with strategies to treat the memory issue appearing in extra large-scale linear systems of size over one billion. Numerical experiments confirm the theoretical results derived in this thesis, and demonstrate the superior accuracy and scalability of the proposed solvers for ill-conditioned linear systems, comparing to the most commonly used direct solvers

    Minimizing Communication in Linear Algebra

    Full text link
    In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional O(n3)O(n^3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as Ω\Omega(#arithmetic operations / M\sqrt{M}), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDLTLDL^T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table

    Fast linear algebra is stable

    Full text link
    In an earlier paper, we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of nn-by-nn matrices can be done by any algorithm in O(nω+η)O(n^{\omega + \eta}) operations for any η>0\eta > 0, then it can be done stably in O(nω+η)O(n^{\omega + \eta}) operations for any η>0\eta > 0. Here we extend this result to show that essentially all standard linear algebra operations, including LU decomposition, QR decomposition, linear equation solving, matrix inversion, solving least squares problems, (generalized) eigenvalue problems and the singular value decomposition can also be done stably (in a normwise sense) in O(nω+η)O(n^{\omega + \eta}) operations.Comment: 26 pages; final version; to appear in Numerische Mathemati

    An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

    Full text link
    We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK -- STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices

    Gradient type optimization methods for electronic structure calculations

    Full text link
    The density functional theory (DFT) in electronic structure calculations can be formulated as either a nonlinear eigenvalue or direct minimization problem. The most widely used approach for solving the former is the so-called self-consistent field (SCF) iteration. A common observation is that the convergence of SCF is not clear theoretically while approaches with convergence guarantee for solving the latter are often not competitive to SCF numerically. In this paper, we study gradient type methods for solving the direct minimization problem by constructing new iterations along the gradient on the Stiefel manifold. Global convergence (i.e., convergence to a stationary point from any initial solution) as well as local convergence rate follows from the standard theory for optimization on manifold directly. A major computational advantage is that the computation of linear eigenvalue problems is no longer needed. The main costs of our approaches arise from the assembling of the total energy functional and its gradient and the projection onto the manifold. These tasks are cheaper than eigenvalue computation and they are often more suitable for parallelization as long as the evaluation of the total energy functional and its gradient is efficient. Numerical results show that they can outperform SCF consistently on many practically large systems.Comment: 24 pages, 11 figures, 59 references, and 1 acknowledgement
    corecore