4,238 research outputs found
Sparse Householder QR factorization on a mesh
This is a post-peer-review, pre-copyedit version of an article published in Proceedings of 4th Euromicro Workshop on Parallel and Distributed Processing. The final authenticated version is available online at: http://dx.doi.org/10.1109/EMPDP.1996.500566.[Abstract] We analyze the parallelization of QR factorization by means of Householder transformations. This parallelization is carried out on a machine with a mesh topology (a 2-D torus to be more precise). We use a cyclic distribution of the elements of the sparse matrix M we want to decompose over the processors. Each processor represents the nonzero elements of its part of the matrix by a one-dimensional doubly linked list data structure. Then, we describe the different procedures that constitute the parallel algorithm. As an application of QR factorization, we concentrate on the least squares problem and finally we present an evaluation of the efficiency of this algorithm for a set of test matrices from the Harwell-Boeing sparse matrix collection
Minimizing Communication in Linear Algebra
In 1981 Hong and Kung proved a lower bound on the amount of communication
needed to perform dense, matrix-multiplication using the conventional
algorithm, where the input matrices were too large to fit in the small, fast
memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and
extended it to the parallel case. In both cases the lower bound may be
expressed as (#arithmetic operations / ), where M is the size
of the fast memory (or local memory in the parallel case). Here we generalize
these results to a much wider variety of algorithms, including LU
factorization, Cholesky factorization, factorization, QR factorization,
algorithms for eigenvalues and singular values, i.e., essentially all direct
methods of linear algebra. The proof works for dense or sparse matrices, and
for sequential or parallel algorithms. In addition to lower bounds on the
amount of data moved (bandwidth) we get lower bounds on the number of messages
required to move it (latency). We illustrate how to extend our lower bound
technique to compositions of linear algebra operations (like computing powers
of a matrix), to decide whether it is enough to call a sequence of simpler
optimal algorithms (like matrix multiplication) to minimize communication, or
if we can do better. We give examples of both. We also show how to extend our
lower bounds to certain graph theoretic problems.
We point out recently designed algorithms for dense LU, Cholesky, QR,
eigenvalue and the SVD problems that attain these lower bounds; implementations
of LU and QR show large speedups over conventional linear algebra algorithms in
standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table
Stable Sparse Orthogonal Factorization of Ill-Conditioned Banded Matrices for Parallel Computing
Sequential and parallel algorithms based on the LU factorization or the QR factorization have been intensely studied and widely used in the problems of computation with large-scale ill-conditioned banded matrices. Great concerns on existing methods include ill-conditioning, sparsity of factor matrices, computational complexity, and scalability. In this dissertation, we study a sparse orthogonal factorization of a banded matrix motivated by parallel computing. Specifically, we develop a process to factorize a banded matrix as a product of a sparse orthogonal matrix and a sparse matrix which can be transformed to an upper triangular matrix by column permutations. We prove that the proposed process requires low complexity, and it is numerically stable, maintaining similar stability results as the modified Gram-Schmidt process. On this basis, we develop a parallel algorithm for the factorization in a distributed computing environment. Through an analysis of its performance, we show that the communication costs reach the theoretical least upper bounds, while its parallel complexity or speedup approaches the optimal bound. For an ill-conditioned banded system, we construct a sequential solver that breaks it down into small-scale underdetermined systems, which are solved by the proposed factorization with high accuracy. We also implement a parallel solver with strategies to treat the memory issue appearing in extra large-scale linear systems of size over one billion. Numerical experiments confirm the theoretical results derived in this thesis, and demonstrate the superior accuracy and scalability of the proposed solvers for ill-conditioned linear systems, comparing to the most commonly used direct solvers
Parallel minimum norm solution of sparse block diagonal column overlapped underdetermined systems
Underdetermined systems of equations in which the minimum norm solution needs to be computed arise in many applications, such as geophysics, signal processing, and biomedical engineering. In this article, we introduce a new parallel algorithm for obtaining the minimum 2-norm solution of an underdetermined system of equations. The proposed algorithm is based on the Balance scheme, which was originally developed for the parallel solution of banded linear systems. The proposed scheme assumes a generalized banded form where the coefficient matrix has column overlapped block structure in which the blocks could be dense or sparse. In this article, we implement the more general sparse case. The blocks can be handled independently by any existing sequential or parallel QR factorization library. A smaller reduced system is formed and solved before obtaining the minimum norm solution of the original system in parallel. We experimentally compare and confirm the error bound of the proposed method against the QR factorization based techniques by using true single-precision arithmetic. We implement the proposed algorithm by using the message passing paradigm. We demonstrate numerical effectiveness as well as parallel scalability of the proposed algorithm on both shared and distributed memory architectures for solving various types of problems. © 2017 ACM
Efficient Randomized Algorithms for the Fixed-Precision Low-Rank Matrix Approximation
Randomized algorithms for low-rank matrix approximation are investigated,
with the emphasis on the fixed-precision problem and computational efficiency
for handling large matrices. The algorithms are based on the so-called QB
factorization, where Q is an orthonormal matrix. Firstly, a mechanism for
calculating the approximation error in Frobenius norm is proposed, which
enables efficient adaptive rank determination for large and/or sparse matrix.
It can be combined with any QB-form factorization algorithm in which B's rows
are incrementally generated. Based on the blocked randQB algorithm by P.-G.
Martinsson and S. Voronin, this results in an algorithm called randQB EI. Then,
we further revise the algorithm to obtain a pass-efficient algorithm, randQB
FP, which is mathematically equivalent to the existing randQB algorithms and
also suitable for the fixed-precision problem. Especially, randQB FP can serve
as a single-pass algorithm for calculating leading singular values, under
certain condition. With large and/or sparse test matrices, we have empirically
validated the merits of the proposed techniques, which exhibit remarkable
speedup and memory saving over the blocked randQB algorithm. We have also
demonstrated that the single-pass algorithm derived by randQB FP is much more
accurate than an existing single-pass algorithm. And with data from a scenic
image and an information retrieval application, we have shown the advantages of
the proposed algorithms over the adaptive range finder algorithm for solving
the fixed-precision problem.Comment: 21 pages, 10 figure
An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling
We present a sparse linear system solver that is based on a multifrontal
variant of Gaussian elimination, and exploits low-rank approximation of the
resulting dense frontal matrices. We use hierarchically semiseparable (HSS)
matrices, which have low-rank off-diagonal blocks, to approximate the frontal
matrices. For HSS matrix construction, a randomized sampling algorithm is used
together with interpolative decompositions. The combination of the randomized
compression with a fast ULV HSS factorization leads to a solver with lower
computational complexity than the standard multifrontal method for many
applications, resulting in speedups up to 7 fold for problems in our test
suite. The implementation targets many-core systems by using task parallelism
with dynamic runtime scheduling. Numerical experiments show performance
improvements over state-of-the-art sparse direct solvers. The implementation
achieves high performance and good scalability on a range of modern shared
memory parallel systems, including the Intel Xeon Phi (MIC). The code is part
of a software package called STRUMPACK -- STRUctured Matrices PACKage, which
also has a distributed memory component for dense rank-structured matrices
- …