5,739 research outputs found
Massively parallel Poisson and QR factorization solvers
AbstractThe paper brings a massively parallel Poisson solver for rectangle domain and parallel algorithms for computation of QR factorization of a dense matrix A by means of Householder reflections and Givens rotations. The computer model under consideration is a SIMD mesh-connected toroidal n × n processor array.The Dirichlet problem is replaced by its finite-difference analog on an M × N (M + 1, N are powers of two) grid. The algorithm is composed of parallel fast sine transform and cyclic odd-even reduction blocks and runs in a fully parallel fashion. Its computational complexity is O(M N log Ln2), where L = max(M + 1, N). A parallel proposal of QR factorization by the Householder method zeros all subdiagonal elements in each column and updates all elements of the given submatrix in parallel. For the second method with Givens rotations, the parallel scheme of the Sameh and Kuck was chosen where the disjoint rotations can be computed simultaneously.The algorithms were coded in MPF and MPL parallel programming languages and results of computational experiments on the MasPar MP-1 system are also presented
Stable Sparse Orthogonal Factorization of Ill-Conditioned Banded Matrices for Parallel Computing
Sequential and parallel algorithms based on the LU factorization or the QR factorization have been intensely studied and widely used in the problems of computation with large-scale ill-conditioned banded matrices. Great concerns on existing methods include ill-conditioning, sparsity of factor matrices, computational complexity, and scalability. In this dissertation, we study a sparse orthogonal factorization of a banded matrix motivated by parallel computing. Specifically, we develop a process to factorize a banded matrix as a product of a sparse orthogonal matrix and a sparse matrix which can be transformed to an upper triangular matrix by column permutations. We prove that the proposed process requires low complexity, and it is numerically stable, maintaining similar stability results as the modified Gram-Schmidt process. On this basis, we develop a parallel algorithm for the factorization in a distributed computing environment. Through an analysis of its performance, we show that the communication costs reach the theoretical least upper bounds, while its parallel complexity or speedup approaches the optimal bound. For an ill-conditioned banded system, we construct a sequential solver that breaks it down into small-scale underdetermined systems, which are solved by the proposed factorization with high accuracy. We also implement a parallel solver with strategies to treat the memory issue appearing in extra large-scale linear systems of size over one billion. Numerical experiments confirm the theoretical results derived in this thesis, and demonstrate the superior accuracy and scalability of the proposed solvers for ill-conditioned linear systems, comparing to the most commonly used direct solvers
Minimizing Communication in Linear Algebra
In 1981 Hong and Kung proved a lower bound on the amount of communication
needed to perform dense, matrix-multiplication using the conventional
algorithm, where the input matrices were too large to fit in the small, fast
memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and
extended it to the parallel case. In both cases the lower bound may be
expressed as (#arithmetic operations / ), where M is the size
of the fast memory (or local memory in the parallel case). Here we generalize
these results to a much wider variety of algorithms, including LU
factorization, Cholesky factorization, factorization, QR factorization,
algorithms for eigenvalues and singular values, i.e., essentially all direct
methods of linear algebra. The proof works for dense or sparse matrices, and
for sequential or parallel algorithms. In addition to lower bounds on the
amount of data moved (bandwidth) we get lower bounds on the number of messages
required to move it (latency). We illustrate how to extend our lower bound
technique to compositions of linear algebra operations (like computing powers
of a matrix), to decide whether it is enough to call a sequence of simpler
optimal algorithms (like matrix multiplication) to minimize communication, or
if we can do better. We give examples of both. We also show how to extend our
lower bounds to certain graph theoretic problems.
We point out recently designed algorithms for dense LU, Cholesky, QR,
eigenvalue and the SVD problems that attain these lower bounds; implementations
of LU and QR show large speedups over conventional linear algebra algorithms in
standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table
Fast linear algebra is stable
In an earlier paper, we showed that a large class of fast recursive matrix
multiplication algorithms is stable in a normwise sense, and that in fact if
multiplication of -by- matrices can be done by any algorithm in
operations for any , then it can be done
stably in operations for any . Here we extend
this result to show that essentially all standard linear algebra operations,
including LU decomposition, QR decomposition, linear equation solving, matrix
inversion, solving least squares problems, (generalized) eigenvalue problems
and the singular value decomposition can also be done stably (in a normwise
sense) in operations.Comment: 26 pages; final version; to appear in Numerische Mathemati
An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling
We present a sparse linear system solver that is based on a multifrontal
variant of Gaussian elimination, and exploits low-rank approximation of the
resulting dense frontal matrices. We use hierarchically semiseparable (HSS)
matrices, which have low-rank off-diagonal blocks, to approximate the frontal
matrices. For HSS matrix construction, a randomized sampling algorithm is used
together with interpolative decompositions. The combination of the randomized
compression with a fast ULV HSS factorization leads to a solver with lower
computational complexity than the standard multifrontal method for many
applications, resulting in speedups up to 7 fold for problems in our test
suite. The implementation targets many-core systems by using task parallelism
with dynamic runtime scheduling. Numerical experiments show performance
improvements over state-of-the-art sparse direct solvers. The implementation
achieves high performance and good scalability on a range of modern shared
memory parallel systems, including the Intel Xeon Phi (MIC). The code is part
of a software package called STRUMPACK -- STRUctured Matrices PACKage, which
also has a distributed memory component for dense rank-structured matrices
Gradient type optimization methods for electronic structure calculations
The density functional theory (DFT) in electronic structure calculations can
be formulated as either a nonlinear eigenvalue or direct minimization problem.
The most widely used approach for solving the former is the so-called
self-consistent field (SCF) iteration. A common observation is that the
convergence of SCF is not clear theoretically while approaches with convergence
guarantee for solving the latter are often not competitive to SCF numerically.
In this paper, we study gradient type methods for solving the direct
minimization problem by constructing new iterations along the gradient on the
Stiefel manifold. Global convergence (i.e., convergence to a stationary point
from any initial solution) as well as local convergence rate follows from the
standard theory for optimization on manifold directly. A major computational
advantage is that the computation of linear eigenvalue problems is no longer
needed. The main costs of our approaches arise from the assembling of the total
energy functional and its gradient and the projection onto the manifold. These
tasks are cheaper than eigenvalue computation and they are often more suitable
for parallelization as long as the evaluation of the total energy functional
and its gradient is efficient. Numerical results show that they can outperform
SCF consistently on many practically large systems.Comment: 24 pages, 11 figures, 59 references, and 1 acknowledgement
- …