Search CORE

5,739 research outputs found

Massively parallel Poisson and QR factorization solvers

Author: Lucká M.
Vajteršic M.
Viktorinová E.
Publication venue: Published by Elsevier Ltd.
Publication date: 31/03/1996
Field of study

AbstractThe paper brings a massively parallel Poisson solver for rectangle domain and parallel algorithms for computation of QR factorization of a dense matrix A by means of Householder reflections and Givens rotations. The computer model under consideration is a SIMD mesh-connected toroidal n × n processor array.The Dirichlet problem is replaced by its finite-difference analog on an M × N (M + 1, N are powers of two) grid. The algorithm is composed of parallel fast sine transform and cyclic odd-even reduction blocks and runs in a fully parallel fashion. Its computational complexity is O(M N log Ln2), where L = max(M + 1, N). A parallel proposal of QR factorization by the Householder method zeros all subdiagonal elements in each column and updates all elements of the given submatrix in parallel. For the second method with Givens rotations, the parallel scheme of the Sameh and Kuck was chosen where the disjoint rotations can be computed simultaneously.The algorithms were coded in MPF and MPL parallel programming languages and results of computational experiments on the MasPar MP-1 system are also presented

Elsevier - Publisher Connector

Stable Sparse Orthogonal Factorization of Ill-Conditioned Banded Matrices for Parallel Computing

Author: Huang Qian
Publication venue: SURFACE at Syracuse University
Publication date: 25/08/2017
Field of study

Sequential and parallel algorithms based on the LU factorization or the QR factorization have been intensely studied and widely used in the problems of computation with large-scale ill-conditioned banded matrices. Great concerns on existing methods include ill-conditioning, sparsity of factor matrices, computational complexity, and scalability. In this dissertation, we study a sparse orthogonal factorization of a banded matrix motivated by parallel computing. Specifically, we develop a process to factorize a banded matrix as a product of a sparse orthogonal matrix and a sparse matrix which can be transformed to an upper triangular matrix by column permutations. We prove that the proposed process requires low complexity, and it is numerically stable, maintaining similar stability results as the modified Gram-Schmidt process. On this basis, we develop a parallel algorithm for the factorization in a distributed computing environment. Through an analysis of its performance, we show that the communication costs reach the theoretical least upper bounds, while its parallel complexity or speedup approaches the optimal bound. For an ill-conditioned banded system, we construct a sequential solver that breaks it down into small-scale underdetermined systems, which are solved by the proposed factorization with high accuracy. We also implement a parallel solver with strategies to treat the memory issue appearing in extra large-scale linear systems of size over one billion. Numerical experiments confirm the theoretical results derived in this thesis, and demonstrate the superior accuracy and scalability of the proposed solvers for ill-conditioned linear systems, comparing to the most commonly used direct solvers

Syracuse University Research Facility and Collaborative Environment

Minimizing Communication in Linear Algebra

Author: Blackford L. S.
Grey Ballard
James Demmel
Oded Schwartz
Olga Holtz
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2009
Field of study

In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional

O(n^3)

algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as

\Omega

(#arithmetic operations /

\sqrt{M}

), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization,

LDL^T

factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table

arXiv.org e-Print Archive

CiteSeerX

Crossref

Fast linear algebra is stable

Author: A. Borodin
A. Edelman
A. Schönhage
A.N. Malyshev
A.Ya. Bulgakov
C. Bischof
D. Bini
D. Coppersmith
D. Heller
E. Elmroth
G. Golub
G.W. Stewart
G.W. Stewart
Ioana Dumitriu
J. Demmel
J. Demmel
J. Demmel
J. Roberts
J. Varah
James Demmel
M. Gu
N. Higham
N.J. Higham
Olga Holtz
P. Bürgisser
P. Hong
R. Cormen
R. Schreiber
R.J. Muirhead
S. Chandrasekaran
S. Huss
S. Toledo
S.K. Godunov
T. Chan
T.W. Anderson
V. Strassen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

In an earlier paper, we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of

n

-by-

n

matrices can be done by any algorithm in

O(n^{\omega + \eta})

operations for any

\eta > 0

, then it can be done stably in

O(n^{\omega + \eta})

operations for any

\eta > 0

. Here we extend this result to show that essentially all standard linear algebra operations, including LU decomposition, QR decomposition, linear equation solving, matrix inversion, solving least squares problems, (generalized) eigenvalue problems and the singular value decomposition can also be done stably (in a normwise sense) in

O(n^{\omega + \eta})

operations.Comment: 26 pages; final version; to appear in Numerische Mathemati

arXiv.org e-Print Archive

CiteSeerX

Crossref

eScholarship - University of California

An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

Author: Ghysels Pieter
Li Xiaoye S.
Napov Artem
Rouet Francois-Henry
Williams Samuel
Publication venue
Publication date: 25/02/2015
Field of study

We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK -- STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices

arXiv.org e-Print Archive

eScholarship - University of California

DI-fusion

Gradient type optimization methods for electronic structure calculations

Author: Wen Zaiwen
Zhang Xin
Zhou Aihui
Zhu Jinwei
Publication venue
Publication date: 13/08/2013
Field of study

The density functional theory (DFT) in electronic structure calculations can be formulated as either a nonlinear eigenvalue or direct minimization problem. The most widely used approach for solving the former is the so-called self-consistent field (SCF) iteration. A common observation is that the convergence of SCF is not clear theoretically while approaches with convergence guarantee for solving the latter are often not competitive to SCF numerically. In this paper, we study gradient type methods for solving the direct minimization problem by constructing new iterations along the gradient on the Stiefel manifold. Global convergence (i.e., convergence to a stationary point from any initial solution) as well as local convergence rate follows from the standard theory for optimization on manifold directly. A major computational advantage is that the computation of linear eigenvalue problems is no longer needed. The main costs of our approaches arise from the assembling of the total energy functional and its gradient and the projection onto the manifold. These tasks are cheaper than eigenvalue computation and they are often more suitable for parallelization as long as the evaluation of the total energy functional and its gradient is efficient. Numerical results show that they can outperform SCF consistently on many practically large systems.Comment: 24 pages, 11 figures, 59 references, and 1 acknowledgement

arXiv.org e-Print Archive