6 research outputs found
Householder orthogonalization with a non-standard inner product
Householder orthogonalization plays an important role in numerical linear
algebra. It attains perfect orthogonality regardless of the conditioning of the
input. However, in the context of a non-standard inner product, it becomes
difficult to apply Householder orthogonalization due to the lack of an initial
orthogonal basis. We propose strategies to overcome this obstacle and discuss
algorithms and variants of Householder orthogonalization with a non-standard
inner product. Rounding error analysis and numerical experiments demonstrate
that our approach is numerically stable
Two-Stage Block Orthogonalization to Improve Performance of -step GMRES
On current computer architectures, GMRES' performance can be limited by its
communication cost to generate orthonormal basis vectors of the Krylov
subspace. To address this performance bottleneck, its -step variant
orthogonalizes a block of basis vectors at a time, potentially reducing the
communication cost by a factor of . Unfortunately, for a large step size
, the solver can generate extremely ill-conditioned basis vectors, and to
maintain stability in practice, a conservatively small step size is used, which
limits the performance of the -step solver. To enhance the performance using
a small step size, in this paper, we introduce a two-stage block
orthogonalization scheme. Similar to the original scheme, the first stage of
the proposed method operates on a block of basis vectors at a time, but its
objective is to maintain the well-conditioning of the generated basis vectors
with a lower cost. The orthogonalization of the basis vectors is delayed until
the second stage when enough basis vectors are generated to obtain higher
performance.
Our analysis shows the stability of the proposed two-stage scheme. The
performance is improved because while the same amount of computation as the
original scheme is required, most of the communication is done at the second
stage of the proposed scheme, reducing the overall communication requirements.
Our performance results with up to 192 NVIDIA V100 GPUs on the Summit
supercomputer demonstrate that when solving a 2D Laplace problem, the two-stage
approach can reduce the orthogonalization time and the total time-to-solution
by the respective factors of up to and over the
original -step GMRES, which had already obtained the respective speedups of
and over the standard GMRES. Similar speedups were
obtained for 3D problems and for matrices from the SuiteSparse Matrix
Collection.Comment: Accepted for publication in IPDPS'2
A robust, open-source implementation of the locally optimal block preconditioned conjugate gradient for large eigenvalue problems in quantum chemistry
We present two open-source implementations of the locally optimal block preconditioned conjugate gradient (lobpcg) algorithm to find a few eigenvalues and eigenvectors of large, possibly sparse matrices. We then test lobpcg for various quantum chemistry problems, encompassing medium to large, dense to sparse, well-behaved to ill-conditioned ones, where the standard method typically used is Davidson’s diagonalization. Numerical tests show that while Davidson’s method remains the best choice for most applications in quantum chemistry, LOBPCG represents a competitive alternative, especially when memory is an issue, and can even outperform Davidson for ill-conditioned, non-diagonally dominant problems
SHIFTED CHOLESKY QR FOR COMPUTING THE QR FACTORIZATION OF ILL-CONDITIONED MATRICES
The Cholesky QR algorithm is an efficient communication-minimizing algorithm for computing the QR factorization of a tall-skinny matrix X epsilon R-mxn, where m >> n. Unfortunately it is inherently unstable and often breaks down when the matrix is ill-conditioned. A recent work [Yamamoto et al., ETNA, 44, pp. 306--326 (2015)] establishes that the instability can be cured by repeating the algorithm twice (called CholeskyQR2). However, the applicability of CholeskyQR2 is still limited by the requirement that the Cholesky factorization of the Gram matrix X-inverted perpendicular X runs to completion, which means that it does not always work for matrices X with the 2-norm condition number kappa(2)(X) roughly greater than u(-1/2), where u is the unit roundoff. In this work we extend the applicability to kappa(2)(X) = O (u(-1)) by introducing a shift to the computed Gram matrix so as to guarantee the Cholesky factorization R-inverted perpendicular R = A(inverted perpendicular) A+sI succeeds numerically. We show that the computed AR(-1) has reduced condition number that is roughly bounded by u(-1/2), for which CholeskyQR2 safely computes the QR factorization, yielding a computed Q of orthogonality vertical bar vertical bar Q(inverted perpendicular) - Q I vertical bar vertical bar(2) and residual vertical bar vertical bar A - QR vertical bar vertical bar(F) / vertical bar vertical bar A vertical bar vertical bar(F) both of the order of u. Thus we obtain the required QR factorization by essentially running Cholesky QR thrice. We extensively analyze the resulting algorithm shiftedCholeskyQR3 to reveal its excellent numerical stability. The shiftedCholeskyQR3 algorithm is also highly parallelizable, and applicable and effective also when working with an oblique inner product. We illustrate our findings through experiments, in which we achieve significant speedup over alternative methods
An overview of block Gram-Schmidt methods and their stability properties
Block Gram-Schmidt algorithms serve as essential kernels in many scientific
computing applications, but for many commonly used variants, a rigorous
treatment of their stability properties remains open. This survey provides a
comprehensive categorization of block Gram-Schmidt algorithms, particularly
those used in Krylov subspace methods to build orthonormal bases one block
vector at a time. All known stability results are assembled, and new results
are summarized or conjectured for important communication-reducing variants.
Additionally, new block versions of low-synchronization variants are derived,
and their efficacy and stability are demonstrated for a wide range of
challenging examples. Low-synchronization variants appear remarkably stable for
s-step-like matrices built with Newton polynomials, pointing towards a new
stable and efficient backbone for Krylov subspace methods. Numerical examples
are computed with a versatile MATLAB package hosted at
https://github.com/katlund/BlockStab, and scripts for reproducing all results
in the paper are provided. Block Gram-Schmidt implementations in popular
software packages are discussed, along with a number of open problems. An
appendix containing all algorithms type-set in a uniform fashion is provided.Comment: 42 pages, 5 tables, 17 figures, 20 algorithm