3 research outputs found
Two-Stage Block Orthogonalization to Improve Performance of -step GMRES
On current computer architectures, GMRES' performance can be limited by its
communication cost to generate orthonormal basis vectors of the Krylov
subspace. To address this performance bottleneck, its -step variant
orthogonalizes a block of basis vectors at a time, potentially reducing the
communication cost by a factor of . Unfortunately, for a large step size
, the solver can generate extremely ill-conditioned basis vectors, and to
maintain stability in practice, a conservatively small step size is used, which
limits the performance of the -step solver. To enhance the performance using
a small step size, in this paper, we introduce a two-stage block
orthogonalization scheme. Similar to the original scheme, the first stage of
the proposed method operates on a block of basis vectors at a time, but its
objective is to maintain the well-conditioning of the generated basis vectors
with a lower cost. The orthogonalization of the basis vectors is delayed until
the second stage when enough basis vectors are generated to obtain higher
performance.
Our analysis shows the stability of the proposed two-stage scheme. The
performance is improved because while the same amount of computation as the
original scheme is required, most of the communication is done at the second
stage of the proposed scheme, reducing the overall communication requirements.
Our performance results with up to 192 NVIDIA V100 GPUs on the Summit
supercomputer demonstrate that when solving a 2D Laplace problem, the two-stage
approach can reduce the orthogonalization time and the total time-to-solution
by the respective factors of up to and over the
original -step GMRES, which had already obtained the respective speedups of
and over the standard GMRES. Similar speedups were
obtained for 3D problems and for matrices from the SuiteSparse Matrix
Collection.Comment: Accepted for publication in IPDPS'2
Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs
International audienceA low-rank approximation of a dense matrix plays an important role in many applications. To compute such an approximation , a common approach uses the QR factorization with column pivoting (QRCP). Though the reliability and efficiency of QRCP have been demonstrated, this determin-istic approach requires costly communication at each step of the factorization. Since such communication is becoming increasingly expensive on modern computers, an alternative approach based on random sampling, which can be implemented using communication-optimal kernels, is becoming attractive. To study its potential, in this paper, we compare the performance of random sampling with that of QRCP on an NVIDIA Kepler GPU. Our performance results demonstrate that random sampling can be up to 12.8× faster than the deterministic approach for computing the approximation of the same accuracy. We also present the parallel scaling of the random sampling over multiple GPUs on a single compute node, showing a speedup of 3.8× over three Kepler GPUs. These results demonstrate the potential of the random sampling as an excellent computational tool for many applications, and its potential is likely to grow on the emerging computers with the increasing communication costs
An overview of block Gram-Schmidt methods and their stability properties
Block Gram-Schmidt algorithms serve as essential kernels in many scientific
computing applications, but for many commonly used variants, a rigorous
treatment of their stability properties remains open. This survey provides a
comprehensive categorization of block Gram-Schmidt algorithms, particularly
those used in Krylov subspace methods to build orthonormal bases one block
vector at a time. All known stability results are assembled, and new results
are summarized or conjectured for important communication-reducing variants.
Additionally, new block versions of low-synchronization variants are derived,
and their efficacy and stability are demonstrated for a wide range of
challenging examples. Low-synchronization variants appear remarkably stable for
s-step-like matrices built with Newton polynomials, pointing towards a new
stable and efficient backbone for Krylov subspace methods. Numerical examples
are computed with a versatile MATLAB package hosted at
https://github.com/katlund/BlockStab, and scripts for reproducing all results
in the paper are provided. Block Gram-Schmidt implementations in popular
software packages are discussed, along with a number of open problems. An
appendix containing all algorithms type-set in a uniform fashion is provided.Comment: 42 pages, 5 tables, 17 figures, 20 algorithm