537 research outputs found
An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling
We present a sparse linear system solver that is based on a multifrontal
variant of Gaussian elimination, and exploits low-rank approximation of the
resulting dense frontal matrices. We use hierarchically semiseparable (HSS)
matrices, which have low-rank off-diagonal blocks, to approximate the frontal
matrices. For HSS matrix construction, a randomized sampling algorithm is used
together with interpolative decompositions. The combination of the randomized
compression with a fast ULV HSS factorization leads to a solver with lower
computational complexity than the standard multifrontal method for many
applications, resulting in speedups up to 7 fold for problems in our test
suite. The implementation targets many-core systems by using task parallelism
with dynamic runtime scheduling. Numerical experiments show performance
improvements over state-of-the-art sparse direct solvers. The implementation
achieves high performance and good scalability on a range of modern shared
memory parallel systems, including the Intel Xeon Phi (MIC). The code is part
of a software package called STRUMPACK -- STRUctured Matrices PACKage, which
also has a distributed memory component for dense rank-structured matrices
Augmented Block-Arnoldi Recycling CFD Solvers
One of the limitations of recycled GCRO methods is the large amount of
computation required to orthogonalize the basis vectors of the newly generated
Krylov subspace for the approximate solution when combined with those of the
recycle subspace. Recent advancements in low synchronization Gram-Schmidt and
generalized minimal residual algorithms, Swirydowicz et
al.~\cite{2020-swirydowicz-nlawa}, Carson et al. \cite{Carson2022}, and Lund
\cite{Lund2022}, can be incorporated, thereby mitigating the loss of
orthogonality of the basis vectors. An augmented Arnoldi formulation of
recycling leads to a matrix decomposition and the associated algorithm can also
be viewed as a {\it block} Krylov method. Generalizations of both classical and
modified block Gram-Schmidt algorithms have been proposed, Carson et
al.~\cite{Carson2022}. Here, an inverse compact modified Gram-Schmidt
algorithm is applied for the inter-block orthogonalization scheme with a block
lower triangular correction matrix at iteration . When combined with a
weighted (oblique inner product) projection step, the inverse compact
scheme leads to significant (over 10 in certain cases) reductions in
the number of solver iterations per linear system. The weight is also
interpreted in terms of the angle between restart residuals in LGMRES, as
defined by Baker et al.\cite{Baker2005}. In many cases, the recycle subspace
eigen-spectrum can substitute for a preconditioner
Adaptively restarted block Krylov subspace methods with low-synchronization skeletons
With the recent realization of exascale performace by Oak Ridge National
Laboratory's Frontier supercomputer, reducing communication in kernels like QR
factorization has become even more imperative. Low-synchronization Gram-Schmidt
methods, first introduced in [K. \'{S}wirydowicz, J. Langou, S. Ananthan, U.
Yang, and S. Thomas, Low Synchronization Gram-Schmidt and Generalized Minimum
Residual Algorithms, Numer. Lin. Alg. Appl., Vol. 28(2), e2343, 2020], have
been shown to improve the scalability of the Arnoldi method in high-performance
distributed computing. Block versions of low-synchronization Gram-Schmidt show
further potential for speeding up algorithms, as column-batching allows for
maximizing cache usage with matrix-matrix operations. In this work,
low-synchronization block Gram-Schmidt variants from [E. Carson, K. Lund, M.
Rozlo\v{z}n\'{i}k, and S. Thomas, Block Gram-Schmidt algorithms and their
stability properties, Lin. Alg. Appl., 638, pp. 150--195, 2022] are transformed
into block Arnoldi variants for use in block full orthogonalization methods
(BFOM) and block generalized minimal residual methods (BGMRES). An adaptive
restarting heuristic is developed to handle instabilities that arise with the
increasing condition number of the Krylov basis. The performance, accuracy, and
stability of these methods are assessed via a flexible benchmarking tool
written in MATLAB. The modularity of the tool additionally permits generalized
block inner products, like the global inner product
- …