66 research outputs found
The impact of global communication latency at extreme scales on Krylov methods
Krylov Subspace Methods (KSMs) are popular numerical tools for solving large linear systems of equations. We consider their role in solving sparse systems on future massively parallel distributed memory machines, by estimating future performance of their constituent operations. To this end we construct a model that is simple, but which takes topology and network acceleration into account as they are important considerations. We show that, as the number of nodes of a parallel machine increases to very large numbers, the increasing latency cost of reductions may well become a problematic bottleneck for traditional formulations of these methods. Finally, we discuss how pipelined KSMs can be used to tackle the potential problem, and appropriate pipeline depths
A distributed-memory package for dense Hierarchically Semi-Separable matrix computations using randomization
We present a distributed-memory library for computations with dense
structured matrices. A matrix is considered structured if its off-diagonal
blocks can be approximated by a rank-deficient matrix with low numerical rank.
Here, we use Hierarchically Semi-Separable representations (HSS). Such matrices
appear in many applications, e.g., finite element methods, boundary element
methods, etc. Exploiting this structure allows for fast solution of linear
systems and/or fast computation of matrix-vector products, which are the two
main building blocks of matrix computations. The compression algorithm that we
use, that computes the HSS form of an input dense matrix, relies on randomized
sampling with a novel adaptive sampling mechanism. We discuss the
parallelization of this algorithm and also present the parallelization of
structured matrix-vector product, structured factorization and solution
routines. The efficiency of the approach is demonstrated on large problems from
different academic and industrial applications, on up to 8,000 cores.
This work is part of a more global effort, the STRUMPACK (STRUctured Matrices
PACKage) software package for computations with sparse and dense structured
matrices. Hence, although useful on their own right, the routines also
represent a step in the direction of a distributed-memory sparse solver
A multi-level preconditioned Krylov method for the efficient solution of algebraic tomographic reconstruction problems
Classical iterative methods for tomographic reconstruction include the class
of Algebraic Reconstruction Techniques (ART). Convergence of these stationary
linear iterative methods is however notably slow. In this paper we propose the
use of Krylov solvers for tomographic linear inversion problems. These advanced
iterative methods feature fast convergence at the expense of a higher
computational cost per iteration, causing them to be generally uncompetitive
without the inclusion of a suitable preconditioner. Combining elements from
standard multigrid (MG) solvers and the theory of wavelets, a novel
wavelet-based multi-level (WMG) preconditioner is introduced, which is shown to
significantly speed-up Krylov convergence. The performance of the
WMG-preconditioned Krylov method is analyzed through a spectral analysis, and
the approach is compared to existing methods like the classical Simultaneous
Iterative Reconstruction Technique (SIRT) and unpreconditioned Krylov methods
on a 2D tomographic benchmark problem. Numerical experiments are promising,
showing the method to be competitive with the classical Algebraic
Reconstruction Techniques in terms of convergence speed and overall performance
(CPU time) as well as precision of the reconstruction.Comment: Journal of Computational and Applied Mathematics (2014), 26 pages, 13
figures, 3 table
Using fast and accurate simulation to explore hardware/software trade-offs in the multi-core era
Writing well-performing parallel programs is challenging in the multi-core processor era. In addition to achieving good per-thread performance, which in itself is a balancing act between instruction-level parallelism, pipeline effects and good memory performance, multi-threaded programs complicate matters even further. These programs require synchronization, and are affected by the interactions between threads through sharing of both processor resources and the cache hierarchy.
At the Intel Exascience Lab, we are developing an architectural simulator called Sniper for simulating future exascale-era multi-core processors. Its goal is twofold: Sniper should assist hardware designers to make design decisions, while simultaneously providing software designers with a tool to gain insight into the behavior of their algorithms and allow for optimization. By taking architectural features into account, our simulator can provide more insight into parallel programs than what can be obtained from existing performance analysis tools. This unique combination of hardware simulator and software performance analysis tool makes Sniper a useful tool for a simultaneous exploration of the hardware and software design space for future high-performance multi-core systems
An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling
We present a sparse linear system solver that is based on a multifrontal
variant of Gaussian elimination, and exploits low-rank approximation of the
resulting dense frontal matrices. We use hierarchically semiseparable (HSS)
matrices, which have low-rank off-diagonal blocks, to approximate the frontal
matrices. For HSS matrix construction, a randomized sampling algorithm is used
together with interpolative decompositions. The combination of the randomized
compression with a fast ULV HSS factorization leads to a solver with lower
computational complexity than the standard multifrontal method for many
applications, resulting in speedups up to 7 fold for problems in our test
suite. The implementation targets many-core systems by using task parallelism
with dynamic runtime scheduling. Numerical experiments show performance
improvements over state-of-the-art sparse direct solvers. The implementation
achieves high performance and good scalability on a range of modern shared
memory parallel systems, including the Intel Xeon Phi (MIC). The code is part
of a software package called STRUMPACK -- STRUctured Matrices PACKage, which
also has a distributed memory component for dense rank-structured matrices
Sparse Approximate Multifrontal Factorization with Butterfly Compression for High Frequency Wave Equations
We present a fast and approximate multifrontal solver for large-scale sparse
linear systems arising from finite-difference, finite-volume or finite-element
discretization of high-frequency wave equations. The proposed solver leverages
the butterfly algorithm and its hierarchical matrix extension for compressing
and factorizing large frontal matrices via graph-distance guided entry
evaluation or randomized matrix-vector multiplication-based schemes. Complexity
analysis and numerical experiments demonstrate
computation and memory complexity when applied to an sparse system arising from 3D high-frequency Helmholtz and Maxwell problems
Recommended from our members
A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression
We present the GPU implementation efforts and challenges of the sparse solver package STRUMPACK. The code is made publicly available on github with a permissive BSD license. STRUMPACK implements an approximate multifrontal solver, a sparse LU factorization which makes use of compression methods to accelerate time to solution and reduce memory usage. Multiple compression schemes based on rank-structured and hierarchical matrix approximations are supported, including hierarchically semi-separable, hierarchically off-diagonal butterfly, and block low rank. In this paper, we present the GPU implementation of the block low rank (BLR) compression method within a multifrontal solver. Our GPU implementation relies on highly optimized vendor libraries such as cuBLAS and cuSOLVER for NVIDIA GPUs, rocBLAS and rocSOLVER for AMD GPUs and the Intel oneAPI Math Kernel Library (oneMKL) for Intel GPUs. Additionally, we rely on external open source libraries such as SLATE (Software for Linear Algebra Targeting Exascale), MAGMA (Matrix Algebra on GPU and Multi-core Architectures), and KBLAS (KAUST BLAS). SLATE is used as a GPU-capable ScaLAPACK replacement. From MAGMA we use variable sized batched dense linear algebra operations such as GEMM, TRSM and LU with partial pivoting. KBLAS provides efficient (batched) low rank matrix compression for NVIDIA GPUs using an adaptive randomized sampling scheme. The resulting sparse solver and preconditioner runs on NVIDIA, AMD and Intel GPUs. Interfaces are available from PETSc, Trilinos and MFEM, or the solver can be used directly in user code. We report results for a range of benchmark applications, using the Perlmutter system from NERSC, Frontier from ORNL, and Aurora from ALCF. For a high frequency wave equation on a regular mesh, using 32 Perlmutter compute nodes, the factorization phase of the exact GPU solver is about 6.5× faster compared to the CPU-only solver. The BLR-enabled GPU solver is about 13.8× faster than the CPU exact solver. For a collection of SuiteSparse matrices, the STRUMPACK exact factorization on a single GPU is on average 1.9× faster than NVIDIA’s cuDSS solver
An effective preconditioning strategy for volume penalized incompressible/low Mach multiphase flow solvers
The volume penalization (VP) or the Brinkman penalization (BP) method is a
diffuse interface method for simulating multiphase fluid-structure interaction
(FSI) problems in ocean engineering and/or phase change problems in thermal
sciences. The method relies on a penalty factor (which is inversely related to
body's permeability ) that must be large to enforce rigid body velocity
in the solid domain. When the penalty factor is large, the discrete system of
equations becomes stiff and difficult to solve numerically. In this paper, we
propose a projection method-based preconditioning strategy for solving volume
penalized (VP) incompressible and low-Mach Navier-Stokes equations. The
projection preconditioner enables the monolithic solution of the coupled
velocity-pressure system in both single phase and multiphase flow settings. In
this approach, the penalty force is treated implicitly, which is allowed to
take arbitrary large values without affecting the solver's convergence rate or
causing numerical stiffness/instability. It is made possible by including the
penalty term in the pressure Poisson equation. Solver scalability under grid
refinement is demonstrated. A manufactured solution in a single phase setting
is used to determine the spatial accuracy of the penalized solution.
Second-order pointwise accuracy is achieved for both velocity and pressure
solutions. Two multiphase fluid-structure interaction (FSI) problems from the
ocean engineering literature are also simulated to evaluate the solver's
robustness and performance. The proposed solver allows us to investigate the
effect of on the motion of the contact line over the surface of the
immersed body. It also allows us to investigate the dynamics of the free
surface of a solidifying meta
- …