1,567 research outputs found
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
Algorithmic patterns for -matrices on many-core processors
In this work, we consider the reformulation of hierarchical ()
matrix algorithms for many-core processors with a model implementation on
graphics processing units (GPUs). matrices approximate specific
dense matrices, e.g., from discretized integral equations or kernel ridge
regression, leading to log-linear time complexity in dense matrix-vector
products. The parallelization of matrix operations on many-core
processors is difficult due to the complex nature of the underlying algorithms.
While previous algorithmic advances for many-core hardware focused on
accelerating existing matrix CPU implementations by many-core
processors, we here aim at totally relying on that processor type. As main
contribution, we introduce the necessary parallel algorithmic patterns allowing
to map the full matrix construction and the fast matrix-vector
product to many-core hardware. Here, crucial ingredients are space filling
curves, parallel tree traversal and batching of linear algebra operations. The
resulting model GPU implementation hmglib is the, to the best of the authors
knowledge, first entirely GPU-based Open Source matrix library of
this kind. We conclude this work by an in-depth performance analysis and a
comparative performance study against a standard matrix library,
highlighting profound speedups of our many-core parallel approach
A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters
In this work, we consider the solution of boundary integral equations by
means of a scalable hierarchical matrix approach on clusters equipped with
graphics hardware, i.e. graphics processing units (GPUs). To this end, we
extend our existing single-GPU hierarchical matrix library hmglib such that it
is able to scale on many GPUs and such that it can be coupled to arbitrary
application codes. Using a model GPU implementation of a boundary element
method (BEM) solver, we are able to achieve more than 67 percent relative
parallel speed-up going from 128 to 1024 GPUs for a model geometry test case
with 1.5 million unknowns and a real-world geometry test case with almost 1.2
million unknowns. On 1024 GPUs of the cluster Titan, it takes less than 6
minutes to solve the 1.5 million unknowns problem, with 5.7 minutes for the
setup phase and 20 seconds for the iterative solver. To the best of the
authors' knowledge, we here discuss the first fully GPU-based
distributed-memory parallel hierarchical matrix Open Source library using the
traditional H-matrix format and adaptive cross approximation with an
application to BEM problems
Geometry-Oblivious FMM for Compressing Dense SPD Matrices
We present GOFMM (geometry-oblivious FMM), a novel method that creates a
hierarchical low-rank approximation, "compression," of an arbitrary dense
symmetric positive definite (SPD) matrix. For many applications, GOFMM enables
an approximate matrix-vector multiplication in or even time,
where is the matrix size. Compression requires storage and work.
In general, our scheme belongs to the family of hierarchical matrix
approximation methods. In particular, it generalizes the fast multipole method
(FMM) to a purely algebraic setting by only requiring the ability to sample
matrix entries. Neither geometric information (i.e., point coordinates) nor
knowledge of how the matrix entries have been generated is required, thus the
term "geometry-oblivious." Also, we introduce a shared-memory parallel scheme
for hierarchical matrix computations that reduces synchronization barriers. We
present results on the Intel Knights Landing and Haswell architectures, and on
the NVIDIA Pascal architecture for a variety of matrices.Comment: 13 pages, accepted by SC'1
Lecture 03: Hierarchically Low Rank Methods and Applications
As simulation and analytics enter the exascale era, numerical algorithms, particularly implicit solvers that couple vast numbers of degrees of freedom, must span a widening gap between ambitious applications and austere architectures to support them. We present fifteen universals for researchers in scalable solvers: imperatives from computer architecture that scalable solvers must respect, strategies towards achieving them that are currently well established, and additional strategies currently being developed for an effective and efficient exascale software ecosystem. We consider recent generalizations of what it means to âsolveâ a computational problem, which suggest that we have often been âoversolvingâ them at the smaller scales of the past because we could afford to do so. We present innovations that allow to approach lin-log complexity in storage and operation count in many important algorithmic kernels and thus create an opportunity for full applications with optimal scalability
Lecture 03: Hierarchically Low Rank Methods and Applications
As simulation and analytics enter the exascale era, numerical algorithms, particularly implicit solvers that couple vast numbers of degrees of freedom, must span a widening gap between ambitious applications and austere architectures to support them. We present fifteen universals for researchers in scalable solvers: imperatives from computer architecture that scalable solvers must respect, strategies towards achieving them that are currently well established, and additional strategies currently being developed for an effective and efficient exascale software ecosystem. We consider recent generalizations of what it means to âsolveâ a computational problem, which suggest that we have often been âoversolvingâ them at the smaller scales of the past because we could afford to do so. We present innovations that allow to approach lin-log complexity in storage and operation count in many important algorithmic kernels and thus create an opportunity for full applications with optimal scalability
SlabLU: A Two-Level Sparse Direct Solver for Elliptic PDEs
The paper describes a sparse direct solver for the linear systems that arise
from the discretization of an elliptic PDE on a two dimensional domain. The
solver is designed to reduce communication costs and perform well on GPUs; it
uses a two-level framework, which is easier to implement and optimize than
traditional multi-frontal schemes based on hierarchical nested dissection
orderings. The scheme decomposes the domain into thin subdomains, or "slabs".
Within each slab, a local factorization is executed that exploits the geometry
of the local domain. A global factorization is then obtained through the LU
factorization of a block-tridiagonal reduced coefficient matrix. The solver has
complexity for the factorization step, and for each
solve once the factorization is completed.
The solver described is compatible with a range of different local
discretizations, and numerical experiments demonstrate its performance for
regular discretizations of rectangular and curved geometries. The technique
becomes particularly efficient when combined with very high-order convergent
multi-domain spectral collocation schemes. With this discretization, a
Helmholtz problem on a domain of size (for
which N=100 \mbox{M}) is solved in 15 minutes to 6 correct digits on a
high-powered desktop with GPU acceleration
- âŠ