1,853 research outputs found
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
Adapting the interior point method for the solution of LPs on serial, coarse grain parallel and massively parallel computers
In this paper we describe a unified scheme for implementing an interior point algorithm (IPM) over a range of computer architectures. In the inner iteration of the IPM a search direction is computed using Newton's method. Computationally this involves solving a sparse symmetric positive definite (SSPD) system of equations. The choice of direct and indirect methods for the solution of this system, and the design of data structures to take advantage of serial, coarse grain parallel and massively parallel computer architectures, are considered in detail. We put forward arguments as to why integration of the system within a sparse simplex solver is important and outline how the system is designed to achieve this integration
Using a multifrontal sparse solver in a high performance, finite element code
We consider the performance of the finite element method on a vector supercomputer. The computationally intensive parts of the finite element method are typically the individual element forms and the solution of the global stiffness matrix both of which are vectorized in high performance codes. To further increase throughput, new algorithms are needed. We compare a multifrontal sparse solver to a traditional skyline solver in a finite element code on a vector supercomputer. The multifrontal solver uses the Multiple-Minimum Degree reordering heuristic to reduce the number of operations required to factor a sparse matrix and full matrix computational kernels (e.g., BLAS3) to enhance vector performance. The net result in an order-of-magnitude reduction in run time for a finite element application on one processor of a Cray X-MP
An Experimental Study of Two-Level Schwarz Domain Decomposition Preconditioners on GPUs
The generalized Dryja--Smith--Widlund (GDSW) preconditioner is a two-level
overlapping Schwarz domain decomposition (DD) preconditioner that couples a
classical one-level overlapping Schwarz preconditioner with an
energy-minimizing coarse space. When used to accelerate the convergence rate of
Krylov subspace iterative methods, the GDSW preconditioner provides robustness
and scalability for the solution of sparse linear systems arising from the
discretization of a wide range of partial different equations. In this paper,
we present FROSch (Fast and Robust Schwarz), a domain decomposition solver
package which implements GDSW-type preconditioners for both CPU and GPU
clusters. To improve the solver performance on GPUs, we use a novel
decomposition to run multiple MPI processes on each GPU, reducing both solver's
computational and storage costs and potentially improving the convergence rate.
This allowed us to obtain competitive or faster performance using GPUs compared
to using CPUs alone. We demonstrate the performance of FROSch on the Summit
supercomputer with NVIDIA V100 GPUs, where we used NVIDIA Multi-Process Service
(MPS) to implement our decomposition strategy.
The solver has a wide variety of algorithmic and implementation choices,
which poses both opportunities and challenges for its GPU implementation. We
conduct a thorough experimental study with different solver options including
the exact or inexact solution of the local overlapping subdomain problems on a
GPU. We also discuss the effect of using the iterative variant of the
incomplete LU factorization and sparse-triangular solve as the approximate
local solver, and using lower precision for computing the whole FROSch
preconditioner. Overall, the solve time was reduced by factors of about
using GPUs, while the GPU acceleration of the numerical setup time
depend on the solver options and the local matrix sizes.Comment: Accepted for publication in IPDPS'2
Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation
Sparse matrix-vector multiplication (spMVM) is the dominant operation in many
sparse solvers. We investigate performance properties of spMVM with matrices of
various sparsity patterns on the nVidia "Fermi" class of GPGPUs. A new "padded
jagged diagonals storage" (pJDS) format is proposed which may substantially
reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme. In our
test scenarios the pJDS format cuts the overall spMVM memory footprint on the
GPGPU by up to 70%, and achieves 95% to 130% of the ELLPACK-R performance.
Using a suitable performance model we identify performance bottlenecks on the
node level that invalidate some types of matrix structures for efficient
multi-GPGPU parallelization. For appropriate sparsity patterns we extend
previous work on distributed-memory parallel spMVM to demonstrate a scalable
hybrid MPI-GPGPU code, achieving efficient overlap of communication and
computation.Comment: 10 pages, 5 figures. Added reference to other recent sparse matrix
format
- …