46 research outputs found
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
Parallel Unsmoothed Aggregation Algebraic Multigrid Algorithms on GPUs
We design and implement a parallel algebraic multigrid method for isotropic
graph Laplacian problems on multicore Graphical Processing Units (GPUs). The
proposed AMG method is based on the aggregation framework. The setup phase of
the algorithm uses a parallel maximal independent set algorithm in forming
aggregates and the resulting coarse level hierarchy is then used in a K-cycle
iteration solve phase with a -Jacobi smoother. Numerical tests of a
parallel implementation of the method for graphics processors are presented to
demonstrate its effectiveness.Comment: 18 pages, 3 figure
Status and Future Perspectives for Lattice Gauge Theory Calculations to the Exascale and Beyond
In this and a set of companion whitepapers, the USQCD Collaboration lays out
a program of science and computing for lattice gauge theory. These whitepapers
describe how calculation using lattice QCD (and other gauge theories) can aid
the interpretation of ongoing and upcoming experiments in particle and nuclear
physics, as well as inspire new ones.Comment: 44 pages. 1 of USQCD whitepapers
Matrix-free multigrid block-preconditioners for higher order Discontinuous Galerkin discretisations
Efficient and suitably preconditioned iterative solvers for elliptic partial
differential equations (PDEs) of the convection-diffusion type are used in all
fields of science and engineering. To achieve optimal performance, solvers have
to exhibit high arithmetic intensity and need to exploit every form of
parallelism available in modern manycore CPUs. The computationally most
expensive components of the solver are the repeated applications of the linear
operator and the preconditioner. For discretisations based on higher-order
Discontinuous Galerkin methods, sum-factorisation results in a dramatic
reduction of the computational complexity of the operator application while, at
the same time, the matrix-free implementation can run at a significant fraction
of the theoretical peak floating point performance. Multigrid methods for high
order methods often rely on block-smoothers to reduce high-frequency error
components within one grid cell. Traditionally, this requires the assembly and
expensive dense matrix solve in each grid cell, which counteracts any
improvements achieved in the fast matrix-free operator application. To overcome
this issue, we present a new matrix-free implementation of block-smoothers.
Inverting the block matrices iteratively avoids storage and factorisation of
the matrix and makes it is possible to harness the full power of the CPU. We
implemented a hybrid multigrid algorithm with matrix-free block-smoothers in
the high order DG space combined with a low order coarse grid correction using
algebraic multigrid where only low order components are explicitly assembled.
The effectiveness of this approach is demonstrated by solving a set of
representative elliptic PDEs of increasing complexity, including a convection
dominated problem and the stationary SPE10 benchmark.Comment: 28 pages, 10 figures, 10 tables; accepted for publication in Journal
of Computational Physic
Automatic Performance Optimization of Stencil Codes
A widely used class of codes are stencil codes. Their general structure is very simple: data points in a large grid are repeatedly recomputed from neighboring values. This predefined neighborhood is the so-called stencil. Despite their very simple structure, stencil codes are hard to optimize since only few computations are performed while a comparatively large number of values have to be accessed, i.e., stencil codes usually have a very low computational intensity. Moreover, the set of optimizations and their parameters also depend on the hardware on which the code is executed.
To cut a long story short, current production compilers are not able to fully optimize this class of codes and optimizing each application by hand is not practical. As a remedy, we propose a set of optimizations and describe how they can be applied automatically by a code generator for the domain of stencil codes. A combination of a space and time tiling is able to increase the data locality, which significantly reduces the memory-bandwidth requirements: a standard three-dimensional 7-point Jacobi stencil can be accelerated by a factor of 3. This optimization can target basically any stencil code, while others are more specialized. E.g., support for arbitrary linear data layout transformations is especially beneficial for colored kernels, such as a Red-Black Gauss-Seidel smoother. On the one hand, an optimized data layout for such kernels reduces the bandwidth requirements while, on the other hand, it simplifies an explicit vectorization.
Other noticeable optimizations described in detail are redundancy elimination techniques to eliminate common subexpressions both in a sequence of statements and across loop boundaries, arithmetic simplifications and normalizations, and the vectorization mentioned previously. In combination, these optimizations are able to increase the performance not only of the model problem given by Poisson’s equation, but also of real-world applications: an optical flow simulation and the simulation of a non-isothermal and non-Newtonian fluid flow
Toward Performance-Portable PETSc for GPU-based Exascale Systems
The Portable Extensible Toolkit for Scientific computation (PETSc) library
delivers scalable solvers for nonlinear time-dependent differential and
algebraic equations and for numerical optimization.The PETSc design for
performance portability addresses fundamental GPU accelerator challenges and
stresses flexibility and extensibility by separating the programming model used
by the application from that used by the library, and it enables application
developers to use their preferred programming model, such as Kokkos, RAJA,
SYCL, HIP, CUDA, or OpenCL, on upcoming exascale systems. A blueprint for using
GPUs from PETSc-based codes is provided, and case studies emphasize the
flexibility and high performance achieved on current GPU-based systems.Comment: 15 pages, 10 figures, 2 table