229 research outputs found
Asynchronous and Multiprecision Linear Solvers - Scalable and Fault-Tolerant Numerics for Energy Efficient High Performance Computing
Asynchronous methods minimize idle times by removing synchronization barriers, and therefore allow the efficient usage of computer systems. The implied high tolerance with respect to communication latencies improves the fault tolerance. As asynchronous methods also enable the usage of the power and energy saving mechanisms provided by the hardware, they are suitable candidates for the highly parallel and heterogeneous hardware platforms that are expected for the near future
Porting Batched Iterative Solvers onto Intel GPUs with SYCL
Batched linear solvers play a vital role in computational sciences,
especially in the fields of plasma physics and combustion simulations. With the
imminent deployment of the Aurora Supercomputer and other upcoming systems
equipped with Intel GPUs, there is a compelling demand to expand the
capabilities of these solvers for Intel GPU architectures. In this paper, we
present our efforts in porting and optimizing the batched iterative solvers on
Intel GPUs using the SYCL programming model. The SYCL-based implementation
exhibits impressive performance and scalability on the Intel GPU Max 1550s
(Ponte Vecchio GPUs). The solvers outperform our previous CUDA implementation
on NVIDIA H100 GPUs by an average of 2.4x for the PeleLM application inputs.
The batched solvers are ready for production use in real-world scientific
applications through the Ginkgo library.Comment: 9 pages, 8 figures, submitted to the P3HPC Workshop at SC2
A factored sparse approximate inverse preconditioned conjugate gradient solver on graphics processing units
Graphics Processing Units (GPUs) exhibit significantly higher peak performance than conventional CPUs. However, in general only highly parallel algorithms can exploit their potential. In this scenario, the iterative solution to sparse linear systems of equations could be carried out quite efficiently on a GPU as it requires only matrix-by-vector products, dot products, and vector updates. However, to be really effective, any iterative solver needs to be properly preconditioned and this represents a major bottleneck for a successful GPU implementation. Due to its inherent parallelism, the factored sparse approximate inverse (FSAI) preconditioner represents an optimal candidate for the conjugate gradient-like solution of sparse linear systems. However, its GPU implementation requires a nontrivial recasting of multiple computational steps. We present our GPU version of the FSAI preconditioner along with a set of results that show how a noticeable speedup with respect to a highly tuned CPU counterpart is obtained
An Evaluation and Comparison of GPU Hardware and Solver Libraries for Accelerating the OPM Flow Reservoir Simulator
Realistic reservoir simulation is known to be prohibitively expensive in
terms of computation time when increasing the accuracy of the simulation or by
enlarging the model grid size. One method to address this issue is to
parallelize the computation by dividing the model in several partitions and
using multiple CPUs to compute the result using techniques such as MPI and
multi-threading. Alternatively, GPUs are also a good candidate to accelerate
the computation due to their massively parallel architecture that allows many
floating point operations per second to be performed. The numerical iterative
solver takes thus the most computational time and is challenging to solve
efficiently due to the dependencies that exist in the model between cells. In
this work, we evaluate the OPM Flow simulator and compare several
state-of-the-art GPU solver libraries as well as custom developed solutions for
a BiCGStab solver using an ILU0 preconditioner and benchmark their performance
against the default DUNE library implementation running on multiple CPU
processors using MPI. The evaluated GPU software libraries include a manual
linear solver in OpenCL and the integration of several third party sparse
linear algebra libraries, such as cuSparse, rocSparse, and amgcl. To perform
our bench-marking, we use small, medium, and large use cases, starting with the
public test case NORNE that includes approximately 50k active cells and ending
with a large model that includes approximately 1 million active cells. We find
that a GPU can accelerate a single dual-threaded MPI process up to 5.6 times,
and that it can compare with around 8 dual-threaded MPI processes
Paraiso : An Automated Tuning Framework for Explicit Solvers of Partial Differential Equations
We propose Paraiso, a domain specific language embedded in functional
programming language Haskell, for automated tuning of explicit solvers of
partial differential equations (PDEs) on GPUs as well as multicore CPUs. In
Paraiso, one can describe PDE solving algorithms succinctly using tensor
equations notation. Hydrodynamic properties, interpolation methods and other
building blocks are described in abstract, modular, re-usable and combinable
forms, which lets us generate versatile solvers from little set of Paraiso
source codes.
We demonstrate Paraiso by implementing a compressive hydrodynamics solver. A
single source code less than 500 lines can be used to generate solvers of
arbitrary dimensions, for both multicore CPUs and GPUs. We demonstrate both
manual annotation based tuning and evolutionary computing based automated
tuning of the program.Comment: 52 pages, 14 figures, accepted for publications in Computational
Science and Discover
GPU acceleration for evolutionary topology optimization of continuum structures using isosurfaces
Evolutionary topology optimization of three-dimensional continuum structures is a computationally demanding task in terms of memory consumption and processing time. This work aims to alleviate these constraints proposing a well-suited strategy for Graphics Processing Unit (GPU) computing. Such a proposal adopts a fine-grained GPU instance of matrix-free iterative solver for structural analysis and an efficient GPU implementation for isosurface extraction and volume fraction calculation. The performance of the solving stage is evaluated using two preconditioning techniques, including the comparison with the sparse-matrix CPU implementation. The proposal is evaluated using topology optimization problems for real-world applications.We gratefully acknowledge the support of NVIDIA Corporation with the donation of some of the GPUs used for this research. Such a work has also been supported by the research support programmes of Ministry of Economy and Competitiveness under the contract DPI2016-77538-R and \Fundación Séneca Agencia de Ciencia y TecnologÃa de la Región de Murcia" under the contract 19274/PI/14
- …