229 research outputs found

    Asynchronous and Multiprecision Linear Solvers - Scalable and Fault-Tolerant Numerics for Energy Efficient High Performance Computing

    Get PDF
    Asynchronous methods minimize idle times by removing synchronization barriers, and therefore allow the efficient usage of computer systems. The implied high tolerance with respect to communication latencies improves the fault tolerance. As asynchronous methods also enable the usage of the power and energy saving mechanisms provided by the hardware, they are suitable candidates for the highly parallel and heterogeneous hardware platforms that are expected for the near future

    Porting Batched Iterative Solvers onto Intel GPUs with SYCL

    Full text link
    Batched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, there is a compelling demand to expand the capabilities of these solvers for Intel GPU architectures. In this paper, we present our efforts in porting and optimizing the batched iterative solvers on Intel GPUs using the SYCL programming model. The SYCL-based implementation exhibits impressive performance and scalability on the Intel GPU Max 1550s (Ponte Vecchio GPUs). The solvers outperform our previous CUDA implementation on NVIDIA H100 GPUs by an average of 2.4x for the PeleLM application inputs. The batched solvers are ready for production use in real-world scientific applications through the Ginkgo library.Comment: 9 pages, 8 figures, submitted to the P3HPC Workshop at SC2

    A factored sparse approximate inverse preconditioned conjugate gradient solver on graphics processing units

    Get PDF
    Graphics Processing Units (GPUs) exhibit significantly higher peak performance than conventional CPUs. However, in general only highly parallel algorithms can exploit their potential. In this scenario, the iterative solution to sparse linear systems of equations could be carried out quite efficiently on a GPU as it requires only matrix-by-vector products, dot products, and vector updates. However, to be really effective, any iterative solver needs to be properly preconditioned and this represents a major bottleneck for a successful GPU implementation. Due to its inherent parallelism, the factored sparse approximate inverse (FSAI) preconditioner represents an optimal candidate for the conjugate gradient-like solution of sparse linear systems. However, its GPU implementation requires a nontrivial recasting of multiple computational steps. We present our GPU version of the FSAI preconditioner along with a set of results that show how a noticeable speedup with respect to a highly tuned CPU counterpart is obtained

    An Evaluation and Comparison of GPU Hardware and Solver Libraries for Accelerating the OPM Flow Reservoir Simulator

    Full text link
    Realistic reservoir simulation is known to be prohibitively expensive in terms of computation time when increasing the accuracy of the simulation or by enlarging the model grid size. One method to address this issue is to parallelize the computation by dividing the model in several partitions and using multiple CPUs to compute the result using techniques such as MPI and multi-threading. Alternatively, GPUs are also a good candidate to accelerate the computation due to their massively parallel architecture that allows many floating point operations per second to be performed. The numerical iterative solver takes thus the most computational time and is challenging to solve efficiently due to the dependencies that exist in the model between cells. In this work, we evaluate the OPM Flow simulator and compare several state-of-the-art GPU solver libraries as well as custom developed solutions for a BiCGStab solver using an ILU0 preconditioner and benchmark their performance against the default DUNE library implementation running on multiple CPU processors using MPI. The evaluated GPU software libraries include a manual linear solver in OpenCL and the integration of several third party sparse linear algebra libraries, such as cuSparse, rocSparse, and amgcl. To perform our bench-marking, we use small, medium, and large use cases, starting with the public test case NORNE that includes approximately 50k active cells and ending with a large model that includes approximately 1 million active cells. We find that a GPU can accelerate a single dual-threaded MPI process up to 5.6 times, and that it can compare with around 8 dual-threaded MPI processes

    Paraiso : An Automated Tuning Framework for Explicit Solvers of Partial Differential Equations

    Full text link
    We propose Paraiso, a domain specific language embedded in functional programming language Haskell, for automated tuning of explicit solvers of partial differential equations (PDEs) on GPUs as well as multicore CPUs. In Paraiso, one can describe PDE solving algorithms succinctly using tensor equations notation. Hydrodynamic properties, interpolation methods and other building blocks are described in abstract, modular, re-usable and combinable forms, which lets us generate versatile solvers from little set of Paraiso source codes. We demonstrate Paraiso by implementing a compressive hydrodynamics solver. A single source code less than 500 lines can be used to generate solvers of arbitrary dimensions, for both multicore CPUs and GPUs. We demonstrate both manual annotation based tuning and evolutionary computing based automated tuning of the program.Comment: 52 pages, 14 figures, accepted for publications in Computational Science and Discover

    GPU acceleration for evolutionary topology optimization of continuum structures using isosurfaces

    Get PDF
    Evolutionary topology optimization of three-dimensional continuum structures is a computationally demanding task in terms of memory consumption and processing time. This work aims to alleviate these constraints proposing a well-suited strategy for Graphics Processing Unit (GPU) computing. Such a proposal adopts a fine-grained GPU instance of matrix-free iterative solver for structural analysis and an efficient GPU implementation for isosurface extraction and volume fraction calculation. The performance of the solving stage is evaluated using two preconditioning techniques, including the comparison with the sparse-matrix CPU implementation. The proposal is evaluated using topology optimization problems for real-world applications.We gratefully acknowledge the support of NVIDIA Corporation with the donation of some of the GPUs used for this research. Such a work has also been supported by the research support programmes of Ministry of Economy and Competitiveness under the contract DPI2016-77538-R and \Fundación Séneca Agencia de Ciencia y Tecnología de la Región de Murcia" under the contract 19274/PI/14
    • …
    corecore