43 research outputs found

    Algebraic, Block and Multiplicative Preconditioners based on Fast Tridiagonal Solves on GPUs

    Get PDF
    This thesis contributes to the field of sparse linear algebra, graph applications, and preconditioners for Krylov iterative solvers of sparse linear equation systems, by providing a (block) tridiagonal solver library, a generalized sparse matrix-vector implementation, a linear forest extraction, and a multiplicative preconditioner based on tridiagonal solves. The tridiagonal library, which supports (scaled) partial pivoting, outperforms cuSPARSE's tridiagonal solver by factor five while completely utilizing the available GPU memory bandwidth. For the performance optimized solving of multiple right-hand sides, the explicit factorization of the tridiagonal matrix can be computed. The extraction of a weighted linear forest (union of disjoint paths) from a general graph is used to build algebraic (block) tridiagonal preconditioners and deploys the generalized sparse-matrix vector implementation of this thesis for preconditioner construction. During linear forest extraction, a new parallel bidirectional scan pattern, which can operate on double-linked list structures, identifies the path ID and the position of a vertex. The algebraic preconditioner construction is also used to build more advanced preconditioners, which contain multiple tridiagonal factors, based on generalized ILU factorizations. Additionally, other preconditioners based on tridiagonal factors are presented and evaluated in comparison to ILU and ILU incomplete sparse approximate inverse preconditioners (ILU-ISAI) for the solution of large sparse linear equation systems from the Sparse Matrix Collection. For all presented problems of this thesis, an efficient parallel algorithm and its CUDA implementation for single GPU systems is provided

    Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic PDEs with variable coefficients

    Full text link
    We present a robust and scalable preconditioner for the solution of large-scale linear systems that arise from the discretization of elliptic PDEs amenable to rank compression. The preconditioner is based on hierarchical low-rank approximations and the cyclic reduction method. The setup and application phases of the preconditioner achieve log-linear complexity in memory footprint and number of operations, and numerical experiments exhibit good weak and strong scalability at large processor counts in a distributed memory environment. Numerical experiments with linear systems that feature symmetry and nonsymmetry, definiteness and indefiniteness, constant and variable coefficients demonstrate the preconditioner applicability and robustness. Furthermore, it is possible to control the number of iterations via the accuracy threshold of the hierarchical matrix approximations and their arithmetic operations, and the tuning of the admissibility condition parameter. Together, these parameters allow for optimization of the memory requirements and performance of the preconditioner.Comment: 24 pages, Elsevier Journal of Computational and Applied Mathematics, Dec 201

    A parallel hybrid implementation of the 2D acoustic wave equation

    Get PDF
    In this paper, we propose a hybrid parallel programming approach for a numerical solution of a two-dimensional acoustic wave equation using an implicit difference scheme for a single computer. The calculations are carried out in an implicit finite difference scheme. First, we transform the differential equation into an implicit finite-difference equation and then using the ADI method, we split the equation into two sub-equations. Using the cyclic reduction algorithm, we calculate an approximate solution. Finally, we change this algorithm to parallelize on GPU, GPU+OpenMP, and Hybrid (GPU+OpenMP+MPI) computing platforms. The special focus is on improving the performance of the parallel algorithms to calculate the acceleration based on the execution time. We show that the code that runs on the hybrid approach gives the expected results by comparing our results to those obtained by running the same simulation on a classical processor core, CUDA, and CUDA+OpenMP implementations.Comment: 10 pages; 1 Chart; 1 Table; 1 Listing; 1 Algorith

    A novel approach to evaluating compact finite differences and similar tridiagonal schemes on GPU-accelerated clusters

    Get PDF
    Compact finite difference schemes are widely used in the direct numerical simulation of fluid flows for their ability to better resolve the small scales of turbulence. However, they can be expensive to evaluate and difficult to parallelize. In this work, we present an approach for the computation of compact finite differences and similar tridiagonal schemes on graphics processing units (GPUs). We present a variant of the cyclic reduction algorithm for solving the tridiagonal linear systems that arise in such numerical schemes. We study the impact of the matrix structure on the cyclic reduction algorithm and show that precomputing forward reduction coefficients can be especially effective for obtaining good performance. Our tridiagonal solver is able to outperform the NVIDIA CUSPARSE and the multithreaded Intel MKL tridiagonal solvers on GPU and CPU respectively. In addition, we present a parallelization strategy for GPU-accelerated clusters, and show scalabality of a 3-D compact finite difference application for up to 64 GPUs on Clemson’s Palmetto cluster

    Batch solution of small PDEs with the OPS DSL

    Get PDF
    In this paper we discuss the challenges and optimisations opportunities when solving a large number of small, equally sized discretised PDEs on regular grids. We present an extension of the OPS (Oxford Parallel library for Structured meshes) embedded Domain Specific Language, and show how support can be added for solving multiple systems, and how OPS makes it easy to deploy a variety of transformations and optimisations. The new capabilities in OPS allow to automatically apply data structure transformations, as well as execution schedule transformations to deliver high performance on a variety of hardware platforms. We evaluate our work on an industrially representative finance simulation on Intel CPUs, as well as NVIDIA GPUs
    corecore