15 research outputs found

    Acceleration of a Full-scale Industrial CFD Application with OP2

    Get PDF

    Finite element algorithms and data structures on graphical processing units

    Get PDF
    The finite element method (FEM) is one of the most commonly used techniques for the solution of partial differential equations on unstructured meshes. This paper discusses both the assembly and the solution phases of the FEM with special attention to the balance of computation and data movement. We present a GPU assembly algorithm that scales to arbitrary degree polynomials used as basis functions, at the expense of redundant computations. We show how the storage of the stiffness matrix affects the performance of both the assembly and the solution. We investigate two approaches: global assembly into the CSR and ELLPACK matrix formats and matrix-free algorithms, and show the trade-off between the amount of indexing data and stiffness data. We discuss the performance of different approaches in light of the implicit caches on Fermi GPUs and show a speedup over a two-socket 12-core CPU of up to 10 times in the assembly and up to 6 times in the solution phase. We present our sparse matrix-vector multiplication algorithms that are part of a conjugate gradient iteration and show that a matrix-free approach may be up to two times faster than global assembly approaches and up to 4 times faster than NVIDIA’s cuSPARSE library, depending on the preconditioner used

    Loop tiling in large-scale stencil codes at run-time with OPS

    Get PDF
    The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2×\times on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively, 3.5×3.5\times on the linear solver TeaLeaf, and 1.7×1.7\times on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information

    Batch solution of small PDEs with the OPS DSL

    Get PDF
    In this paper we discuss the challenges and optimisations opportunities when solving a large number of small, equally sized discretised PDEs on regular grids. We present an extension of the OPS (Oxford Parallel library for Structured meshes) embedded Domain Specific Language, and show how support can be added for solving multiple systems, and how OPS makes it easy to deploy a variety of transformations and optimisations. The new capabilities in OPS allow to automatically apply data structure transformations, as well as execution schedule transformations to deliver high performance on a variety of hardware platforms. We evaluate our work on an industrially representative finance simulation on Intel CPUs, as well as NVIDIA GPUs

    The VOLNA-OP2 tsunami code (version 1.5)

    Get PDF
    In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume nonlinear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms

    Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

    Get PDF
    Efficiently exploiting GPUs is increasingly essential in scientific computing, as many current and upcoming supercomputers are built using them. To facilitate this, there are a number of programming approaches, such as CUDA, OpenACC and OpenMP 4, supporting different programming languages (mainly C/C++ and Fortran). There are also several compiler suites (clang, nvcc, PGI, XL) each supporting different combinations of languages. In this study, we take a detailed look at some of the currently available options, and carry out a comprehensive analysis and comparison using computational loops and applications from the domain of unstructured mesh computations. Beyond runtimes and performance metrics (GB/s), we explore factors that influence performance such as register counts, occupancy, usage of different memory types, instruction counts, and algorithmic differences. Results of this work show how clang's CUDA compiler frequently outperform NVIDIA's nvcc, performance issues with directive-based approaches on complex kernels, and OpenMP 4 support maturing in clang and XL; currently around 10% slower than CUDA

    Finite Element Algorithms and Data Structures on Graphical Processing Units

    No full text
    The finite element method (FEM) is one of the most commonly used techniques for the solution of partial differential equations on unstructured meshes. This paper discusses both the assembly and the solution phases of the FEM with special attention to the balance of computation and data movement. We present a GPU assembly algorithm that scales to arbitrary degree polynomials used as basis functions, at the expense of redundant computations. We show how the storage of the stiffness matrix affects the performance of both the assembly and the solution. We investigate two approaches: global assembly into the CSR and ELLPACK matrix formats and matrix-free algorithms, and show the trade-off between the amount of indexing data and stiffness data. We discuss the performance of different approaches in light of the implicit caches on Fermi GPUs and show a speedup over a two-socket 12-core CPU of up to 10 times in the assembly and up to 6 times in the solution phase. We present our sparse matrix-vector multiplication algorithms that are part of a conjugate gradient iteration and show that a matrix-free approach may be up to two times faster than global assembly approaches and up to 4 times faster than NVIDIA's cuSPARSE library, depending on the preconditioner used. © 2013 Springer Science+Business Media New York

    Finite element algorithms and data structures on graphical processing units

    No full text
    The finite element method (FEM) is one of the most commonly used techniques for the solution of partial differential equations on unstructured meshes. This paper discusses both the assembly and the solution phases of the FEM with special attention to the balance of computation and data movement. We present a GPU assembly algorithm that scales to arbitrary degree polynomials used as basis functions, at the expense of redundant computations. We show how the storage of the stiffness matrix affects the performance of both the assembly and the solution. We investigate two approaches: global assembly into the CSR and ELLPACK matrix formats and matrix-free algorithms, and show the trade-off between the amount of indexing data and stiffness data. We discuss the performance of different approaches in light of the implicit caches on Fermi GPUs and show a speedup over a two-socket 12-core CPU of up to 10 times in the assembly and up to 6 times in the solution phase. We present our sparse matrix-vector multiplication algorithms that are part of a conjugate gradient iteration and show that a matrix-free approach may be up to two times faster than global assembly approaches and up to 4 times faster than NVIDIA’s cuSPARSE library, depending on the preconditioner used

    Loop tiling in large-scale stencil codes at run-time with OPS

    No full text
    The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2×\times on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively, 3.5×3.5\times on the linear solver TeaLeaf, and 1.7×1.7\times on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information

    Beyond 16GB: Out-of-core stencil computations

    No full text
    Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. Evaluating our work on Intel’s Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15% loss in efficiency
    corecore