Search CORE

15 research outputs found

Acceleration of a Full-scale Industrial CFD Application with OP2

Author: Bertolli C
Betts A
Giles MB
Kelly PHJ
Mudalige GR
Radford D
Reguly IZ
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/06/2015
Field of study

Spiral - Imperial College Digital Repository

Finite element algorithms and data structures on graphical processing units

Author: Giles Michael
Reguly IZ
Publication venue: Springer US
Publication date: 04/12/2013
Field of study

The finite element method (FEM) is one of the most commonly used techniques for the solution of partial differential equations on unstructured meshes. This paper discusses both the assembly and the solution phases of the FEM with special attention to the balance of computation and data movement. We present a GPU assembly algorithm that scales to arbitrary degree polynomials used as basis functions, at the expense of redundant computations. We show how the storage of the stiffness matrix affects the performance of both the assembly and the solution. We investigate two approaches: global assembly into the CSR and ELLPACK matrix formats and matrix-free algorithms, and show the trade-off between the amount of indexing data and stiffness data. We discuss the performance of different approaches in light of the implicit caches on Fermi GPUs and show a speedup over a two-socket 12-core CPU of up to 10 times in the assembly and up to 6 times in the solution phase. We present our sparse matrix-vector multiplication algorithms that are part of a conjugate gradient iteration and show that a matrix-free approach may be up to two times faster than global assembly approaches and up to 4 times faster than NVIDIA’s cuSPARSE library, depending on the preconditioner used

Crossref

Oxford University Research Archive

Loop tiling in large-scale stencil codes at run-time with OPS

Author: Giles Michael
Mudalige GR
Reguly IZ
Publication venue: Institute of Electrical and Electronics Engineers
Publication date: 01/01/2017
Field of study

The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that schedule across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called iteration space slicing for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2

\times

on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively,

3.5\times

on the linear solver TeaLeaf, and

1.7\times

on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Warwick Research Archives Portal Repository

Repository of the Academy's Library

Batch solution of small PDEs with the OPS DSL

Author: E László
GR Mudalige
H Carter Edwards
H Wang
IZ Reguly
JE Stone
JG Verwer
K In’t Hout
K In’t Hout
M Wyns
P MacNeice
R Chandra
R Nath
S Kronawitter
SP Jammy
T Deakin
W Gropp
W Hundsdorfer
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

In this paper we discuss the challenges and optimisations opportunities when solving a large number of small, equally sized discretised PDEs on regular grids. We present an extension of the OPS (Oxford Parallel library for Structured meshes) embedded Domain Specific Language, and show how support can be added for solving multiple systems, and how OPS makes it easy to deploy a variety of transformations and optimisations. The new capabilities in OPS allow to automatically apply data structure transformations, as well as execution schedule transformations to deliver high performance on a variety of hardware platforms. We evaluate our work on an industrially representative finance simulation on Intel CPUs, as well as NVIDIA GPUs

Crossref

Warwick Research Archives Portal Repository

Repository of the Academy's Library

The VOLNA-OP2 tsunami code (version 1.5)

Author: Beck JH
Dias F
Giles D
Giles MB
Gopinathan D
Guillas S
Quivy L
Reguly IZ
Publication venue: COPERNICUS GESELLSCHAFT MBH
Publication date: 01/01/2018
Field of study

In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume nonlinear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms

Directory of Open Access Journals

UCL Discovery

Oxford University Research Archive

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

Author: A Hart
D Dutykh
G Ruetsch
I Karlin
IZ Reguly
J Gong
J Nickolls
JE Stone
M Martineau
M Norman
MB Giles
S Wienke
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Efficiently exploiting GPUs is increasingly essential in scientific computing, as many current and upcoming supercomputers are built using them. To facilitate this, there are a number of programming approaches, such as CUDA, OpenACC and OpenMP 4, supporting different programming languages (mainly C/C++ and Fortran). There are also several compiler suites (clang, nvcc, PGI, XL) each supporting different combinations of languages. In this study, we take a detailed look at some of the currently available options, and carry out a comprehensive analysis and comparison using computational loops and applications from the domain of unstructured mesh computations. Beyond runtimes and performance metrics (GB/s), we explore factors that influence performance such as register counts, occupancy, usage of different memory types, instruction counts, and algorithmic differences. Results of this work show how clang's CUDA compiler frequently outperform NVIDIA's nvcc, performance issues with directive-based approaches on complex kernels, and OpenMP 4 support maturing in clang and XL; currently around 10% slower than CUDA

arXiv.org e-Print Archive

Crossref

Warwick Research Archives Portal Repository

Repository of the Academy's Library

Finite Element Algorithms and Data Structures on Graphical Processing Units

Author: Giles Michael
Reguly IZ
Publication venue: Springer
Publication date: 01/04/2015
Field of study

The finite element method (FEM) is one of the most commonly used techniques for the solution of partial differential equations on unstructured meshes. This paper discusses both the assembly and the solution phases of the FEM with special attention to the balance of computation and data movement. We present a GPU assembly algorithm that scales to arbitrary degree polynomials used as basis functions, at the expense of redundant computations. We show how the storage of the stiffness matrix affects the performance of both the assembly and the solution. We investigate two approaches: global assembly into the CSR and ELLPACK matrix formats and matrix-free algorithms, and show the trade-off between the amount of indexing data and stiffness data. We discuss the performance of different approaches in light of the implicit caches on Fermi GPUs and show a speedup over a two-socket 12-core CPU of up to 10 times in the assembly and up to 6 times in the solution phase. We present our sparse matrix-vector multiplication algorithms that are part of a conjugate gradient iteration and show that a matrix-free approach may be up to two times faster than global assembly approaches and up to 4 times faster than NVIDIA's cuSPARSE library, depending on the preconditioner used. © 2013 Springer Science+Business Media New York

CiteSeerX

Oxford University Research Archive

Finite element algorithms and data structures on graphical processing units

Author: Giles M
Reguly IZ
Publication venue: Springer US
Publication date: 01/01/2013
Field of study

Oxford University Research Archive

Loop tiling in large-scale stencil codes at run-time with OPS

Author: Giles MB
Mudalige GR
Reguly IZ
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

\times

on the Cloverleaf 2D/3D proxy application, which contain 83/141 loops respectively,

3.5\times

on the linear solver TeaLeaf, and

1.7\times

on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of CINECA&apos;s Marconi supercomputer. We also evaluate our algorithms on Intel&apos;s Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop data access information

Oxford University Research Archive

Beyond 16GB: Out-of-core stencil computations

Author: Giles MB
Mudalige GR
Reguly IZ
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. Evaluating our work on Intel’s Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15% loss in efficiency

Oxford University Research Archive