Search CORE

14 research outputs found

Optimization of Tensor-product Operations in Nekbone on GPUs

Author: Jansson Niclas
Karp Martin
Markidis Stefano
Podobas Artur
Schlatter Philipp
Publication venue
Publication date: 01/01/2020
Field of study

In the CFD solver Nek5000, the computation is dominated by the evaluation of small tensor operations. Nekbone is a proxy app for Nek5000 and has previously been ported to GPUs with a mixed OpenACC and CUDA approach. In this work, we continue this effort and optimize the main tensor-product operation in Nekbone further. Our optimization is done in CUDA and uses a different, 2D, thread structure to make the computations layer by layer. This enables us to use loop unrolling as well as utilize registers and shared memory efficiently. Our implementation is then compared on both the Pascal and Volta GPU architectures to previous GPU versions of Nekbone as well as a measured roofline. The results show that our implementation outperforms previous GPU Nekbone implementations by 6-10%. Compared to the measured roofline, we obtain 77 - 92% of the peak performance for both Nvidia P100 and V100 GPUs for inputs with 1024 - 4096 elements and polynomial degree 9.Comment: 4 pages, 4 figure

arXiv.org e-Print Archive

Publikationer från KTH

Digitala Vetenskapliga Arkivet - Academic Archive On-line

NekBone with Optimized OpenACC directives

Author: Cebamanos Luis
Fischer Paul
Gong Jing
Hart Alistair
Laure Erwin
Markidis Stefano
Min Misun
Schliephake Michael
Publication venue
Publication date: 01/01/2015
Field of study

Proceedings of: Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015). Krakow (Poland), September 10-11, 2015.Accelerators and, in particular, Graphics Processing Units (GPUs) have emerged as promising computing technologies which may be suitable for the future Exascale systems. Here, we present performance results of NekBone, a benchmark of the Nek5000 code, implemented with optimized OpenACC directives and GPUDirect communications. Nek5000 is a computational fluid dynamics code based on the spectral element method used for the simulation of incompressible flow. Results of an optimized NekBone version lead to 78 Gflops performance on a single node. In addition, a performance result of 609 Tflops has been reached on 16, 384 GPUs of the Titan supercomputer at Oak Ridge National Laboratory.This work is partially supported by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS) and the Swedish e-Science Research Center (SeRC). We acknowledge PRACE for awarding us access to resource CURIE based in France at CEA as well as the computing time on the Raven system at Cray and the Titan supercomputer at Oak Ridge National Laboratory. We would also like to thank Brent Leback for the CUDA FORTRAN code used in the paper

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Master of Science

Author: Rivera Axel Y.
Publication venue: University of Utah
Publication date: 01/12/2014
Field of study

thesisTensors are mathematical representations of physical entities that have magnitude with multiple directions. Tensor contraction is a form of creating these objects using the Einstein summation equation. It is commonly used in physics and chemistry for solving problems like spectral elements and coupled cluster computation. Mathematically, tensor contraction operations can be reduced to expressions similar to matrix multiplications. However, linear algebra libraries (e.g., BLAS and LAPACK) perform poorly on the small matrix sizes that commonly arise in certain tensor contraction computations. Another challenge seen in the computation of tensor contraction is the dierence between the mathematical representation and an ecient implementation. This thesis proposes a framework that allows users to express a tensor contraction problem in a high-level mathematical representation and transform it into a linear algebra expression that is mapped to a high-performance implementation. The framework produces code that takes advantage of the parallelism that graphics processing units (GPUs) provide. It relies on autotuning to nd the preferred implementation that achieves high performance on the available device. Performance results from the benchmarks tested, nekbone and NWChem, show that the output of the framework achieves a speedup of 8.56x and 14.25x, respectively, on an NVIDIA Tesla C2050 GPU against the sequential version; while using an NVIDIA Tesla K20c GPU it achieved speedups of 8.87x and 17.62x. The parallel decompositions found by the tool were also tested with an OpenACC implementation and achieved a speedup of 8.87x and 10.42x for nekbone, while NWChem obtained a speedup of 7.25x and 10.34x compared to the choices made by default in the OpenACC compiler. The contributions of this work are: (1) a simplied interface that allows the user to express tensor contraction using a high-level representation and transform it into high-performance code; (2) a decision algorithm that explores a set of optimization strategies for achieving performance; and, (3) a demonstration that this approach can achieve better performance than OpenACC and can be used to accelerate OpenACC

The University of Utah: J. Willard Marriott Digital Library

Comparison of Parallelisation Approaches, Languages, and Compilers for Unstructured Mesh Algorithms on GPUs

Author: A Hart
D Dutykh
G Ruetsch
I Karlin
IZ Reguly
J Gong
J Nickolls
JE Stone
M Martineau
M Norman
MB Giles
S Wienke
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Efficiently exploiting GPUs is increasingly essential in scientific computing, as many current and upcoming supercomputers are built using them. To facilitate this, there are a number of programming approaches, such as CUDA, OpenACC and OpenMP 4, supporting different programming languages (mainly C/C++ and Fortran). There are also several compiler suites (clang, nvcc, PGI, XL) each supporting different combinations of languages. In this study, we take a detailed look at some of the currently available options, and carry out a comprehensive analysis and comparison using computational loops and applications from the domain of unstructured mesh computations. Beyond runtimes and performance metrics (GB/s), we explore factors that influence performance such as register counts, occupancy, usage of different memory types, instruction counts, and algorithmic differences. Results of this work show how clang's CUDA compiler frequently outperform NVIDIA's nvcc, performance issues with directive-based approaches on complex kernels, and OpenMP 4 support maturing in clang and XL; currently around 10% slower than CUDA

arXiv.org e-Print Archive

Crossref

Warwick Research Archives Portal Repository

Repository of the Academy's Library

Productivity, performance, and portability for computational fluid dynamics applications

Author: Mudalige Gihan R.
Reguly Istvan Z.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

Hardware trends over the last decade show increasing complexity and heterogeneity in high performance computing architectures, which presents developers of CFD applications with three key challenges; the need for achieving good performance, being able to utilise current and future hardware by being portable, and doing so in a productive manner. These three appear to contradict each other when using traditional programming approaches, but in recent years, several strategies such as template libraries and Domain Specific Languages have emerged as a potential solution; by giving up generality and focusing on a narrower domain of problems, all three can be achieved. This paper gives an overview of the state-of-the-art for delivering performance, portability, and productivity to CFD applications, ranging from high-level libraries that allow the symbolic description of PDEs to low-level techniques that target individual algorithmic patterns. We discuss advantages and challenges in using each approach, and review the performance benchmarking literature that compares implementations for hardware architectures and their programming methods, giving an overview of key applications and their comparative performance

Warwick Research Archives Portal Repository

Repository of the Academy's Library

Proceedings of the Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015) Krakow, Poland

Author: Carretero Pérez Jesús
García Blas Francisco Javier
Jeannot Emmanuel
Wyrzykowski Roman
Publication venue
Publication date: 01/10/2015
Field of study

Proceedings of: Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015). Krakow (Poland), September 10-11, 2015

Universidad Carlos III de Madrid e-Archivo

Applications for Ultrascale Computing

Author: Bongo Lars Ailo
Ciegis Raimondas
Frasheri Neki
Gong Jing
Kimovski Dragi
Kropf Peter
Margenov Svetozar
Mihajlovic Milan
Neytcheva Maya
Rauber Thomas
Runger Gudula
Trobec Roman
Wuyts Roel
Wyrzykowski Roman
Publication venue: 'FSAEIHE South Ural State University (National Research University)'
Publication date: 01/01/2015
Field of study

The University of Manchester - Institutional Repository

Accelerating advection for atmospheric modelling on Xilinx and Intel FPGAs

Author: Brown Nick
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/10/2021
Field of study

Edinburgh Research Explorer