Search CORE

474 research outputs found

High Performance Stencil Code Generation with LIFT

Author: Dubach Christophe
Gorlatch Sergei
Hagedorn Bastian
Steuwer Michel
Stoltzfus Larisa
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Stencil computations are widely used from physical simulations to machine-learning. They are embarrassingly parallel and perfectly fit modern hardware such as Graphic Processing Units. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain Specific Languages (DSLs) have raised the programming abstraction and offer good performance. However, this places the burden on DSL implementers who have to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability and is based on a small set of reusable parallel primitives that DSL or library writers can build upon. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. However, Lift has mostly focused on linear algebra operations and it remains to be seen whether this approach is applicable for other domains. This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives. By leveraging existing Lift primitives and optimizations, we only require the addition of two primitives and one rewrite rule to do so. Our results show that this approach outperforms existing compiler approaches and hand-tuned codes

Crossref

Edinburgh Research Explorer

Enlighten

Position-Dependent Arrays and Their Application for High Performance Code Generation

Author: Dubach Christophe
Pizzuti Federico
Steuwer Michel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/08/2019
Field of study

Edinburgh Research Explorer

High-Level Hardware Feature Extractionfor GPU Performance Prediction of Stencils

Author: Chen Tianqi
Henriksen Troels
Lee Seyong
Leissa Roland
McDonell Trevor L.
Steuwer Michel
Tartara Michele
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/02/2020
Field of study

Crossref

Edinburgh Research Explorer

Tiling Optimizations for Stencil Computations Using Rewrite Rules in Lift

Author: Dubach Christophe
Gorlatch Sergei
Hagedorn Bastian
Steuwer Michel
Stoltzfus Larisa
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2019
Field of study

Stencil computations are a widely used type of algorithm, found in applications from physical simulations to machine learning. Stencils are embarrassingly parallel, therefore fit on modern hardware such as Graphic Processing Units perfectly. Although stencil computations have been extensively studied, optimizing them for increasingly diverse hardware remains challenging. Domain-specific Languages (DSLs) have raised the programming abstraction and offer good performance; however, this method places the burden on DSL implementers to write almost full-fledged parallelizing compilers and optimizers. Lift has recently emerged as a promising approach to achieve performance portability by using a small set of reusable parallel primitives that DSL or library writers utilize. Lift’s key novelty is in its encoding of optimizations as a system of extensible rewrite rules which are used to explore the optimization space. This article demonstrates how complex multi-dimensional stencil code and optimizations are expressed using compositions of simple 1D Lift primitives and rewrite rules. We introduce two optimizations that provide high performance for stencils in particular: classical overlapped tiling for multi-dimensional stencils and 2.5D tiling specifically for 3D stencils. We provide an in-depth analysis on how the tiling optimizations affects stencils of different shapes and sizes across different applications. Our experimental results show that our approach outperforms existing compiler approaches and hand-tuned codes

Edinburgh Research Explorer

Enlighten

A Modular Approach to Performance, Portability and Productivity for 3D Wave Models

Author: Bilbao Stefan
Dubach Christophe
Gray Alan
Steuwer Michel
Stoltzfus Larisa
Publication venue
Publication date: 01/01/2017
Field of study

No abstract available

Edinburgh Research Explorer

Enlighten

Automatic Generation of Specialized Direct Convolutions for Mobile GPUs

Author: Abadi Martín
Chen Tianqi
Crowley Elliot J
Fukushima Kunihiko
Jia Yangqing
Leary Chris
Tsai Yaohung
Tschannen Michael
Zhang Jiyuan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/02/2020
Field of study

Crossref

Edinburgh Research Explorer

Code generation for room acoustics simulations with complex boundary conditions using LIFT

Author: Dubach Christophe
Hamilton Brian
Li Lily
Steuwer Michel
Stoltzfus Larisa
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/06/2021
Field of study

Edinburgh Research Explorer

Automatic matching of legacy code to heterogeneous APIs: An idiomatic approach

Author: Bodin Bruno
Dubach Christophe
Ginsbach Philip
O'Boyle Michael
Remmelg Toomas
Steuwer Michel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/03/2018
Field of study

Heterogeneous accelerators often disappoint. They provide the prospect of great performance, but only deliver it when using vendor specific optimized libraries or domain specific languages. This requires considerable legacy code modifications, hindering the adoption of heterogeneous computing. This paper develops a novel approach to automatically detect opportunities for accelerator exploitation. We focus on calculations that are well supported by established APIs: sparse and dense linear algebra, stencil codes and generalized reductions and histograms. We call them idioms and use a custom constraint-based Idiom Description Language (IDL) to discover them within user code. Detected idioms are then mapped to BLAS libraries, cuSPARSE and clSPARSE and two DSLs: Halide and Lift. We implemented the approach in LLVM and evaluated it on the NAS and Parboil sequential C/C++ benchmarks, where we detect 60 idiom instances. In those cases where idioms are a significant part of the sequential execution time, we generate code that achieves 1.26× to over 20× speedup on integrated and external GPUs

Crossref

Edinburgh Research Explorer

Enlighten

Automatic performance optimisation of parallel programs for GPUs via rewrite rules

Author: Remmelg Toomas
Publication venue: The University of Edinburgh
Publication date: 11/12/2019
Field of study

Graphics Processing Units (GPUs) are now commonplace in computing systems and are the most successful parallel accelerators. Their performance is orders of magnitude higher than traditional Central Processing Units (CPUs) making them attractive for many application domains with high computational demands. However, achieving their full performance potential is extremely hard, even for experienced programmers, as it requires specialised software tailored for specific devices written in low-level languages such as OpenCL. Differences in device characteristics between manufacturers and even hardware generations often lead to large performance variations when different optimisations are applied. This inevitably leads to code that is not performance portable across different hardware. This thesis demonstrates that achieving performance portability is possible using LIFT, a functional data-parallel language which allows programs to be expressed at a high-level in a hardware-agnostic way. The LIFT compiler is empowered to automatically explore the optimisation space using a set of well-defined rewrite rules to transform programs seamlessly between different high-level algorithmic forms before translating them to a low-level OpenCL-specific form. The first contribution of this thesis is the development of techniques to compile functional LIFT programs that have optimisations explicitly encoded into efficient imperative OpenCL code. Producing efficient code is non-trivial as many performance sensitive details such as memory allocation, array accesses or synchronisation are not explicitly represented in the functional LIFT language. The thesis shows that the newly developed techniques are essential for achieving performance on par with manually optimised code for GPU programs with the exact same complex optimisations applied. The second contribution of this thesis is the presentation of techniques that enable the LIFT compiler to perform complex optimisations that usually require from tens to hundreds of individual rule applications by grouping them as macro-rules that cut through the optimisation space. Using matrix multiplication as an example, starting from a single high-level program the compiler automatically generates highly optimised and specialised implementations for desktop and mobile GPUs with very different architectures achieving performance portability. The final contribution of this thesis is the demonstration of how low-level and GPU-specific features are extracted directly from the high-level functional LIFT program, enabling building a statistical performance model that makes accurate predictions about the performance of differently optimised program variants. This performance model is then used to drastically speed up the time taken by the optimisation space exploration by ranking the different variants based on their predicted performance. Overall, this thesis demonstrates that performance portability is achievable using LIFT

Edinburgh Research Archive