165 research outputs found
Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)
Computing systems have become increasingly complex with the emergence of
heterogeneous hardware combining multicore CPUs and GPUs. These parallel
systems exhibit tremendous computational power at the cost of increased
programming effort. This results in a tension between achieving performance and
code portability. Code is either tuned using device-specific optimizations to
achieve maximum performance or is written in a high-level language to achieve
portability at the expense of performance.
We propose a novel approach that offers high-level programming, code
portability and high-performance. It is based on algorithmic pattern
composition coupled with a powerful, yet simple, set of rewrite rules. This
enables systematic transformation and optimization of a high-level program into
a low-level hardware specific representation which leads to high performance
code.
We test our design in practice by describing a subset of the OpenCL
programming model with low-level patterns and by implementing a compiler which
generates high performance OpenCL code. Our experiments show that we can
systematically derive high-performance device-specific implementations from
simple high-level algorithmic expressions. The performance of the generated
OpenCL code is on par with highly tuned implementations for multicore CPUs and
GPUs written by expertsComment: Technical Repor
Generating Performance Portable Code using Rewrite Rules: From High-Level Functional Expressions to High-Performance OpenCL Code
Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort resulting in a tension between performance and code portability. Typically, code is either tuned in a low-level imperative language using hardware-specific optimizations to achieve maximum performance or is written in a high-level, possibly functional, language to achieve portability at
the expense of performance.
We propose a novel approach aiming to combine high-level programming, code portability, and high-performance. Starting from a high-level functional expression we apply a simple set of rewrite rules to transform it into a low-level functional representation, close to the OpenCL programming model, from which OpenCL code is generated. Our rewrite rules define a space of possible implementations which we automatically explore to generate hardware-specific OpenCL implementations. We formalize our system with a core dependently-typed λ-calculus along with a denotational semantics
which we use to prove the correctness of the rewrite rules.
We test our design in practice by implementing a compiler which generates high performance imperative OpenCL code. Our experiments show that we can automatically derive hardware-specific implementations from simple functional high-level algorithmic expressions offering performance on a par with highly tuned code for multicore CPUs and GPUs written by experts
Automatic performance optimisation of parallel programs for GPUs via rewrite rules
Graphics Processing Units (GPUs) are now commonplace in computing systems and are the
most successful parallel accelerators. Their performance is orders of magnitude higher than
traditional Central Processing Units (CPUs) making them attractive for many application domains
with high computational demands. However, achieving their full performance potential
is extremely hard, even for experienced programmers, as it requires specialised software tailored
for specific devices written in low-level languages such as OpenCL. Differences in device
characteristics between manufacturers and even hardware generations often lead to large performance
variations when different optimisations are applied. This inevitably leads to code that
is not performance portable across different hardware.
This thesis demonstrates that achieving performance portability is possible using LIFT, a
functional data-parallel language which allows programs to be expressed at a high-level in a
hardware-agnostic way. The LIFT compiler is empowered to automatically explore the optimisation
space using a set of well-defined rewrite rules to transform programs seamlessly between
different high-level algorithmic forms before translating them to a low-level OpenCL-specific
form.
The first contribution of this thesis is the development of techniques to compile functional
LIFT programs that have optimisations explicitly encoded into efficient imperative OpenCL
code. Producing efficient code is non-trivial as many performance sensitive details such as
memory allocation, array accesses or synchronisation are not explicitly represented in the functional
LIFT language. The thesis shows that the newly developed techniques are essential for
achieving performance on par with manually optimised code for GPU programs with the exact
same complex optimisations applied.
The second contribution of this thesis is the presentation of techniques that enable the
LIFT compiler to perform complex optimisations that usually require from tens to hundreds of
individual rule applications by grouping them as macro-rules that cut through the optimisation
space. Using matrix multiplication as an example, starting from a single high-level program
the compiler automatically generates highly optimised and specialised implementations for
desktop and mobile GPUs with very different architectures achieving performance portability.
The final contribution of this thesis is the demonstration of how low-level and GPU-specific
features are extracted directly from the high-level functional LIFT program, enabling building
a statistical performance model that makes accurate predictions about the performance of differently
optimised program variants. This performance model is then used to drastically speed
up the time taken by the optimisation space exploration by ranking the different variants based
on their predicted performance.
Overall, this thesis demonstrates that performance portability is achievable using LIFT
A multi-level functional IR with rewrites for higher-level synthesis of accelerators
Specialised accelerators deliver orders of magnitude higher energy-efficiency than
general-purpose processors. Field Programmable Gate Arrays (FPGAs) have become
the substrate of choice, because the ever-changing nature of modern workloads, such
as machine learning, demands reconfigurability. However, they are notoriously hard
to program directly using Hardware Description Languages (HDLs). Traditional High-Level Synthesis (HLS) tools improve productivity, but come with their own problems.
They often produce sub-optimal designs and programmers are still required to write
hardware-specific code, thus development cycles remain long.
This thesis proposes Shir, a higher-level synthesis approach for high-performance
accelerator design with a hardware-agnostic programming entry point, a multi-level
Intermediate Representation (IR), a compiler and rewrite rules for optimisation.
First, a novel, multi-level functional IR structure for accelerator design is described.
The IRs operate on different levels of abstraction, cleanly separating different hardware
concerns. They enable the expression of different forms of parallelism and standard
memory features, such as asynchronous off-chip memories or synchronous on-chip
buffers, as well as arbitration of such shared resources. Exposing these features at the
IR level is essential for achieving high performance.
Next, mechanical lowering procedures are introduced to automatically compile
a program specification through Shir’s functional IRs until low-level HDL code for
FPGA synthesis is emitted. Each lowering step gradually adds implementation details.
Finally, this thesis presents rewrite rules for automatic optimisations around parallelisation, buffering and data reshaping. Reshaping operations pose a challenge to
functional approaches in particular. They introduce overheads that compromise performance or even prevent the generation of synthesisable hardware designs altogether.
This fundamental issue is solved by the application of rewrite rules.
The viability of this approach is demonstrated by running matrix multiplication
and 2D convolution on an Intel Arria 10 FPGA. A limited design space exploration is
conducted, confirming the ability of the IR to exploit various hardware features. Using
rewrite rules for optimisation, it is possible to generate high-performance designs
that are competitive with highly tuned OpenCL implementations and that outperform
hardware-agnostic OpenCL code. The performance impact of the optimisations is
further evaluated showing that they are essential to achieving high performance, and
in many cases also necessary to produce hardware that fits the resource constraints
Refactoring for introducing and tuning parallelism for heterogeneous multicore machines in Erlang
This research has been generously supported by the European Union Framework 7 Para-Phrase project (IST-288570), EU Horizon 2020 projects RePhrase (H2020-ICT-2014-1), agreement number 644235; Teamplay (H2020-ICT 2017-1) agreement number 779882, and EPSRC Discovery, EP/P020631/1. EU COST Action IC1202: Timing Analysis On Code-Level (TACLe), and by a travel grant from EU HiPEAC.This paper presents semi‐automatic software refactorings to introduce and tune structured parallelism in sequential Erlang code, as well as to generate code for running computations on GPUs and possibly other accelerators. Our refactorings are based on the lapedo framework for programming heterogeneous multi‐core systems in Erlang. lapedo is based on the PaRTE refactoring tool and also contains (1) a set of hybrid skeletons that target both CPU and GPU processors, (2) novel refactorings for introducing and tuning parallelism, and (3) a tool to generate the GPU offloading and scheduling code in Erlang, which is used as a component of hybrid skeletons. We demonstrate, on four realistic use‐case applications, that we are able to refactor sequential code and produce heterogeneous parallel versions that can achieve significant and scalable speedups of up to 220 over the original sequential Erlang program on a 24‐core machine with a GPU.PostprintPeer reviewe
A domain-extensible compiler with controllable automation of optimisations
In high performance domains like image processing, physics simulation or machine learning, program performance is critical. Programmers called performance engineers are responsible for the challenging task of optimising programs. Two major challenges prevent modern compilers targeting heterogeneous architectures from reliably automating optimisation. First, domain specific compilers such as Halide for image processing and TVM for machine learning are difficult to extend with the new optimisations required by new algorithms and hardware. Second, automatic optimisation is often unable to achieve the required performance, and performance engineers often fall back to painstaking manual optimisation.
This thesis shows the potential of the Shine compiler to achieve domain-extensibility, controllable automation, and generate high performance code. Domain-extensibility facilitates adapting compilers to new algorithms and hardware. Controllable automation enables performance engineers to gradually take control of the optimisation process. The first research contribution is to add 3 code generation features to Shine, namely: synchronisation barrier insertion, kernel execution, and storage folding. Adding these features requires making novel design choices in terms of compiler extensibility and controllability. The rest of this thesis builds on these features to generate code with competitive runtime compared to established domain-specific compilers. The second research contribution is to demonstrate how extensibility and controllability are exploited to optimise a standard image processing pipeline for corner detection. Shine achieves 6 well-known image processing optimisations, 2 of them not being supported by Halide. Our results on 4 ARM multi-core CPUs show that the code generated by Shine for corner detection runs up to 1.4× faster than the Halide code. However, we observe that controlling rewriting is tedious, motivating the need for more automation.
The final research contribution is to introduce sketch-guided equality saturation, a semiautomated technique that allows performance engineers to guide program rewriting by specifying rewrite goals as sketches: program patterns that leave details unspecified. We evaluate this approach by applying 7 realistic optimisations of matrix multiplication. Without guidance, the compiler fails to apply the 5 most complex optimisations even given an hour and 60GB of RAM. With the guidance of at most 3 sketch guides, each 10 times smaller than the complete program, the compiler applies the optimisations in seconds using less than 1GB
Automatic matching of legacy code to heterogeneous APIs: An idiomatic approach
Heterogeneous accelerators often disappoint. They provide
the prospect of great performance, but only deliver it when
using vendor specific optimized libraries or domain specific
languages. This requires considerable legacy code modifications,
hindering the adoption of heterogeneous computing.
This paper develops a novel approach to automatically
detect opportunities for accelerator exploitation. We focus
on calculations that are well supported by established APIs:
sparse and dense linear algebra, stencil codes and generalized
reductions and histograms. We call them idioms and use a
custom constraint-based Idiom Description Language (IDL)
to discover them within user code. Detected idioms are then
mapped to BLAS libraries, cuSPARSE and clSPARSE and two
DSLs: Halide and Lift.
We implemented the approach in LLVM and evaluated
it on the NAS and Parboil sequential C/C++ benchmarks,
where we detect 60 idiom instances. In those cases where
idioms are a significant part of the sequential execution time,
we generate code that achieves 1.26× to over 20× speedup
on integrated and external GPUs
Tools for improving performance portability in heterogeneous environments
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Abstract]
Parallel computing is currently partially dominated by the availability of heterogeneous
devices. These devices differ from each other in aspects such as the
instruction set they execute, the number and the type of computing devices that
they offer or the structure of their memory systems. In the last years, langnages,
libraries and extensions have appeared to allow to write a parallel code once aud
run it in a wide variety of devices, OpenCL being the most widespread solution of
this kind. However, functional portability does not imply performance portability.
This way, one of the probletns that is still open in this field is to achieve automatic
performance portability. That is, the ability to automatically tune a given code for
any device where it will be execnted so that it ill obtain a good performance. This
thesis develops three different solutions to tackle this problem. The three of them
are based on typical source-to-sonrce optimizations for heterogeneous devices. Both
the set of optimizations to apply and the way they are applied depend on different
optimization parameters, whose values have to be tuned for each specific device.
The first solution is OCLoptimizer, a source-to-source optimizer that can optimize
annotated OpenCL kemels with the help of configuration files that guide the
optimization process. The tool optimizes kernels for a specific device, and it is also
able to automate the generation of functional host codes when only a single kernel
is optimized.
The two remaining solutions are built on top of the Heterogeneous Programming
Library (HPL), a C++ framework that provides an easy and portable way to exploit
heterogeneous computing systexns. The first of these solutions uses the run-time
code generation capabilities of HPL to generate a self-optimizing version of a matrix
multiplication that can optimize itself at run-time for an spedfic device. The last solutíon is the development of a built-in just-in-time optirnizer for HPL, that can
optirnize, at run-tirne, a HPL code for an specific device. While the first two solutions
use search processes to find the best values for the optimization parameters, this Iast
alternative relies on heuristics bMed on general optirnization strategies.[Resumen]
Actualmente la computación paralela se encuentra dominada parcialmente por
los múltiples dispositivos heterogéneos disponibles. Estos dispositivos difieren entre
sí en características tales como el conjunto de instrucciones que ejecutan, el número
y tipo de unidades de computación que incluyen o la estructura de sus sistemas de
memoria. Durante los últimos años han aparecido lenguajes, librerías y extensiones
que permiten escribir una única vez la versión paralela de un código y ejecutarla en
un amplio abanico de dispositivos, siendo de entre todos ellos OpenCL la solución
más extendida. Sin embargo, la portabilidad funcional no implica portabilidad de
rendimiento. Así, uno de los grandes problemas que sigue abierto en este campo
es la automatización de la portabilidad de rendimiento, es decir, la capacidad de
adaptar automáticamente un código dado para su ejecución en cualquier dispositivo
y obtener un buen rendimiento. Esta tesis aborda este problema planteando tres
soluciones diferentes al mismo. Las tres se basan en la aplicación de optimizaciones
de código a código usadas habitualmente en dispositivos heterogéneos. Tanto el
conjunto de optimizaciones a aplicar como la forma de aplicarlas dependen de varios
parámetros de optimización, cuyos valores han de ser ajustados para cada dispositivo
concreto.
La primera solución planteada es OCLoptirnizer, un optimizador de código a
código que a partir de kernels OpenCL anotados y ficheros de configuración como
apoyo, obtiene versiones optimizada de dichos kernels para un dispositivo concreto.
Además, cuando el kernel a optimizar es único, automatiza la generación de un
código de host funcional para ese kernel.
Las otras dos soluciones han sido implementadas utilizando Heterogeneous Prograrnming
LibranJ (HPL), una librería C++ que permite programar sistemas heterogéneos de forma fácil y portable. La primera de estas soluciones explota las
capacidades de generación de código en tiempo de ejecución de HPL para generar
versiones de un producto de matrices que se adaptan automáticamente en tiempo
de ejecución a las características de un dispositivo concreto. La última solución consiste
en el desarrollo e incorporación a HPL de un optimizador al vuelo, de fonna
que se puedan obtener en tiempo de ejecución versiones optimizadas de un código
HPL para un dispositivo dado. Mientras las dos primeras soluciones usan procesos
de búsqueda para encontrar los mejores valores para los parámetros de optimización,
esta última altemativa se basa para ello en heurísticas definidas a partir de
recomendaciones generales de optimización.[Resumo]
Actualmente a computación paralela atópase dominada parcialmente polos múltiples
dispositivos heteroxéneos dispoñibles. Estes dispositivos difiren entre si en características
tales como o conxunto de instruccións que executan, o número e tipo
de unidades de computación que inclúen ou a estrutura dos seus sistemas de mem~
ría. Nos últimos anos apareceron linguaxes, bibliotecas e extensións que permiten
escribir unha soa vez a versión paralela dun código e executala nun amplio abano de
dispositivos, senda de entre todos eles OpenCL a solución máis extendida. Porén, a
portabilidade funcional non implica portabilidade de rendemento. Deste xeito, uns
dos grandes problemas que segue aberto neste campo é a automatización da portabilidade
de rendemento, isto é, a capacidade de adaptar automaticamente un código
dado para a súa execución en calquera dispositivo e obter un bo rendemento. Esta
tese aborda este problema propondo tres solucións diferentes. As tres están baseadas
na aplicación de optimizacións de código a código usadas habitualmente en disp~
sitivos heteroxéneos. Tanto o conxunto de optimizacións a aplicar como a forma de
aplicalas dependen de varios parámetros de optimización para os que é preciso fixar
determinados valores en función do dispositivo concreto.
A primeira solución pro posta é OCLoptirnizer, un optimizador de código a código
que partindo de kemels OpenCL anotados e ficheiros de configuración de apoio,
obtén versións optimizadas dos devanditos kernels para un dispositivo concreto.
Amais, cando o kernel a optimizaré único, tarnén automatiza a xeración dun código
de host funcional para ese kernel.
As outras dúas solucións foron implementadas utilizando Heterogeneous Programming
Library (HPL), unha biblioteca C++ que permite programar sistemas
heteroxéneos de xeito fácil e portable. A primeira destas solucións explota as capacidades de xeración de código en tempo de execución de HPL para xerar versións
dun produto de matrices que se adaptan automaticamente ás características dun
dispositivo concreto. A última solución consiste no deseuvolvemento e incorporación
a HPL dun optimizador capaz de obter en tiempo de execución versións optimizada<;
dun código HPL para un dispositivo dado. Mentres as dúas primeiras solucións usan
procesos de procura para atopar os mellares valores para os parámetros de optimización,
esta última alternativa baséase para iso en heurísticas definidas a partir de
recomendacións xerais de optimización
- …