23 research outputs found
Code generation for 3D partial differential equation models from a high-level functional intermediate language
Partial Differential Equation (PDE) modelling is an important tool in scientific domains for bridging
theory with reality; however, they can be complex to program and even more difficult to abstract. The
evolving parallel computing landscape is also making it increasingly difficult to write and maintain codes
(such as PDE models) which retain performance across different parallel platforms. Computational
scientists should be able to focus on their science instead of also having to become high performance
computing experts in order to take advantage of faster parallel hardware. Current methods targeting this
problem either concentrate on very niche applications, are too simplistic for real world problems or are
too low-level to be easily programmable. Domain Specific Languages (DSLs) are a popular approach,
but they have two opposing goals: improving programmability, while also providing high performance.
This thesis presents a solution for developing performance portable 3D PDE models, using room
acoustics simulations as a case study, by raising the abstraction level in the existing hardware-agnostic,
intermediary language LIFT. This functional language and compiler is designed for DSLs to compile into
and provides a separation of concerns for developing parallel applications. This separation enables DSL
writers to focus on developing high-level abstractions providing productivity to the user, while LIFT turns
the intermediary parallel representation these abstractions compile down to into hardware-optimised
code. A suite of composable, algorithmic primitives enables LIFT to reuse functionality across domains
and an exploratory search space provides a way to find the best optimisations for a given platform.
As this thesis shows, room acoustic simulations are expressible in LIFT with only a few small
changes to the framework. These expressions are able to achieve comparable or better performance
to original hand-written benchmarks. Furthermore, such expressions enable room acoustics models to
run across multiple platforms and easily swap in optimisations. Being able to test out what optimisations
give the best performance for a given platform â without rewriting or retuning â allows computational
scientists to focus on their own work.
Optimisations previously inaccessible in LIFT are developed that target 3D stencils generally, including 3D PDE models. In particular, 2.5D Tiling and compiler passes to inline private arrays and structs
are added to the LIFT ecosystem, giving high performance to various 3D stencil codes. The 2.5D Tiling
optimisation is coded functionally for the first time in LIFT and is selected automatically by additional
rewrite rules. These rewrite rules, such as the one for 2.5D Tiling, are explored in a search space to find
the best set of optimisations for an application on a given platform.
Building on previous work, LIFT is extended to enable complex boundary conditions and room
shapes for room acoustics models. This is the first intermediate representation in a high-level code generator to do so. Additionally, it is also the first high-level framework to support frequency-dependent
boundary handling for room acoustics simulations. Combined, these contributions show that high-level
abstractions for 3D PDE models are possible, enabling computational scientists to optimise and parallelise their codes more easily across different parallel platforms
Deep learning for compilers
Constructing compilers is hard. Optimising compilers are multi-million dollar projects
spanning years of development, yet remain unable to fully exploit the available performance,
and are prone to bugs. The rapid transition to heterogeneous parallelism and
diverse architectures has raised demand for aggressively-optimising compilers to an
all time high, leaving compiler developers struggling to keep up. What is needed are
better tools to simplify compiler construction.
This thesis presents new techniques that dramatically lower the cost of compiler
construction, while improving robustness and performance. The enabling insight for
this research is the leveraging of deep learning to model the correlations between
source code and program behaviour, enabling tasks which previously required significant
engineering effort to be automated. This is demonstrated in three domains:
First, a generative model for compiler benchmarks is developed. The model requires
no prior knowledge of programming languages, yet produces output of such
quality that professional software developers cannot distinguish generated from hand-written
programs. The efficacy of the generator is demonstrated by supplementing the
training data of predictive models for compiler optimisations. The generator yields an
automatic improvement in heuristic performance, and exposes weaknesses in state-of-the-
art approaches which, when corrected, yield further performance improvements.
Second, a compiler fuzzer is developed which is far simpler than prior techniques.
By learning a generative model rather than engineering a generator from scratch, it is
implemented in 100 fewer lines of code than the state-of-the-art, yet is capable of
exposing bugs which prior techniques cannot. An extensive testing campaign reveals
67 new bugs in OpenCL compilers, many of which have now been fixed.
Finally, this thesis addresses the challenge of feature design. A methodology for
learning compiler heuristics is presented that, in contrast to prior approaches, learns
directly over the raw textual representation of programs. The approach outperforms
state-of-the-art models with hand-engineered features in two challenging optimisation
domains, without requiring any expert guidance. Additionally, the methodology enables
models trained in one task to be adapted to perform another, permitting the novel
transfer of information between optimisation problem domains.
The techniques developed in these three contrasting domains demonstrate the exciting
potential of deep learning to simplify and improve compiler construction. The
outcomes of this thesis enable new lines of research to equip compiler developers to
keep up with the rapidly evolving landscape of heterogeneous architectures
From constraint programming to heterogeneous parallelism
The scaling limitations of multi-core processor development have led to a diversification of the processor cores used within individual computers. Heterogeneous computing has become widespread, involving the cooperation of several structurally different processor cores. Central processor (CPU) cores are most frequently complemented with graphics processors (GPUs), which despite their name are suitable for many highly parallel computations besides computer graphics. Furthermore, deep learning accelerators are rapidly gaining relevance.
Many applications could profit from heterogeneous computing but are held back by the surrounding software ecosystems. Heterogeneous systems are a challenge for compilers in particular, which usually target only the increasingly marginalised homogeneous CPU cores. Therefore, heterogeneous acceleration is primarily accessible via libraries and domain-specific languages (DSLs), requiring application rewrites and resulting in vendor lock-in.
This thesis presents a compiler method for automatically targeting heterogeneous hardware from existing sequential C/C++ source code. A new constraint programming method enables the declarative specification and automatic detection of computational idioms within compiler intermediate representation code. Examples of computational idioms are stencils, reductions, and linear algebra. Computational idioms denote algorithmic structures that commonly occur in performance-critical loops. Consequently, well-designed accelerator DSLs and libraries support computational idioms with their programming models and function interfaces. The detection of computational idioms in their middle end enables compilers to incorporate DSL and library backends for code generation. These backends leverage domain knowledge for the efficient utilisation of heterogeneous hardware.
The constraint programming methodology is first derived on an abstract model and then implemented as an extension to LLVM. Two constraint programming languages are designed to target this implementation: the Compiler Analysis Description Language (CAnDL), and the extended Idiom Detection Language (IDL). These languages are evaluated on a range of different compiler problems, culminating in a complete heterogeneous acceleration pipeline integrated with the Clang C/C++ compiler. This pipeline was evaluated on the established benchmark collections NPB and Parboil. The approach was applicable to 10 of the benchmark programs, resulting in significant speedups from 1.26Ă on âhistoâ to 275Ă on âsgemmâ when starting from sequential baseline versions.
In summary, this thesis shows that the automatic recognition of computational idioms during compilation enables the heterogeneous acceleration of sequential C/C++ programs. Moreover, the declarative specification of computational idioms is derived in novel declarative programming languages, and it is demonstrated that constraint programming on Single Static Assignment intermediate code is a suitable method for their automatic detection
Polyhedral+Dataflow Graphs
This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains.
The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods
Automatische Codegenerierung fĂŒr Massiv Parallele Applikationen in der Numerischen Strömungsmechanik
Solving partial differential equations (PDEs) is a fundamental challenge in many application domains in industry and academia alike. With increasingly large problems, efficient and highly scalable implementations become more and more crucial. Today, facing this challenge is more difficult than ever due to the increasingly heterogeneous hardware landscape. One promising approach is developing domainâspecific languages (DSLs) for a set of applications. Using code generation techniques then allows targeting a range of hardware platforms while concurrently applying domainâspecific optimizations in an automated fashion. The present work aims to further the state of the art in this field. As domain, we choose PDE solvers and, in particular, those from the group of geometric multigrid methods. To avoid having a focus too broad, we restrict ourselves to methods working on structured and patchâstructured grids.
We face the challenge of handling a domain as complex as ours, while providing different abstractions for diverse user groups, by splitting our external DSL ExaSlang into multiple layers, each specifying different aspects of the final application. Layer 1 is designed to resemble LaTeX and allows inputting continuous equations and functions. Their discretization is expressed on layer 2. It is complemented by algorithmic components which can be implemented in a Matlabâlike syntax on layer 3. All information provided to this point is summarized on layer 4, enriched with particulars about data structures and the employed parallelization. Additionally, we support automated progression between the different layers. All ExaSlang input is processed by our jointly developed Scala code generation framework to ultimately emit C++ code. We particularly focus on how to generate applications parallelized with, e.g., MPI and OpenMP that are able to run on workstations and largeâscale cluster alike.
We showcase the applicability of our approach by implementing simple test problems, like Poissonâs equation, as well as relevant applications from the field of computational fluid dynamics (CFD). In particular, we implement scalable solvers for the Stokes, NavierâStokes and shallow water equations (SWE) discretized using finite differences (FD) and finite volumes (FV). For the case of NavierâStokes, we also extend our implementation towards nonâuniform grids, thereby enabling static mesh refinement, and advanced effects such as the simulated fluid being nonâNewtonian and nonâisothermal
A Survey on Compiler Autotuning using Machine Learning
Since the mid-1990s, researchers have been trying to use machine-learning
based approaches to solve a number of different compiler optimization problems.
These techniques primarily enhance the quality of the obtained results and,
more importantly, make it feasible to tackle two main compiler optimization
problems: optimization selection (choosing which optimizations to apply) and
phase-ordering (choosing the order of applying optimizations). The compiler
optimization space continues to grow due to the advancement of applications,
increasing number of compiler optimizations, and new target architectures.
Generic optimization passes in compilers cannot fully leverage newly introduced
optimizations and, therefore, cannot keep up with the pace of increasing
options. This survey summarizes and classifies the recent advances in using
machine learning for the compiler optimization field, particularly on the two
major problems of (1) selecting the best optimizations and (2) the
phase-ordering of optimizations. The survey highlights the approaches taken so
far, the obtained results, the fine-grain classification among different
approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our
Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated
quarterly here (Send me your new published papers to be added in the
subsequent version) History: Received November 2016; Revised August 2017;
Revised February 2018; Accepted March 2018
Machine Learning in Compiler Optimization
In the last decade, machine learning based compilation has moved from an an obscure research niche to a mainstream activity. In this article, we describe the relationship between machine learning and compiler optimisation and introduce the main concepts of features, models, training and deployment. We then provide a comprehensive survey and provide a road map for the wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This paper provides both an accessible introduction to the fast moving area of machine learning based compilation and a detailed bibliography of its main achievements
GPU Array Access Auto-Tuning
GPUs have been used for years in compute intensive applications. Their massive parallel processing capabilities can speedup calculations significantly. However, to leverage this speedup it is necessary to rethink and develop new algorithms that allow parallel processing. These algorithms are only one piece to achieve high performance. Nearly as important as suitable algorithms is the actual implementation and the usage of special hardware features such as intra-warp communication, shared memory, caches, and memory access patterns. Optimizing these factors is usually a time consuming task that requires deep understanding of the algorithms and the underlying hardware. Unlike CPUs, the internal structure of GPUs has changed significantly and will likely change even more over the years. Therefore it does not suffice to optimize the code once during the development, but it has to be optimized for each new GPU generation that is released. To efficiently (re-)optimize code towards the underlying hardware, auto-tuning tools have been developed that perform these optimizations automatically, taking this burden from the programmer.
In particular, NVIDIA -- the leading manufacturer for GPUs today -- applied significant changes to the memory hierarchy over the last four hardware generations. This makes the memory hierarchy an attractive objective for an auto-tuner.
In this thesis we introduce the MATOG auto-tuner that automatically optimizes array access for NVIDIA CUDA applications. In order to achieve these optimizations, MATOG has to analyze the application to determine optimal parameter values. The analysis relies on empirical profiling combined with a prediction method and a data post-processing step. This allows to find nearly optimal parameter values in a minimal amount of time. Further, MATOG is able to automatically detect varying application workloads and can apply different optimization parameter settings at runtime.
To show MATOG's capabilities, we evaluated it on a variety of different applications, ranging from simple algorithms up to complex applications on the last four hardware generations, with a total of 14 GPUs. MATOG is able to achieve equal or even better performance than hand-optimized code. Further, it is able to provide performance portability across different GPU types (low-, mid-, high-end and HPC) and generations. In some cases it is able to exceed the performance of hand-crafted code that has been specifically optimized for the tested GPU by dynamically changing data layouts throughout the execution