12 research outputs found
A Survey on Compiler Autotuning using Machine Learning
Since the mid-1990s, researchers have been trying to use machine-learning
based approaches to solve a number of different compiler optimization problems.
These techniques primarily enhance the quality of the obtained results and,
more importantly, make it feasible to tackle two main compiler optimization
problems: optimization selection (choosing which optimizations to apply) and
phase-ordering (choosing the order of applying optimizations). The compiler
optimization space continues to grow due to the advancement of applications,
increasing number of compiler optimizations, and new target architectures.
Generic optimization passes in compilers cannot fully leverage newly introduced
optimizations and, therefore, cannot keep up with the pace of increasing
options. This survey summarizes and classifies the recent advances in using
machine learning for the compiler optimization field, particularly on the two
major problems of (1) selecting the best optimizations and (2) the
phase-ordering of optimizations. The survey highlights the approaches taken so
far, the obtained results, the fine-grain classification among different
approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our
Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated
quarterly here (Send me your new published papers to be added in the
subsequent version) History: Received November 2016; Revised August 2017;
Revised February 2018; Accepted March 2018
Identifying Compiler Options to Minimise Energy Consumption for Embedded Platforms
This paper presents an analysis of the energy consumption of an extensive
number of the optimisations a modern compiler can perform. Using GCC as a test
case, we evaluate a set of ten carefully selected benchmarks for five different
embedded platforms.
A fractional factorial design is used to systematically explore the large
optimisation space (2^82 possible combinations), whilst still accurately
determining the effects of optimisations and optimisation combinations.
Hardware power measurements on each platform are taken to ensure all
architectural effects on the energy consumption are captured.
We show that fractional factorial design can find more optimal combinations
than relying on built in compiler settings. We explore the relationship between
run-time and energy consumption, and identify scenarios where they are and are
not correlated.
A further conclusion of this study is the structure of the benchmark has a
larger effect than the hardware architecture on whether the optimisation will
be effective, and that no single optimisation is universally beneficial for
execution time or energy consumption.Comment: 14 pages, 7 figure
A methodology for speeding up loop kernels by exploiting the software information and the memory architecture
It is well-known that today׳s compilers and state of the art libraries have three major drawbacks. First, the compiler sub-problems are optimized separately; this is not efficient because the separate sub-problems optimization gives a different schedule for each sub-problem and these schedules cannot coexist as the refining of one, causes the degradation of another. Second, they take into account only part of the specific algorithm׳s information. Third, they take into account only a few hardware architecture parameters. These approaches cannot give an optimal solution.
In this paper, a new methodology/pre-compiler is introduced, which speeds up loop kernels, by overcoming the above problems. This methodology solves four of the major scheduling sub-problems, together as one problem and not separately; these are the sub-problems of finding the schedules with the minimum numbers of (i) L1 data cache accesses, (ii) L2 data cache accesses, (iii) main memory data accesses, (iv) addressing instructions. First, the exploration space (possible solutions) is found according to the algorithm׳s information, e.g. array subscripts. Then, the exploration space is decreased by orders of magnitude, by applying constraint propagation to the software and hardware parameters.
We take the C-code and the memory architecture parameters as input and we automatically produce a new faster C-code; this code cannot be obtained by applying the existing compiler transformations to the original code. The proposed methodology has been evaluated for five well-known algorithms in both general and embedded processors; it is compared with gcc and clang compilers and also with iterative compilation
Le tuilage mono-paramétrique est une transformation polyédrique
Tiling is a crucial program transformation with many benefits: it improves locality, exposes parallelism, allows for adjusting the ops-to-bytes balance of codes, and can be applied at multiple levels. Allowing tile sizes to be symbolic parameters at compile time has many benefits, including efficient autotuning, and run-time adaptability to system variations. For polyhedral programs, parametric tiling in its full generality is known to be non-linear, breaking the mathematical closure properties of the polyhedral model. Most compilation tools therefore either avoid it by only performing fixed size tiling, or apply it in only the final, code generation step. Both strategies have limitations. We first introduce mono-parametric partitioning, a restricted parametric, tiling-like transformation which can be used to express a tiling. We show that, despite being parametric, it is a polyhedral transformation. We first prove that applying mono-parametric partitioning (i) to a polyhedron yields a union of polyhedra, and (ii) to an affine function produces a piecewise-affine function. We then use these properties to show how to partition an entire polyhedral program, including one with reductions. Next, we generalize this transformation to tiles with arbitrary tile shapes that can tesselate the iteration space (e.g., hexagonal, trapezoidal, etc). We show how mono-parametric tiling can be applied at multiple levels, and enables a wide range of polyhedral analysis and transformations to be applied
Reducing the cost of heuristic generation with machine learning
The space of compile-time transformations and or run-time options which can improve
the performance of a given code is usually so large as to be virtually impossible
to search in any practical time-frame. Thus, heuristics are leveraged which can suggest
good but not necessarily best configurations. Unfortunately, since such heuristics are
tightly coupled to processor architecture performance is not portable; heuristics must
be tuned, traditionally manually, for each device in turn. This is extremely laborious
and the result is often outdated heuristics and less effective optimisation.
Ideally, to keep up with changes in hardware and run-time environments a fast and
automated method to generate heuristics is needed. Recent works have shown that
machine learning can be used to produce mathematical models or rules in their place,
which is automated but not necessarily fast. This thesis proposes the use of active
machine learning, sequential analysis, and active feature acquisition to accelerate the
training process in an automatic way, thereby tackling this timely and substantive issue.
First, a demonstration of the efficiency of active learning over the previously standard
supervised machine learning technique is presented in the form of an ensemble
algorithm. This algorithm learns a model capable of predicting the best processing
device in a heterogeneous system to use per workload size, per kernel. Active machine
learning is a methodology which is sensitive to the cost of training; specifically, it is
able to reduce the time taken to construct a model by predicting how much is expected
to be learnt from each new training instance and then only choosing to learn from those
most profitable examples. The exemplar heuristic is constructed on average 4x faster
than a baseline approach, whilst maintaining comparable quality.
Next, a combination of active learning and sequential analysis is presented which
reduces both the number of samples per training example as well as the number of
training examples overall. This allows for the creation of models based on noisy information,
sacrificing accuracy per training instance for speed, without having a significant
affect on the quality of the final product. In particular, the runtime of high-performance
compute kernels is predicted from code transformations one may want to
apply using a heuristic which was generated up to 26x faster than with active learning
alone.
Finally, preliminary work demonstrates that an automated system can be created
which optimises both the number of training examples as well as which features to
select during training to further substantially accelerate learning, in cases where each
feature value that is revealed comes at some cost
Machine learning based mapping of data and streaming parallelism to multi-cores
Multi-core processors are now ubiquitous and are widely seen as the most viable means
of delivering performance with increasing transistor densities. However, this potential
can only be realised if the application programs are suitably parallel. Applications
can either be written in parallel from scratch or converted from existing sequential
programs. Regardless of how applications are parallelised, the code must be efficiently
mapped onto the underlying platform to fully exploit the hardware’s potential.
This thesis addresses the problem of finding the best mappings of data and streaming
parallelism—two types of parallelism that exist in broad and important domains
such as scientific, signal processing and media applications. Despite significant
progress having been made over the past few decades, state-of-the-art mapping approaches
still largely rely upon hand-crafted, architecture-specific heuristics. Developing
a heuristic by hand, however, often requiresmonths of development time. Asmulticore
designs become increasingly diverse and complex, manually tuning a heuristic
for a wide range of architectures is no longer feasible. What are needed are innovative
techniques that can automatically scale with advances in multi-core technologies.
In this thesis two distinct areas of computer science, namely parallel compiler design
and machine learning, are brought together to develop new compiler-based mapping
techniques. Using machine learning, it is possible to automatically build highquality
mapping schemes, which adapt to evolving architectures, with little human
involvement.
First, two techniques are proposed to find the best mapping of data parallelism.
The first technique predicts whether parallel execution of a data parallel candidate is
profitable on the underlying architecture. On a typical multi-core platform, it achieves
almost the same (and sometimes a better) level of performance when compared to the
manually parallelised code developed by independent experts. For a profitable candidate,
the second technique predicts how many threads should be used to execute
the candidate across different program inputs. The second technique achieves, on average,
over 96% of the maximum available performance on two different multi-core
platforms.
Next, a new approach is developed for partitioning stream applications. This approach
predicts the ideal partitioning structure for a given stream application. Based
on the prediction, a compiler can rapidly search the program space (without executing
any code) to generate a good partition. It achieves, on average, a 1.90x speedup over
the already tuned partitioning scheme of a state-of-the-art streaming compiler
Enabling aggressive compiler optimization for the mobile environment
Aggressive code optimization on the mobile environment is a difficult endeavor. Billions of users rely on mobile devices for their daily computing tasks. Yet, they mostly run poorly optimized code, under-utilizing their already limited processing and energy resources. Existing optimization approaches, like iterative compilation, perform well in other domains but fall short on the mobile environment. They either rely on representative inputs that are hard to reconstruct, or expose users to slowdowns and crashes.
An ideal solution must be able to perform an optimization search by repeatedly evaluating different optimization decisions on the same input. That input should be representative of actual user usage without requiring developers to artificially create it. Finally, users should never be exposed to slow or crashing evaluations, a quite common side-effect of iterative compilation. This thesis presents a novel approach with all above in mind, bringing aggressive code optimization to the mobile environment.
With a transparent capture mechanism, real user inputs can be stored. This mechanism is infrequently invoked and remains unnoticeable from the users. A single capture is enough to enable offline, input-driven code optimization. It supports C functions as well as code regions of interactive Android applications. It allows controlling the timing and frequency of captures, it bails out on imminent high-impact runtime events, and excludes from captures some immutable data.
A replay-based evaluation mechanism is able to repeatedly restore a captured input while changing the underlying code. For C programs, it employs compile and link-time strategies to consistently work despite code transformations. For Android apps, a novel mechanism was developed, able to replay using different code types. These are the original Android-compiled code, interpretation, and LLVM-generated code. Additionally, it works well even in the presence of memory-shuffling security mechanisms.
Capture and replay is fused into an iterative compilation system that uses offline, replay-based evaluations. Initially, real inputs are captured online, without noticeably affecting the users. For C and interactive apps, captures required on average 2ms and 15ms respectively. Then, an optimization search is performed by repeatedly replaying the inputs using different code transformations. As this happens offline, any crashing or erroneous executions are not affecting the users. C programs became 29% faster using a random search, while interactive apps became 44% faster using a genetic algorithm and a novel Android backend based on LLVM. Finally, with crowd-sourcing, the offline evaluation effort was significantly accelerated. Specifically, for the user with the highest workload the search accelerated by 7 times