643 research outputs found
Removing and restoring control flow with the Value State Dependence Graph
This thesis studies the practicality of compiling with only data flow information.
Specifically, we focus on the challenges that arise when using the Value
State Dependence Graph (VSDG) as an intermediate representation (IR).
We perform a detailed survey of IRs in the literature in order to discover
trends over time, and we classify them by their features in a taxonomy. We
see how the VSDG fits into the IR landscape, and look at the divide between
academia and the 'real world' in terms of compiler technology. Since most
data flow IRs cannot be constructed for irreducible programs, we perform an
empirical study of irreducibility in current versions of open source software,
and then compare them with older versions of the same software. We also
study machine-generated C code from a variety of different software tools.
We show that irreducibility is no longer a problem, and is becoming less so
with time. We then address the problem of constructing the VSDG. Since
previous approaches in the literature have been poorly documented or ignored
altogether, we give our approach to constructing the VSDG from a common
IR: the Control Flow Graph. We show how our approach is independent of
the source and target language, how it is able to handle unstructured control
flow, and how it is able to transform irreducible programs on the fly. Once the
VSDG is constructed, we implement Lawrence's proceduralisation algorithm
in order to encode an evaluation strategy whilst translating the program into
a parallel representation: the Program Dependence Graph. From here, we
implement scheduling and then code generation using the LLVM compiler.
We compare our compiler framework against several existing compilers, and
show how removing control flow with the VSDG and then restoring it later
can produce high quality code. We also examine specific situations where the
VSDG can put pressure on existing code generators. Our results show that the
VSDG represents a radically different, yet practical, approach to compilation
Compilation Techniques for High-Performance Embedded Systems with Multiple Processors
Institute for Computing Systems ArchitectureDespite the progress made in developing more advanced compilers for embedded systems,
programming of embedded high-performance computing systems based on Digital
Signal Processors (DSPs) is still a highly skilled manual task. This is true for
single-processor systems, and even more for embedded systems based on multiple
DSPs. Compilers often fail to optimise existing DSP codes written in C due to the
employed programming style. Parallelisation is hampered by the complex multiple address
space memory architecture, which can be found in most commercial multi-DSP
configurations.
This thesis develops an integrated optimisation and parallelisation strategy that can
deal with low-level C codes and produces optimised parallel code for a homogeneous
multi-DSP architecture with distributed physical memory and multiple logical address
spaces. In a first step, low-level programming idioms are identified and recovered. This
enables the application of high-level code and data transformations well-known in the
field of scientific computing. Iterative feedback-driven search for “good” transformation
sequences is being investigated. A novel approach to parallelisation based on a
unified data and loop transformation framework is presented and evaluated. Performance
optimisation is achieved through exploitation of data locality on the one hand,
and utilisation of DSP-specific architectural features such as Direct Memory Access
(DMA) transfers on the other hand.
The proposed methodology is evaluated against two benchmark suites (DSPstone
& UTDSP) and four different high-performance DSPs, one of which is part of a commercial
four processor multi-DSP board also used for evaluation. Experiments confirm
the effectiveness of the program recovery techniques as enablers of high-level transformations
and automatic parallelisation. Source-to-source transformations of DSP
codes yield an average speedup of 2.21 across four different DSP architectures. The
parallelisation scheme is – in conjunction with a set of locality optimisations – able to
produce linear and even super-linear speedups on a number of relevant DSP kernels
and applications
Profile Guided Dataflow Transformation for FPGAs and CPUs
This paper proposes a new high-level approach for optimising field programmable gate array (FPGA) designs. FPGA designs are commonly implemented in low-level hardware description languages (HDLs), which lack the abstractions necessary for identifying opportunities for significant performance improvements. Using a computer vision case study, we show that modelling computation with dataflow abstractions enables substantial restructuring of FPGA designs before lowering to the HDL level, and also improve CPU performance. Using the CPU transformations, runtime is reduced by 43 %. Using the FPGA transformations, clock frequency is increased from 67MHz to 110MHz. Our results outperform commercial low-level HDL optimisations, showcasing dataflow program abstraction as an amenable computation model for highly effective FPGA optimisation
Approaches to the determination of parallelism in computer programs
Approaches to the determination of parallelism in computer program
Indexed dependence metadata and its applications in software performance optimisation
To achieve continued performance improvements, modern microprocessor design is tending to concentrate
an increasing proportion of hardware on computation units with less automatic management
of data movement and extraction of parallelism. As a result, architectures increasingly include multiple
computation cores and complicated, software-managed memory hierarchies. Compilers have
difficulty characterizing the behaviour of a kernel in a general enough manner to enable automatic
generation of efficient code in any but the most straightforward of cases.
We propose the concept of indexed dependence metadata to improve application development and
mapping onto such architectures. The metadata represent both the iteration space of a kernel and the
mapping of that iteration space from a given index to the set of data elements that iteration might
use: thus the dependence metadata is indexed by the kernel’s iteration space. This explicit mapping
allows the compiler or runtime to optimise the program more efficiently, and improves the program
structure for the developer. We argue that this form of explicit interface specification reduces the need
for premature, architecture-specific optimisation. It improves program portability, supports intercomponent
optimisation and enables generation of efficient data movement code.
We offer the following contributions: an introduction to the concept of indexed dependence metadata
as a generalisation of stream programming, a demonstration of its advantages in a component
programming system, the decoupled access/execute model for C++ programs, and how indexed dependence
metadata might be used to improve the programming model for GPU-based designs. Our
experimental results with prototype implementations show that indexed dependence metadata supports
automatic synthesis of double-buffered data movement for the Cell processor and enables aggressive
loop fusion optimisations in image processing, linear algebra and multigrid application case
studies
Compilers that learn to optimise: a probabilistic machine learning approach
Compiler optimisation is the process of making a compiler produce better code, i.e. code that,
for example, runs faster on a target architecture. Although numerous program transformations
for optimisation have been proposed in the literature, these transformations are not always beneficial and they can interact in very complex ways. Traditional approaches adopted by compiler
writers fix the order of the transformations and decide when and how these transformations
should be applied to a program by using hard-coded heuristics. However, these heuristics require a lot of time and effort to construct and may sacrifice performance on programs they have
not been tuned for.This thesis proposes a probabilistic machine learning solution to the compiler optimisation problem that automatically determines "good" optimisation strategies for programs. This
approach uses predictive modelling in order to search the space of compiler transformations.
Unlike most previous work that learns when/how to apply a single transformation in isolation or
a fixed-order set of transformations, the techniques proposed in this thesis are capable of tackling the general problem of predicting "good" sequences of compiler transformations. This is
achieved by exploiting transference across programs with two different techniques: Predictive
Search Distributions (PSD) and multi-task Gaussian process prediction (multi-task GP). While
the former directly addresses the problem of predicting "good" transformation sequences, the
latter learns regression models (or proxies) of the performance of the programs in order to
rapidly scan the space of transformation sequences.Both methods, PSD and multi-task GP, are formulated as general machine learning techniques. In particular, the PSD method is proposed in order to speed up search in combinatorial
optimisation problems by learning a distribution over good solutions on a set of problem in¬
stances and using that distribution to search the optimisation space of a problem that has not
been seen before. Likewise, multi-task GP is proposed as a general method for multi-task learning that directly models the correlation between several machine learning tasks, exploiting the
shared information across the tasks.Additionally, this thesis presents an extension to the well-known analysis of variance
(ANOVA) methodology in order to deal with sequence data. This extension is used to address the problem of optimisation space characterisation by identifying and quantifying the
main effects of program transformations and their interactions.Finally, the machine learning methods proposed are successfully applied to a data set that
has been generated as a result of the application of source-to-source transformations to 12 C
programs from the UTDSP benchmark suite
- …