97 research outputs found
Compilation Techniques for High-Performance Embedded Systems with Multiple Processors
Institute for Computing Systems ArchitectureDespite the progress made in developing more advanced compilers for embedded systems,
programming of embedded high-performance computing systems based on Digital
Signal Processors (DSPs) is still a highly skilled manual task. This is true for
single-processor systems, and even more for embedded systems based on multiple
DSPs. Compilers often fail to optimise existing DSP codes written in C due to the
employed programming style. Parallelisation is hampered by the complex multiple address
space memory architecture, which can be found in most commercial multi-DSP
configurations.
This thesis develops an integrated optimisation and parallelisation strategy that can
deal with low-level C codes and produces optimised parallel code for a homogeneous
multi-DSP architecture with distributed physical memory and multiple logical address
spaces. In a first step, low-level programming idioms are identified and recovered. This
enables the application of high-level code and data transformations well-known in the
field of scientific computing. Iterative feedback-driven search for “good” transformation
sequences is being investigated. A novel approach to parallelisation based on a
unified data and loop transformation framework is presented and evaluated. Performance
optimisation is achieved through exploitation of data locality on the one hand,
and utilisation of DSP-specific architectural features such as Direct Memory Access
(DMA) transfers on the other hand.
The proposed methodology is evaluated against two benchmark suites (DSPstone
& UTDSP) and four different high-performance DSPs, one of which is part of a commercial
four processor multi-DSP board also used for evaluation. Experiments confirm
the effectiveness of the program recovery techniques as enablers of high-level transformations
and automatic parallelisation. Source-to-source transformations of DSP
codes yield an average speedup of 2.21 across four different DSP architectures. The
parallelisation scheme is – in conjunction with a set of locality optimisations – able to
produce linear and even super-linear speedups on a number of relevant DSP kernels
and applications
Profile-driven parallelisation of sequential programs
Traditional parallelism detection in compilers is performed by means of static analysis
and more specifically data and control dependence analysis. The information that
is available at compile time, however, is inherently limited and therefore restricts the
parallelisation opportunities. Furthermore, applications written in C – which represent
the majority of today’s scientific, embedded and system software – utilise many lowlevel
features and an intricate programming style that forces the compiler to even more
conservative assumptions. Despite the numerous proposals to handle this uncertainty
at compile time using speculative optimisation and parallelisation, the software industry
still lacks any pragmatic approaches that extracts coarse-grain parallelism to exploit
the multiple processing units of modern commodity hardware.
This thesis introduces a novel approach for extracting and exploiting multiple forms
of coarse-grain parallelism from sequential applications written in C. We utilise profiling
information to overcome the limitations of static data and control-flow analysis
enabling more aggressive parallelisation. Profiling is performed using an instrumentation
scheme operating at the Intermediate Representation (Ir) level of the compiler.
In contrast to existing approaches that depend on low-level binary tools and debugging
information, Ir-profiling provides precise and direct correlation of profiling information
back to the Ir structures of the compiler. Additionally, our approach is orthogonal to
existing automatic parallelisation approaches and additional fine-grain parallelism may
be exploited.
We demonstrate the applicability and versatility of the proposed methodology using
two studies that target different forms of parallelism. First, we focus on the exploitation
of loop-level parallelism that is abundant in many scientific and embedded
applications. We evaluate our parallelisation strategy against the Nas and Spec Fp
benchmarks and two different multi-core platforms (a shared-memory Intel Xeon Smp
and a heterogeneous distributed-memory Ibm Cell blade). Empirical evaluation shows
that our approach not only yields significant improvements when compared with state-of-
the-art parallelising compilers, but comes close to and sometimes exceeds the performance
of manually parallelised codes. On average, our methodology achieves 96%
of the performance of the hand-tuned parallel benchmarks on the Intel Xeon platform,
and a significant speedup for the Cell platform. The second study, addresses
the problem of partially sequential loops, typically found in implementations of multimedia
codecs. We develop a more powerful whole-program representation based on the Program Dependence Graph (Pdg) that supports profiling, partitioning and codegeneration
for pipeline parallelism. In addition we demonstrate how this enhances
conventional pipeline parallelisation by incorporating support for multi-level loops and
pipeline stage replication in a uniform and automatic way. Experimental results using a
set of complex multimedia and stream processing benchmarks confirm the effectiveness
of the proposed methodology that yields speedups up to 4.7 on a eight-core Intel Xeon
machine
An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline
Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore’s law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler
A metadata-enhanced framework for high performance visual effects
This thesis is devoted to reducing the interactive latency of image processing computations in
visual effects. Film and television graphic artists depend upon low-latency feedback to receive
a visual response to changes in effect parameters. We tackle latency with a domain-specific optimising
compiler which leverages high-level program metadata to guide key computational and
memory hierarchy optimisations. This metadata encodes static and dynamic information about
data dependence and patterns of memory access in the algorithms constituting a visual effect –
features that are typically difficult to extract through program analysis – and presents it to the
compiler in an explicit form. By using domain-specific information as a substitute for program
analysis, our compiler is able to target a set of complex source-level optimisations that a vendor
compiler does not attempt, before passing the optimised source to the vendor compiler for
lower-level optimisation.
Three key metadata-supported optimisations are presented. The first is an adaptation of
space and schedule optimisation – based upon well-known compositions of the loop fusion and
array contraction transformations – to the dynamic working sets and schedules of a runtimeparameterised
visual effect. This adaptation sidesteps the costly solution of runtime code generation
by specialising static parameters in an offline process and exploiting dynamic metadata to
adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second
optimisation comprises a set of transformations to generate SIMD ISA-augmented source code.
Our approach differs from autovectorisation by using static metadata to identify parallelism, in
place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable
parameters for optimal aligned memory access. The third optimisation comprises a related set
of transformations to generate code for SIMT architectures, such as GPUs. Static dependence
metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads.
Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by identifying
inter-thread and intra-core data sharing opportunities in memory access metadata.
A detailed performance analysis of these optimisations is presented for two industrially developed
visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD
multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written implementations
of these two effects. Programmability is enhanced by automating the generation of
SIMD and SIMT implementations from a single programmer-managed scalar representation
A data dependency recovery system for a heterogeneous multicore processor
Multicore processors often increase the performance of applications. However, with their deeper pipelining, they have proven increasingly difficult to improve. In an attempt to deliver enhanced performance at lower power requirements, semiconductor microprocessor manufacturers have progressively utilised chip-multicore processors. Existing research has utilised a very common technique known as thread-level speculation. This technique attempts to compute results before the actual result is known. However, thread-level speculation impacts operation latency, circuit timing, confounds data cache behaviour and code generation in the compiler. We describe an software framework codenamed Lyuba that handles low-level data hazards and automatically recovers the application from data hazards without programmer and speculation intervention for an asymmetric chip-multicore processor. The problem of determining correct execution of multiple threads when data hazards occur on conventional symmetrical chip-multicore processors is a significant and on-going challenge. However, there has been very little focus on the use of asymmetrical (heterogeneous) processors with applications that have complex data dependencies. The purpose of this thesis is to: (i) define the development of a software framework for an asymmetric (heterogeneous) chip-multicore processor; (ii) present an optimal software control of hardware for distributed processing and recovery from violations;(iii) provides performance results of five applications using three datasets. Applications with a small dataset showed an improvement of 17% and a larger dataset showed an improvement of 16% giving overall 11% improvement in performance
The exploitation of parallelism on shared memory multiprocessors
PhD ThesisWith the arrival of many general purpose shared memory multiple processor
(multiprocessor) computers into the commercial arena during the mid-1980's, a
rift has opened between the raw processing power offered by the emerging
hardware and the relative inability of its operating software to effectively deliver
this power to potential users. This rift stems from the fact that, currently, no
computational model with the capability to elegantly express parallel activity is
mature enough to be universally accepted, and used as the basis for programming
languages to exploit the parallelism that multiprocessors offer. To add to this,
there is a lack of software tools to assist programmers in the processes of designing
and debugging parallel programs.
Although much research has been done in the field of programming languages,
no undisputed candidate for the most appropriate language for programming
shared memory multiprocessors has yet been found. This thesis examines why this
state of affairs has arisen and proposes programming language constructs,
together with a programming methodology and environment, to close the ever
widening hardware to software gap.
The novel programming constructs described in this thesis are intended for use
in imperative languages even though they make use of the synchronisation
inherent in the dataflow model by using the semantics of single assignment when
operating on shared data, so giving rise to the term shared values. As there are
several distinct parallel programming paradigms, matching flavours of shared
value are developed to permit the concise expression of these paradigms.The Science and Engineering Research Council
A parallel transformations framework for cluster environments.
In recent years program transformation technology has matured into a practical solution for many software reengineering and migration tasks.
FermaT, an industrial strength program transformation system, has demonstrated that legacy systems can be successfully transformed into efficient and maintainable structured C or COBOL code. Its core, a transformation engine, is based on mathematically proven program transformations and ensures that transformed programs are semantically equivalent to its original state. Its engine facilitates a Wide Spectrum Language (WSL), with low-level as well as high-level constructs, to capture as much information as possible during transformation steps. FermaT’s methodology and technique lack in provision of concurrent migration and analysis. This provision is crucial if the transformation process is to be further automated. As the constraint based program migration theory has demonstrated, it is inefficient and time consuming, trying to satisfy the enormous computation of the generated transformation sequence search-space and its constraints.
With the objective to solve the above problems and to extend the operating range of the FermaT transformation system, this thesis proposes a Parallel Transformations Framework which makes parallel transformations processing within the FermaT environment not only possible but also beneficial for its migration process. During a migration process, many thousands of program transformations have to be applied. For example a 1 million line of assembler to C migration takes over 21 hours to be processed on a single PC. Various approaches of search, prediction techniques and a constraint-based approach to address the presented issues already exist but they solve them unsatisfactorily. To remedy this situation, this dissertation proposes a framework to extend transformation processing systems with parallel processing capabilities. The parallel system can analyse specified parallel transformation tasks and produce appropriate parallel transformations processing outlines. To underpin an automated objective, a formal language is introduced. This language can be utilised to describe and outline parallel transformation tasks whereas parallel processing constraints underpin the parallel objective.
This thesis addresses and explains how transformation processing steps can be automatically parallelised within a reengineering domain. It presents search and prediction tactics within this field. The decomposition and parallelisation of transformation sequence search-spaces is outlined. At the end, the presented work is evaluated on practical case studies, to demonstrate different parallel transformations processing techniques and conclusions are drawn
Tools for efficient Deep Learning
In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption.
We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work.
This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C.
Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets.
All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces
Analysis and transformation of legacy code
Hardware evolves faster than software. While a hardware system might need replacement
every one to five years, the average lifespan of a software system is a decade,
with some instances living up to several decades. Inevitably, code outlives the platform
it was developed for and may become legacy: development of the software stops,
but maintenance has to continue to keep up with the evolving ecosystem. No new features
are added, but the software is still used to fulfil its original purpose. Even in the
cases where it is still functional (which discourages its replacement), legacy code is
inefficient, costly to maintain, and a risk to security.
This thesis proposes methods to leverage the expertise put in the development of
legacy code and to extend its useful lifespan, rather than to throw it away. A novel
methodology is proposed, for automatically exploiting platform specific optimisations
when retargeting a program to another platform. The key idea is to leverage the optimisation
information embedded in vector processing intrinsic functions. The performance
of the resulting code is shown to be close to the performance of manually
retargeted programs, however with the human labour removed.
Building on top of that, the question of discovering optimisation information when
there are no hints in the form of intrinsics or annotations is investigated. This thesis
postulates that such information can potentially be extracted from profiling the data
flow during executions of the program. A context-aware data dependence profiling
system is described, detailing previously overlooked aspects in related research. The
system is shown to be essential in surpassing the information that can be inferred statically,
in particular about loop iterators.
Loop iterators are the controlling part of a loop. This thesis describes and evaluates
a system for extracting the loop iterators in a program. It is found to significantly
outperform previously known techniques and further increases the amount of information
about the structure of a program that is available to a compiler. Combining this
system with data dependence profiling improves its results even more. Loop iterator
recognition enables other code modernising techniques, like source code rejuvenation
and commutativity analysis. The former increases the use of idiomatic code and as
a result increases the maintainability of the program. The latter can potentially drive
parallelisation and thus dramatically improve runtime performance
- …