137 research outputs found
Compiler and runtime techniques for bulk-synchronous programming models on CPU architectures
The rising pressure to simultaneously improve performance and reduce power consumption is driving more heterogeneity into all aspects of computing devices.
However, wide adoption of specialized computing devices such as GPUs and Xeon Phis comes with a programming challenge. A carefully optimized program that is well matched to the target hardware can run many times faster and more energy efficiently than one that is not.
Ideally, programmers should write their code using a single programming model, and the compiler would transform the program to run optimally on the target architecture.
In practice, however, programmers have to expend great effort to translate performance enjoyed on one platform to another.
As such, single-source code-based portability has gained substantial momentum and OpenCL, a bulk-synchronous programming language, has become a popular choice, among others, to fulfill the need for portability.
The assumed computing model of these languages is inevitably loosely coupled with an underlying architecture, obligating a combined compiler and runtime to find an efficient execution mapping from the input program onto the architecture which best exploits the hardware for performance.
In this dissertation, I argue and demonstrate that obtaining high performance from executing OpenCL programs on CPU is feasible. In order to achieve the goal, I present compiler and runtime techniques to execute OpenCL programs on CPU architectures.
First, I propose a compiler technique in which the execution of fine-grained parallel threads, called work-items, is collectively analyzed to consider the impact of scheduling them with respect to data locality.
By analyzing the memory addresses accessed in a kernel, the technique can make better decisions on how to schedule work-items to construct better memory access patterns, thereby improving performance.
The approach achieves geomean speedups of 3.32x over AMD's and 1.71x over Intel's state-of-the-art implementations on Parboil and Rodinia benchmarks.
Second, I propose a runtime that allows a compiler to deposit differently optimized kernels to mitigate the stress on the compiler in deriving the most optimal code.
The runtime systematically deploys candidate kernels on a small portion of the actual data to determine which achieves the best performance for the hardware-data combination.
It exploits the fact that OpenCL programs typically come with a large number of independent work-groups, a feature that amortizes the cost of profiling execution of a few work-items, while the overhead is further reduced by retaining the profiling execution result to constitute the final execution output.
The proposed runtime performs with an average overhead of 3% compared to an ideal/oracular runtime in execution time
Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models
Pipelined wavefront computations are an ubiquitous class of high performance parallel algorithms
used for the solution of many scientific and engineering applications. In order to aid
the design and optimisation of these applications, and to ensure that during procurement platforms
are chosen best suited to these codes, there has been considerable research in analysing
and evaluating their operational performance.
Wavefront codes exhibit complex computation, communication, synchronisation patterns,
and as a result there exist a large variety of such codes and possible optimisations. The
problem is compounded by each new generation of high performance computing system,
which has often introduced a previously unexplored architectural trait, requiring previous
performance models to be rewritten and reevaluated.
In this thesis, we address the performance modelling and optimisation of this class of
application, as a whole. This differs from previous studies in which bespoke models are applied
to specific applications. The analytic performance models are generalised and reusable,
and we demonstrate their application to the predictive analysis and optimisation of pipelined
wavefront computations running on modern high performance computing systems.
The performance model is based on the LogGP parameterisation, and uses a small
number of input parameters to specify the particular behaviour of most wavefront codes. The
new parameters and model equations capture the key structural and behavioural differences
among different wavefront application codes, providing a succinct summary of the operations
for each application and insights into alternative wavefront application design.
The models are applied to three industry-strength wavefront codes and are validated
on several systems including a Cray XT3/XT4 and an InfiniBand commodity cluster. Model
predictions show high quantitative accuracy (less than 20% error) for all high performance
configurations and excellent qualitative accuracy.
The thesis presents applications, projections and insights for optimisations using the
model, which show the utility of reusable analytic models for performance engineering of
high performance computing codes. In particular, we demonstrate the use of the model for:
(1) evaluating application configuration and resulting performance; (2) evaluating hardware
platform issues including platform sizing, configuration; (3) exploring hardware platform design
alternatives and system procurement and, (4) considering possible code and algorithmic
optimisations
Wave Front Sensing and Correction Using Spatial Modulation and Digitally Enhanced Heterodyne Interferometry
This thesis is about light. Specifically it explores a new way
sensing the spatial distribution
of amplitude and phase across the wavefront of a propagating
laser. It uses spatial
light modulators to tag spatially distinct regions of the beam, a
single diode to collect
the resulting light and digitally enhanced heterodyne
interferometry to decode the phase
and amplitude information across the wavefront. It also
demonstrates how using these
methods can be used to maximise the transmission of light through
a cavity and shows
how minor aberrations in the beam can be corrected in real time.
Finally it demonstrate
the preferential transmission of higher order modes.
Wavefront sensing is becoming increasingly important as the
demands on modern interferometers
increase. Land based systems such as the Laser Interferometer
Gravitational-Wave
Observatory (LIGO) use it to maximise the amount of power in the
arm cavities during
operation and reduce noise, while space based missions such as
the Laser Interferometer
Space Antenna (LISA) will use it to align distant partner
satellites and ensure that the
maximum amount of signal is exchanged. Conventionally wavefront
sensing is accomplished
using either Hartmann Sensors or multi-element diodes. These are
well proven
and very effective techniques but bring with them a number of
well understood limitations.
Critically, while they can map a wavefront in detail, they are
strictly sensors and
can do nothing to correct it.
Our new technique is based on a single-element photo-diode and
the spatial modulation
of the local oscillator beam. We encode orthogonal codes
spatially onto this light and use
these to separate the phases and amplitudes of different parts of
the signal beam in post
processing. This technique shifts complexity from the optical
hardware into deterministic
digital signal processing. Notably, the use of a single analogue
channel (photo-diode,
connections and analogue to digital converter) avoids some
low-frequency error sources.
The technique can also sense the wavefront phase at many points,
limited only by the
number of actuators on the spatial light modulator in contrast to
the standard 4 points
from a quadrant photo-diode. For ground-based systems, our
technique could be used to
identify and eliminate higher-order modes, while, for space-based
systems, it provides a
measure of wavefront tilt which is less susceptible to low
frequency noise.
In the future it may be possible to couple the technique with an
artificial intelligence
engine to automate more of the beam alignment process in
arrangements involving multiple
cavities, preferentially select (or reject) specific higher order
modes and start to reduce
the burgeoning requirements for human control of these complex
instruments
Efficient Algorithms for Coastal Geographic Problems
The increasing performance of computers has made it possible to solve algorithmically problems for which manual and possibly inaccurate methods have been previously used. Nevertheless, one must still pay attention to the performance of an algorithm if huge datasets are used or if the problem iscomputationally difficult.
Two geographic problems are studied in the articles included in this thesis. In the first problem the goal is to determine distances from points, called study points, to shorelines in predefined directions. Together with other in-formation, mainly related to wind, these distances can be used to estimate wave exposure at different areas. In the second problem the input consists of a set of sites where water quality observations have been made and of the results of the measurements at the different sites. The goal is to select a subset of the observational sites in such a manner that water quality is still measured in a sufficient accuracy when monitoring at the other sites is stopped to reduce economic cost.
Most of the thesis concentrates on the first problem, known as the fetch length problem. The main challenge is that the two-dimensional map is represented as a set of polygons with millions of vertices in total and the distances may also be computed for millions of study points in several directions. Efficient algorithms are developed for the problem, one of them approximate and the others exact except for rounding errors. The solutions also differ in that three of them are targeted for serial operation or for a small number of CPU cores whereas one, together with its further developments, is suitable also for parallel machines such as GPUs.Tietokoneiden suorituskyvyn kasvaminen on tehnyt mahdolliseksi ratkaista algoritmisesti ongelmia, joita on aiemmin tarkasteltu paljon ihmistyötä vaativilla, mahdollisesti epätarkoilla, menetelmillä. Algoritmien suorituskykyyn on kuitenkin toisinaan edelleen kiinnitettävä huomiota lähtömateriaalin suuren määrän tai ongelman laskennallisen vaikeuden takia.
Väitöskirjaansisältyvissäartikkeleissatarkastellaankahtamaantieteellistä ongelmaa. Ensimmäisessä näistä on määritettävä etäisyyksiä merellä olevista pisteistä lähimpään rantaviivaan ennalta määrätyissä suunnissa. Etäisyyksiä ja tuulen voimakkuutta koskevien tietojen avulla on mahdollista arvioida esimerkiksi aallokon voimakkuutta. Toisessa ongelmista annettuna on joukko tarkkailuasemia ja niiltä aiemmin kerättyä tietoa erilaisista vedenlaatua kuvaavista parametreista kuten sameudesta ja ravinteiden määristä. Tehtävänä on valita asemajoukosta sellainen osa joukko, että vedenlaatua voidaan edelleen tarkkailla riittävällä tarkkuudella, kun mittausten tekeminen muilla havaintopaikoilla lopetetaan kustannusten säästämiseksi.
Väitöskirja keskittyy pääosin ensimmäisen ongelman, suunnattujen etäisyyksien, ratkaisemiseen. Haasteena on se, että tarkasteltava kaksiulotteinen kartta kuvaa rantaviivan tyypillisesti miljoonista kärkipisteistä koostuvana joukkonapolygonejajaetäisyyksiäonlaskettavamiljoonilletarkastelupisteille kymmenissä eri suunnissa. Ongelmalle kehitetään tehokkaita ratkaisutapoja, joista yksi on likimääräinen, muut pyöristysvirheitä lukuun ottamatta tarkkoja. Ratkaisut eroavat toisistaan myös siinä, että kolme menetelmistä on suunniteltu ajettavaksi sarjamuotoisesti tai pienellä määrällä suoritinytimiä, kun taas yksi menetelmistä ja siihen tehdyt parannukset soveltuvat myös voimakkaasti rinnakkaisille laitteille kuten GPU:lle.
Vedenlaatuongelmassa annetulla asemajoukolla on suuri määrä mahdollisia osajoukkoja. Lisäksi tehtävässä käytetään aikaa vaativia operaatioita kuten lineaarista regressiota, mikä entisestään rajoittaa sitä, kuinka monta osajoukkoa voidaan tutkia. Ratkaisussa käytetäänkin heuristiikkoja, jotkaeivät välttämättä tuota optimaalista lopputulosta.Siirretty Doriast
Toward commercial realisation of whole field interferometric analysis
The objective of this work was to produce an instrument which could
undertake wholefield inspection and displacement measurement utilising a
non-contacting technology. The instrument has been designed to permit
operation by engineers not necessarily familiar with the underlying
technology and produce results in a meaningful form. Of the possible
techniques considered Holographic Interferometry was originally identified
as meeting these objectives. Experimental work undertaken 'provides' data
which confirms the potential of the technique for solving problems but
also highlights some difficulties.
In order to perform a complete three dimensional displacement analysis a
number of holographic views must be recorded. Considerable effort is
required to extract quantitative data from the holograms. Error analysis
of the experimental arrangement has highlighted a number of practical
restrictions which lead to data uncertainties. Qualitative analysis of
engineering components using Holographic Interferometry has been
successfully undertaken and results in useful analytical data which is
used in three different engineering design programmes. Unfortunately,
attempts to quantify the data to provide strain values relies upon double
differentiation of the fringe field, a process that is highly sensitive to
fringe position errors. In spite of this, these experiments provided the
confidence that optical interferometry is able to produce data of suitable
displacement sensitivity, with results acceptable to other engineers.....
Autotuning wavefront patterns for heterogeneous architectures
Manual tuning of applications for heterogeneous parallel systems is tedious and complex.
Optimizations are often not portable, and the whole process must be repeated when moving
to a new system, or sometimes even to a different problem size.
Pattern based parallel programming models were originally designed to provide programmers
with an abstract layer, hiding tedious parallel boilerplate code, and allowing a focus on
only application specific issues. However, the constrained algorithmic model associated with
each pattern also enables the creation of pattern-specific optimization strategies. These can
capture more complex variations than would be accessible by analysis of equivalent unstructured
source code. These variations create complex optimization spaces. Machine learning
offers well established techniques for exploring such spaces.
In this thesis we use machine learning to create autotuning strategies for heterogeneous
parallel implementations of applications which follow the wavefront pattern. In a wavefront,
computation starts from one corner of the problem grid and proceeds diagonally like a wave
to the opposite corner in either two or three dimensions. Our framework partitions and
optimizes the work created by these applications across systems comprising multicore CPUs
and multiple GPU accelerators. The tuning opportunities for a wavefront include controlling
the amount of computation to be offloaded onto GPU accelerators, choosing the number of
CPU and GPU threads to process tasks, tiling for both CPU and GPU memory structures,
and trading redundant halo computation against communication for multiple GPUs.
Our exhaustive search of the problem space shows that these parameters are very sensitive
to the combination of architecture, wavefront instance and problem size. We design and
investigate a family of autotuning strategies, targeting single and multiple CPU + GPU
systems, and both two and three dimensional wavefront instances. These yield an average
of 87% of the performance found by offline exhaustive search, with up to 99% in some cases
Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models
Pipelined wavefront computations are an ubiquitous class of high performance parallel algorithms used for the solution of many scientific and engineering applications. In order to aid the design and optimisation of these applications, and to ensure that during procurement platforms are chosen best suited to these codes, there has been considerable research in analysing and evaluating their operational performance. Wavefront codes exhibit complex computation, communication, synchronisation patterns, and as a result there exist a large variety of such codes and possible optimisations. The problem is compounded by each new generation of high performance computing system, which has often introduced a previously unexplored architectural trait, requiring previous performance models to be rewritten and reevaluated. In this thesis, we address the performance modelling and optimisation of this class of application, as a whole. This differs from previous studies in which bespoke models are applied to specific applications. The analytic performance models are generalised and reusable, and we demonstrate their application to the predictive analysis and optimisation of pipelined wavefront computations running on modern high performance computing systems. The performance model is based on the LogGP parameterisation, and uses a small number of input parameters to specify the particular behaviour of most wavefront codes. The new parameters and model equations capture the key structural and behavioural differences among different wavefront application codes, providing a succinct summary of the operations for each application and insights into alternative wavefront application design. The models are applied to three industry-strength wavefront codes and are validated on several systems including a Cray XT3/XT4 and an InfiniBand commodity cluster. Model predictions show high quantitative accuracy (less than 20% error) for all high performance configurations and excellent qualitative accuracy. The thesis presents applications, projections and insights for optimisations using the model, which show the utility of reusable analytic models for performance engineering of high performance computing codes. In particular, we demonstrate the use of the model for: (1) evaluating application configuration and resulting performance; (2) evaluating hardware platform issues including platform sizing, configuration; (3) exploring hardware platform design alternatives and system procurement and, (4) considering possible code and algorithmic optimisations.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Development of High Resolution Tools for Investigating Cardiac Arrhythmia Dynamics
Every year 300,000 Americans die due to sudden cardiac death. There are many pathologies, acquired and genetic, that can lead to sudden cardiac death. Regardless of the underlying pathology, death is frequently the result of ventricular tachycardia and/or fibrillation (VT/VF). Despite decades of research, the mechanisms of ventricular arrhythmia initiation and maintenance are still incompletely understood.
A contributing factor to this lack of understanding is the limitations of the investigative tools used to study VT/VF. Arrhythmias are organ level phenomena that are governed by cellular interactions and as such, near cellular levels of resolution are needed to tease out their intricacies. They are also behaviors that are not limited by region, but dynamically affect the entirety of the heart. For these reasons, high-resolution methodologies capable of measuring electrophysiology of the whole entirety of the ventricles will play an important role in gaining a complete understanding of the principles that govern ventricular arrhythmia dynamics. They will also be essential in the development of novel therapies for arrhythmia management.
In this dissertation, I first present the validation and characterization of a novel capacitive electrode design that overcomes the key limitations faced by modern implantable cardiac devices. I then outline the construction, methodologies, and open-source tools of an improved optical panoramic mapping system for small mammalian cardiac electrophysiology studies. I conclude with a small mammal study of the relationship between action potential duration restitution dynamics and the mechanisms of maintenance in ventricular arrhythmias
Recommended from our members
Logic, parallelism and semantic networks : the binary predicate execution model
This thesis develops the Binary Predicate Execution Model; a distributed, massively-parallel system for semantic networks and knowledge bases that is built on a subset of first-order predicate logic. The use of logic gives the model an easily-understood programming paradigm and a well-defined semantics of execution. When expressed in binary predicates, a simple graphical interpretation can be used. All program facts are represented in an assertion graph. Each vertex is associated with a term appearing in a fact and the edges are labeled with the predicate names. Similar graphs are also associated with each rule body and the query. Finding all possible solutions corresponds to finding all possible matches between the query graph and the assertion graph. Invoking a rule corresponds to substituting the graph of its body constrained by the dependencies between its arguments. This can be implemented in a parallel, message-passing fashion where the assertion graph vertices are active processing elements which asynchronously exchange messages identifying different parts of the query that remain to be matched and containing any binding information from previous matching required to accomplish this. The model is data-driven since every message can be immediately processed without the need for any centralized control or centralized memory. By restricting how functional terms can occur, distributed data structures and remote data look-ups for unification are eliminated. Thus, the model's performance on increasingly larger problems scales-up given increasingly larger machines in most cases. Architectural support for the model is investigated and simulation results of a relatively simple software implementation are reported. This suggests performance on the order of 10^5 logical inferences per second for 256 processing elements in an n-cube configuration. Further research directions, including that of increasing efficiency, are discussed
- …