Modern computers are based on manycore architectures, with multiple processors on
a single silicon chip. In this environment programmers are required to make use of
parallelism to fully exploit the available cores. This can either be within a single chip,
normally using shared-memory programming or at a larger scale on a cluster of chips,
normally using message-passing.
Legacy programs written using either paradigm face issues when run on modern
manycore architectures. In message-passing the problem is performance related,
with clusters based on manycores introducing necessarily tiered topologies that unaware
programs may not fully exploit. In shared-memory it is a correctness problem,
with modern systems employing more relaxed memory consistency models, on which
legacy programs were not designed to operate. Solutions to this correctness problem
exist, but introduce a performance problem as they are necessarily conservative. This
thesis focuses on addressing these problems, largely through compile-time analysis
and transformation.
The first technique proposed is a method for statically determining the communication
graph of an MPI program. This is then used to optimise process placement in
a cluster of CMPs. Using the 64-process versions of the NAS parallel benchmarks,
we see an average of 28% (7%) improvement in communication localisation over by-rank
scheduling for 8-core (12-core) CMP-based clusters, representing the maximum
possible improvement.
Secondly, we move into the shared-memory paradigm, identifying and proving
necessary conditions for a read to be an acquire. This can be used to improve solutions
in several application areas, two of which we then explore.
We apply our acquire signatures to the problem of fence placement for legacy well-synchronised
programs. We find that applying our signatures, we can reduce the number
of fences placed by an average of 62%, leading to a speedup of up to 2.64x over an
existing practical technique.
Finally, we develop a dynamic synchronisation detection tool known as SyncDetect.
This proof of concept tool leverages our acquire signatures to more accurately
detect ad hoc synchronisations in running programs and provides the programmer with
a report of their locations in the source code. The tool aims to assist programmers with
the notoriously difficult problem of parallel debugging and in manually porting legacy
programs to more modern (relaxed) memory consistency models