Improving program performance through the use of multiple homogeneous processing
elements, or cores, is common-place. However, these architectures increase the
complexity required at the software level. Existing work is focused on optimising
programs that run in isolation on these systems, but ignores the fact that, in reality,
these systems run multiple parallel programs concurrently with programs competing
for system resources. In order to improve performance in this shared environment,
cooperative tuning of multiple, concurrently running parallel programs is required.
Moreover, the set of programs running on the system – the system workload – is dynamic
and rapidly changing. This makes cooperative tuning a challenge, as it must
react rapidly to changes in the system workload.
This thesis explores the scope for performance improvement from cooperatively
tuning skeleton parallel programs, and techniques that can be used to cooperatively
auto-tune parallel programs. Parallel skeletons provide a clear separation between
algorithm description and implementation, and provide tuning knobs that the system
can use to make high-level changes to a programs implementation. This work
is in three parts: (i) how many threads should be allocated to each program running
on the system, (ii) on which cores should a programs threads be executed and
(iii) what values should be chosen for high-level parameters of the parallel skeletons.
We demonstrate that significant performance improvements are available in each of
these areas, compared to the current state-of-the-art