2 research outputs found

    Automatic performance optimisation of parallel programs for GPUs via rewrite rules

    Get PDF
    Graphics Processing Units (GPUs) are now commonplace in computing systems and are the most successful parallel accelerators. Their performance is orders of magnitude higher than traditional Central Processing Units (CPUs) making them attractive for many application domains with high computational demands. However, achieving their full performance potential is extremely hard, even for experienced programmers, as it requires specialised software tailored for specific devices written in low-level languages such as OpenCL. Differences in device characteristics between manufacturers and even hardware generations often lead to large performance variations when different optimisations are applied. This inevitably leads to code that is not performance portable across different hardware. This thesis demonstrates that achieving performance portability is possible using LIFT, a functional data-parallel language which allows programs to be expressed at a high-level in a hardware-agnostic way. The LIFT compiler is empowered to automatically explore the optimisation space using a set of well-defined rewrite rules to transform programs seamlessly between different high-level algorithmic forms before translating them to a low-level OpenCL-specific form. The first contribution of this thesis is the development of techniques to compile functional LIFT programs that have optimisations explicitly encoded into efficient imperative OpenCL code. Producing efficient code is non-trivial as many performance sensitive details such as memory allocation, array accesses or synchronisation are not explicitly represented in the functional LIFT language. The thesis shows that the newly developed techniques are essential for achieving performance on par with manually optimised code for GPU programs with the exact same complex optimisations applied. The second contribution of this thesis is the presentation of techniques that enable the LIFT compiler to perform complex optimisations that usually require from tens to hundreds of individual rule applications by grouping them as macro-rules that cut through the optimisation space. Using matrix multiplication as an example, starting from a single high-level program the compiler automatically generates highly optimised and specialised implementations for desktop and mobile GPUs with very different architectures achieving performance portability. The final contribution of this thesis is the demonstration of how low-level and GPU-specific features are extracted directly from the high-level functional LIFT program, enabling building a statistical performance model that makes accurate predictions about the performance of differently optimised program variants. This performance model is then used to drastically speed up the time taken by the optimisation space exploration by ranking the different variants based on their predicted performance. Overall, this thesis demonstrates that performance portability is achievable using LIFT

    Pattern discovery for parallelism in functional languages

    Get PDF
    No longer the preserve of specialist hardware, parallel devices are now ubiquitous. Pattern-based approaches to parallelism, such as algorithmic skeletons, simplify traditional low-level approaches by presenting composable high-level patterns of parallelism to the programmer. This allows optimal parallel configurations to be derived automatically, and facilitates the use of different parallel architectures. Moreover, parallel patterns can be swap-replaced for sequential recursion schemes, thus simplifying their introduction. Unfortunately, there is no guarantee that recursion schemes are present in all functional programs. Automatic pattern discovery techniques can be used to discover recursion schemes. Current approaches are limited by both the range of analysable functions, and by the range of discoverable patterns. In this thesis, we present an approach based on program slicing techniques that facilitates the analysis of a wider range of explicitly recursive functions. We then present an approach using anti-unification that expands the range of discoverable patterns. In particular, this approach is user-extensible; i.e. patterns developed by the programmer can be discovered without significant effort. We present prototype implementations of both approaches, and evaluate them on a range of examples, including five parallel benchmarks and functions from the Haskell Prelude. We achieve maximum speedups of 32.93x on our 28-core hyperthreaded experimental machine for our parallel benchmarks, demonstrating that our approaches can discover patterns that produce good parallel speedups. Together, the approaches presented in this thesis enable the discovery of more loci of potential parallelism in pure functional programs than currently possible. This leads to more possibilities for parallelism, and so more possibilities to take advantage of the potential performance gains that heterogeneous parallel systems present
    corecore