7 research outputs found
Automatically deriving cost models for structured parallel processes using hylomorphisms
This work has been partially supported by the EU Horizon 2020 grant âRePhrase: Refactoring Parallel Heterogeneous Resource-Aware Applications - a Software Engineering Approachâ (ICT-644235), by COST Action IC1202 (TACLe), supported by COST (European Cooperation on Science and Technology), and by EPSRC grant EP/M027317/1 âC33: Scalable & Verified Shared Memory via Consistency-directed Cache Coherenceâ.Structured parallelism using nested algorithmic skeletons can greatly ease the task of writing parallel software, since common, but hard-to-debug, problems such as race conditions are eliminated by design. However, choosing the best combination of algorithmic skeletons to yield good parallel speedups for a specific program on a specific parallel architecture is still a difficult problem. This paper uses the unifying notion of hylomorphisms, a general recursion pattern, to make it possible to reason about both the functional correctness properties and the extra-functional timing properties of structured parallel programs. We have previously used hylomorphisms to provide a denotational semantics for skeletons, and proved that a given parallel structure for a program satisfies functional correctness. This paper expands on this theme, providing a simple operational semantics for algorithmic skeletons and a cost semantics that can be automatically derived from that operational semantics. We prove that both semantics are sound with respect to our previously defined denotational semantics. This means that we can now automatically and statically choose a provably optimal parallel structure for a given program with respect to a cost model for a (class of) parallel architecture. By deriving an automatic amortised analysis from our cost model, we can also accurately predict parallel runtimes and speedups.PostprintPeer reviewe
Structured arrows : a type-based framework for structured parallelism
This thesis deals with the important problem of parallelising sequential code. Despite the importance of parallelism in modern computing, writing parallel software still relies on many low-level and often error-prone approaches. These low-level approaches can lead to serious execution problems such as deadlocks and race conditions. Due to the non-deterministic behaviour of most parallel programs, testing parallel software can be both tedious and time-consuming. A way of providing guarantees of correctness for parallel programs would therefore provide significant benefit. Moreover, even if we ignore the problem of correctness, achieving good speedups is not straightforward, since this generally involves rewriting a program to consider a (possibly large) number of alternative parallelisations.
This thesis argues that new languages and frameworks are needed. These language and frameworks must not only support high-level parallel programming constructs, but must also provide predictable cost models for these parallel constructs. Moreover, they need to be built around solid, well-understood theories that ensure that: (a) changes to the source code will not change the functional behaviour of a program, and (b) the speedup obtained by doing the necessary changes is predictable. Algorithmic skeletons are parametric implementations of common patterns of parallelism that provide good abstractions for creating new high-level languages, and also support frameworks for parallel computing that satisfy the correctness and predictability requirements that we require.
This thesis presents a new type-based framework, based on the connection between structured parallelism and structured patterns of recursion, that provides parallel structures as type abstractions that can be used to statically parallelise a program. Specifically, this thesis exploits hylomorphisms as a single, unifying construct to represent the functional behaviour of parallel programs, and to perform correct code rewritings between alternative parallel implementations, represented as algorithmic skeletons. This thesis also defines a mechanism for deriving cost models for parallel constructs from a queue-based operational semantics. In this way, we can provide strong static guarantees about the correctness of a parallel program, while simultaneously achieving predictable speedups.âThis work was supported by the University of St Andrews (School of Computer
Science); by the EU FP7 grant âParaPhrase:Parallel Patterns Adaptive Heterogeneous Multicore Systemsâ (n. 288570); by the EU H2020
grant âRePhrase: Refactoring Parallel Heterogeneous Resource-Aware Applications
- a Software Engineering Approachâ (ICT-644235), by COST
Action IC1202 (TACLe), supported by COST (European Cooperation Science and Technology); and by EPSRC grant âDiscovery: Pattern Discovery
and Program Shaping for Manycore Systemsâ (EP/P020631/1)â -- Acknowledgement
Pattern discovery for parallelism in functional languages
No longer the preserve of specialist hardware, parallel devices
are now ubiquitous. Pattern-based approaches to parallelism,
such as algorithmic skeletons, simplify traditional low-level
approaches by presenting composable high-level patterns of
parallelism to the programmer. This allows optimal parallel
configurations to be derived automatically, and facilitates the
use of different parallel architectures. Moreover, parallel patterns
can be swap-replaced for sequential recursion schemes,
thus simplifying their introduction. Unfortunately, there is no
guarantee that recursion schemes are present in all functional
programs. Automatic pattern discovery techniques can be used
to discover recursion schemes. Current approaches are limited
by both the range of analysable functions, and by the range of
discoverable patterns. In this thesis, we present an approach
based on program slicing techniques that facilitates the analysis
of a wider range of explicitly recursive functions. We then
present an approach using anti-unification that expands the
range of discoverable patterns. In particular, this approach is
user-extensible; i.e. patterns developed by the programmer can
be discovered without significant effort. We present prototype
implementations of both approaches, and evaluate them on
a range of examples, including five parallel benchmarks and
functions from the Haskell Prelude. We achieve maximum
speedups of 32.93x on our 28-core hyperthreaded experimental
machine for our parallel benchmarks, demonstrating
that our approaches can discover patterns that produce good
parallel speedups. Together, the approaches presented in this
thesis enable the discovery of more loci of potential parallelism
in pure functional programs than currently possible.
This leads to more possibilities for parallelism, and so more
possibilities to take advantage of the potential performance
gains that heterogeneous parallel systems present
Optimal program variant generation for hybrid manycore systems
Field Programmable Gate Arrays promise to deliver superior energy efficiency in heterogeneous high performance computing, as compared to multicore CPUs and GPUs. The rate of adoption is however hampered by the relative difficulty of programming FPGAs. High-level synthesis tools such as Xilinx Vivado, Altera OpenCL or Intel's HLS address a large part of the programmability issue by synthesizing a Hardware Description Languages representation from a high-level specification of the application, given in programming languages such as OpenCL C, typically used to program CPUs and GPUs. Although HLS solutions make programming easier, they fail to also lighten the burden of optimization. Application developers must rely on expert knowledge to manually optimize their applications for each target device, meaning that traditional HLS solutions do not offer a solution to the issue of performance portability. This state of fact prompted the development of compiler frameworks such as TyTra that operate at an even higher level of abstraction that is amenable to the use of Design Space Exploration (DSE). With DSE the initial program specification can be seen as the starting location in a search-space of correct-by-construction program transformations. In TyTra the search-space is generated from the transitive-closure of term-level transformations derived from type-level transformations. Compiler frameworks such as TyTra theoretically solve the issue of performance portability by providing a way to automatically generate alternative correct program variants. They however suffer from the very practical issue that the generated space is often too large to fully explore. As a consequence, the globally optimal solution may be overlooked.
In this work we provide a novel solution to issue performance portability by deriving an efficient yet effective DSE strategy for the TyTra compiler framework. We make use of categorical data types to derive categorical semantics for the formal languages that describe the terms, types, cost-performance estimates and their transformations. From these we define a category of interpretations for TyTra applications, from which we derive a DSE strategy that finds the globally optimal transformation sequence in polynomial time. This is achieved by reducing the size of the generated search space. We formally state and prove a theorem for this claim and then show that the polynomial run-time for our DSE strategy has practically negligible coefficients leading to sub-second exploration times for realistic applications
Finding parallel functional pearls : automatic parallel recursion scheme detection in Haskell functions via anti-unification
This work has been partially supported by the EU H2020 grant âRePhrase: Refactoring Parallel Heterogeneous Resource-Aware Applicationsâa Software Engineering Approachâ (ICT-644235), by COST Action IC1202 (TACLe), supported by COST (European Cooperation in Science and Technology) , by EPSRC grant âDiscovery: Pattern Discovery and Program Shaping for Manycore Systemsâ (EP/P020631/1), and by Scottish Enterprise PS7305CA44.This paper describes a new technique for identifying potentially parallelisable code structures in functional programs. Higher-order functions enable simple and easily understood abstractions that can be used to implement a variety of common recursion schemes, such as maps and folds over traversable data structures. Many of these recursion schemes have natural parallel implementations in the form of algorithmic skeletons. This paper presents a technique that detects instances of potentially parallelisable recursion schemes in Haskell 98 functions. Unusually, we exploit anti-unification to expose these recursion schemes from source-level definitions whose structures match a recursion scheme, but which are not necessarily written directly in terms of maps, folds, etc. This allows us to automatically introduce parallelism, without requiring the programmer to structure their code a priori in terms of specific higher-order functions. We have implemented our approach in the Haskell refactoring tool, HaRe, and demonstrated its use on a range of common benchmarking examples. Using our technique, we show that recursion schemes can be easily detected, that parallel implementations can be easily introduced, and that we can achieve real parallel speedups (up to 23 . 79 Ă the sequential performance on 28 physical cores, or 32 . 93 Ă the sequential performance with hyper-threading enabled).PostprintPeer reviewe
AutoPar: automating the parallelization of functional programs
As the pervasiveness of parallel architectures in computing increases, so does the need for efficiently implemented parallel software. However, the development of parallel software is inherently more difficult than that of sequential software and is fraught with many pitfalls, such as race conditions and locking issues, amongst others. Developers are typically more comfortable developing sequentially, yet as the limitations of single-core processor speeds are reached, they have no choice but to reach for parallel implementations to obtain the required performance increases.
An obvious solution to the parallelisation problem is to allow developers to continue to develop sequentially and generate efficient parallel programs automatically from these sequential ones. There are many existing techniques
which automate the parallelisation process, however these techniques place many constraints upon the programs they are applicable to.
This thesis defines a fully automatic parallelisation technique which places no restriction on its input programs and is applicable to programs defined using any data-type. The technique consists of two components: the first allows a given program to be redefined in terms of well-partitioned data. The second then explicitly parallelises the resulting program using Glasgow parallel Haskell.
The technique is applied to several Haskell programs, the results of which have then been benchmarked with respect to the performance of handparallelised versions of the original programs. The benchmarking process has recorded the execution time and parallel performance of each benchmark program. The evaluation of the benchmark results has allowed for the merit of the automated parallelisation technique to be shown
Symmetric Edit Lenses: A New Foundation for Bidirectional Languages
Lenses are bidirectional transformations between pairs of connected structures capable of translating an edit on one structure into an edit on the other. Most of the extensive existing work on lenses has focused on the special case of asymmetric lenses, where one structures is taken as primary and the other is thought of as a projection or view. Some symmetric variants exist, where each structure contains information not present in the other, but these all lack the basic operation of composition. Additionally, existing accounts do not represent edits carefully, making incremental operation difficult or producing unsatisfactory synchronization candidates. We present a new symmetric formulation which works with descriptions of changes to structures, rather than with the structures themselves. We construct a semantic space of edit lenses between âeditable structuresââmonoids of edits with a partial monoid action for applying editsâwith natural laws governing their behavior. We present generalizations of a number of known constructions on asymmetric lenses and settle some longstanding questions about their propertiesâin particular, we prove the existence of (symmetric monoidal) tensor products and sums and the non-existence of full categorical products and sums in a category of lenses. Universal algebra shows how to build iterator lenses for structured data such as lists and trees, yielding lenses for operations like mapping, filtering, and concatenation from first principles. More generally, we provide mapping combinators based on the theory of containers. Finally, we present a prototype implementation of the core theory and take a first step in addressing the challenge of translating between user gestures and the internal representation of edits