10 research outputs found
Architecture aware parallel programming in Glasgow parallel Haskell (GPH)
General purpose computing architectures are evolving quickly to become manycore
and hierarchical: i.e. a core can communicate more quickly locally than
globally. To be effective on such architectures, programming models must be
aware of the communications hierarchy. This thesis investigates a programming
model that aims to share the responsibility of task placement, load balance, thread
creation, and synchronisation between the application developer and the runtime
system.
The main contribution of this thesis is the development of four new architectureaware
constructs for Glasgow parallel Haskell that exploit information about task
size and aim to reduce communication for small tasks, preserve data locality, or to
distribute large units of work. We define a semantics for the constructs that specifies the sets of PEs that each construct identifies, and we check four properties
of the semantics using QuickCheck.
We report a preliminary investigation of architecture aware programming
models that abstract over the new constructs. In particular, we propose architecture
aware evaluation strategies and skeletons. We investigate three common
paradigms, such as data parallelism, divide-and-conquer and nested parallelism,
on hierarchical architectures with up to 224 cores. The results show that the
architecture-aware programming model consistently delivers better speedup and
scalability than existing constructs, together with a dramatic reduction in the
execution time variability.
We present a comparison of functional multicore technologies and it reports
some of the first ever multicore results for the Feedback Directed Implicit Parallelism
(FDIP) and the semi-explicit parallelism (GpH and Eden) languages. The
comparison reflects the growing maturity of the field by systematically evaluating
four parallel Haskell implementations on a common multicore architecture.
The comparison contrasts the programming effort each language requires with
the parallel performance delivered.
We investigate the minimum thread granularity required to achieve satisfactory
performance for three implementations parallel functional language on a
multicore platform. The results show that GHC-GUM requires a larger thread
granularity than Eden and GHC-SMP. The thread granularity rises as the number
of cores rises
Parallel evaluation strategies for lazy data structures in Haskell
Conventional parallel programming is complex and error prone. To improve programmer
productivity, we need to raise the level of abstraction with a higher-level
programming model that hides many parallel coordination aspects. Evaluation
strategies use non-strictness to separate the coordination and computation aspects
of a Glasgow parallel Haskell (GpH) program. This allows the specification of high
level parallel programs, eliminating the low-level complexity of synchronisation and
communication associated with parallel programming.
This thesis employs a data-structure-driven approach for parallelism derived through
generic parallel traversal and evaluation of sub-components of data structures. We
focus on evaluation strategies over list, tree and graph data structures, allowing
re-use across applications with minimal changes to the sequential algorithm.
In particular, we develop novel evaluation strategies for tree data structures, using
core functional programming techniques for coordination control, achieving more
flexible parallelism. We use non-strictness to control parallelism more flexibly. We
apply the notion of fuel as a resource that dictates parallelism generation, in particular,
the bi-directional flow of fuel, implemented using a circular program definition,
in a tree structure as a novel way of controlling parallel evaluation. This is the first
use of circular programming in evaluation strategies and is complemented by a lazy
function for bounding the size of sub-trees.
We extend these control mechanisms to graph structures and demonstrate performance
improvements on several parallel graph traversals. We combine circularity
for control for improved performance of strategies with circularity for computation
using circular data structures. In particular, we develop a hybrid traversal strategy
for graphs, exploiting breadth-first order for exposing parallelism initially, and
then proceeding with a depth-first order to minimise overhead associated with a full
parallel breadth-first traversal.
The efficiency of the tree strategies is evaluated on a benchmark program, and
two non-trivial case studies: a Barnes-Hut algorithm for the n-body problem and
sparse matrix multiplication, both using quad-trees. We also evaluate a graph search
algorithm implemented using the various traversal strategies.
We demonstrate improved performance on a server-class multicore machine with
up to 48 cores, with the advanced fuel splitting mechanisms proving to be more
flexible in throttling parallelism. To guide the behaviour of the strategies, we develop
heuristics-based parameter selection to select their specific control parameters
Reliable massively parallel symbolic computing : fault tolerance for a distributed Haskell
As the number of cores in manycore systems grows exponentially, the number of failures is
also predicted to grow exponentially. Hence massively parallel computations must be able to
tolerate faults. Moreover new approaches to language design and system architecture are needed
to address the resilience of massively parallel heterogeneous architectures.
Symbolic computation has underpinned key advances in Mathematics and Computer Science,
for example in number theory, cryptography, and coding theory. Computer algebra software
systems facilitate symbolic mathematics. Developing these at scale has its own distinctive
set of challenges, as symbolic algorithms tend to employ complex irregular data and control
structures. SymGridParII is a middleware for parallel symbolic computing on massively parallel
High Performance Computing platforms. A key element of SymGridParII is a domain specific
language (DSL) called Haskell Distributed Parallel Haskell (HdpH). It is explicitly designed for
scalable distributed-memory parallelism, and employs work stealing to load balance dynamically
generated irregular task sizes.
To investigate providing scalable fault tolerant symbolic computation we design, implement
and evaluate a reliable version of HdpH, HdpH-RS. Its reliable scheduler detects and handles
faults, using task replication as a key recovery strategy. The scheduler supports load balancing
with a fault tolerant work stealing protocol. The reliable scheduler is invoked with two fault
tolerance primitives for implicit and explicit work placement, and 10 fault tolerant parallel
skeletons that encapsulate common parallel programming patterns. The user is oblivious to
many failures, they are instead handled by the scheduler.
An operational semantics describes small-step reductions on states. A simple abstract machine
for scheduling transitions and task evaluation is presented. It defines the semantics of
supervised futures, and the transition rules for recovering tasks in the presence of failure. The
transition rules are demonstrated with a fault-free execution, and three executions that recover
from faults.
The fault tolerant work stealing has been abstracted in to a Promela model. The SPIN
model checker is used to exhaustively search the intersection of states in this automaton to
validate a key resiliency property of the protocol. It asserts that an initially empty supervised
future on the supervisor node will eventually be full in the presence of all possible combinations
of failures.
The performance of HdpH-RS is measured using five benchmarks. Supervised scheduling
achieves a speedup of 757 with explicit task placement and 340 with lazy work stealing when
executing Summatory Liouville up to 1400 cores of a HPC architecture. Moreover, supervision
overheads are consistently low scaling up to 1400 cores. Low recovery overheads are observed in
the presence of frequent failure when lazy on-demand work stealing is used. A Chaos Monkey
mechanism has been developed for stress testing resiliency with random failure combinations.
All unit tests pass in the presence of random failure, terminating with the expected results
Structured Parallelism by Composition - Design and implementation of a framework supporting skeleton compositionality
This thesis is dedicated to the efficient compositionality of algorithmic skeletons, which are abstractions of common parallel programming patterns. Skeletons can be implemented in the functional parallel language Eden as mere parallel higher order functions. The use of algorithmic skeletons facilitates parallel programming massively. This is because they already implement the tedious details of parallel programming and can be specialised for concrete applications by providing problem specific functions and parameters. Efficient skeleton compositionality is of particular importance because complex, specialised skeletons can be compound of simpler base skeletons. The resulting modularity is especially important for the context of functional programming and should not be missing in a functional language. We subdivide composition into three categories:
-Nesting: A skeleton is instantiated from another skeleton instance. Communication is tree shaped, along the call hierarchy. This is directly supported by Eden.
-Composition in sequence: The result of a skeleton is the input for a succeeding skeleton. Function composition is expressed in Eden by the ( . ) operator. For performance reasons the processes of both skeletons should be able to exchange results directly instead of using the indirection via the caller process. We therefore introduce the remote data concept.
-Iteration: A skeleton is called in sequence a variable number of times. This can be defined using recursion and composition in sequence. We optimise the number of skeleton instances, the communication in between the iteration steps and the control of the loop. To this end, we developed an iteration framework where iteration skeletons are composed from control and body skeletons.
Central to our composition concept is remote data. We send a remote data handle instead of ordinary data, the data handle is used at its destination to request the referenced data. Remote data can be used inside arbitrary container types for efficient skeleton composition similar to ordinary distributed data types. The free combinability of remote data with arbitrary container types leads to a high degree of flexibility. The programmer is not restricted by using a predefined set of distributed data types and (re-)distribution functions. Moreover, he can use remote data with arbitrary container types to elegantly create process topologies.
For the special case of skeleton iteration we prevent the repeated construction and deconstruction of skeleton instances for each single iteration step, which is common for the recursive use of skeletons. This minimises the parallel overhead for process and channel creation and allows to keep data local on persistent processes. To this end we provide a skeleton framework. This concept is independent of remote data, however the use of remote data in combination with the iteration framework makes the framework more flexible.
For our case studies, both approaches perform competitively compared to programs with identical parallel structure but which are implemented using monolithic skeletons - i.e. skeleton not composed from simpler ones.
Further, we present extensions of Eden which enhance composition support: generalisation of overloaded communication, generalisation of process instantiation, compositional process placement and extensions of Box types used to adapt communication behaviour
Implementation and Evaluation of Algorithmic Skeletons: Parallelisation of Computer Algebra Algorithms
This thesis presents design and implementation approaches for the parallel algorithms of computer algebra. We use algorithmic skeletons and also further approaches, like data parallel arithmetic and actors. We have implemented skeletons for divide and conquer algorithms and some special parallel loops, that we call ‘repeated computation with a possibility of premature termination’. We introduce in this thesis a rational data parallel arithmetic. We focus on parallel symbolic computation algorithms, for these algorithms our arithmetic provides a generic parallelisation approach.
The implementation is carried out in Eden, a parallel functional programming language based on Haskell. This choice enables us to encode both the skeletons and the programs in the same language. Moreover, it allows us to refrain from using two different languages—one for the implementation and one for the interface—for our implementation of computer algebra algorithms.
Further, this thesis presents methods for evaluation and estimation of parallel execution times. We partition the parallel execution time into two components. One of them accounts for the quality of the parallelisation, we call it the ‘parallel penalty’. The other is the sequential execution time. For the estimation, we predict both components separately, using statistical methods. This enables very confident estimations, although using drastically less measurement points than other methods. We have applied both our evaluation and estimation approaches to the parallel programs presented in this thesis. We haven also used existing estimation methods.
We developed divide and conquer skeletons for the implementation of fast parallel multiplication. We have implemented the Karatsuba algorithm, Strassen’s matrix multiplication algorithm and the fast Fourier transform. The latter was used to implement polynomial convolution that leads to a further fast multiplication algorithm. Specially for our implementation of Strassen algorithm we have designed and implemented a divide and conquer skeleton basing on actors. We have implemented the parallel fast Fourier transform, and not only did we use new divide and conquer skeletons, but also developed a map-and-transpose skeleton. It enables good parallelisation of the Fourier transform. The parallelisation of Karatsuba multiplication shows a very good performance. We have analysed the parallel penalty of our programs and compared it to the serial fraction—an approach, known from literature. We also performed execution time estimations of our divide and conquer programs.
This thesis presents a parallel map+reduce skeleton scheme. It allows us to combine the usual parallel map skeletons, like parMap, farm, workpool, with a premature termination property. We use this to implement the so-called ‘parallel repeated computation’, a special form of a speculative parallel loop. We have implemented two probabilistic primality tests: the Rabin–Miller test and the Jacobi sum test. We parallelised both with our approach. We analysed the task distribution and stated the fitting configurations of the Jacobi sum test. We have shown formally that the Jacobi sum test can be implemented in parallel. Subsequently, we parallelised it, analysed the load balancing issues, and produced an optimisation. The latter enabled a good implementation, as verified using the parallel penalty. We have also estimated the performance of the tests for further input sizes and numbers of processing elements. Parallelisation of the Jacobi sum test and our generic parallelisation scheme for the repeated computation is our original contribution.
The data parallel arithmetic was defined not only for integers, which is already known, but also for rationals. We handled the common factors of the numerator or denominator of the fraction with the modulus in a novel manner. This is required to obtain a true multiple-residue arithmetic, a novel result of our research. Using these mathematical advances, we have parallelised the determinant computation using the Gauß elimination. As always, we have performed task distribution analysis and estimation of the parallel execution time of our implementation. A similar computation in Maple emphasised the potential of our approach. Data parallel arithmetic enables parallelisation of entire classes of computer algebra algorithms.
Summarising, this thesis presents and thoroughly evaluates new and existing design decisions for high-level parallelisations of computer algebra algorithms
Adaptive architecture-transparent policy control in a distributed graph reducer
The end of the frequency scaling era occured around 2005 as the clock frequency
has stalled for commodity architectures. Thus performance improvements that could
in the past be expected with each new hardware generation needed to originate
elsewhere. Almost all computer architectures exhibit substantial and growing levels
of parallelism, exploiting which became one of the key sources of performance and
scalability improvements. Alas, parallel programming proved much more difficult
than sequential, due to the need to specify coordination and parallelism management
aspects. Whilst low-level languages place the burden on the programmers reducing
productivity and portability, semi-implicit approaches delegate the responsibility to
sophisticated compilers and run-time systems.
This thesis presents a study of adaptive load distribution based on work stealing
using history and ancestry information in a distributed graph reducer for a nonstrict functional language. The results contribute to the exploration of more flexible
run-time-system-level parallelism control implementing a semi-explicit model of parallelism, which offers productivity and high level of abstraction by delegating the
responsibility of coordination to the run-time system.
After characterising a set of parallel functional applications, we study the use of
historical information to adapt the choice of the victim to steal from in a work stealing scheduler. We observe substantially lower numbers of messages for data-parallel
and nested applications. However, this heuristic fails in cases where past application behaviour is not resembling future behaviour, for instance for Divide-&-Conquer
applications with a large number of very fine-grained threads and generators of parallelism that move dynamically across processing elements. This mechanism is not
specific to the language and the run-time system, and applies to other work stealing
schedulers.
Next, we focus on the other key work stealing decision of which sparks that represent potential parallelism to donate, investigating the effect of Spark Colocation
on the performance of five Divide-&-Conquer programs run on a cluster of up to
256 PEs. When using Spark Colocation, the distributed graph reducer shares related
work resulting in a higher degree of both potential and actual parallelism, and more
fine-grained and less variable thread size. We validate this behaviour by observing
a reduction in average fetch times, but increased amounts of FETCH messages and
of inter-PE pointers for colocation, which nevertheless results in improved load balance for three of the five benchmark programs. The results show high speedups and
speedup improvements for Spark Colocation for the three more regular and nested
applications and performance degradation for two programs: one that is excessively
fine-grained and one exhibiting limited scalability. Overall, Spark Colocation appears most beneficial for higher numbers of PEs, where improved load balance and
higher degree of parallelism have more opportunities to pay off.
In more general terms, we show that a run-time system can beneficially use historical information on past stealing successes that is gathered dynamically and used
within the same run and the ancestry information dynamically reconstructed at run
time using annotations. Moreover, the results support the view that different heuristics are beneficial for applications using different parallelism patterns, underlining
the advantages of a flexible architecture-transparent approach.The Scottish Informatics and Computer Science Alliance (SICSA
Distributing abstract machines
Today's distributed programs are often written using either explicit message passing or Remote Procedure Calls (RPCs) that are not natively integrated in the language. It is difficult to establish the correctness of programs written this way compared to programs written for a single computer.
We propose a generalisation of RPCs that are natively integrated in a functional programming language meaning that they have support for higher-order calls across node boundaries. Our focus is on how such languages can be compiled correctly and efficiently.
We present four different solutions. Two of them are based on interaction semantics --- the Geometry of Interaction and game semantics --- and two are extensions of conventional abstract machines --- the Krivine machine and the SECD machine. To target as general distributed systems as possible our solutions support RPCs without sending code.
We prove the correctness of the abstract machines with respect to their single-node execution, and show their viability for use for compilation by implementing prototype compilers based on them. The conventionally based machines are shown to enable efficient programs.
Our intention is that these abstract machines can form the foundation for future programming languages that use the idea of higher-order RPCs
Datenparallele algorithmische Skelette:Erweiterungen und Anwendungen der Münster Skelettbibliothek Muesli
Die Arbeit thematisiert den datenparallelen Bestandteil der Münster Skelettbibliothek Muesli und beschreibt neben einer Reihe implementierter Erweiterungen auch mit Muesli parallelisierte Anwendungen. Eine der wichtigsten Neuerungen ist die Unterstützung von Mehrkernprozessoren durch die Verwendung von OpenMP, infolgedessen mit Muesli entwickelte Programme auch auf Parallelrechnern mit hybrider Speicherarchitektur skalieren. Eine zusätzliche Erweiterung stellt die Neuentwicklung einer verteilten Datenstruktur für dünnbesetzte Matrizen dar. Letztere implementiert ein flexibles Designkonzept, was die Verwendung benutzerdefinierter Kompressions- sowie Lastverteilungsmechanismen ermöglicht. Darüber hinaus werden mit dem LM OSEM-Algorithmus und den ART 2-Netzen zwei Anwendungen vorgestellt, die mit Muesli parallelisiert wurden. Neben einer Beschreibung der Funktionsweise sowie der Eigenschaften und Konzepte von MPI und OpenMP wird darüber hinaus der aktuelle Forschungsstand skizziert. <br/