654 research outputs found
Towards an Adaptive Skeleton Framework for Performance Portability
The proliferation of widely available, but very different, parallel architectures
makes the ability to deliver good parallel performance
on a range of architectures, or performance portability, highly desirable.
Irregularly-parallel problems, where the number and size
of tasks is unpredictable, are particularly challenging and require
dynamic coordination.
The paper outlines a novel approach to delivering portable parallel
performance for irregularly parallel programs. The approach
combines declarative parallelism with JIT technology, dynamic
scheduling, and dynamic transformation.
We present the design of an adaptive skeleton library, with a task
graph implementation, JIT trace costing, and adaptive transformations.
We outline the architecture of the protoype adaptive skeleton
execution framework in Pycket, describing tasks, serialisation,
and the current scheduler.We report a preliminary evaluation of the
prototype framework using 4 micro-benchmarks and a small case
study on two NUMA servers (24 and 96 cores) and a small cluster
(17 hosts, 272 cores). Key results include Pycket delivering good
sequential performance e.g. almost as fast as C for some benchmarks;
good absolute speedups on all architectures (up to 120 on
128 cores for sumEuler); and that the adaptive transformations do
improve performance
On Designing Multicore-aware Simulators for Biological Systems
The stochastic simulation of biological systems is an increasingly popular
technique in bioinformatics. It often is an enlightening technique, which may
however result in being computational expensive. We discuss the main
opportunities to speed it up on multi-core platforms, which pose new challenges
for parallelisation techniques. These opportunities are developed in two
general families of solutions involving both the single simulation and a bulk
of independent simulations (either replicas of derived from parameter sweep).
Proposed solutions are tested on the parallelisation of the CWC simulator
(Calculus of Wrapped Compartments) that is carried out according to proposed
solutions by way of the FastFlow programming framework making possible fast
development and efficient execution on multi-cores.Comment: 19 pages + cover pag
Understanding Spark System Performance for Image Processing in a Heterogeneous Commodity Cluster
In recent years, Apache Spark has seen a widespread adoption in industries and institutions due to its
cache mechanism for faster Big Data analytics. However, the speed advantage Spark provides, especially in
a heterogeneous cluster environment, is not obtainable out-of-the-box; it requires the right combination of
configuration parameters from the myriads of parameters provided by Spark developers. Recognizing this
challenge, this thesis undertakes a study to provide insight on Spark performance particularly, regarding
the impact of choice parameter settings. These are parameters that are critical to fast job completion and
effective utilization of resources.
To this end, the study focuses on two specific example applications namely, flowerCounter and imageClustering,
for processing still image datasets of Canola plants collected during the Summer of 2016 from selected
plot fields using timelapse cameras in a heterogeneous Spark-clustered environments. These applications
were of initial interest to the Plant Phenotyping and Imaging Research Centre (P2IRC) at the University of
Saskatchewan. The P2IRC is responsible for developing systems that will aid fast analysis of large-scale seed
breeding to ensure global food security. The flowerCounter application estimates the count of flowers from
the images while the imageClustering application clusters images based on physical plant attributes. Two
clusters are used for the experiments: a 12-node and 3-node cluster (including a master node), with Hadoop
Distributed File System (HDFS) as the storage medium for the image datasets.
Experiments with the two case study applications demonstrate that increasing the number of tasks does
not always speed-up job processing due to increased communication overheads. Findings from other experiments
show that numerous tasks with one core per executor and small allocated memory limits parallelism
within an executor and result in inefficient use of cluster resources. Executors with large CPU and memory,
on the other hand, do not speed-up analytics due to processing delays and threads concurrency. Further
experimental results indicate that application processing time depends on input data storage in conjunction
with locality levels and executor run time is largely dominated by the disk I/O time especially, the read
time cost. With respect to horizontal node scaling, Spark scales with increasing homogeneous computing
nodes but the speed-up degrades with heterogeneous nodes. Finally, this study shows that the effectiveness
of speculative tasks execution in mitigating the impact of slow nodes varies for the applications
Implementation and Evaluation of Algorithmic Skeletons: Parallelisation of Computer Algebra Algorithms
This thesis presents design and implementation approaches for the parallel algorithms of computer algebra. We use algorithmic skeletons and also further approaches, like data parallel arithmetic and actors. We have implemented skeletons for divide and conquer algorithms and some special parallel loops, that we call ârepeated computation with a possibility of premature terminationâ. We introduce in this thesis a rational data parallel arithmetic. We focus on parallel symbolic computation algorithms, for these algorithms our arithmetic provides a generic parallelisation approach.
The implementation is carried out in Eden, a parallel functional programming language based on Haskell. This choice enables us to encode both the skeletons and the programs in the same language. Moreover, it allows us to refrain from using two different languagesâone for the implementation and one for the interfaceâfor our implementation of computer algebra algorithms.
Further, this thesis presents methods for evaluation and estimation of parallel execution times. We partition the parallel execution time into two components. One of them accounts for the quality of the parallelisation, we call it the âparallel penaltyâ. The other is the sequential execution time. For the estimation, we predict both components separately, using statistical methods. This enables very confident estimations, although using drastically less measurement points than other methods. We have applied both our evaluation and estimation approaches to the parallel programs presented in this thesis. We haven also used existing estimation methods.
We developed divide and conquer skeletons for the implementation of fast parallel multiplication. We have implemented the Karatsuba algorithm, Strassenâs matrix multiplication algorithm and the fast Fourier transform. The latter was used to implement polynomial convolution that leads to a further fast multiplication algorithm. Specially for our implementation of Strassen algorithm we have designed and implemented a divide and conquer skeleton basing on actors. We have implemented the parallel fast Fourier transform, and not only did we use new divide and conquer skeletons, but also developed a map-and-transpose skeleton. It enables good parallelisation of the Fourier transform. The parallelisation of Karatsuba multiplication shows a very good performance. We have analysed the parallel penalty of our programs and compared it to the serial fractionâan approach, known from literature. We also performed execution time estimations of our divide and conquer programs.
This thesis presents a parallel map+reduce skeleton scheme. It allows us to combine the usual parallel map skeletons, like parMap, farm, workpool, with a premature termination property. We use this to implement the so-called âparallel repeated computationâ, a special form of a speculative parallel loop. We have implemented two probabilistic primality tests: the RabinâMiller test and the Jacobi sum test. We parallelised both with our approach. We analysed the task distribution and stated the fitting configurations of the Jacobi sum test. We have shown formally that the Jacobi sum test can be implemented in parallel. Subsequently, we parallelised it, analysed the load balancing issues, and produced an optimisation. The latter enabled a good implementation, as verified using the parallel penalty. We have also estimated the performance of the tests for further input sizes and numbers of processing elements. Parallelisation of the Jacobi sum test and our generic parallelisation scheme for the repeated computation is our original contribution.
The data parallel arithmetic was defined not only for integers, which is already known, but also for rationals. We handled the common factors of the numerator or denominator of the fraction with the modulus in a novel manner. This is required to obtain a true multiple-residue arithmetic, a novel result of our research. Using these mathematical advances, we have parallelised the determinant computation using the GauĂ elimination. As always, we have performed task distribution analysis and estimation of the parallel execution time of our implementation. A similar computation in Maple emphasised the potential of our approach. Data parallel arithmetic enables parallelisation of entire classes of computer algebra algorithms.
Summarising, this thesis presents and thoroughly evaluates new and existing design decisions for high-level parallelisations of computer algebra algorithms
The parallel event loop model and runtime: a parallel programming model and runtime system for safe event-based parallel programming
Recent trends in programming models for server-side development have shown an increasing popularity of event-based single- threaded programming models based on the combination of dynamic languages such as JavaScript and event-based runtime systems for asynchronous I/O management such as Node.JS. Reasons for the success of such models are the simplicity of the single-threaded event-based programming model as well as the growing popularity of the Cloud as a deployment platform for Web applications. Unfortunately, the popularity of single-threaded models comes at the price of performance and scalability, as single-threaded event-based models present limitations when parallel processing is needed, and traditional approaches to concurrency such as threads and locks don't play well with event-based systems. This dissertation proposes a programming model and a runtime system to overcome such limitations by enabling single-threaded event-based applications with support for speculative parallel execution. The model, called Parallel Event Loop, has the goal of bringing parallel execution to the domain of single-threaded event-based programming without relaxing the main characteristics of the single-threaded model, and therefore providing developers with the impression of a safe, single-threaded, runtime. Rather than supporting only pure single-threaded programming, however, the parallel event loop can also be used to derive safe, high-level, parallel programming models characterized by a strong compatibility with single-threaded runtimes. We describe three distinct implementations of speculative runtimes enabling the parallel execution of event-based applications. The first implementation we describe is a pessimistic runtime system based on locks to implement speculative parallelization. The second and the third implementations are based on two distinct optimistic runtimes using software transactional memory. Each of the implementations supports the parallelization of applications written using an asynchronous single-threaded programming style, and each of them enables applications to benefit from parallel execution
Dependable mapreduce in a cloud-of-clouds
Tese de doutoramento, InformĂĄtica (Engenharia InformĂĄtica), Universidade de Lisboa, Faculdade de CiĂȘncias, 2017MapReduce is a simple and elegant programming model suitable for loosely coupled parallelization problemsâproblems that can be decomposed into subproblems. Hadoop MapReduce has become the most popular framework for performing large-scale computation on off-the-shelf clusters, and it is widely used to process these problems in a parallel and distributed fashion. This framework is highly scalable, can deal efficiently with large volumes of unstructured data, and it is a platform for many other applications. However, the framework has limitations concerning dependability. Namely, it is solely prepared to tolerate crash faults by re-executing tasks in case of failure, and to detect file corruptions using file checksums. Unfortunately, there is evidence that arbitrary faults do occur and can affect the correctness of MapReduce execution. Although such Byzantine faults are considered to be rare, particular MapReduce applications are critical and intolerant to this type of fault. Furthermore, typical MapReduce implementations are constrained to a single cloud environment. This is a problem as there is increasing evidence of outages on major cloud offerings, raising concerns about the dependence on a single cloud. In this thesis, I propose techniques to improve the dependability of MapReduce systems. The proposed solutions allow MapReduce to scale out computations to a multi-cloud environment, or cloud of-clouds, to tolerate arbitrary and malicious faults and cloud outages. The proposals have three important properties: they increase the dependability of MapReduce by tolerating the faults mentioned above; they require minimal or no modifications to usersâ applications; and they achieve this increased level of fault tolerance at reasonable cost. To achieve these goals, I introduce three key ideas: minimizing the required replication; applying context-based job scheduling based on cloud and network conditions; and performing fine-grained replication. I evaluated all proposed solutions in real testbed environments running typical MapReduce applications. The results demonstrate interesting trade-offs concerning resilience and performance when compared to traditional methods. The fundamental conclusion is that the cost introduced by our solutions is small, and thus deemed acceptable for many critical applications.O MapReduce Ă© um modelo de programação adequado para processar grandes volumes de dados em paralelo, executando um conjunto de tarefas independentes, e combinando os resultados parciais na solução final. OHadoop MapReduce Ă© uma plataforma popular para processar grandes quantidades de dados de forma paralela e distribuĂda. Do ponto de vista da confiabilidade, a plataforma estĂĄ preparada exclusivamente para tolerar faltas de paragem, re-executando tarefas, e detectar corrupçÔes de ficheiros usando somas de verificação. Esta Ă© uma importante limitação dado haver evidĂȘncia de que faltas arbitrĂĄrias ocorrem e podem afetar a execução do MapReduce. Embora estas faltas Bizantinas sejam raras, certas aplicaçÔes de MapReduce sĂŁo crĂticas e nĂŁo toleram faltas deste tipo. AlĂ©m disso, o nĂșmero de ocorrĂȘncias de interrupçÔes em infraestruturas da nuvem tem vindo a aumentar ao longo dos anos, levantando preocupaçÔes sobre a dependĂȘncia dos clientes num fornecedor Ășnico de serviços de nuvem. Nesta tese proponho vĂĄrias tĂ©cnicas para melhorar a confiabilidade do sistema MapReduce. As soluçÔes propostas permitem processar tarefas MapReduce num ambiente de mĂșltiplas nuvens para tolerar faltas arbitrĂĄrias, maliciosas e faltas de paragem nas nuvens. Estas soluçÔes oferecem trĂȘs importantes propriedades: toleram os tipos de faltas mencionadas; nĂŁo exigem modificaçÔes Ă s aplicaçÔes dos clientes; alcançam esta tolerĂąncia a faltas a um custo razoĂĄvel. Estas tĂ©cnicas sĂŁo baseadas nas seguintes ideias: minimizar a replicação, desenvolver algoritmos de escalonamento para o MapReduce baseados nas condiçÔes da nuvem e da rede, e criar um sistema de tolerĂąncia a faltas com granularidade fina no que respeita Ă replicação. Avaliei as minhas propostas em ambientes de teste real com aplicaçÔes comuns do MapReduce, que me permite demonstrar compromissos interessantes em termos de resiliĂȘncia e desempenho, quando comparados com mĂ©todos tradicionais. Em particular, os resultados mostram que o custo introduzido pelas soluçÔes sĂŁo aceitĂĄveis para muitas aplicaçÔes crĂticas
Automatic skeleton-driven performance optimizations for transactional memory
The recent shift toward multi -core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more
coarse -grain parallelism with every new generation of processors, or the performance
of their applications will remain roughly the same or even degrade. Unfortunately,
parallel programming is still hard and error prone. This has driven the development of
many new parallel programming models that aim to make this process efficient.This thesis first combines the skeleton -based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance
and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or
pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the
problem at hand. Thus, this programming approach simplifies parallel programming
by eliminating some of the major challenges of parallel programming, namely thread
communication, scheduling and orchestration. However, the application programmer
has still to correctly synchronize threads on data races. This commonly requires the
use of locks to guarantee atomic access to shared data. In particular, lock programming
is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads
that could be potentially executed in parallel.Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly.
This model allows programmers to write parallel code as transactions, which are then
guaranteed by the runtime system to execute atomically and in isolation regardless of
eventual data races. TM programming thus frees the application from deadlocks and
enables the exploitation of coarse grain parallelism when transactions do not conflict
very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework.
This fact makes the combination of skeleton -based and transactional programming a
natural step to improve programmability since these models complement each other.
In fact, this combination releases the application programmer from dealing with thread
management and data races, and also inherits the performance improvements of both
models. In addition to it, a skeleton framework is also amenable to skeleton - driven
iii
performance optimizations that exploits the application pattern and system information.This thesis thus also presents a set of pattern- oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the
knowledge about the worklist pattern and the TM nature of the applications to avoid
transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these patternoriented performance optimizations for each application and adjusts them accordingly.
Experimental results on a subset of five applications from the STAMP benchmark suite
show that the proposed autotuning mechanism can achieve performance improvements
within 2 %, on average, of a static oracle for a 16 -core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32 -core NUMA (Non -Uniform
Memory Access) platform.Finally, this thesis also investigates skeleton -driven system- oriented performance
optimizations such as thread mapping and memory page allocation. In order to do
it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five
applications from the STAMP benchmark show that the OpenSkel framework with the
extended autotuning mechanism driving both pattern and system- oriented optimizations can achieve performance improvements of up to 88 %, with an average of 46 %,
over a baseline version for a 16 -core UMA platform and up to 162 %, with an average
of 91 %, for a 32 -core NUMA platform
- âŠ