654 research outputs found

    Towards an Adaptive Skeleton Framework for Performance Portability

    Get PDF
    The proliferation of widely available, but very different, parallel architectures makes the ability to deliver good parallel performance on a range of architectures, or performance portability, highly desirable. Irregularly-parallel problems, where the number and size of tasks is unpredictable, are particularly challenging and require dynamic coordination. The paper outlines a novel approach to delivering portable parallel performance for irregularly parallel programs. The approach combines declarative parallelism with JIT technology, dynamic scheduling, and dynamic transformation. We present the design of an adaptive skeleton library, with a task graph implementation, JIT trace costing, and adaptive transformations. We outline the architecture of the protoype adaptive skeleton execution framework in Pycket, describing tasks, serialisation, and the current scheduler.We report a preliminary evaluation of the prototype framework using 4 micro-benchmarks and a small case study on two NUMA servers (24 and 96 cores) and a small cluster (17 hosts, 272 cores). Key results include Pycket delivering good sequential performance e.g. almost as fast as C for some benchmarks; good absolute speedups on all architectures (up to 120 on 128 cores for sumEuler); and that the adaptive transformations do improve performance

    On Designing Multicore-aware Simulators for Biological Systems

    Full text link
    The stochastic simulation of biological systems is an increasingly popular technique in bioinformatics. It often is an enlightening technique, which may however result in being computational expensive. We discuss the main opportunities to speed it up on multi-core platforms, which pose new challenges for parallelisation techniques. These opportunities are developed in two general families of solutions involving both the single simulation and a bulk of independent simulations (either replicas of derived from parameter sweep). Proposed solutions are tested on the parallelisation of the CWC simulator (Calculus of Wrapped Compartments) that is carried out according to proposed solutions by way of the FastFlow programming framework making possible fast development and efficient execution on multi-cores.Comment: 19 pages + cover pag

    Understanding Spark System Performance for Image Processing in a Heterogeneous Commodity Cluster

    Get PDF
    In recent years, Apache Spark has seen a widespread adoption in industries and institutions due to its cache mechanism for faster Big Data analytics. However, the speed advantage Spark provides, especially in a heterogeneous cluster environment, is not obtainable out-of-the-box; it requires the right combination of configuration parameters from the myriads of parameters provided by Spark developers. Recognizing this challenge, this thesis undertakes a study to provide insight on Spark performance particularly, regarding the impact of choice parameter settings. These are parameters that are critical to fast job completion and effective utilization of resources. To this end, the study focuses on two specific example applications namely, flowerCounter and imageClustering, for processing still image datasets of Canola plants collected during the Summer of 2016 from selected plot fields using timelapse cameras in a heterogeneous Spark-clustered environments. These applications were of initial interest to the Plant Phenotyping and Imaging Research Centre (P2IRC) at the University of Saskatchewan. The P2IRC is responsible for developing systems that will aid fast analysis of large-scale seed breeding to ensure global food security. The flowerCounter application estimates the count of flowers from the images while the imageClustering application clusters images based on physical plant attributes. Two clusters are used for the experiments: a 12-node and 3-node cluster (including a master node), with Hadoop Distributed File System (HDFS) as the storage medium for the image datasets. Experiments with the two case study applications demonstrate that increasing the number of tasks does not always speed-up job processing due to increased communication overheads. Findings from other experiments show that numerous tasks with one core per executor and small allocated memory limits parallelism within an executor and result in inefficient use of cluster resources. Executors with large CPU and memory, on the other hand, do not speed-up analytics due to processing delays and threads concurrency. Further experimental results indicate that application processing time depends on input data storage in conjunction with locality levels and executor run time is largely dominated by the disk I/O time especially, the read time cost. With respect to horizontal node scaling, Spark scales with increasing homogeneous computing nodes but the speed-up degrades with heterogeneous nodes. Finally, this study shows that the effectiveness of speculative tasks execution in mitigating the impact of slow nodes varies for the applications

    Implementation and Evaluation of Algorithmic Skeletons: Parallelisation of Computer Algebra Algorithms

    Get PDF
    This thesis presents design and implementation approaches for the parallel algorithms of computer algebra. We use algorithmic skeletons and also further approaches, like data parallel arithmetic and actors. We have implemented skeletons for divide and conquer algorithms and some special parallel loops, that we call ‘repeated computation with a possibility of premature termination’. We introduce in this thesis a rational data parallel arithmetic. We focus on parallel symbolic computation algorithms, for these algorithms our arithmetic provides a generic parallelisation approach. The implementation is carried out in Eden, a parallel functional programming language based on Haskell. This choice enables us to encode both the skeletons and the programs in the same language. Moreover, it allows us to refrain from using two different languages—one for the implementation and one for the interface—for our implementation of computer algebra algorithms. Further, this thesis presents methods for evaluation and estimation of parallel execution times. We partition the parallel execution time into two components. One of them accounts for the quality of the parallelisation, we call it the ‘parallel penalty’. The other is the sequential execution time. For the estimation, we predict both components separately, using statistical methods. This enables very confident estimations, although using drastically less measurement points than other methods. We have applied both our evaluation and estimation approaches to the parallel programs presented in this thesis. We haven also used existing estimation methods. We developed divide and conquer skeletons for the implementation of fast parallel multiplication. We have implemented the Karatsuba algorithm, Strassen’s matrix multiplication algorithm and the fast Fourier transform. The latter was used to implement polynomial convolution that leads to a further fast multiplication algorithm. Specially for our implementation of Strassen algorithm we have designed and implemented a divide and conquer skeleton basing on actors. We have implemented the parallel fast Fourier transform, and not only did we use new divide and conquer skeletons, but also developed a map-and-transpose skeleton. It enables good parallelisation of the Fourier transform. The parallelisation of Karatsuba multiplication shows a very good performance. We have analysed the parallel penalty of our programs and compared it to the serial fraction—an approach, known from literature. We also performed execution time estimations of our divide and conquer programs. This thesis presents a parallel map+reduce skeleton scheme. It allows us to combine the usual parallel map skeletons, like parMap, farm, workpool, with a premature termination property. We use this to implement the so-called ‘parallel repeated computation’, a special form of a speculative parallel loop. We have implemented two probabilistic primality tests: the Rabin–Miller test and the Jacobi sum test. We parallelised both with our approach. We analysed the task distribution and stated the fitting configurations of the Jacobi sum test. We have shown formally that the Jacobi sum test can be implemented in parallel. Subsequently, we parallelised it, analysed the load balancing issues, and produced an optimisation. The latter enabled a good implementation, as verified using the parallel penalty. We have also estimated the performance of the tests for further input sizes and numbers of processing elements. Parallelisation of the Jacobi sum test and our generic parallelisation scheme for the repeated computation is our original contribution. The data parallel arithmetic was defined not only for integers, which is already known, but also for rationals. We handled the common factors of the numerator or denominator of the fraction with the modulus in a novel manner. This is required to obtain a true multiple-residue arithmetic, a novel result of our research. Using these mathematical advances, we have parallelised the determinant computation using the Gauß elimination. As always, we have performed task distribution analysis and estimation of the parallel execution time of our implementation. A similar computation in Maple emphasised the potential of our approach. Data parallel arithmetic enables parallelisation of entire classes of computer algebra algorithms. Summarising, this thesis presents and thoroughly evaluates new and existing design decisions for high-level parallelisations of computer algebra algorithms

    The parallel event loop model and runtime: a parallel programming model and runtime system for safe event-based parallel programming

    Get PDF
    Recent trends in programming models for server-side development have shown an increasing popularity of event-based single- threaded programming models based on the combination of dynamic languages such as JavaScript and event-based runtime systems for asynchronous I/O management such as Node.JS. Reasons for the success of such models are the simplicity of the single-threaded event-based programming model as well as the growing popularity of the Cloud as a deployment platform for Web applications. Unfortunately, the popularity of single-threaded models comes at the price of performance and scalability, as single-threaded event-based models present limitations when parallel processing is needed, and traditional approaches to concurrency such as threads and locks don't play well with event-based systems. This dissertation proposes a programming model and a runtime system to overcome such limitations by enabling single-threaded event-based applications with support for speculative parallel execution. The model, called Parallel Event Loop, has the goal of bringing parallel execution to the domain of single-threaded event-based programming without relaxing the main characteristics of the single-threaded model, and therefore providing developers with the impression of a safe, single-threaded, runtime. Rather than supporting only pure single-threaded programming, however, the parallel event loop can also be used to derive safe, high-level, parallel programming models characterized by a strong compatibility with single-threaded runtimes. We describe three distinct implementations of speculative runtimes enabling the parallel execution of event-based applications. The first implementation we describe is a pessimistic runtime system based on locks to implement speculative parallelization. The second and the third implementations are based on two distinct optimistic runtimes using software transactional memory. Each of the implementations supports the parallelization of applications written using an asynchronous single-threaded programming style, and each of them enables applications to benefit from parallel execution

    Dependable mapreduce in a cloud-of-clouds

    Get PDF
    Tese de doutoramento, InformĂĄtica (Engenharia InformĂĄtica), Universidade de Lisboa, Faculdade de CiĂȘncias, 2017MapReduce is a simple and elegant programming model suitable for loosely coupled parallelization problems—problems that can be decomposed into subproblems. Hadoop MapReduce has become the most popular framework for performing large-scale computation on off-the-shelf clusters, and it is widely used to process these problems in a parallel and distributed fashion. This framework is highly scalable, can deal efficiently with large volumes of unstructured data, and it is a platform for many other applications. However, the framework has limitations concerning dependability. Namely, it is solely prepared to tolerate crash faults by re-executing tasks in case of failure, and to detect file corruptions using file checksums. Unfortunately, there is evidence that arbitrary faults do occur and can affect the correctness of MapReduce execution. Although such Byzantine faults are considered to be rare, particular MapReduce applications are critical and intolerant to this type of fault. Furthermore, typical MapReduce implementations are constrained to a single cloud environment. This is a problem as there is increasing evidence of outages on major cloud offerings, raising concerns about the dependence on a single cloud. In this thesis, I propose techniques to improve the dependability of MapReduce systems. The proposed solutions allow MapReduce to scale out computations to a multi-cloud environment, or cloud of-clouds, to tolerate arbitrary and malicious faults and cloud outages. The proposals have three important properties: they increase the dependability of MapReduce by tolerating the faults mentioned above; they require minimal or no modifications to users’ applications; and they achieve this increased level of fault tolerance at reasonable cost. To achieve these goals, I introduce three key ideas: minimizing the required replication; applying context-based job scheduling based on cloud and network conditions; and performing fine-grained replication. I evaluated all proposed solutions in real testbed environments running typical MapReduce applications. The results demonstrate interesting trade-offs concerning resilience and performance when compared to traditional methods. The fundamental conclusion is that the cost introduced by our solutions is small, and thus deemed acceptable for many critical applications.O MapReduce Ă© um modelo de programação adequado para processar grandes volumes de dados em paralelo, executando um conjunto de tarefas independentes, e combinando os resultados parciais na solução final. OHadoop MapReduce Ă© uma plataforma popular para processar grandes quantidades de dados de forma paralela e distribuĂ­da. Do ponto de vista da confiabilidade, a plataforma estĂĄ preparada exclusivamente para tolerar faltas de paragem, re-executando tarefas, e detectar corrupçÔes de ficheiros usando somas de verificação. Esta Ă© uma importante limitação dado haver evidĂȘncia de que faltas arbitrĂĄrias ocorrem e podem afetar a execução do MapReduce. Embora estas faltas Bizantinas sejam raras, certas aplicaçÔes de MapReduce sĂŁo crĂ­ticas e nĂŁo toleram faltas deste tipo. AlĂ©m disso, o nĂșmero de ocorrĂȘncias de interrupçÔes em infraestruturas da nuvem tem vindo a aumentar ao longo dos anos, levantando preocupaçÔes sobre a dependĂȘncia dos clientes num fornecedor Ășnico de serviços de nuvem. Nesta tese proponho vĂĄrias tĂ©cnicas para melhorar a confiabilidade do sistema MapReduce. As soluçÔes propostas permitem processar tarefas MapReduce num ambiente de mĂșltiplas nuvens para tolerar faltas arbitrĂĄrias, maliciosas e faltas de paragem nas nuvens. Estas soluçÔes oferecem trĂȘs importantes propriedades: toleram os tipos de faltas mencionadas; nĂŁo exigem modificaçÔes Ă s aplicaçÔes dos clientes; alcançam esta tolerĂąncia a faltas a um custo razoĂĄvel. Estas tĂ©cnicas sĂŁo baseadas nas seguintes ideias: minimizar a replicação, desenvolver algoritmos de escalonamento para o MapReduce baseados nas condiçÔes da nuvem e da rede, e criar um sistema de tolerĂąncia a faltas com granularidade fina no que respeita Ă  replicação. Avaliei as minhas propostas em ambientes de teste real com aplicaçÔes comuns do MapReduce, que me permite demonstrar compromissos interessantes em termos de resiliĂȘncia e desempenho, quando comparados com mĂ©todos tradicionais. Em particular, os resultados mostram que o custo introduzido pelas soluçÔes sĂŁo aceitĂĄveis para muitas aplicaçÔes crĂ­ticas

    Automatic skeleton-driven performance optimizations for transactional memory

    Get PDF
    The recent shift toward multi -core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more coarse -grain parallelism with every new generation of processors, or the performance of their applications will remain roughly the same or even degrade. Unfortunately, parallel programming is still hard and error prone. This has driven the development of many new parallel programming models that aim to make this process efficient.This thesis first combines the skeleton -based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the problem at hand. Thus, this programming approach simplifies parallel programming by eliminating some of the major challenges of parallel programming, namely thread communication, scheduling and orchestration. However, the application programmer has still to correctly synchronize threads on data races. This commonly requires the use of locks to guarantee atomic access to shared data. In particular, lock programming is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads that could be potentially executed in parallel.Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly. This model allows programmers to write parallel code as transactions, which are then guaranteed by the runtime system to execute atomically and in isolation regardless of eventual data races. TM programming thus frees the application from deadlocks and enables the exploitation of coarse grain parallelism when transactions do not conflict very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework. This fact makes the combination of skeleton -based and transactional programming a natural step to improve programmability since these models complement each other. In fact, this combination releases the application programmer from dealing with thread management and data races, and also inherits the performance improvements of both models. In addition to it, a skeleton framework is also amenable to skeleton - driven iii performance optimizations that exploits the application pattern and system information.This thesis thus also presents a set of pattern- oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the knowledge about the worklist pattern and the TM nature of the applications to avoid transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these patternoriented performance optimizations for each application and adjusts them accordingly. Experimental results on a subset of five applications from the STAMP benchmark suite show that the proposed autotuning mechanism can achieve performance improvements within 2 %, on average, of a static oracle for a 16 -core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32 -core NUMA (Non -Uniform Memory Access) platform.Finally, this thesis also investigates skeleton -driven system- oriented performance optimizations such as thread mapping and memory page allocation. In order to do it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five applications from the STAMP benchmark show that the OpenSkel framework with the extended autotuning mechanism driving both pattern and system- oriented optimizations can achieve performance improvements of up to 88 %, with an average of 46 %, over a baseline version for a 16 -core UMA platform and up to 162 %, with an average of 91 %, for a 32 -core NUMA platform
    • 

    corecore