43 research outputs found

    Coupling Memory and Computation for Locality Management

    Get PDF
    We articulate the need for managing (data) locality automatically rather than leaving it to the programmer, especially in parallel programming systems. To this end, we propose techniques for coupling tightly the computation (including the thread scheduler) and the memory manager so that data and computation can be positioned closely in hardware. Such tight coupling of computation and memory management is in sharp contrast with the prevailing practice of considering each in isolation. For example, memory-management techniques usually abstract the computation as an unknown "mutator", which is treated as a "black box". As an example of the approach, in this paper we consider a specific class of parallel computations, nested-parallel computations. Such computations dynamically create a nesting of parallel tasks. We propose a method for organizing memory as a tree of heaps reflecting the structure of the nesting. More specifically, our approach creates a heap for a task if it is separately scheduled on a processor. This allows us to couple garbage collection with the structure of the computation and the way in which it is dynamically scheduled on the processors. This coupling enables taking advantage of locality in the program by mapping it to the locality of the hardware. For example for improved locality a heap can be garbage collected immediately after its task finishes when the heap contents is likely in cache

    Retrofitting parallelism onto OCaml.

    Get PDF
    OCaml is an industrial-strength, multi-paradigm programming language, widely used in industry and academia. OCaml is also one of the few modern managed system programming languages to lack support for shared memory parallel programming. This paper describes the design, a full-fledged implementation and evaluation of a mostly-concurrent garbage collector (GC) for the multicore extension of the OCaml programming language. Given that we propose to add parallelism to a widely used programming language with millions of lines of existing code, we face the challenge of maintaining backwards compatibility--not just in terms of the language features but also the performance of single-threaded code running with the new GC. To this end, the paper presents a series of novel techniques and demonstrates that the new GC strikes a balance between performance and feature backwards compatibility for sequential programs and scales admirably on modern multicore processors

    PAEAN : portable and scalable runtime support for parallel Haskell dialects

    Get PDF
    Over time, several competing approaches to parallel Haskell programming have emerged. Different approaches support parallelism at various different scales, ranging from small multicores to massively parallel high-performance computing systems. They also provide varying degrees of control, ranging from completely implicit approaches to ones providing full programmer control. Most current designs assume a shared memory model at the programmer, implementation and hardware levels. This is, however, becoming increasingly divorced from the reality at the hardware level. It also imposes significant unwanted runtime overheads in the form of garbage collection synchronisation etc. What is needed is an easy way to abstract over the implementation and hardware levels, while presenting a simple parallelism model to the programmer. The PArallEl shAred Nothing runtime system design aims to provide a portable and high-level shared-nothing implementation platform for parallel Haskell dialects. It abstracts over major issues such as work distribution and data serialisation, consolidating existing, successful designs into a single framework. It also provides an optional virtual shared-memory programming abstraction for (possibly) shared-nothing parallel machines, such as modern multicore/manycore architectures or cluster/cloud computing systems. It builds on, unifies and extends, existing well-developed support for shared-memory parallelism that is provided by the widely used GHC Haskell compiler. This paper summarises the state-of-the-art in shared-nothing parallel Haskell implementations, introduces the PArallEl shAred Nothing abstractions, shows how they can be used to implement three distinct parallel Haskell dialects, and demonstrates that good scalability can be obtained on recent parallel machines.PostprintPeer reviewe

    Hierarchical Memory Management for Parallel Programs

    Get PDF
    International audienceAn important feature of functional programs is that they are parallel by default. Implementing an efficient parallel functional language, however, is a major challenge, in part because the high rate of allocation and freeing associated with functional programs requires an efficient and scalable memory manager. In this paper, we present a technique for parallel memory management for strict functional languages with nested parallelism. At the highest level of abstraction, the approach consists of a technique to organize memory as a hierarchy of heaps, and an algorithm for performing automatic memory reclamation by taking advantage of a disentanglement property of parallel functional programs. More specifically, the idea is to assign to each parallel task its own heap in memory and organize the heaps in a hierarchy/tree that mirrors the hierarchy of tasks. We present a nested-parallel calculus that specifies hierarchical heaps and prove in this calculus a disentanglement property, which prohibits a task from accessing objects allocated by another task that might execute in parallel. Leveraging the disentanglement property, we present a garbage collection technique that can operate on any subtree in the memory hierarchy concurrently as other tasks (and/or other collections) proceed in parallel. We prove the safety of this collector by formalizing it in the context of our parallel calculus. In addition, we describe how the proposed techniques can be implemented on modern shared-memory machines and present a prototype implementation as an extension to MLton, a high-performance compiler for the Standard ML language. Finally, we evaluate the performance of this implementation on a number of parallel benchmarks

    Doctor of Philosophy

    Get PDF
    dissertationPlaces and distributed places bring new support for message-passing parallelism to Racket. This dissertation describes the programming model and how Racket's sequential runtime-system was modified to support places and distributed places. The freedom to design the places programming model helped make the implementation tractable; specifically, the conventional pain of adding just the right amount of locking to a big, legacy runtime system was avoided. The dissertation presents an evaluation of the places design that includes both real-world applications and standard parallel benchmarks. Distributed places are introduced as a language extension of the places design and architecture. The distributed places extension augments places with the features of remote process launch, remote place invocation, and distributed message passing. Distributed places provide a foundation for constructing higher-level distributed frameworks. Example implementations of RPC, MPI, map reduce, and nested data parallelism demonstrate the extensibility of the distributed places API

    Composable Scheduler Activations for Haskell

    Get PDF

    Partial Aborts for Transactions via First Class Continuations

    Get PDF
    Software transactional memory (STM) has proven to be a useful abstraction for developing concurrent applications where programmers denote transactions with an atomic construct that delimits a collection of reads and writes to shared mutable references. The runtime system then guarantees that all transactions are observed to execute atomically with respect to each other. Traditionally, when the runtime system detects that one transaction conflicts with another, it aborts one of the transactions and restarts its execution from the beginning. This can lead to problems with both execution time and throughput. This thesis presents a novel approach that uses first-class continuations to restart a conflicting transaction at the point of a conflict, avoiding the re-execution of any work from the beginning of the transaction that has not been compromised. In practice, this allows transactions to complete more quickly, decreasing execution time and increasing throughput. The ideas presented in this thesis have been implemented in the context of the Manticore project, an ML-family language with support for parallelism and concurrency. Crucially, this work relies on constant-time continuation capturing via a continuation-passing-style (CPS) transformation and heap-allocated continuations. The partial abort scheme has been implemented as a part of three modern STM implementations: TL2, TinySTM, and NOrec. Experimental results show that, while no base STM implementation is universally best, each partial-abort implementation compares favorably to its full-abort counterpart. In addition to an implementation, this thesis presents a formal semantics for partial aborts. A proof of correctness is given by relating the partial-abort semantics to an analogous full-abort semantics via a simulation. All proofs have been formally verified using the Coq Theorem Prover

    Parallel evaluation strategies for lazy data structures in Haskell

    Get PDF
    Conventional parallel programming is complex and error prone. To improve programmer productivity, we need to raise the level of abstraction with a higher-level programming model that hides many parallel coordination aspects. Evaluation strategies use non-strictness to separate the coordination and computation aspects of a Glasgow parallel Haskell (GpH) program. This allows the specification of high level parallel programs, eliminating the low-level complexity of synchronisation and communication associated with parallel programming. This thesis employs a data-structure-driven approach for parallelism derived through generic parallel traversal and evaluation of sub-components of data structures. We focus on evaluation strategies over list, tree and graph data structures, allowing re-use across applications with minimal changes to the sequential algorithm. In particular, we develop novel evaluation strategies for tree data structures, using core functional programming techniques for coordination control, achieving more flexible parallelism. We use non-strictness to control parallelism more flexibly. We apply the notion of fuel as a resource that dictates parallelism generation, in particular, the bi-directional flow of fuel, implemented using a circular program definition, in a tree structure as a novel way of controlling parallel evaluation. This is the first use of circular programming in evaluation strategies and is complemented by a lazy function for bounding the size of sub-trees. We extend these control mechanisms to graph structures and demonstrate performance improvements on several parallel graph traversals. We combine circularity for control for improved performance of strategies with circularity for computation using circular data structures. In particular, we develop a hybrid traversal strategy for graphs, exploiting breadth-first order for exposing parallelism initially, and then proceeding with a depth-first order to minimise overhead associated with a full parallel breadth-first traversal. The efficiency of the tree strategies is evaluated on a benchmark program, and two non-trivial case studies: a Barnes-Hut algorithm for the n-body problem and sparse matrix multiplication, both using quad-trees. We also evaluate a graph search algorithm implemented using the various traversal strategies. We demonstrate improved performance on a server-class multicore machine with up to 48 cores, with the advanced fuel splitting mechanisms proving to be more flexible in throttling parallelism. To guide the behaviour of the strategies, we develop heuristics-based parameter selection to select their specific control parameters
    corecore