50 research outputs found
The best multicore-parallelization refactoring you've never heard of
In this short paper, we explore a new way to refactor a simple but
tricky-to-parallelize tree-traversal algorithm to harness multicore
parallelism. Crucially, the refactoring draws from some classic techniques from
programming-languages research, such as the continuation-passing-style
transform and defunctionalization. The algorithm we consider faces a
particularly acute granularity-control challenge, owing to the wide range of
inputs it has to deal with. Our solution achieves efficiency from heartbeat
scheduling, a recent approach to automatic granularity control. We present our
solution in a series of individually simple refactoring steps, starting from a
high-level, recursive specification of the algorithm. As such, our approach may
prove useful as a teaching tool, and perhaps be used for one-off
parallelizations, as the technique requires no special compiler support
Efficient Representations for Large Dynamic Sequences in ML
International audienceThe use of sequence containers, including stacks, queues, and double-ended queues, is ubiquitous in programming. When the maximal number of elements is not known in advance, containers need to grow dynamically. For this purpose, most ML programs either rely on lists or vectors. These structures are inefficient, both in terms of time and space usage. We investigate the use of chunked-based data structures. Such structures save a lot of memory and may deliver better performance than classic container data structures. We observe a 2x speedup compared with vectors, and up to a 3x speedup compared with lengthy lists
Efficient Tree-Traversals: Reconciling Parallelism and Dense Data Representations
Recent work showed that compiling functional programs to use dense,
serialized memory representations for recursive algebraic datatypes can yield
significant constant-factor speedups for sequential programs. But serializing
data in a maximally dense format consequently serializes the processing of that
data, yielding a tension between density and parallelism. This paper shows that
a disciplined, practical compromise is possible. We present Parallel Gibbon, a
compiler that obtains the benefits of dense data formats and parallelism. We
formalize the semantics of the parallel location calculus underpinning this
novel implementation strategy, and show that it is type-safe. Parallel Gibbon
exceeds the parallel performance of existing compilers for purely functional
programs that use recursive algebraic datatypes, including, notably,
abstract-syntax-tree traversals as in compilers
Contention in Structured Concurrency: Provably Efficient Dynamic Non-Zero Indicators for Nested Parallelism
International audienceOver the past two decades, many concurrent data structures have been designed and implemented. Nearly all such work analyzes concurrent data structures empirically, omitting asymptotic bounds on their efficiency, partly because of the complexity of the analysis needed, and partly because of the difficulty of obtaining relevant asymptotic bounds: when the analysis takes into account important practical factors, such as contention, it is difficult or even impossible to prove desirable bounds. In this paper, we show that considering structured concurrency or relaxed concurrency models can enable establishing strong bounds, also for contention. To this end, we first present a dynamic relaxed counter data structure that indicates the non-zero status of the counter. Our data structure extends a recently proposed data structure, called SNZI, allowing our structure to grow dynamically in response to the increasing degree of concurrency in the system. Using the dynamic SNZI data structure, we then present a concurrent data structure for series-parallel directed acyclic graphs (sp-dags), a key data structure widely used in the implementation of modern parallel programming languages. The key component of sp-dags is an in-counter data structure that is an instance of our dynamic SNZI. We analyze the efficiency of our concurrent sp-dags and in-counter data structures under nested-parallel computing paradigm. This paradigm offers a structured model for concurrency. Under this model, we prove that our data structures require amortized O(1) shared memory steps, including contention. We present an implementation and an experimental evaluation that suggests that the sp-dags data structure is practical and can perform well in practice
Dag-calculus: a calculus for parallel computation
International audienceIncreasing availability of multicore systems has led to greater focus on the design and implementation of languages for writing parallel programs. Such languages support various abstractions for parallelism, such as fork-join, async-finish, futures. While they may seem similar, these abstractions lead to different semantics, language design and implementation decisions, and can significantly impact the performance of end-user applications. In this paper, we consider the question of whether it would be possible to unify various paradigms of parallel computing. To this end, we propose a calculus, called dag calculus, that can encode fork-join, async-finish, and futures, and possibly others. We describe dag calculus and its semantics, establish translations from the afore-mentioned paradigms into dag calculus. These translations establish that dag calculus is sufficiently powerful for encoding programs written in prevailing paradigms of parallelism. We present concurrent algorithms and data structures for realizing dag calculus on multi-core hardware and prove that the proposed techniques are consistent with the semantics. Finally, we present an implementation of the calculus and evaluate it empirically by comparing its performance to highly optimized code from prior work. The results show that the calculus is expressive and that it competes well with, and sometimes outperforms, the state of the art
LoCal: a language for programs operating on serialized data
In a typical data-processing program, the representation of data in memory is distinct from its representation in a serialized form on disk. The former has pointers and arbitrary, sparse layout, facilitating easy manipulation by a program, while the latter is packed contiguously, facilitating easy I/O. We propose a language, LoCal, to unify in-memory and serialized formats. LoCal extends a region calculus into a location calculus, employing a type system that tracks the byte-addressed layout of all heap values. We formalize LoCal and prove type safety, and show how LoCal programs can be inferred from unannotated source terms.
We transform the existing Gibbon compiler to use LoCal as an intermediate language, with the goal of achieving a balance between code speed and data compactness by introducing just enough indirection into heap layouts, preserving the asymptotic complexity of traditional representations, but working with mostly or completely serialized data. We show that our approach yields significant performance improvement over prior approaches to operating on packed data, without abandoning idiomatic programming with recursive functions
Garbage collection for mostly serialized heaps
Over the years, traditional tracing garbage collectors have accumulated assumptions that may not hold in new language designs. For instance, we usually assume that run-time objects do not hold addressable sub-parts and have a size of at least one pointer. These fail in systems striving to eliminate pointers and represent data in a dense, serialized form, such as the Gibbon compiler. We propose a new memory management strategy for language runtimes with mostly serialized heaps. It uses a hybrid, generational collector, where regions are bump-allocated into the young generation and objects are bump-allocated within those regions. Minor collections copy data into larger regions in the old generation, compacting it further. The old generation uses region-level reference counting. The resulting system maintains high performance for data traversal programs, while significantly improving performance on other kinds of allocation patterns