22 research outputs found
Lock-free Concurrent Data Structures
Concurrent data structures are the data sharing side of parallel programming.
Data structures give the means to the program to store data, but also provide
operations to the program to access and manipulate these data. These operations
are implemented through algorithms that have to be efficient. In the sequential
setting, data structures are crucially important for the performance of the
respective computation. In the parallel programming setting, their importance
becomes more crucial because of the increased use of data and resource sharing
for utilizing parallelism.
The first and main goal of this chapter is to provide a sufficient background
and intuition to help the interested reader to navigate in the complex research
area of lock-free data structures. The second goal is to offer the programmer
familiarity to the subject that will allow her to use truly concurrent methods.Comment: To appear in "Programming Multi-core and Many-core Computing
Systems", eds. S. Pllana and F. Xhafa, Wiley Series on Parallel and
Distributed Computin
Efficient Multi-Word Compare and Swap
Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes required to implement MCAS.
We first prove in this paper that it is impossible to "pack" the information required to perform a k-word CAS (k-CAS) in less than k locations to be CASed. Then we present the first algorithm that requires k+1 CASes per call to k-CAS in the common uncontended case. We implement our algorithm and show that it outperforms a state-of-the-art baseline in a variety of benchmarks in most considered workloads. We also present a durably linearizable (persistent memory friendly) version of our MCAS algorithm using only 2 persistence fences per call, while still only requiring k+1 CASes per k-CAS
LL/SC and Atomic Copy: Constant Time, Space Efficient Implementations Using Only Pointer-Width CAS
When designing concurrent algorithms, Load-Link/Store-Conditional (LL/SC) is
often the ideal primitive to have because unlike Compare and Swap (CAS), LL/SC
is immune to the ABA problem. However, the full semantics of LL/SC are not
supported by any modern machine, so there has been a significant amount of work
on simulations of LL/SC using Compare and Swap (CAS), a synchronization
primitive that enjoys widespread hardware support. All of the algorithms so far
that are constant time either use unbounded sequence numbers (and thus base
objects of unbounded size), or require space for LL/SC object
(where is the number of processes). We present a constant time
implementation of LL/SC objects using space, where is
the maximum number of overlapping LL/SC operations per process (usually a
constant), and requiring only pointer-sized CAS objects. Our implementation can
also be used to implement -word objects in time (for
both and ) and space. To achieve these bounds, we
begin by implementing a new primitive called Single-Writer Copy which takes a
pointer to a word sized memory location and atomically copies its contents into
another object. The restriction is that only one process is allowed to
write/copy into the destination object at a time. We believe this primitive
will be very useful in designing other concurrent algorithms as well
Anonymous Readers Counting: A Wait-Free Multi-Word Atomic Register Algorithm for Scalable Data Sharing on Multi-Core Machines
In this article we present Anonymous Readers Counting (ARC), a multi-word atomic (1,N) register algorithm for multi-core machines. ARC exploits Read-Modify-Write (RMW) instructions to coordinate the writer and reader threads in a wait-free manner and enables large-scale data sharing by admitting up to (232-2) concurrent readers on off-the-shelf 64-bit machines, as opposed to the most advanced RMW-based approach which is limited to 58 readers on the same kind of machines. Further, ARC avoids multiple copies of the register content when accessing it - this is a problem that affects classical register algorithms based on atomic read/write operations on single words. Thus it allows for higher scalability with respect to the register size. Moreover, ARC explicitly reduces the overall power consumption, via a proper limitation of RMW instructions in case of read operations re-accessing a still-valid snapshot of the register content, and by showing constant time for read operations and amortized constant time for write operations. Our proposal has therefore a strong focus on real-world off-the-shelf architectures, allowing us to capture properties which benefit both performance and power consumption. A proof of correctness of our register algorithm is also provided, together with experimental data for a comparison with literature proposals. Beyond assessing ARC on physical platforms, we carry out as well an experimentation on virtualized infrastructures, which shows the resilience of wait-free synchronization as provided by ARC with respect to CPU-steal times, proper of modern paradigms such as cloud computing. Finally, we discuss how to extend ARC for scenarios with multiple writers and multiple readers - the so called (M,N) register. This is achieved not by changing the operations (and their wait-free nature) executed along the critical path of the threads, rather only changing the ratio between the number of buffers keeping the register snapshots and the number of threads to coordinate, as well as the number of bits used for counting readers within a 64-bit mask accessed via RMW instructions - just depending on the target balance between the number of readers and the number of writers to be supported
The Disclosure Power of Shared Objects
Shared objects are the means by which processes gather and exchange information about the state of a distributed system. Objects that disclose more information about the systemâand thus provide a more centralized viewâare therefore more desirable. In this paper, we propose the schedule reconstruction (SR) problem as a new metric for the disclosure power of shared memory objects. In schedule reconstruction, processes take steps which are interleaved to form a schedule; each process needs to be able to reconstruct the schedule up to its last step. We show that objects can be ranked in a hierarchy according to their ability to solve SR. In this hierarchy, stronger objects can implement weaker objects via a SR-based universal construction. We identify a connection between SR and consensus and prove that SR is at least as hard as consensus. Perhaps surprisingly, we show that objects that are powerful in solving consensusâsuch as compare-and-swapâare not always powerful in their ability to solve SR
Efficient and Practical Constructions of LL/SC Variables
Over the past decade, LL/SC have emerged as the most suitable synchronization instructions for the design of lock-free algorithms. However, current architectures do not support these instructions; instead, they support either CAS or RLL/RSC (e.g. POWER4, MIPS, SPARC, IA-64). To bridge this gap, this paper presents two efficient wait-free algorithms for implementing 64-bit LL/SC objects from 64-bit CAS or RLL/RSC objects. Our first algorithm is practical: it has a small, constant time complexity (of 4 for LL and 5 for SC) and a space overhead of only 4 words per process. This algorithm uses unbounded sequence numbers. For theoretical interest, we also present a more complex bounded algorithm that still guarantees constant time complexity and O(1) space overhead per process. The LL/SC primitive is free of the well-known ABA problem that afflicts CAS. By efficiently implementing LL/SC words from CAS words, this work presents an efficient general solution to the ABA problem