25 research outputs found
A Scalable, Portable, and Memory-Efficient Lock-Free FIFO Queue
We present a new lock-free multiple-producer and multiple-consumer (MPMC) FIFO queue design which is scalable and, unlike existing high-performant queues, very memory efficient. Moreover, the design is ABA safe and does not require any external memory allocators or safe memory reclamation techniques, typically needed by other scalable designs. In fact, this queue itself can be leveraged for object allocation and reclamation, as in data pools. We use FAA (fetch-and-add), a specialized and more scalable than CAS (compare-and-set) instruction, on the most contended hot spots of the algorithm. However, unlike prior attempts with FAA, our queue is both lock-free and linearizable.
We propose a general approach, SCQ, for bounded queues. This approach can easily be extended to support unbounded FIFO queues which can store an arbitrary number of elements. SCQ is portable across virtually all existing architectures and flexible enough for a wide variety of uses. We measure the performance of our algorithm on the x86-64 and PowerPC architectures. Our evaluation validates that our queue has exceptional memory efficiency compared to other algorithms and its performance is often comparable to, or exceeding that of state-of-the-art scalable algorithms
LIPIcs
The semantics of concurrent data structures is usually given by a sequential specification and a consistency condition. Linearizability is the most popular consistency condition due to its simplicity and general applicability. Nevertheless, for applications that do not require all guarantees offered by linearizability, recent research has focused on improving performance and scalability of concurrent data structures by relaxing their semantics. In this paper, we present local linearizability, a relaxed consistency condition that is applicable to container-type concurrent data structures like pools, queues, and stacks. While linearizability requires that the effect of each operation is observed by all threads at the same time, local linearizability only requires that for each thread T, the effects of its local insertion operations and the effects of those removal operations that remove values inserted by T are observed by all threads at the same time. We investigate theoretical and practical properties of local linearizability and its relationship to many existing consistency conditions. We present a generic implementation method for locally linearizable data structures that uses existing linearizable data structures as building blocks. Our implementations show performance and scalability improvements over the original building blocks and outperform the fastest existing container-type implementations
Concurrent Deterministic Skiplist and Other Data Structures
Skiplists are used in a variety of applications for storing data subject to
order criteria. In this article we discuss the design, analysis and performance
of a concurrent deterministic skip list on many-core NUMA nodes. We also
evaluate the performance of a concurrent lock-free unbounded queue
implementation and three implementations of multi-writer, multi-reader~(MWMR)
hash tables and compare their performance with equivalent implementations from
Intel's Thread Building Blocks~(TBB) library. We focus on strategies for memory
management that reduce page faults and cache misses for the memory access
patterns in these data structures. This paper proposes hierarchical usage of
concurrent data structures in programs to improve memory latencies by reducing
memory accesses from remote NUMA nodes
The Power of Choice in Priority Scheduling
Consider the following random process: we are given queues, into which
elements of increasing labels are inserted uniformly at random. To remove an
element, we pick two queues at random, and remove the element of lower label
(higher priority) among the two. The cost of a removal is the rank of the label
removed, among labels still present in any of the queues, that is, the distance
from the optimal choice at each step. Variants of this strategy are prevalent
in state-of-the-art concurrent priority queue implementations. Nonetheless, it
is not known whether such implementations provide any rank guarantees, even in
a sequential model.
We answer this question, showing that this strategy provides surprisingly
strong guarantees: Although the single-choice process, where we always insert
and remove from a single randomly chosen queue, has degrading cost, going to
infinity as we increase the number of steps, in the two choice process, the
expected rank of a removed element is while the expected worst-case
cost is . These bounds are tight, and hold irrespective of the
number of steps for which we run the process.
The argument is based on a new technical connection between "heavily loaded"
balls-into-bins processes and priority scheduling.
Our analytic results inspire a new concurrent priority queue implementation,
which improves upon the state of the art in terms of practical performance
Efficient Wait-Free Queue Algorithms with Multiple Enqueuers and Multiple Dequeuers
Despite the widespread usage of FIFO queues in distributed applications, designing efficient wait-free implementations of queues remains a challenge. The majority of wait-free queue implementations restrict either the number of dequeuers or the number of enqueuers that can operate on the queue, even when they use strong synchronization primitives, like the Compare&Swap. If we do not limit the number of processes that can perform enqueue and dequeue operations, the best-known upper bound on the worst case step complexity for a wait-free queue is given by [Khanchandani and Wattenhofer, 2018]. In particular, they present an implementation of a multiple dequeuer multiple enqueuer wait-free queue whose worst case step complexity is in O(?n), where n is the number of processes. In this work, we investigate whether it is possible to improve this bound. In particular, we present a wait-free FIFO queue implementation that supports n enqueuers and k dequeuers where the worst case step complexity of an Enqueue operation is in O(log n) and of a Dequeue operation is in O(k log n).
Then, we show that if the semantics of the queue can be relaxed, by allowing concurrent Dequeue operations to retrieve the same element, then we can achieve O(log n) worst-case step complexity for both the Enqueue and Dequeue operations
Concurrent Data Structures Linked in Time
Arguments about correctness of a concurrent data structure are typically
carried out by using the notion of linearizability and specifying the
linearization points of the data structure's procedures. Such arguments are
often cumbersome as the linearization points' position in time can be dynamic
(depend on the interference, run-time values and events from the past, or even
future), non-local (appear in procedures other than the one considered), and
whose position in the execution trace may only be determined after the
considered procedure has already terminated.
In this paper we propose a new method, based on a separation-style logic, for
reasoning about concurrent objects with such linearization points. We embrace
the dynamic nature of linearization points, and encode it as part of the data
structure's auxiliary state, so that it can be dynamically modified in place by
auxiliary code, as needed when some appropriate run-time event occurs. We name
the idea linking-in-time, because it reduces temporal reasoning to spatial
reasoning. For example, modifying a temporal position of a linearization point
can be modeled similarly to a pointer update in separation logic. Furthermore,
the auxiliary state provides a convenient way to concisely express the
properties essential for reasoning about clients of such concurrent objects. We
illustrate the method by verifying (mechanically in Coq) an intricate optimal
snapshot algorithm due to Jayanti, as well as some clients
Highly-Efficient Persistent FIFO Queues
In this paper, we study the question whether techniques employed, in a
conventional system, by state-of-the-art concurrent algorithms to avoid
contended hot spots are still efficient for recoverable computing in settings
with Non-Volatile Memory (NVM). We focus on concurrent FIFO queues that have
two end-points, head and tail, which are highly contended.
We present a persistent FIFO queue implementation that performs a pair of
persistence instructions per operation (enqueue or dequeue). The algorithm
achieves to perform these instructions on variables of low contention by
employing Fetch&Increment and using the state-of-the-art queue implementation
by Afek and Morrison (PPoPP'13). These result in performance that is up to 2x
faster than state-of-the-art persistent FIFO queue implementations