39 research outputs found
Efficient Lock-free Binary Search Trees
In this paper we present a novel algorithm for concurrent lock-free internal
binary search trees (BST) and implement a Set abstract data type (ADT) based on
that. We show that in the presented lock-free BST algorithm the amortized step
complexity of each set operation - {\sc Add}, {\sc Remove} and {\sc Contains} -
is , where, is the height of BST with number of nodes
and is the contention during the execution. Our algorithm adapts to
contention measures according to read-write load. If the situation is
read-heavy, the operations avoid helping pending concurrent {\sc Remove}
operations during traversal, and, adapt to interval contention. However, for
write-heavy situations we let an operation help pending {\sc Remove}, even
though it is not obstructed, and so adapt to tighter point contention. It uses
single-word compare-and-swap (\texttt{CAS}) operations. We show that our
algorithm has improved disjoint-access-parallelism compared to similar existing
algorithms. We prove that the presented algorithm is linearizable. To the best
of our knowledge this is the first algorithm for any concurrent tree data
structure in which the modify operations are performed with an additive term of
contention measure.Comment: 15 pages, 3 figures, submitted to POD
Parallel Finger Search Structures
In this paper we present two versions of a parallel finger structure FS on p processors that supports searches, insertions and deletions, and has a finger at each end. This is to our knowledge the first implementation of a parallel search structure that is work-optimal with respect to the finger bound and yet has very good parallelism (within a factor of O(log p)^2) of optimal). We utilize an extended implicit batching framework that transparently facilitates the use of FS by any parallel program P that is modelled by a dynamically generated DAG D where each node is either a unit-time instruction or a call to FS.
The work done by FS is bounded by the finger bound F_L (for some linearization L of D), i.e. each operation on an item with distance r from a finger takes O(log r+1) amortized work. Running P using the simpler version takes O((T_1+F_L)/p + T_infty + d * ((log p)^2 + log n)) time on a greedy scheduler, where T_1, T_infty are the size and span of D respectively, and n is the maximum number of items in FS, and d is the maximum number of calls to FS along any path in D. Using the faster version, this is reduced to O((T_1+F_L)/p + T_infty + d *(log p)^2 + s_L) time, where s_L is the weighted span of D where each call to FS is weighted by its cost according to F_L. FS can be extended to a fixed number of movable fingers.
The data structures in our paper fit into the dynamic multithreading paradigm, and their performance bounds are directly composable with other data structures given in the same paradigm. Also, the results can be translated to practical implementations using work-stealing schedulers
The Power of Choice in Priority Scheduling
Consider the following random process: we are given queues, into which
elements of increasing labels are inserted uniformly at random. To remove an
element, we pick two queues at random, and remove the element of lower label
(higher priority) among the two. The cost of a removal is the rank of the label
removed, among labels still present in any of the queues, that is, the distance
from the optimal choice at each step. Variants of this strategy are prevalent
in state-of-the-art concurrent priority queue implementations. Nonetheless, it
is not known whether such implementations provide any rank guarantees, even in
a sequential model.
We answer this question, showing that this strategy provides surprisingly
strong guarantees: Although the single-choice process, where we always insert
and remove from a single randomly chosen queue, has degrading cost, going to
infinity as we increase the number of steps, in the two choice process, the
expected rank of a removed element is while the expected worst-case
cost is . These bounds are tight, and hold irrespective of the
number of steps for which we run the process.
The argument is based on a new technical connection between "heavily loaded"
balls-into-bins processes and priority scheduling.
Our analytic results inspire a new concurrent priority queue implementation,
which improves upon the state of the art in terms of practical performance
Parallel Working-Set Search Structures
In this paper we present two versions of a parallel working-set map on p
processors that supports searches, insertions and deletions. In both versions,
the total work of all operations when the map has size at least p is bounded by
the working-set bound, i.e., the cost of an item depends on how recently it was
accessed (for some linearization): accessing an item in the map with recency r
takes O(1+log r) work. In the simpler version each map operation has O((log
p)^2+log n) span (where n is the maximum size of the map). In the pipelined
version each map operation on an item with recency r has O((log p)^2+log r)
span. (Operations in parallel may have overlapping span; span is additive only
for operations in sequence.)
Both data structures are designed to be used by a dynamic multithreading
parallel program that at each step executes a unit-time instruction or makes a
data structure call. To achieve the stated bounds, the pipelined data structure
requires a weak-priority scheduler, which supports a limited form of 2-level
prioritization. At the end we explain how the results translate to practical
implementations using work-stealing schedulers.
To the best of our knowledge, this is the first parallel implementation of a
self-adjusting search structure where the cost of an operation adapts to the
access sequence. A corollary of the working-set bound is that it achieves work
static optimality: the total work is bounded by the access costs in an optimal
static search tree.Comment: Authors' version of a paper accepted to SPAA 201
A Template for Implementing Fast Lock-free Trees Using HTM
Algorithms that use hardware transactional memory (HTM) must provide a
software-only fallback path to guarantee progress. The design of the fallback
path can have a profound impact on performance. If the fallback path is allowed
to run concurrently with hardware transactions, then hardware transactions must
be instrumented, adding significant overhead. Otherwise, hardware transactions
must wait for any processes on the fallback path, causing concurrency
bottlenecks, or move to the fallback path. We introduce an approach that
combines the best of both worlds. The key idea is to use three execution paths:
an HTM fast path, an HTM middle path, and a software fallback path, such that
the middle path can run concurrently with each of the other two. The fast path
and fallback path do not run concurrently, so the fast path incurs no
instrumentation overhead. Furthermore, fast path transactions can move to the
middle path instead of waiting or moving to the software path. We demonstrate
our approach by producing an accelerated version of the tree update template of
Brown et al., which can be used to implement fast lock-free data structures
based on down-trees. We used the accelerated template to implement two
lock-free trees: a binary search tree (BST), and an (a,b)-tree (a
generalization of a B-tree). Experiments show that, with 72 concurrent
processes, our accelerated (a,b)-tree performs between 4.0x and 4.2x as many
operations per second as an implementation obtained using the original tree
update template
Concurrent Deterministic Skiplist and Other Data Structures
Skiplists are used in a variety of applications for storing data subject to
order criteria. In this article we discuss the design, analysis and performance
of a concurrent deterministic skip list on many-core NUMA nodes. We also
evaluate the performance of a concurrent lock-free unbounded queue
implementation and three implementations of multi-writer, multi-reader~(MWMR)
hash tables and compare their performance with equivalent implementations from
Intel's Thread Building Blocks~(TBB) library. We focus on strategies for memory
management that reduce page faults and cache misses for the memory access
patterns in these data structures. This paper proposes hierarchical usage of
concurrent data structures in programs to improve memory latencies by reducing
memory accesses from remote NUMA nodes