5,016 research outputs found
DeltaTree: A Practical Locality-aware Concurrent Search Tree
As other fundamental programming abstractions in energy-efficient computing,
search trees are expected to support both high parallelism and data locality.
However, existing highly-concurrent search trees such as red-black trees and
AVL trees do not consider data locality while existing locality-aware search
trees such as those based on the van Emde Boas layout (vEB-based trees), poorly
support concurrent (update) operations.
This paper presents DeltaTree, a practical locality-aware concurrent search
tree that combines both locality-optimisation techniques from vEB-based trees
and concurrency-optimisation techniques from non-blocking highly-concurrent
search trees. DeltaTree is a -ary leaf-oriented tree of DeltaNodes in which
each DeltaNode is a size-fixed tree-container with the van Emde Boas layout.
The expected memory transfer costs of DeltaTree's Search, Insert, and Delete
operations are , where are the tree size and the unknown
memory block size in the ideal cache model, respectively. DeltaTree's Search
operation is wait-free, providing prioritised lanes for Search operations, the
dominant operation in search trees. Its Insert and {\em Delete} operations are
non-blocking to other Search, Insert, and Delete operations, but they may be
occasionally blocked by maintenance operations that are sometimes triggered to
keep DeltaTree in good shape. Our experimental evaluation using the latest
implementation of AVL, red-black, and speculation friendly trees from the
Synchrobench benchmark has shown that DeltaTree is up to 5 times faster than
all of the three concurrent search trees for searching operations and up to 1.6
times faster for update operations when the update contention is not too high
A Template for Implementing Fast Lock-free Trees Using HTM
Algorithms that use hardware transactional memory (HTM) must provide a
software-only fallback path to guarantee progress. The design of the fallback
path can have a profound impact on performance. If the fallback path is allowed
to run concurrently with hardware transactions, then hardware transactions must
be instrumented, adding significant overhead. Otherwise, hardware transactions
must wait for any processes on the fallback path, causing concurrency
bottlenecks, or move to the fallback path. We introduce an approach that
combines the best of both worlds. The key idea is to use three execution paths:
an HTM fast path, an HTM middle path, and a software fallback path, such that
the middle path can run concurrently with each of the other two. The fast path
and fallback path do not run concurrently, so the fast path incurs no
instrumentation overhead. Furthermore, fast path transactions can move to the
middle path instead of waiting or moving to the software path. We demonstrate
our approach by producing an accelerated version of the tree update template of
Brown et al., which can be used to implement fast lock-free data structures
based on down-trees. We used the accelerated template to implement two
lock-free trees: a binary search tree (BST), and an (a,b)-tree (a
generalization of a B-tree). Experiments show that, with 72 concurrent
processes, our accelerated (a,b)-tree performs between 4.0x and 4.2x as many
operations per second as an implementation obtained using the original tree
update template
Techniques for Constructing Efficient Lock-free Data Structures
Building a library of concurrent data structures is an essential way to
simplify the difficult task of developing concurrent software. Lock-free data
structures, in which processes can help one another to complete operations,
offer the following progress guarantee: If processes take infinitely many
steps, then infinitely many operations are performed. Handcrafted lock-free
data structures can be very efficient, but are notoriously difficult to
implement. We introduce numerous tools that support the development of
efficient lock-free data structures, and especially trees.Comment: PhD thesis, Univ Toronto (2017
Deletion without Rebalancing in Non-Blocking Binary Search Trees
We present a provably linearizable and lock-free relaxed AVL tree called the non-blocking ravl tree. At any time, the height of a non-blocking ravl tree is upper bounded by log_d (2m) + c, where d is the golden ratio, m is the total number of successful INSERT operations performed so far and c is the number of active concurrent processes that have inserted new keys and are still rebalancing the tree at this time. The most significant feature of the non-blocking ravl tree is that it does not rebalance itself after DELETE operations. Instead, it performs rebalancing only after INSERT operations. Thus, the non-blocking ravl tree is much simpler to implement than other self-balancing concurrent binary search trees (BSTs) which typically introduce a large number of rebalancing cases after DELETE operations, while still providing a provable non-trivial bound on its height. We further conduct experimental studies to compare our solution with other state-of-the-art concurrent BSTs using randomly generated data sequences under uniform distributions, and find that our solution achieves the best performance among concurrent self-balancing BSTs. As the keys in access sequences are likely to be partially sorted in system software, we also conduct experiments using data sequences with various degrees of presortedness to better simulate applications in practice. Our experimental results show that, when there are enough degrees of presortedness, our solution achieves the best performance among all the concurrent BSTs used in our studies, including those that perform self-balancing operations and those that do not, and thus is potentially the best candidate for many real-world applications
Parallel Working-Set Search Structures
In this paper we present two versions of a parallel working-set map on p
processors that supports searches, insertions and deletions. In both versions,
the total work of all operations when the map has size at least p is bounded by
the working-set bound, i.e., the cost of an item depends on how recently it was
accessed (for some linearization): accessing an item in the map with recency r
takes O(1+log r) work. In the simpler version each map operation has O((log
p)^2+log n) span (where n is the maximum size of the map). In the pipelined
version each map operation on an item with recency r has O((log p)^2+log r)
span. (Operations in parallel may have overlapping span; span is additive only
for operations in sequence.)
Both data structures are designed to be used by a dynamic multithreading
parallel program that at each step executes a unit-time instruction or makes a
data structure call. To achieve the stated bounds, the pipelined data structure
requires a weak-priority scheduler, which supports a limited form of 2-level
prioritization. At the end we explain how the results translate to practical
implementations using work-stealing schedulers.
To the best of our knowledge, this is the first parallel implementation of a
self-adjusting search structure where the cost of an operation adapts to the
access sequence. A corollary of the working-set bound is that it achieves work
static optimality: the total work is bounded by the access costs in an optimal
static search tree.Comment: Authors' version of a paper accepted to SPAA 201
- …