2,001 research outputs found

    DeltaTree: A Practical Locality-aware Concurrent Search Tree

    Full text link
    As other fundamental programming abstractions in energy-efficient computing, search trees are expected to support both high parallelism and data locality. However, existing highly-concurrent search trees such as red-black trees and AVL trees do not consider data locality while existing locality-aware search trees such as those based on the van Emde Boas layout (vEB-based trees), poorly support concurrent (update) operations. This paper presents DeltaTree, a practical locality-aware concurrent search tree that combines both locality-optimisation techniques from vEB-based trees and concurrency-optimisation techniques from non-blocking highly-concurrent search trees. DeltaTree is a kk-ary leaf-oriented tree of DeltaNodes in which each DeltaNode is a size-fixed tree-container with the van Emde Boas layout. The expected memory transfer costs of DeltaTree's Search, Insert, and Delete operations are O(log⁑BN)O(\log_B N), where N,BN, B are the tree size and the unknown memory block size in the ideal cache model, respectively. DeltaTree's Search operation is wait-free, providing prioritised lanes for Search operations, the dominant operation in search trees. Its Insert and {\em Delete} operations are non-blocking to other Search, Insert, and Delete operations, but they may be occasionally blocked by maintenance operations that are sometimes triggered to keep DeltaTree in good shape. Our experimental evaluation using the latest implementation of AVL, red-black, and speculation friendly trees from the Synchrobench benchmark has shown that DeltaTree is up to 5 times faster than all of the three concurrent search trees for searching operations and up to 1.6 times faster for update operations when the update contention is not too high

    Optimal Hierarchical Layouts for Cache-Oblivious Search Trees

    Full text link
    This paper proposes a general framework for generating cache-oblivious layouts for binary search trees. A cache-oblivious layout attempts to minimize cache misses on any hierarchical memory, independent of the number of memory levels and attributes at each level such as cache size, line size, and replacement policy. Recursively partitioning a tree into contiguous subtrees and prescribing an ordering amongst the subtrees, Hierarchical Layouts generalize many commonly used layouts for trees such as in-order, pre-order and breadth-first. They also generalize the various flavors of the van Emde Boas layout, which have previously been used as cache-oblivious layouts. Hierarchical Layouts thus unify all previous attempts at deriving layouts for search trees. The paper then derives a new locality measure (the Weighted Edge Product) that mimics the probability of cache misses at multiple levels, and shows that layouts that reduce this measure perform better. We analyze the various degrees of freedom in the construction of Hierarchical Layouts, and investigate the relative effect of each of these decisions in the construction of cache-oblivious layouts. Optimizing the Weighted Edge Product for complete binary search trees, we introduce the MinWEP layout, and show that it outperforms previously used cache-oblivious layouts by almost 20%.Comment: Extended version with proofs added to the appendi

    Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers

    Full text link
    The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "βˆ₯\parallel" (parallel) and ";;" (serial), are insufficient in expressing "partial dependencies" or "partial parallelism" in a program. We propose a new dataflow composition construct "⇝\leadsto" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and provide theoretical guarantees on their ability to preserve locality and load balance. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is O(βˆ‘i=0hβˆ’1Qβˆ—(t;Οƒβ‹…Mi)β‹…Cip)O\left(\frac{\sum_{i=0}^{h-1} Q^{*}({\mathsf t};\sigma\cdot M_i)\cdot C_i}{p}\right), where Qβˆ—Q^{*} is the cache complexity of task t{\mathsf t}, CiC_i is the cost of cache miss at level-ii cache which is of size MiM_i, Οƒβˆˆ(0,1)\sigma\in(0,1) is a constant, and pp is the number of processors in an hh-level cache hierarchy
    • …
    corecore