16,881 research outputs found

    A Template for Implementing Fast Lock-free Trees Using HTM

    Full text link
    Algorithms that use hardware transactional memory (HTM) must provide a software-only fallback path to guarantee progress. The design of the fallback path can have a profound impact on performance. If the fallback path is allowed to run concurrently with hardware transactions, then hardware transactions must be instrumented, adding significant overhead. Otherwise, hardware transactions must wait for any processes on the fallback path, causing concurrency bottlenecks, or move to the fallback path. We introduce an approach that combines the best of both worlds. The key idea is to use three execution paths: an HTM fast path, an HTM middle path, and a software fallback path, such that the middle path can run concurrently with each of the other two. The fast path and fallback path do not run concurrently, so the fast path incurs no instrumentation overhead. Furthermore, fast path transactions can move to the middle path instead of waiting or moving to the software path. We demonstrate our approach by producing an accelerated version of the tree update template of Brown et al., which can be used to implement fast lock-free data structures based on down-trees. We used the accelerated template to implement two lock-free trees: a binary search tree (BST), and an (a,b)-tree (a generalization of a B-tree). Experiments show that, with 72 concurrent processes, our accelerated (a,b)-tree performs between 4.0x and 4.2x as many operations per second as an implementation obtained using the original tree update template

    A Concurrency-Optimal Binary Search Tree

    Full text link
    The paper presents the first \emph{concurrency-optimal} implementation of a binary search tree (BST). The implementation, based on a standard sequential implementation of an internal tree, ensures that every \emph{schedule} is accepted, i.e., interleaving of steps of the sequential code, unless linearizability is violated. To ensure this property, we use a novel read-write locking scheme that protects tree \emph{edges} in addition to nodes. Our implementation outperforms the state-of-the art BSTs on most basic workloads, which suggests that optimizing the set of accepted schedules of the sequential code can be an adequate design principle for efficient concurrent data structures

    Fast Quantum Modular Exponentiation

    Full text link
    We present a detailed analysis of the impact on modular exponentiation of architectural features and possible concurrent gate execution. Various arithmetic algorithms are evaluated for execution time, potential concurrency, and space tradeoffs. We find that, to exponentiate an n-bit number, for storage space 100n (twenty times the minimum 5n), we can execute modular exponentiation two hundred to seven hundred times faster than optimized versions of the basic algorithms, depending on architecture, for n=128. Addition on a neighbor-only architecture is limited to O(n) time when non-neighbor architectures can reach O(log n), demonstrating that physical characteristics of a computing device have an important impact on both real-world running time and asymptotic behavior. Our results will help guide experimental implementations of quantum algorithms and devices.Comment: to appear in PRA 71(5); RevTeX, 12 pages, 12 figures; v2 revision is substantial, with new algorithmic variants, much shorter and clearer text, and revised equation formattin

    Compiling vector pascal to the XeonPhi

    Get PDF
    Intel's XeonPhi is a highly parallel x86 architecture chip made by Intel. It has a number of novel features which make it a particularly challenging target for the compiler writer. This paper describes the techniques used to port the Glasgow Vector Pascal Compiler to this architecture and assess its performance by comparisons of the XeonPhi with 3 other machines running the same algorithms
    corecore