16,883 research outputs found
A Template for Implementing Fast Lock-free Trees Using HTM
Algorithms that use hardware transactional memory (HTM) must provide a
software-only fallback path to guarantee progress. The design of the fallback
path can have a profound impact on performance. If the fallback path is allowed
to run concurrently with hardware transactions, then hardware transactions must
be instrumented, adding significant overhead. Otherwise, hardware transactions
must wait for any processes on the fallback path, causing concurrency
bottlenecks, or move to the fallback path. We introduce an approach that
combines the best of both worlds. The key idea is to use three execution paths:
an HTM fast path, an HTM middle path, and a software fallback path, such that
the middle path can run concurrently with each of the other two. The fast path
and fallback path do not run concurrently, so the fast path incurs no
instrumentation overhead. Furthermore, fast path transactions can move to the
middle path instead of waiting or moving to the software path. We demonstrate
our approach by producing an accelerated version of the tree update template of
Brown et al., which can be used to implement fast lock-free data structures
based on down-trees. We used the accelerated template to implement two
lock-free trees: a binary search tree (BST), and an (a,b)-tree (a
generalization of a B-tree). Experiments show that, with 72 concurrent
processes, our accelerated (a,b)-tree performs between 4.0x and 4.2x as many
operations per second as an implementation obtained using the original tree
update template
A Concurrency-Optimal Binary Search Tree
The paper presents the first \emph{concurrency-optimal} implementation of a
binary search tree (BST). The implementation, based on a standard sequential
implementation of an internal tree, ensures that every \emph{schedule} is
accepted, i.e., interleaving of steps of the sequential code, unless
linearizability is violated. To ensure this property, we use a novel read-write
locking scheme that protects tree \emph{edges} in addition to nodes.
Our implementation outperforms the state-of-the art BSTs on most basic
workloads, which suggests that optimizing the set of accepted schedules of the
sequential code can be an adequate design principle for efficient concurrent
data structures
Fast Quantum Modular Exponentiation
We present a detailed analysis of the impact on modular exponentiation of
architectural features and possible concurrent gate execution. Various
arithmetic algorithms are evaluated for execution time, potential concurrency,
and space tradeoffs. We find that, to exponentiate an n-bit number, for storage
space 100n (twenty times the minimum 5n), we can execute modular exponentiation
two hundred to seven hundred times faster than optimized versions of the basic
algorithms, depending on architecture, for n=128. Addition on a neighbor-only
architecture is limited to O(n) time when non-neighbor architectures can reach
O(log n), demonstrating that physical characteristics of a computing device
have an important impact on both real-world running time and asymptotic
behavior. Our results will help guide experimental implementations of quantum
algorithms and devices.Comment: to appear in PRA 71(5); RevTeX, 12 pages, 12 figures; v2 revision is
substantial, with new algorithmic variants, much shorter and clearer text,
and revised equation formattin
Compiling vector pascal to the XeonPhi
Intel's XeonPhi is a highly parallel x86 architecture chip made by Intel. It has a number of novel features which make it a particularly challenging target for the compiler writer. This paper describes the techniques used to port the Glasgow Vector Pascal Compiler to this architecture and assess its performance by comparisons of the XeonPhi with 3 other machines running the same algorithms
- …