9 research outputs found
Brief Announcement: Persistent Software Combining
We study the performance power of software combining in designing recoverable algorithms and data structures. We present two recoverable synchronization protocols, one blocking and another wait-free, which illustrate how to use software combining to achieve both low persistence and synchronization cost. Our experiments show that these protocols outperform by far state-of-the-art recoverable universal constructions and transactional memory systems. We built recoverable queues and stacks, based on these protocols, that exhibit much better performance than previous such implementations
Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems
This paper proposes the algorithms for optimization of Remote Core Locking (RCL) synchronization method in multithreaded programs. The algorithm of initialization of RCL-locks and the algorithms for threads affinity optimization are developed. The algorithms consider the structures of hierarchical computer systems and non-uniform memory access (NUMA) to minimize execution time of RCL-programs. The experimental results on multi-core computer systems represented in the paper shows the reduction of RCL-programs execution time
The Power of Choice in Priority Scheduling
Consider the following random process: we are given queues, into which
elements of increasing labels are inserted uniformly at random. To remove an
element, we pick two queues at random, and remove the element of lower label
(higher priority) among the two. The cost of a removal is the rank of the label
removed, among labels still present in any of the queues, that is, the distance
from the optimal choice at each step. Variants of this strategy are prevalent
in state-of-the-art concurrent priority queue implementations. Nonetheless, it
is not known whether such implementations provide any rank guarantees, even in
a sequential model.
We answer this question, showing that this strategy provides surprisingly
strong guarantees: Although the single-choice process, where we always insert
and remove from a single randomly chosen queue, has degrading cost, going to
infinity as we increase the number of steps, in the two choice process, the
expected rank of a removed element is while the expected worst-case
cost is . These bounds are tight, and hold irrespective of the
number of steps for which we run the process.
The argument is based on a new technical connection between "heavily loaded"
balls-into-bins processes and priority scheduling.
Our analytic results inspire a new concurrent priority queue implementation,
which improves upon the state of the art in terms of practical performance
Parallel Working-Set Search Structures
In this paper we present two versions of a parallel working-set map on p
processors that supports searches, insertions and deletions. In both versions,
the total work of all operations when the map has size at least p is bounded by
the working-set bound, i.e., the cost of an item depends on how recently it was
accessed (for some linearization): accessing an item in the map with recency r
takes O(1+log r) work. In the simpler version each map operation has O((log
p)^2+log n) span (where n is the maximum size of the map). In the pipelined
version each map operation on an item with recency r has O((log p)^2+log r)
span. (Operations in parallel may have overlapping span; span is additive only
for operations in sequence.)
Both data structures are designed to be used by a dynamic multithreading
parallel program that at each step executes a unit-time instruction or makes a
data structure call. To achieve the stated bounds, the pipelined data structure
requires a weak-priority scheduler, which supports a limited form of 2-level
prioritization. At the end we explain how the results translate to practical
implementations using work-stealing schedulers.
To the best of our knowledge, this is the first parallel implementation of a
self-adjusting search structure where the cost of an operation adapts to the
access sequence. A corollary of the working-set bound is that it achieves work
static optimality: the total work is bounded by the access costs in an optimal
static search tree.Comment: Authors' version of a paper accepted to SPAA 201
Highly-Efficient Persistent FIFO Queues
In this paper, we study the question whether techniques employed, in a
conventional system, by state-of-the-art concurrent algorithms to avoid
contended hot spots are still efficient for recoverable computing in settings
with Non-Volatile Memory (NVM). We focus on concurrent FIFO queues that have
two end-points, head and tail, which are highly contended.
We present a persistent FIFO queue implementation that performs a pair of
persistence instructions per operation (enqueue or dequeue). The algorithm
achieves to perform these instructions on variables of low contention by
employing Fetch&Increment and using the state-of-the-art queue implementation
by Afek and Morrison (PPoPP'13). These result in performance that is up to 2x
faster than state-of-the-art persistent FIFO queue implementations
Monotonically relaxing concurrent data-structure semantics for performance: An efficient 2D design framework
There has been a significant amount of work in the literature proposing
semantic relaxation of concurrent data structures for improving scalability and
performance. By relaxing the semantics of a data structure, a bigger design
space, that allows weaker synchronization and more useful parallelism, is
unveiled. Investigating new data structure designs, capable of trading
semantics for achieving better performance in a monotonic way, is a major
challenge in the area. We algorithmically address this challenge in this paper.
We present an efficient, lock-free, concurrent data structure design framework
for out-of-order semantic relaxation. Our framework introduces a new two
dimensional algorithmic design, that uses multiple instances of a given data
structure. The first dimension of our design is the number of data structure
instances operations are spread to, in order to benefit from parallelism
through disjoint memory access. The second dimension is the number of
consecutive operations that try to use the same data structure instance in
order to benefit from data locality. Our design can flexibly explore this
two-dimensional space to achieve the property of monotonically relaxing
concurrent data structure semantics for achieving better throughput performance
within a tight deterministic relaxation bound, as we prove in the paper. We
show how our framework can instantiate lock-free out-of-order queues, stacks,
counters and dequeues. We provide implementations of these relaxed data
structures and evaluate their performance and behaviour on two parallel
architectures. Experimental evaluation shows that our two-dimensional data
structures significantly outperform the respected previous proposed ones with
respect to scalability and throughput performance. Moreover, their throughput
increases monotonically as relaxation increases
Efficient Communication and Synchronization on Manycore Processors
The increased number of cores integrated on a chip has brought about a number of challenges. Concerns about the scalability of cache coherence protocols have urged both researchers and practitioners to explore alternative programming models, where cache coherence is not a given. Message passing, traditionally used in distributed systems, has surfaced as an appealing alternative to shared memory, commonly used in multiprocessor systems. In this thesis, we study how basic communication and synchronization primitives on manycore processors can be improved, with an accent on taking advantage of message passing. We do this in two different contexts: (i) message passing is the only means of communication and (ii) it coexists with traditional cache-coherent shared memory. In the first part of the thesis, we analytically and experimentally study collective communication on a message-passing manycore processor. First, we devise broadcast algorithms for the Intel SCC, an experimental manycore platform without coherent caches. Our ideas are captured by OC-Bcast (on-chip broadcast), a tree-based broadcast algorithm. Two versions of OC-Bcast are presented: One for synchronous communication, suitable for use in high-performance libraries implementing the Message Passing Interface (MPI), and another for asynchronous communication, for use in distributed algorithms and general-purpose software. Both OC-Bcast flavors are based on one-sided communication and significantly outperform (by up to 3x) state-of-the-art two-sided algorithms. Next, we conceive an analytical communication model for the SCC. By expressing the latency and throughput of different broadcast algorithms through this model, we reveal that the advantage of OC-Bcast comes from greatly reducing the number of off-chip memory accesses on the critical path. The second part of the thesis focuses on lock-based synchronization. We start by introducing the concept of hybrid mutual exclusion algorithms, which rely both on cache-coherent shared memory and message passing. The hybrid algorithms we present, HybLock and HybComb, are shown to significantly outperform (by even 4x) their shared-memory-only counterparts, when used to implement concurrent counters, stacks and queues on a hybrid Tilera TILE-Gx processor. The advantage of our hybrid algorithms comes from the fact that their most critical parts rely on message passing, thereby avoiding the overhead of the cache coherence protocol. Still, we take advantage of shared memory, as shared state makes the implementation of certain mechanisms much more straightforward. Next, we try to profit from these insights even on processors without hardware support for message passing. Taking two classic x86 processors from Intel and AMD, we come up with cache-aware optimizations that improve the performance of executing contended critical sections by as much as 6x