17,587 research outputs found

    Single-Producer/Single-Consumer Queues on Shared Cache Multi-Core Systems

    Full text link
    Using efficient point-to-point communication channels is critical for implementing fine grained parallel program on modern shared cache multi-core architectures. This report discusses in detail several implementations of wait-free Single-Producer/Single-Consumer queue (SPSC), and presents a novel and efficient algorithm for the implementation of an unbounded wait-free SPSC queue (uSPSC). The correctness proof of the new algorithm, and several performance measurements based on simple synthetic benchmark and microbenchmark, are also discussed

    Survey of Distributed Decision

    Get PDF
    We survey the recent distributed computing literature on checking whether a given distributed system configuration satisfies a given boolean predicate, i.e., whether the configuration is legal or illegal w.r.t. that predicate. We consider classical distributed computing environments, including mostly synchronous fault-free network computing (LOCAL and CONGEST models), but also asynchronous crash-prone shared-memory computing (WAIT-FREE model), and mobile computing (FSYNC model)

    Fast spectroscopy and imaging with the FORS2 HIT mode

    Full text link
    The HIgh-Time resolution (HIT) mode of FORS2 has 3 sub-modes that allow for imaging and spectroscopy over a range of timescales from milliseconds up to seconds. It is the only high time resolution spectroscopy mode available on an 8m class telescope. In imaging mode, it can be used to measure the pulse of pulsars and spinning white dwarfs in a variety of high throughput broad- and narrow-band filters. In spectroscopy mode it can take up to 10 spectra per second using a novel ''shift-and-wait'' clocking pattern for the CCD. It takes advantage of the user-designed masks which can be inserted into FORS2 to allow any two targets within the 6.8' x 6.8' field of view of FORS2 to be selected. A number of integration, or more precisely 'wait', times are available, which together with the high throughput GRISMs can observe the entire optical spectrum on a range of timescales.Comment: 19 pages, 7 figures, to appear in "High Time Resolution Astrophysics, Galway 2006

    An OpenSHMEM Implementation for the Adapteva Epiphany Coprocessor

    Full text link
    This paper reports the implementation and performance evaluation of the OpenSHMEM 1.3 specification for the Adapteva Epiphany architecture within the Parallella single-board computer. The Epiphany architecture exhibits massive many-core scalability with a physically compact 2D array of RISC CPU cores and a fast network-on-chip (NoC). While fully capable of MPMD execution, the physical topology and memory-mapped capabilities of the core and network translate well to Partitioned Global Address Space (PGAS) programming models and SPMD execution with SHMEM.Comment: 14 pages, 9 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and Related Technologie

    On the Complexity of Implementing Certain Classes of Shared Objects

    Get PDF
    We consider shared memory systems in which asynchronous processes cooperate with each other by communicating via shared data objects, such as counters, queues, stacks, and priority queues. The common approach to implementing such shared objects is based on locking: To perform an operation on a shared object, a process obtains a lock, accesses the object, and then releases the lock. Locking, however, has several drawbacks, including convoying, priority inversion, and deadlocks. Furthermore, lock-based implementations are not fault-tolerant: if a process crashes while holding a lock, other processes can end up waiting forever for the lock. Wait-free linearizable implementations were conceived to overcome most of the above drawbacks of locking. A wait-free implementation guarantees that if a process repeatedly takes steps, then its operation on the implemented data object will eventually complete, regardless of whether other processes are slow, or fast, or have crashed. In this thesis, we first present an efficient wait-free linearizable implementation of a class of object types, called closed and closable types, and then prove time and space lower bounds on wait-free linearizable implementations of another class of object types, called perturbable types. (1) We present a wait-free linearizable implementation of n-process closed and closable types (such as swap, fetch&add, fetch&multiply, and fetch&L, where L is any of the boolean operations and, or, or complement) using registers that support load-link (LL) and store-conditional (SC) as base objects. The time complexity of the implementation grows linearly with contention, but is never more than O(log ^2 n). We believe that this is the first implementation of a class of types (as opposed to a specific type) to achieve a sub-linear time complexity. (2) We prove linear time and space lower bounds on the wait-free linearizable implementations of n-process perturbable types (such as increment, fetch&add, modulo k counter, LL/SC bit, k-valued compare&swap (for any k \u3e= n), single-writer snapshot) that use resettable consensus and historyless objects (such as registers that support read and write) as base objects. This improves on some previously known Omega(sqrt{n}) space complexity lower bounds. It also shows the near space optimality of some known wait-free linearizable implementations

    Concurrent Search Data Structures Can Be Blocking and Practically Wait-Free

    Get PDF
    We argue that there is virtually no practical situation in which one should seek a "theoretically wait-free" algorithm at the expense of a state-of-the-art blocking algorithm in the case of search data structures: blocking algorithms are simple, fast, and can be made "practically wait-free". We draw this conclusion based on the most exhaustive study of blocking search data structures to date. We consider (a) different search data structures of different sizes, (b) numerous uniform and non-uniform workloads, representative of a wide range of practical scenarios, with different percentages of update operations, (c) with and without delayed threads, (d) on different hardware technologies, including processors providing HTM instructions. We explain our claim that blocking search data structures are practically wait-free through an analogy with the birthday paradox, revealing that, in state-of-the-art algorithms implementing such data structures, the probability of conflicts is extremely small. When conflicts occur as a result of context switches and interrupts, we show that HTM-based locks enable blocking algorithms to cope with the

    DeltaTree: A Practical Locality-aware Concurrent Search Tree

    Full text link
    As other fundamental programming abstractions in energy-efficient computing, search trees are expected to support both high parallelism and data locality. However, existing highly-concurrent search trees such as red-black trees and AVL trees do not consider data locality while existing locality-aware search trees such as those based on the van Emde Boas layout (vEB-based trees), poorly support concurrent (update) operations. This paper presents DeltaTree, a practical locality-aware concurrent search tree that combines both locality-optimisation techniques from vEB-based trees and concurrency-optimisation techniques from non-blocking highly-concurrent search trees. DeltaTree is a kk-ary leaf-oriented tree of DeltaNodes in which each DeltaNode is a size-fixed tree-container with the van Emde Boas layout. The expected memory transfer costs of DeltaTree's Search, Insert, and Delete operations are O(logBN)O(\log_B N), where N,BN, B are the tree size and the unknown memory block size in the ideal cache model, respectively. DeltaTree's Search operation is wait-free, providing prioritised lanes for Search operations, the dominant operation in search trees. Its Insert and {\em Delete} operations are non-blocking to other Search, Insert, and Delete operations, but they may be occasionally blocked by maintenance operations that are sometimes triggered to keep DeltaTree in good shape. Our experimental evaluation using the latest implementation of AVL, red-black, and speculation friendly trees from the Synchrobench benchmark has shown that DeltaTree is up to 5 times faster than all of the three concurrent search trees for searching operations and up to 1.6 times faster for update operations when the update contention is not too high

    On the Space Complexity of Set Agreement

    Full text link
    The kk-set agreement problem is a generalization of the classical consensus problem in which processes are permitted to output up to kk different input values. In a system of nn processes, an mm-obstruction-free solution to the problem requires termination only in executions where the number of processes taking steps is eventually bounded by mm. This family of progress conditions generalizes wait-freedom (m=nm=n) and obstruction-freedom (m=1m=1). In this paper, we prove upper and lower bounds on the number of registers required to solve mm-obstruction-free kk-set agreement, considering both one-shot and repeated formulations. In particular, we show that repeated kk set agreement can be solved using n+2mkn+2m-k registers and establish a nearly matching lower bound of n+mkn+m-k

    Efficient and Practical Constructions of LL/SC Variables

    Get PDF
    Over the past decade, LL/SC have emerged as the most suitable synchronization instructions for the design of lock-free algorithms. However, current architectures do not support these instructions; instead, they support either CAS or RLL/RSC (e.g. POWER4, MIPS, SPARC, IA-64). To bridge this gap, this paper presents two efficient wait-free algorithms for implementing 64-bit LL/SC objects from 64-bit CAS or RLL/RSC objects. Our first algorithm is practical: it has a small, constant time complexity (of 4 for LL and 5 for SC) and a space overhead of only 4 words per process. This algorithm uses unbounded sequence numbers. For theoretical interest, we also present a more complex bounded algorithm that still guarantees constant time complexity and O(1) space overhead per process. The LL/SC primitive is free of the well-known ABA problem that afflicts CAS. By efficiently implementing LL/SC words from CAS words, this work presents an efficient general solution to the ABA problem
    corecore