37 research outputs found

    Progressive Transactional Memory in Time and Space

    Full text link
    Transactional memory (TM) allows concurrent processes to organize sequences of operations on shared \emph{data items} into atomic transactions. A transaction may commit, in which case it appears to have executed sequentially or it may \emph{abort}, in which case no data item is updated. The TM programming paradigm emerged as an alternative to conventional fine-grained locking techniques, offering ease of programming and compositionality. Though typically themselves implemented using locks, TMs hide the inherent issues of lock-based synchronization behind a nice transactional programming interface. In this paper, we explore inherent time and space complexity of lock-based TMs, with a focus of the most popular class of \emph{progressive} lock-based TMs. We derive that a progressive TM might enforce a read-only transaction to perform a quadratic (in the number of the data items it reads) number of steps and access a linear number of distinct memory locations, closing the question of inherent cost of \emph{read validation} in TMs. We then show that the total number of \emph{remote memory references} (RMRs) that take place in an execution of a progressive TM in which nn concurrent processes perform transactions on a single data item might reach Ω(nlog⁥n)\Omega(n \log n), which appears to be the first RMR complexity lower bound for transactional memory.Comment: Model of Transactional Memory identical with arXiv:1407.6876, arXiv:1502.0272

    An Almost Tight RMR Lower Bound for Abortable Test-And-Set

    Get PDF
    We prove a lower bound of Omega(log n/log log n) for the remote memory reference (RMR) complexity of abortable test-and-set (leader election) in the cache-coherent (CC) and the distributed shared memory (DSM) model. This separates the complexities of abortable and non-abortable test-and-set, as the latter has constant RMR complexity [Wojciech Golab et al., 2010]. Golab, Hendler, Hadzilacos and Woelfel [Wojciech M. Golab et al., 2012] showed that compare-and-swap can be implemented from registers and test-and-set objects with constant RMR complexity. We observe that a small modification to that implementation is abortable, provided that the used test-and-set objects are atomic (or abortable). As a consequence, using existing efficient randomized wait-free implementations of test-and-set [George Giakkoupis and Philipp Woelfel, 2012], we obtain randomized abortable compare-and-swap objects with almost constant (O(log^* n)) RMR complexity

    A complexity separation between the cache-coherent and distributed shared memory models

    Full text link

    Fast Arrays: Atomic Arrays with Constant Time Initialization

    Get PDF
    Some algorithms require a large array, but only operate on a small fraction of its indices. Examples include adjacency matrices for sparse graphs, hash tables, and van Emde Boas trees. For such algorithms, array initialization can be the most time-consuming operation. Fast arrays were invented to avoid this costly initialization. A fast array is a software implementation of an array, such that the entire array can be initialized in just constant time. While algorithms for sequential fast arrays have been known for a long time, to the best of our knowledge, there are no previous algorithms for concurrent fast arrays. We present the first such algorithms in this paper. Our first algorithm is linearizable and wait-free, uses only linear space, and supports all operations - initialize, read, and write - in constant time. Our second algorithm enhances the first to additionally support all the read-modify-write operations available in hardware (such as compare-and-swap) in constant time

    Efficient Communication and Synchronization on Manycore Processors

    Get PDF
    The increased number of cores integrated on a chip has brought about a number of challenges. Concerns about the scalability of cache coherence protocols have urged both researchers and practitioners to explore alternative programming models, where cache coherence is not a given. Message passing, traditionally used in distributed systems, has surfaced as an appealing alternative to shared memory, commonly used in multiprocessor systems. In this thesis, we study how basic communication and synchronization primitives on manycore processors can be improved, with an accent on taking advantage of message passing. We do this in two different contexts: (i) message passing is the only means of communication and (ii) it coexists with traditional cache-coherent shared memory. In the first part of the thesis, we analytically and experimentally study collective communication on a message-passing manycore processor. First, we devise broadcast algorithms for the Intel SCC, an experimental manycore platform without coherent caches. Our ideas are captured by OC-Bcast (on-chip broadcast), a tree-based broadcast algorithm. Two versions of OC-Bcast are presented: One for synchronous communication, suitable for use in high-performance libraries implementing the Message Passing Interface (MPI), and another for asynchronous communication, for use in distributed algorithms and general-purpose software. Both OC-Bcast flavors are based on one-sided communication and significantly outperform (by up to 3x) state-of-the-art two-sided algorithms. Next, we conceive an analytical communication model for the SCC. By expressing the latency and throughput of different broadcast algorithms through this model, we reveal that the advantage of OC-Bcast comes from greatly reducing the number of off-chip memory accesses on the critical path. The second part of the thesis focuses on lock-based synchronization. We start by introducing the concept of hybrid mutual exclusion algorithms, which rely both on cache-coherent shared memory and message passing. The hybrid algorithms we present, HybLock and HybComb, are shown to significantly outperform (by even 4x) their shared-memory-only counterparts, when used to implement concurrent counters, stacks and queues on a hybrid Tilera TILE-Gx processor. The advantage of our hybrid algorithms comes from the fact that their most critical parts rely on message passing, thereby avoiding the overhead of the cache coherence protocol. Still, we take advantage of shared memory, as shared state makes the implementation of certain mechanisms much more straightforward. Next, we try to profit from these insights even on processors without hardware support for message passing. Taking two classic x86 processors from Intel and AMD, we come up with cache-aware optimizations that improve the performance of executing contended critical sections by as much as 6x

    Recoverable and Detectable Fetch&Add

    Get PDF
    The emergence of systems with non-volatile main memory (NVRAM) increases the need for persistent concurrent objects. Of specific interest are recoverable implementations that, in addition to being robust to crash-failures, are also detectable. Detectability ensures that upon recovery, it is possible to infer whether the failed operation took effect or not and, in the former case, obtain its response. This work presents two recoverable detectable Fetch&Add (FAA) algorithms that are self-implementations, i.e, use only a fetch&add base object, in addition to read/write registers. The algorithms target two different models for recovery: the global-crash model and the individual-crash model. In both algorithms, operations are wait-free when there are no crashes, but the recovery code may block if there are repeated failures. We also prove that in the individual-crash model, there is no implementation of recoverable and detectable FAA using only read, write and fetch&add primitives in which all operations, including recovery, are lock-free

    Randomized Mutual Exclusion with Constant Amortized RMR Complexity on the DSM

    Get PDF
    International audienceIn this paper we settle an open question by determining the remote memory reference (RMR) complexity of randomized mutual exclusion, on the distributed shared memory model (DSM) with atomic registers, in a weak but natural (and stronger than oblivious) adversary model. In particular, we present a mutual exclusion algorithm that has constant expected amortized RMR complexity and is deterministically deadlock free. Prior to this work, no randomized algorithm with o(log n/ log log n) RMR complexity was known for the DSM model. Our algorithm is fairly simple, and compares favorably with one by Bender and Gilbert (FOCS 2011) for the CC model, which has expected amortized RMR complexity O(log^2 log n) and provides only probabilistic deadlock freedom
    corecore