66 research outputs found

    Lock-Free and Practical Deques using Single-Word Compare-And-Swap

    Full text link
    We present an efficient and practical lock-free implementation of a concurrent deque that is disjoint-parallel accessible and uses atomic primitives which are available in modern computer systems. Previously known lock-free algorithms of deques are either based on non-available atomic synchronization primitives, only implement a subset of the functionality, or are not designed for disjoint accesses. Our algorithm is based on a doubly linked list, and only requires single-word compare-and-swap atomic primitives, even for dynamic memory sizes. We have performed an empirical study using full implementations of the most efficient algorithms of lock-free deques known. For systems with low concurrency, the algorithm by Michael shows the best performance. However, as our algorithm is designed for disjoint accesses, it performs significantly better on systems with high concurrency and non-uniform memory architecture

    Lock-free Concurrent Data Structures

    Full text link
    Concurrent data structures are the data sharing side of parallel programming. Data structures give the means to the program to store data, but also provide operations to the program to access and manipulate these data. These operations are implemented through algorithms that have to be efficient. In the sequential setting, data structures are crucially important for the performance of the respective computation. In the parallel programming setting, their importance becomes more crucial because of the increased use of data and resource sharing for utilizing parallelism. The first and main goal of this chapter is to provide a sufficient background and intuition to help the interested reader to navigate in the complex research area of lock-free data structures. The second goal is to offer the programmer familiarity to the subject that will allow her to use truly concurrent methods.Comment: To appear in "Programming Multi-core and Many-core Computing Systems", eds. S. Pllana and F. Xhafa, Wiley Series on Parallel and Distributed Computin

    The Parallel Persistent Memory Model

    Full text link
    We consider a parallel computational model that consists of PP processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault with bounded probability, and possibly restart. On faulting all processor state and local ephemeral memory are lost, but the persistent memory remains. This model is motivated by upcoming non-volatile memories that are as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. Within the model we develop a framework for developing locality efficient parallel algorithms that are resilient to failures. There are several challenges, including the need to recover from failures, the desire to do this in an asynchronous setting (i.e., not blocking other processors when one fails), and the need for synchronization primitives that are robust to failures. We describe approaches to solve these challenges based on breaking computations into what we call capsules, which have certain properties, and developing a work-stealing scheduler that functions properly within the context of failures. The scheduler guarantees a time bound of O(W/PA+D(P/PA)log1/fW)O(W/P_A + D(P/P_A) \lceil\log_{1/f} W\rceil) in expectation, where WW and DD are the work and depth of the computation (in the absence of failures), PAP_A is the average number of processors available during the computation, and f1/2f \le 1/2 is the probability that a capsule fails. Within the model and using the proposed methods, we develop efficient algorithms for parallel sorting and other primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same nam

    Implementation of the Low-Cost Work Stealing Algorithm for parallel computations

    Get PDF
    For quite a while, CPU’s clock speed has stagnated while the number of cores keeps increasing. Because of this, parallel computing rose as a paradigm for programming on multi-core architectures, making it critical to control the costs of communication. Achieving this is hard, creating the need for tools that facilitate this task. Work Stealing (WSteal) became a popular option for scheduling multithreaded com- putations. It ensures scalability and can achieve high performance by spreading work across processors. Each processor owns a double-ended queue where it stores its work. When such deque is empty, the processor becomes a thief, attempting to steal work, at random, from other processors’ deques. This strategy was proved to be efficient and is still currently used in state-of-the-art WSteal algorithms. However, due to the concur- rent nature of the deque, local operations require expensive memory fences to ensure correctness. This means that even when a processor is not stealing work from others, it still incurs excessive overhead due to the local accesses to the deque. Moreover, the pure receiver-initiated approach to load balancing, as well as, the randomness of the targeting of a victim makes it not suitable for scheduling computations with few or unbalanced parallelism. In this thesis, we explore the various limitations of WSteal in addition to solutions proposed by related work. This is necessary to help decide on possible optimizations for the Low-Cost Work Stealing (LCWS) algorithm, proposed by Paulino and Rito, that we implemented in C++. This algorithm is proven to have exponentially less overhead than the state-of-the-art WSteal algorithms. Such implementation will be tested against the canonical WSteal and other variants that we implemented so that we can quantify the gains of the algorithm.Já faz algum tempo desde que a velocidade dos CPUs tem vindo a estagnar enquanto o número de cores tem vindo a subir. Por causa disto, o ramo de computação paralela subiu como paradigma para programação em arquiteturas multi-core, tornando crítico controlar os custos associados de comunicação. No entanto, isto não é uma tarefa fácil, criando a necessidade de criar ferramentas que facilitem este controlo. Work Stealing (WSteal) tornou-se uma opção popular para o escalonamento de com- putações concorrentes. Este garante escalabilidade e consegue alcançar alto desempenho por distribuir o trabalho por vários processadores. Cada processador possui uma fila du- plamente terminada (deque) onde é guardado o trabalho. Quando este deque está vazio, o processador torna-se um ladrão, tentando roubar trabalho do deque de um outro pro- cessador, escolhido aleatoriamente. Esta estratégia foi provada como eficiente e ainda é atualmente usada em vários algoritmos WSteal. Contudo, devido à natureza concorrente do deque, operações locais requerem barreiras de memória, cujo correto funcionamento tem um alto custo associado. Além disso, a estratégia pura receiver-initiated (iniciada pelo recetor) de balanceamento de carga, assim como a aleatoriedade no processo de es- colha de uma vitima faz com que o algoritmo não seja adequado para o scheduling de computações com pouco ou desequilibrado paralelismo. Nesta tese, nós exploramos as várias limitações de WSteal, para além das soluções propostas por trabalhos relacionados. Isto é um passo necessário para ajudar a decidir possíveis otimisações para o algoritmo Low-Cost Work Stealing (LCWS), proposto por Paulino e Rito, que implementámos em C++. Este algoritmo está provado como tendo exponencialmente menos overhead que outros algoritmos de WSteal. Tal implementação será testada e comparada com o algoritmo canónico de WSteal, assim como outras suas variantes que implementámos para que possamos quantificar os ganhos do algoritmo

    Easier Parallel Programming with Provably-Efficient Runtime Schedulers

    Get PDF
    Over the past decade processor manufacturers have pivoted from increasing uniprocessor performance to multicore architectures. However, utilizing this computational power has proved challenging for software developers. Many concurrency platforms and languages have emerged to address parallel programming challenges, yet writing correct and performant parallel code retains a reputation of being one of the hardest tasks a programmer can undertake. This dissertation will study how runtime scheduling systems can be used to make parallel programming easier. We address the difficulty in writing parallel data structures, automatically finding shared memory bugs, and reproducing non-deterministic synchronization bugs. Each of the systems presented depends on a novel runtime system which provides strong theoretical performance guarantees and performs well in practice

    Efficient Lock-free Binary Search Trees

    Full text link
    In this paper we present a novel algorithm for concurrent lock-free internal binary search trees (BST) and implement a Set abstract data type (ADT) based on that. We show that in the presented lock-free BST algorithm the amortized step complexity of each set operation - {\sc Add}, {\sc Remove} and {\sc Contains} - is O(H(n)+c)O(H(n) + c), where, H(n)H(n) is the height of BST with nn number of nodes and cc is the contention during the execution. Our algorithm adapts to contention measures according to read-write load. If the situation is read-heavy, the operations avoid helping pending concurrent {\sc Remove} operations during traversal, and, adapt to interval contention. However, for write-heavy situations we let an operation help pending {\sc Remove}, even though it is not obstructed, and so adapt to tighter point contention. It uses single-word compare-and-swap (\texttt{CAS}) operations. We show that our algorithm has improved disjoint-access-parallelism compared to similar existing algorithms. We prove that the presented algorithm is linearizable. To the best of our knowledge this is the first algorithm for any concurrent tree data structure in which the modify operations are performed with an additive term of contention measure.Comment: 15 pages, 3 figures, submitted to POD

    Efficient Multi-Word Compare and Swap

    Get PDF
    Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes required to implement MCAS. We first prove in this paper that it is impossible to "pack" the information required to perform a k-word CAS (k-CAS) in less than k locations to be CASed. Then we present the first algorithm that requires k+1 CASes per call to k-CAS in the common uncontended case. We implement our algorithm and show that it outperforms a state-of-the-art baseline in a variety of benchmarks in most considered workloads. We also present a durably linearizable (persistent memory friendly) version of our MCAS algorithm using only 2 persistence fences per call, while still only requiring k+1 CASes per k-CAS

    Non-Blocking Doubly-Linked Lists with Good Amortized Complexity

    Get PDF
    We present a new non-blocking doubly-linked list implementation for an asynchronous shared-memory system. It is the first such implementation for which an upper bound on amortized time complexity has been proved. In our implementation, operations access the list via cursors. Each cursor is located at an item in the list and is local to a process. In our implementation, cursors can be used to traverse and update the list, even as concurrent operations modify the list. The implementation supports two update operations, insertBefore and delete, and two move operations, moveRight and moveLeft. An insertBefore(c, x) operation inserts an item x into the list immediately before the cursor c\u27s location. A delete(c) operation removes the item at the cursor c\u27s location and sets the cursor to the next item in the list. The move operations move the cursor one position to the right or left. Update operations use single-word Compare&Swap instructions. Move operations only read shared memory and never change the state of the data structure. If all update operations modify different parts of the list, they run completely concurrently. A cursor is active if it is initialized, but not yet removed from the process\u27s set of cursors. Let c.(op) be the maximum number of active cursors at any one time during the operation op. The amortized step complexity is O(c.(op)) for each update op and O(1) for each move. We provide a detailed correctness proof and amortized analysis of our implementation

    Real-Time Wait-Free Queues using Micro-Transactions

    Get PDF
    corecore