66 research outputs found
Lock-Free and Practical Deques using Single-Word Compare-And-Swap
We present an efficient and practical lock-free implementation of a
concurrent deque that is disjoint-parallel accessible and uses atomic
primitives which are available in modern computer systems. Previously known
lock-free algorithms of deques are either based on non-available atomic
synchronization primitives, only implement a subset of the functionality, or
are not designed for disjoint accesses. Our algorithm is based on a doubly
linked list, and only requires single-word compare-and-swap atomic primitives,
even for dynamic memory sizes. We have performed an empirical study using full
implementations of the most efficient algorithms of lock-free deques known. For
systems with low concurrency, the algorithm by Michael shows the best
performance. However, as our algorithm is designed for disjoint accesses, it
performs significantly better on systems with high concurrency and non-uniform
memory architecture
Lock-free Concurrent Data Structures
Concurrent data structures are the data sharing side of parallel programming.
Data structures give the means to the program to store data, but also provide
operations to the program to access and manipulate these data. These operations
are implemented through algorithms that have to be efficient. In the sequential
setting, data structures are crucially important for the performance of the
respective computation. In the parallel programming setting, their importance
becomes more crucial because of the increased use of data and resource sharing
for utilizing parallelism.
The first and main goal of this chapter is to provide a sufficient background
and intuition to help the interested reader to navigate in the complex research
area of lock-free data structures. The second goal is to offer the programmer
familiarity to the subject that will allow her to use truly concurrent methods.Comment: To appear in "Programming Multi-core and Many-core Computing
Systems", eds. S. Pllana and F. Xhafa, Wiley Series on Parallel and
Distributed Computin
The Parallel Persistent Memory Model
We consider a parallel computational model that consists of processors,
each with a fast local ephemeral memory of limited size, and sharing a large
persistent memory. The model allows for each processor to fault with bounded
probability, and possibly restart. On faulting all processor state and local
ephemeral memory are lost, but the persistent memory remains. This model is
motivated by upcoming non-volatile memories that are as fast as existing random
access memory, are accessible at the granularity of cache lines, and have the
capability of surviving power outages. It is further motivated by the
observation that in large parallel systems, failure of processors and their
caches is not unusual.
Within the model we develop a framework for developing locality efficient
parallel algorithms that are resilient to failures. There are several
challenges, including the need to recover from failures, the desire to do this
in an asynchronous setting (i.e., not blocking other processors when one
fails), and the need for synchronization primitives that are robust to
failures. We describe approaches to solve these challenges based on breaking
computations into what we call capsules, which have certain properties, and
developing a work-stealing scheduler that functions properly within the context
of failures. The scheduler guarantees a time bound of in expectation, where and are the work and
depth of the computation (in the absence of failures), is the average
number of processors available during the computation, and is the
probability that a capsule fails. Within the model and using the proposed
methods, we develop efficient algorithms for parallel sorting and other
primitives.Comment: This paper is the full version of a paper at SPAA 2018 with the same
nam
Implementation of the Low-Cost Work Stealing Algorithm for parallel computations
For quite a while, CPU’s clock speed has stagnated while the number of cores keeps
increasing. Because of this, parallel computing rose as a paradigm for programming
on multi-core architectures, making it critical to control the costs of communication.
Achieving this is hard, creating the need for tools that facilitate this task.
Work Stealing (WSteal) became a popular option for scheduling multithreaded com-
putations. It ensures scalability and can achieve high performance by spreading work
across processors. Each processor owns a double-ended queue where it stores its work.
When such deque is empty, the processor becomes a thief, attempting to steal work, at
random, from other processors’ deques. This strategy was proved to be efficient and is
still currently used in state-of-the-art WSteal algorithms. However, due to the concur-
rent nature of the deque, local operations require expensive memory fences to ensure
correctness. This means that even when a processor is not stealing work from others, it
still incurs excessive overhead due to the local accesses to the deque. Moreover, the pure
receiver-initiated approach to load balancing, as well as, the randomness of the targeting
of a victim makes it not suitable for scheduling computations with few or unbalanced
parallelism.
In this thesis, we explore the various limitations of WSteal in addition to solutions
proposed by related work. This is necessary to help decide on possible optimizations for
the Low-Cost Work Stealing (LCWS) algorithm, proposed by Paulino and Rito, that we
implemented in C++. This algorithm is proven to have exponentially less overhead than
the state-of-the-art WSteal algorithms. Such implementation will be tested against the
canonical WSteal and other variants that we implemented so that we can quantify the
gains of the algorithm.Já faz algum tempo desde que a velocidade dos CPUs tem vindo a estagnar enquanto
o número de cores tem vindo a subir. Por causa disto, o ramo de computação paralela
subiu como paradigma para programação em arquiteturas multi-core, tornando crítico
controlar os custos associados de comunicação. No entanto, isto não é uma tarefa fácil,
criando a necessidade de criar ferramentas que facilitem este controlo.
Work Stealing (WSteal) tornou-se uma opção popular para o escalonamento de com-
putações concorrentes. Este garante escalabilidade e consegue alcançar alto desempenho
por distribuir o trabalho por vários processadores. Cada processador possui uma fila du-
plamente terminada (deque) onde é guardado o trabalho. Quando este deque está vazio,
o processador torna-se um ladrão, tentando roubar trabalho do deque de um outro pro-
cessador, escolhido aleatoriamente. Esta estratégia foi provada como eficiente e ainda é
atualmente usada em vários algoritmos WSteal. Contudo, devido à natureza concorrente
do deque, operações locais requerem barreiras de memória, cujo correto funcionamento
tem um alto custo associado. Além disso, a estratégia pura receiver-initiated (iniciada
pelo recetor) de balanceamento de carga, assim como a aleatoriedade no processo de es-
colha de uma vitima faz com que o algoritmo não seja adequado para o scheduling de
computações com pouco ou desequilibrado paralelismo.
Nesta tese, nós exploramos as várias limitações de WSteal, para além das soluções
propostas por trabalhos relacionados. Isto é um passo necessário para ajudar a decidir
possíveis otimisações para o algoritmo Low-Cost Work Stealing (LCWS), proposto por
Paulino e Rito, que implementámos em C++. Este algoritmo está provado como tendo
exponencialmente menos overhead que outros algoritmos de WSteal. Tal implementação
será testada e comparada com o algoritmo canónico de WSteal, assim como outras suas
variantes que implementámos para que possamos quantificar os ganhos do algoritmo
Easier Parallel Programming with Provably-Efficient Runtime Schedulers
Over the past decade processor manufacturers have pivoted from increasing uniprocessor performance to multicore architectures. However, utilizing this computational power has proved challenging for software developers. Many concurrency platforms and languages have emerged to address parallel programming challenges, yet writing correct and performant parallel code retains a reputation of being one of the hardest tasks a programmer can undertake.
This dissertation will study how runtime scheduling systems can be used to make parallel programming easier. We address the difficulty in writing parallel data structures, automatically finding shared memory bugs, and reproducing non-deterministic synchronization bugs. Each of the systems presented depends on a novel runtime system which provides strong theoretical performance guarantees and performs well in practice
Efficient Lock-free Binary Search Trees
In this paper we present a novel algorithm for concurrent lock-free internal
binary search trees (BST) and implement a Set abstract data type (ADT) based on
that. We show that in the presented lock-free BST algorithm the amortized step
complexity of each set operation - {\sc Add}, {\sc Remove} and {\sc Contains} -
is , where, is the height of BST with number of nodes
and is the contention during the execution. Our algorithm adapts to
contention measures according to read-write load. If the situation is
read-heavy, the operations avoid helping pending concurrent {\sc Remove}
operations during traversal, and, adapt to interval contention. However, for
write-heavy situations we let an operation help pending {\sc Remove}, even
though it is not obstructed, and so adapt to tighter point contention. It uses
single-word compare-and-swap (\texttt{CAS}) operations. We show that our
algorithm has improved disjoint-access-parallelism compared to similar existing
algorithms. We prove that the presented algorithm is linearizable. To the best
of our knowledge this is the first algorithm for any concurrent tree data
structure in which the modify operations are performed with an additive term of
contention measure.Comment: 15 pages, 3 figures, submitted to POD
Efficient Multi-Word Compare and Swap
Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes required to implement MCAS.
We first prove in this paper that it is impossible to "pack" the information required to perform a k-word CAS (k-CAS) in less than k locations to be CASed. Then we present the first algorithm that requires k+1 CASes per call to k-CAS in the common uncontended case. We implement our algorithm and show that it outperforms a state-of-the-art baseline in a variety of benchmarks in most considered workloads. We also present a durably linearizable (persistent memory friendly) version of our MCAS algorithm using only 2 persistence fences per call, while still only requiring k+1 CASes per k-CAS
Non-Blocking Doubly-Linked Lists with Good Amortized Complexity
We present a new non-blocking doubly-linked list implementation for an asynchronous shared-memory system. It is the first such implementation for which an upper bound on amortized time complexity has been proved. In our implementation, operations access the list via cursors. Each cursor is located at an item in the list and is local to a process. In our implementation, cursors can be used to traverse and update the list, even as concurrent operations modify the list. The implementation supports two update operations, insertBefore and delete, and two move operations, moveRight and moveLeft. An insertBefore(c, x) operation inserts an item x into the list immediately before the cursor c\u27s location. A delete(c) operation removes the item at the cursor c\u27s location and sets the cursor to the next item in the list. The move operations move the cursor one position to the right or left. Update operations use single-word Compare&Swap instructions. Move operations only read shared memory and never change the state of the data structure. If all update operations modify different parts of the list, they run completely concurrently. A cursor is active if it is initialized, but not yet removed from the process\u27s set of cursors. Let c.(op) be the maximum number of active cursors at any one time during the operation op. The amortized step complexity is O(c.(op)) for each update op and O(1) for each move. We provide a detailed correctness proof and amortized analysis of our implementation
- …