48 research outputs found
Lock cohorting: A general technique for designing NUMA locks
Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machines' non-uniform memory and caching hierarchy, ever more important. This paper presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful.
Lock cohorting allows one to transform any spin-lock algorithm, with minimal non-intrusive changes, into scalable NUMA-aware spin-locks. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability.
We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA-oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases
Efficient Multi-Word Compare and Swap
Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes required to implement MCAS.
We first prove in this paper that it is impossible to "pack" the information required to perform a k-word CAS (k-CAS) in less than k locations to be CASed. Then we present the first algorithm that requires k+1 CASes per call to k-CAS in the common uncontended case. We implement our algorithm and show that it outperforms a state-of-the-art baseline in a variety of benchmarks in most considered workloads. We also present a durably linearizable (persistent memory friendly) version of our MCAS algorithm using only 2 persistence fences per call, while still only requiring k+1 CASes per k-CAS
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models
The upcoming many-core architectures require software developers to exploit
concurrency to utilize available computational power. Today's high-level
language virtual machines (VMs), which are a cornerstone of software
development, do not provide sufficient abstraction for concurrency concepts. We
analyze concrete and abstract concurrency models and identify the challenges
they impose for VMs. To provide sufficient concurrency support in VMs, we
propose to integrate concurrency operations into VM instruction sets.
Since there will always be VMs optimized for special purposes, our goal is to
develop a methodology to design instruction sets with concurrency support.
Therefore, we also propose a list of trade-offs that have to be investigated to
advise the design of such instruction sets.
As a first experiment, we implemented one instruction set extension for
shared memory and one for non-shared memory concurrency. From our experimental
results, we derived a list of requirements for a full-grown experimental
environment for further research
Efficient Nonblocking Software Transactional Memory
Foundational transactional memory research grew out of researc
Categories and Subject Descriptors [D.1.3 Concurrent Programming]: Programming Abstractions General Terms Algorithms, Languages
Transactional memory is a powerful programming abstraction that enables a programmer to turn a complex, composite collection of statements into an atomic operation. Previous work usually expresses this abstraction as an atomic block, which offers mutua
Toward High Performance Nonblocking Software Transactional Memory
Substantial advances in STM performance in recent years have mostly focused on blocking systems. We describe our work integrating the most important techniques and optimizations emerging from the recent work on blocking STMs into several variants of a nonblocking STM. In particular, our design is based on the philosophy of keeping the common, contention free execution path as simple (consequently fast) as possible, while resorting to the more expensive data displacement and metadata management only in situations where transactions have problems making forward progress. We employ novel ownership “stealing ” and metadata management techniques in our nonblocking STM to enable several recent blocking STM optimizations such as timestamp-based validation and ownership release via store instructions, all leading to a more streamlined and efficient fast path. We present an undo log (eager versioning) variant of our STM, as well as two redo log (lazy versioning) variants, the latter of which are based on the two ownership acquisition techniques (namely eager and lazy) for writes made by transactions. Experimental results show that our efforts have improved the performance of nonblocking STMs up to the level of being competitive with the state-of-the-art blocking STMs such as TL2