14 research outputs found
Brief Announcement: Jiffy: A Fast, Memory Efficient, Wait-Free Multi-Producers Single-Consumer Queue
In applications such as sharded data processing systems, data flow programming and load sharing applications, multiple concurrent data producers are feeding requests into the same data consumer. This can be naturally realized through concurrent queues, where each consumer pulls its tasks from its dedicated queue. For scalability, wait-free queues are often preferred over lock based structures.
The vast majority of wait-free queue implementations, and even lock-free ones, support the multi-producer multi-consumer model. Yet, this comes at a premium, since implementing wait-free multi-producer multi-consumer queues requires utilizing complex helper data structures. The latter increases the memory consumption of such queues and limits their performance and scalability. Additionally, many such designs employ (hardware) cache unfriendly memory access patterns.
In this work we study the implementation of wait-free multi-producer single-consumer queues. Specifically, we propose Jiffy, an efficient memory frugal novel wait-free multi-producer single-consumer queue and formally prove its correctness. We then compare the performance and memory requirements of Jiffy with other state of the art lock-free and wait-free queues. We show that indeed Jiffy can maintain good performance with up to 128 threads, delivers better throughput than other constructions we compared against, and consumes less memory
Recoverable, Abortable, and Adaptive Mutual Exclusion with Sublogarithmic RMR Complexity
We present the first recoverable mutual exclusion (RME) algorithm that is simultaneously abortable, adaptive to point contention, and with sublogarithmic RMR complexity. Our algorithm has O(min(K,log_W N)) RMR passage complexity and O(F + min(K,log_W N)) RMR super-passage complexity, where K is the number of concurrent processes (point contention), W is the size (in bits) of registers, and F is the number of crashes in a super-passage. Under the standard assumption that W = ?(log N), these bounds translate to worst-case O((log N)/(log log N)) passage complexity and O(F + (log N)/(log log N)) super-passage complexity. Our key building blocks are:
- A D-process abortable RME algorithm, for D ? W, with O(1) passage complexity and O(1+F) super-passage complexity. We obtain this algorithm by using the Fetch-And-Add (FAA) primitive, unlike prior work on RME that uses Fetch-And-Store (FAS/SWAP).
- A generic transformation that transforms any abortable RME algorithm with passage complexity of B < W, into an abortable RME lock with passage complexity of O(min(K,B))
An Almost Tight RMR Lower Bound for Abortable Test-And-Set
We prove a lower bound of Omega(log n/log log n) for the remote memory reference (RMR) complexity of abortable test-and-set (leader election) in the cache-coherent (CC) and the distributed shared memory (DSM) model. This separates the complexities of abortable and non-abortable test-and-set, as the latter has constant RMR complexity [Wojciech Golab et al., 2010].
Golab, Hendler, Hadzilacos and Woelfel [Wojciech M. Golab et al., 2012] showed that compare-and-swap can be implemented from registers and test-and-set objects with constant RMR complexity. We observe that a small modification to that implementation is abortable, provided that the used test-and-set objects are atomic (or abortable). As a consequence, using existing efficient randomized wait-free implementations of test-and-set [George Giakkoupis and Philipp Woelfel, 2012], we obtain randomized abortable compare-and-swap objects with almost constant (O(log^* n)) RMR complexity
Abortable Reader-Writer Locks are No More Complex Than Abortable Mutex Locks
When a process attempts to acquire a mutex lock, it may be forced to wait if another process currently holds the lock. In certain applications, such as real-time operating systems and databases, indefinite waiting can cause a process to miss an important deadline. Hence, there has been research on designing abortable mutual exclusion locks, and fairly efficient algorithms of O(log n) RMR complexity have been discovered (n denotes the number of processes for which the algorithm is designed). The abort feature is just as important for a reader-writer lock as it is for a mutual exclusion lock, but to the best of our knowledge there are currently no abortable reader-writer locks that are starvation-free. We show the surprising result that any abortable, starvation-free mutual exclusion algorithm of RMR complexity t(n) can be transformed into an abortable, starvation-free reader-writer exclusion algorithm of RMR complexity O(t(n)). Thus, we obtain the first abortable, starvation-free reader-writer exclusion algorithm of O(log n) RMR complexity. Our results apply to the Cache-Coherent (CC) model of multiprocessors
Lock cohorting: A general technique for designing NUMA locks
Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machines' non-uniform memory and caching hierarchy, ever more important. This paper presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful.
Lock cohorting allows one to transform any spin-lock algorithm, with minimal non-intrusive changes, into scalable NUMA-aware spin-locks. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability.
We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA-oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases
Aether: A scalable approach to logging
The shift to multi-core hardware brings new challenges to database systems, as the software parallelism determines performance. Even though database systems traditionally accommodate simultaneous requests, a multitude of synchronization barriers serialize execution. Write-ahead logging is a fundamental, omnipresent component in ARIES-style concurrency and recovery, and one of the most important yet-to-be addressed potential bottlenecks, especially in OLTP workloads with small and frequent changes to data. In this paper, we identify four logging-related impediments to database system scalability. Each issue challenges different level in the software architecture: (a) the high volume of small-sized I/O requests may saturate the disk, (b) transactions hold locks while waiting for the log flush, (c) extensive context switching overwhelms the OS scheduler with threads executing log I/Os, and (d) contention appears as transactions serialize accesses to in-memory log data structures. We demonstrate these problems and address them with techniques that, when combined, comprise a holistic, scalable approach to logging. Our solution achieves a 20%-69% speedup over a modern database system when running log-intensive workloads, such as the TPC-B and TATP benchmarks. Moreover, it achieves a log insert throughput of small-sized log records of over 1.8GB/s on a modern single socket server, an order of magnitude higher than the traditional way of accessing the log using a single mutex
SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures
Near-Data-Processing (NDP) architectures present a promising way to alleviate
data movement costs and can provide significant performance and energy benefits
to parallel applications. Typically, NDP architectures support several NDP
units, each including multiple simple cores placed close to memory. To fully
leverage the benefits of NDP and achieve high performance for parallel
workloads, efficient synchronization among the NDP cores of a system is
necessary. However, supporting synchronization in many NDP systems is
challenging because they lack shared caches and hardware cache coherence
support, which are commonly used for synchronization in multicore systems, and
communication across different NDP units can be expensive.
This paper comprehensively examines the synchronization problem in NDP
systems, and proposes SynCron, an end-to-end synchronization solution for NDP
systems. SynCron adds low-cost hardware support near memory for synchronization
acceleration, and avoids the need for hardware cache coherence support. SynCron
has three components: 1) a specialized cache memory structure to avoid memory
accesses for synchronization and minimize latency overheads, 2) a hierarchical
message-passing communication protocol to minimize expensive communication
across NDP units of the system, and 3) a hardware-only overflow management
scheme to avoid performance degradation when hardware resources for
synchronization tracking are exceeded.
We evaluate SynCron using a variety of parallel workloads, covering various
contention scenarios. SynCron improves performance by 1.27 on average
(up to 1.78) under high-contention scenarios, and by 1.35 on
average (up to 2.29) under low-contention real applications, compared
to state-of-the-art approaches. SynCron reduces system energy consumption by
2.08 on average (up to 4.25).Comment: To appear in the 27th IEEE International Symposium on
High-Performance Computer Architecture (HPCA-27