Search CORE

16 research outputs found

ThreadScan: Automatic and Scalable Memory Reclamation

Author: Alistarh Dan
Leiserson William Mitchell
Matveev Alexander
Shavit Nir N.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2015
Field of study

The concurrent memory reclamation problem is that of devising a way for a deallocating thread to verify that no other concurrent threads hold references to a memory block being deallocated. To date, in the absence of automatic garbage collection, there is no satisfactory solution to this problem. Existing tracking methods like hazard pointers, reference counters, or epoch-based techniques like RCU, are either prohibitively expensive or require significant programming expertise, to the extent that implementing them efficiently can be worthy of a publication. None of the existing techniques are automatic or even semi-automated. In this paper, we take a new approach to concurrent memory reclamation: instead of manually tracking access to memory locations as done in techniques like hazard pointers, or restricting shared accesses to specific epoch boundaries as in RCU, our algorithm, called ThreadScan, leverages operating system signaling to automatically detect which memory locations are being accessed by concurrent threads. Initial empirical evidence shows that ThreadScan scales surprisingly well and requires negligible programming effort beyond the standard use of Malloc and Free

DSpace@MIT

Crossref

Hardware extensions to make lazy subscription safe

Author: Alex Kogan
Dave Dice
Mark Moir
Timothy L Harris
Yossi Lev
Publication venue
Publication date: 01/05/2020
Field of study

Abstract Transactional Lock Elision (TLE) uses Hardware Transactional Memory (HTM) to execute unmodified critical sections concurrently, even if they are protected by the same lock. To ensure correctness, the transactions used to execute these critical sections "subscribe" to the lock by reading it and checking that it is available. A recent paper proposed using the tempting "lazy subscription" optimization for a similar technique in a different context, namely transactional systems that use a single global lock (SGL) to protect all transactional data. We identify several pitfalls that show that lazy subscription is not safe for TLE because unmodified critical sections executing before subscribing to the lock may behave incorrectly in a number of subtle ways. We also show that recently proposed compiler support for modifying transaction code to ensure subscription occurs before any incorrect behavior could manifest is not sufficient to avoid all of the pitfalls we identify. We further argue that extending such compiler support to avoid all pitfalls would add substantial complexity and would usually limit the extent to which subscription can be deferred, undermining the effectiveness of the optimization. Hardware extensions suggested in the recent proposal also do not address all of the pitfalls we identify. In this extended version of our WTTM 2014 paper, we describe hardware extensions that make lazy subscription safe, both for SGL-based transactional systems and for TLE, without the need for special compiler support. We also explain how nontransactional loads can be exploited, if available, to further enhance the effectiveness of lazy subscription

CiteSeerX

Drop the anchor: lightweight memory management for non-blocking data structures

Author: Alex Kogan
Anastasia Braginsky
Erez Petrank
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2013
Field of study

ABSTRACT Efficient memory management of dynamic non-blocking data structures remains an important open question. Existing methods either sacrifice the ability to deallocate objects or reduce performance notably. In this paper, we present a novel technique, called Drop the Anchor, which significantly reduces the overhead associated with the memory management while reclaiming objects even in the presence of thread failures. We demonstrate this memory management scheme on the common linked list data structure. Using extensive evaluation, we show that Drop the Anchor significantly outperforms Hazard Pointers, the widely used technique for non-blocking memory management

CiteSeerX

Fast and Robust Memory Reclamation for Concurrent Data Structures

Author: Balmau Oana Maria
Guerraoui Rachid
Herlihy Maurice
Zablotchi Mihail Igor
Publication venue
Publication date: 17/05/2016
Field of study

In concurrent systems without automatic garbage collection, it is challenging to determine when it is safe to reclaim memory, especially for lock-free data structures. Existing concurrent memory reclamation schemes are either fast but do not tolerate process delays, robust to delays but with high overhead, or both robust and fast but narrowly applicable. This paper proposes QSense, a novel concurrent memory reclamation technique. QSense is a hybrid technique with a fast path and a fallback path. In the common case (without process delays), a high-performing memory reclamation scheme is used (fast path). If process delays block memory reclamation through the fast path, a robust fallback path is used to guarantee progress. The fallback path uses hazard pointers, but avoids their notorious need for frequent and expensive memory fences. QSense is widely applicable, as we illustrate through several lock-free data structure algorithms. Our experimental evaluation shows that QSense has an overhead comparable to the fastest memory reclamation techniques, while still tolerating prolonged process delays

Infoscience - École polytechnique fédérale de Lausanne

Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures

Author: Arcangeli Andrea
Boyd-Wickizer Silas
David Tudor
Fan Bin
Herlihy Maurice
Intel
McKenney Paul E
McKenney Paul E
Nishtala Rajesh
Timothy
Triplett Josh
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/04/2015
Field of study

We introduce "asynchronized concurrency (ASCY),'' a paradigm consisting of four complementary programming patterns. ASCY calls for the design of concurrent search data structures (CSDSs) to resemble that of their sequential counterparts. We argue that ASCY leads to implementations which are portably scalable: they scale across different types of hardware platforms, including single and multi-socket ones, for various classes of workloads, such as read-only and read-write, and according to different performance metrics, including throughput, latency, and energy. We substantiate our thesis through the most exhaustive evaluation of CSDSs to date, involving 6 platforms, 22 state-of-the-art CSDS algorithms, 10 re-engineered state-of-the-art CSDS algorithms following the ASCY patterns, and 2 new CSDS algorithms designed with ASCY in mind. We observe up to 30% improvements in throughput in the re-engineered algorithms, while our new algorithms out-perform the state-of-the-art alternatives

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

Crossref

On the Performance of Software Transactional Memory

Author: Dragojevic Aleksandar
Publication venue: Lausanne, EPFL
Publication date: 29/05/2012
Field of study

The recent proliferation of multi-core processors has moved concurrent programming into mainstream by forcing increasingly more programmers to write parallel code. Using traditional concurrency techniques, such as locking, is notoriously difficult and has been considered the domain of a few experts for a long time. This discrepancy between the established techniques and typical programmer's skills raises a pressing need for new programming paradigms. A particularly appealing concurrent programming paradigm is transactional memory: it enables programmers to write correct concurrent code in a simple manner, while promising scalable performance. Software implementations of transactional memory (STM) have attracted a lot of attention for their ability to support dynamic transactions of any size and execute on existing hardware. This is in contrast to hardware implementations that typically support only transactions of limited size and are not yet commercially available. Surprisingly, prior work has largely neglected software support for transactions of arbitrary size, despite them being an important target for STM. Consequently, existing STMs have not been optimized for large transactions, which results in poor performance of those STMs, and sometimes even program crashes, when dealing with large transactions. In this thesis, I contribute to changing the current state of affairs by improving performance and scalability of STM, in particular with dynamic transactions of arbitrary size. I propose SwissTM, a novel STM design that efficiently supports large transactions, while not compromising on performance with smaller ones. SwissTM features: (1) mixed conflict detection, that detects write-write conflicts eagerly and read-write conflicts lazily, and (2) a two-phase contention manager, that imposes little overhead on small transactions and effectively manages conflicts between larger ones. SwissTM indeed achieves good performance across a range of workloads: it outperforms several state-of-the-art STMs on a representative large-scale benchmark by at least 55% with eight threads, while matching their performance or outperforming them across a wide range of smaller-scale benchmarks. I also present a detailed empirical analysis of the SwissTM design, individually evaluating each of the chosen design points and their impact on performance. This "dissection" of SwissTM is particularly valuable for STM designers as it helps them understand which parts of the design are well-suited to their own STMs, enabling them to reuse just those parts. Furthermore, I address the question of whether STM can perform well enough to be practical by performing the most extensive comparison of performance of STM-based and sequential, non-thread-safe code to date. This comparison demonstrates the very fact that SwissTM indeed outperforms sequential code, often with just a handful of threads: with four threads it outperforms sequential code in 80% of cases, by up to 4x. Furthermore, the performance scales well when increasing thread counts: with 64 threads it outperforms sequential code by up to 29x. These results suggest that STM is indeed a viable alternative for writing concurrent code today

Infoscience - École polytechnique fédérale de Lausanne

Crafting Concurrent Data Structures

Author: Liu Yujie
Publication venue: Lehigh Preserve
Publication date
Field of study

Concurrent data structures lie at the heart of modern parallel programs. The design and implementation of concurrent data structures can be challenging due to the demand for good performance (low latency and high scalability) and strong progress guarantees. In this dissertation, we enrich the knowledge of concurrent data structure design by proposing new implementations, as well as general techniques to improve the performance of existing ones.The first part of the dissertation present an unordered linked list implementation that supports nonblocking insert, remove, and lookup operations. The algorithm is based on a novel ``enlist\u27\u27 technique that greatly simplifies the task of achieving wait-freedom. The value of our technique is also demonstrated in the creation of other wait-free data structures such as stacks and hash tables.The second data structure presented is a nonblocking hash table implementation which solves a long-standing design challenge by permitting the hash table to dynamically adjust its size in a nonblocking manner. Additionally, our hash table offers strong theoretical properties such as supporting unbounded memory. In our algorithm, we introduce a new ``freezable set\u27\u27 abstraction which allows us to achieve atomic migration of keys during a resize. The freezable set abstraction also enables highly efficient implementations which maximally exploit the processor cache locality. In experiments, we found our lock-free hash table performs consistently better than state-of-the-art implementations, such as the split-ordered list.The third data structure we present is a concurrent priority queue called the ``mound\u27\u27. Our implementations include nonblocking and lock-based variants. The mound employs randomization to reduce contention on concurrent insert operations, and decomposes a remove operation into smaller atomic operations so that multiple remove operations can execute in parallel within a pipeline. In experiments, we show that the mound can provide excellent latency at low thread counts.Lastly, we discuss how hardware transactional memory (HTM) can be used to accelerate existing nonblocking concurrent data structure implementations. We propose optimization techniques that can significantly improve the performance (1.5x to 3x speedups) of a variety of important concurrent data structures, such as binary search trees and hash tables. The optimizations also preserve the strong progress guarantees of the original implementations

Lehigh University: Lehigh Preserve

Tailoring Transactional Memory to Real-World Applications

Author: Zhou Tingzhe
Publication venue: Lehigh Preserve
Publication date
Field of study

Transactional Memory (TM) promises to provide a scalable mechanism for synchronizationin concurrent programs, and to offer ease-of-use benefits to programmers. Since multiprocessorarchitectures have dominated CPU design, exploiting parallelism in program

Lehigh University: Lehigh Preserve