Search CORE

156 research outputs found

Hardware extensions to make lazy subscription safe

Author: Dice Dave
Harris Timothy L.
Kogan Alex
Lev Yossi
Moir Mark
Publication venue
Publication date: 23/07/2014
Field of study

Transactional Lock Elision (TLE) uses Hardware Transactional Memory (HTM) to execute unmodified critical sections concurrently, even if they are protected by the same lock. To ensure correctness, the transactions used to execute these critical sections "subscribe" to the lock by reading it and checking that it is available. A recent paper proposed using the tempting "lazy subscription" optimization for a similar technique in a different context, namely transactional systems that use a single global lock (SGL) to protect all transactional data. We identify several pitfalls that show that lazy subscription \emph{is not safe} for TLE because unmodified critical sections executing before subscribing to the lock may behave incorrectly in a number of subtle ways. We also show that recently proposed compiler support for modifying transaction code to ensure subscription occurs before any incorrect behavior could manifest is not sufficient to avoid all of the pitfalls we identify. We further argue that extending such compiler support to avoid all pitfalls would add substantial complexity and would usually limit the extent to which subscription can be deferred, undermining the effectiveness of the optimization. Hardware extensions suggested in the recent proposal also do not address all of the pitfalls we identify. In this extended version of our WTTM 2014 paper, we describe hardware extensions that make lazy subscription safe, both for SGL-based transactional systems and for TLE, without the need for special compiler support. We also explain how nontransactional loads can be exploited, if available, to further enhance the effectiveness of lazy subscription.Comment: 6 pages, extended version of WTTM2014 pape

arXiv.org e-Print Archive

The Influence of Malloc Placement on TSX Hardware Transactional Memory

Author: Dice Dave
Harris Tim
Kogan Alex
Lev Yossi
Publication venue
Publication date: 29/04/2015
Field of study

The hardware transactional memory (HTM) implementation in Intel's i7-4770 "Haswell" processor tracks the transactional read-set in the L1 (level-1), L2 (level-2) and L3 (level-3) caches and the write-set in the L1 cache. Displacement or eviction of read-set entries from the cache hierarchy or write-set entries from the L1 results in abort. We show that the placement policies of dynamic storage allocators -- such as those found in common "malloc" implementations -- can influence the L1 conflict miss rate in the L1. Conflict misses -- sometimes called mapping misses -- arise because of less than ideal associativity and represent imbalanced distribution of active memory blocks over the set of available L1 indices. Under transactional execution conflict misses may manifest as aborts, representing wasted or futile effort instead of a simple stall as would occur in normal execution mode. Furthermore, when HTM is used for transactional lock elision (TLE), persistent aborts arising from conflict misses can force the offending thread through the so-called "slow path". The slow path is undesirable as the thread must acquire the lock and run the critical section in normal execution mode, precluding the concurrent execution of threads in the "fast path" that monitor that same lock and run their critical sections in transactional mode. For a given lock, multiple threads can concurrently use the transactional fast path, but at most one thread can use the non-transactional slow path at any given time. Threads in the slow path preclude safe concurrent fast path execution. Aborts rising from placement policies and L1 index imbalance can thus result in loss of concurrency and reduced aggregate throughput

arXiv.org e-Print Archive

Tailoring Transactional Memory to Real-World Applications

Author: Zhou Tingzhe
Publication venue: Lehigh Preserve
Publication date
Field of study

Transactional Memory (TM) promises to provide a scalable mechanism for synchronizationin concurrent programs, and to offer ease-of-use benefits to programmers. Since multiprocessorarchitectures have dominated CPU design, exploiting parallelism in program

Lehigh University: Lehigh Preserve

Observationally Cooperative Multithreading

Author: Bohr Sonja A.
Cozzette Adam M.
DeBlasio M. Joe
Matsieva Julia
O'Neill Melissa E.
Pernsteiner Stuart A.
Schumer Ari D.
Stone Christopher A.
Publication venue
Publication date: 17/02/2015
Field of study

Despite widespread interest in multicore computing, concur- rency models in mainstream languages often lead to subtle, error-prone code. Observationally Cooperative Multithreading (OCM) is a new approach to shared-memory parallelism. Programmers write code using the well-understood cooperative (i.e., nonpreemptive) multithreading model for uniprocessors. OCM then allows threads to run in parallel, so long as results remain consistent with the cooperative model. Programmers benefit because they can reason largely sequentially. Remaining interthread interactions are far less chaotic than in other models, permitting easier reasoning and debugging. Programmers can also defer the choice of concurrency-control mechanism (e.g., locks or transactions) until after they have written their programs, at which point they can compare concurrency-control strategies and choose the one that offers the best performance. Implementers and researchers also benefit from the agnostic nature of OCM -- it provides a level of abstraction to investigate, compare, and combine a variety of interesting concurrency-control techniques

arXiv.org e-Print Archive

Cost of Concurrency in Hybrid Transactional Memory

Author: Brown Trevor
Ravi Srivatsan
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st International Symposium on Distributed Computing (DISC 2017)
Publication date: 01/01/2017
Field of study

State-of-the-art software transactional memory (STM) implementations achieve good performance by carefully avoiding the overhead of incremental validation (i.e., re-reading previously read data items to avoid inconsistency) while still providing progressiveness (allowing transactional aborts only due to data conflicts). Hardware transactional memory (HTM) implementations promise even better performance, but offer no progress guarantees. Thus, they must be combined with STMs, leading to hybrid TMs (HyTMs) in which hardware transactions must be instrumented (i.e., access metadata) to detect contention with software transactions. We show that, unlike in progressive STMs, software transactions in progressive HyTMs cannot avoid incremental validation. In fact, this result holds even if hardware transactions can read metadata non-speculatively. We then present opaque HyTM algorithms providing progressiveness for a subset of transactions that are optimal in terms of hardware instrumentation. We explore the concurrency vs. hardware instrumentation vs. software validation trade-offs for these algorithms. Our experiments with Intel and IBM POWER8 HTMs seem to suggest that (i) the cost of concurrency also exists in practice, (ii) it is important to implement HyTMs that provide progressiveness for a maximal set of transactions without incurring high hardware instrumentation overhead or using global contending bottlenecks and (iii) there is no easy way to derive more efficient HyTMs by taking advantage of non-speculative accesses within hardware

Dagstuhl Research Online Publication Server

A Survey on Thread-Level Speculation Techniques

Author: Estébanez López Álvaro
González Escribano Arturo
Llanos Ferraris Diego Rafael
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

Producción CientíficaThread-Level Speculation (TLS) is a promising technique that allows the parallel execution of sequential code without relying on a prior, compile-time-dependence analysis. In this work, we introduce the technique, present a taxonomy of TLS solutions, and summarize and put into perspective the most relevant advances in this field.MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H5 network (TIN2014-53522-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)

Repositorio Documental de la Universidad de Valladolid

Hardware extensions to make lazy subscription safe

Author: Alex Kogan
Dave Dice
Mark Moir
Timothy L Harris
Yossi Lev
Publication venue
Publication date: 01/05/2020
Field of study

Abstract Transactional Lock Elision (TLE) uses Hardware Transactional Memory (HTM) to execute unmodified critical sections concurrently, even if they are protected by the same lock. To ensure correctness, the transactions used to execute these critical sections "subscribe" to the lock by reading it and checking that it is available. A recent paper proposed using the tempting "lazy subscription" optimization for a similar technique in a different context, namely transactional systems that use a single global lock (SGL) to protect all transactional data. We identify several pitfalls that show that lazy subscription is not safe for TLE because unmodified critical sections executing before subscribing to the lock may behave incorrectly in a number of subtle ways. We also show that recently proposed compiler support for modifying transaction code to ensure subscription occurs before any incorrect behavior could manifest is not sufficient to avoid all of the pitfalls we identify. We further argue that extending such compiler support to avoid all pitfalls would add substantial complexity and would usually limit the extent to which subscription can be deferred, undermining the effectiveness of the optimization. Hardware extensions suggested in the recent proposal also do not address all of the pitfalls we identify. In this extended version of our WTTM 2014 paper, we describe hardware extensions that make lazy subscription safe, both for SGL-based transactional systems and for TLE, without the need for special compiler support. We also explain how nontransactional loads can be exploited, if available, to further enhance the effectiveness of lazy subscription

CiteSeerX

Practical Condition Synchronization for Transactional Memory

Author: Wang Chao
Publication venue: Lehigh Preserve
Publication date
Field of study

Few transactional memory implementations allow for condition synchronization among transactions. The problems are many, most notably the lack of consensus about a single appropriate linguistic construct, and the lack of mechanisms that are compatible with hardware transactional memory. In this thesis, we introduce a broadly useful mechanism for supporting condition synchronization among transactions. Our mechanism supports a number of linguistic constructs for coordinating transactions, and does so without introducing overhead on in-flight hardware transactions. Experiments show that our mechanisms work well, and that the diversity of linguistic constructs allows programmers to chose the technique that is best suited to a particular application

Lehigh University: Lehigh Preserve

An Efficient Graph Accelerator with Parallel Data Conflict Management

Author: Yao Pengcheng
Publication venue
Publication date: 03/06/2018
Field of study

Graph-specific computing with the support of dedicated accelerator has greatly boosted the graph processing in both efficiency and energy. Nevertheless, their data conflict management is still sequential in essential when some vertex needs a large number of conflicting updates at the same time, leading to prohibitive performance degradation. This is particularly true for processing natural graphs. In this paper, we have the insight that the atomic operations for the vertex updating of many graph algorithms (e.g., BFS, PageRank and WCC) are typically incremental and simplex. This hence allows us to parallelize the conflicting vertex updates in an accumulative manner. We architect a novel graphspecific accelerator that can simultaneously process atomic vertex updates for massive parallelism on the conflicting data access while ensuring the correctness. A parallel accumulator is designed to remove the serialization in atomic protection for conflicting vertex updates through merging their results in parallel. Our implementation on Xilinx Virtex UltraScale+ XCVU9P with a wide variety of typical graph algorithms shows that our accelerator achieves an average throughput by 2.36 GTEPS as well as up to 3.14x performance speedup in comparison with state-of-the-art ForeGraph (with single-chip version)

arXiv.org e-Print Archive

HeTM: Transactional Memory for Heterogeneous Systems

Author: Castro Daniel
Ilic Aleksandar
Khan Amin M.
Romano Paolo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/09/2019
Field of study

Modern heterogeneous computing architectures, which couple multi-core CPUs with discrete many-core GPUs (or other specialized hardware accelerators), enable unprecedented peak performance and energy efficiency levels. Unfortunately, though, developing applications that can take full advantage of the potential of heterogeneous systems is a notoriously hard task. This work takes a step towards reducing the complexity of programming heterogeneous systems by introducing the abstraction of Heterogeneous Transactional Memory (HeTM). HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions. Besides introducing the abstract semantics and programming model of HeTM, we present the design and evaluation of a concrete implementation of the proposed abstraction, which we named Speculative HeTM (SHeTM). SHeTM makes use of a novel design that leverages on speculative techniques and aims at hiding the inherently large communication latency between CPUs and discrete GPUs and at minimizing inter-device synchronization overhead. SHeTM is based on a modular and extensible design that allows for easily integrating alternative TM implementations on the CPU's and GPU's sides, which allows the flexibility to adopt, on either side, the TM implementation (e.g., in hardware or software) that best fits the applications' workload and the architectural characteristics of the processing unit. We demonstrate the efficiency of the SHeTM via an extensive quantitative study based both on synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT'19

arXiv.org e-Print Archive

Crossref