Search CORE

253 research outputs found

Recommended from our members

Software lock elision for x86 machine code

Author: Roy Amitabha
Publication venue: University of Cambridge
Publication date: 12/07/2011
Field of study

More than a decade after becoming a topic of intense research there is no transactional memory hardware nor any examples of software transactional memory use outside the research community. Using software transactional memory in large pieces of software needs copious source code annotations and often means that standard compilers and debuggers can no longer be used. At the same time, overheads associated with software transactional memory fail to motivate programmers to expend the needed effort to use software transactional memory. The only way around the overheads in the case of general unmanaged code is the anticipated availability of hardware support. On the other hand, architects are unwilling to devote power and area budgets in mainstream microprocessors to hardware transactional memory, pointing to transactional memory being a "niche" programming construct. A deadlock has thus ensued that is blocking transactional memory use and experimentation in the mainstream. This dissertation covers the design and construction of a software transactional memory runtime system called SLE_x86 that can potentially break this deadlock by decoupling transactional memory from programs using it. Unlike most other STM designs, the core design principle is transparency rather than performance. SLE_x86 operates at the level of x86 machine code, thereby becoming immediately applicable to binaries for the popular x86 architecture. The only requirement is that the binary synchronise using known locking constructs or calls such as those in Pthreads or OpenMP libraries. SLE_x86 provides speculative lock elision (SLE) entirely in software, executing critical sections in the binary using transactional memory. Optionally, the critical sections can also be executed without using transactions by acquiring the protecting lock. The dissertation makes a careful analysis of the impact on performance due to the demands of the x86 memory consistency model and the need to transparently instrument x86 machine code. It shows that both of these problems can be overcome to reach a reasonable level of performance, where transparent software transactional memory can perform better than a lock. SLE_x86 can ensure that programs are ready for transactional memory in any form, without being explicitly written for it

Apollo (Cambridge)

Hardware Transactional Memory Optimization Guidelines, Applied to Ordered Maps

Author: Bonnichsen Lars Frydendal
Karlsson Sven
Probst Christian W.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

Crossref

Online Research Database In Technology

An Adaptive Middleware for Improved Computational Performance

Author: Bonnichsen Lars Frydendal
Publication venue: Technical University of Denmark
Publication date: 01/01/2016
Field of study

Online Research Database In Technology

Achieving High Performance and High Productivity in Next Generational Parallel Programming Languages

Author: Kumar Vivek
Publication venue
Publication date: 01/01/2014
Field of study

Processor design has turned toward parallelism and heterogeneity cores to achieve performance and energy efficiency. Developers find high-level languages attractive because they use abstraction to offer productivity and portability over hardware complexities. To achieve performance, some modern implementations of high-level languages use work-stealing scheduling for load balancing of dynamically created tasks. Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. These overheads arise as a necessary side effect of the implementation and hamper parallel performance. In addition to runtime-imposed overheads, there is a substantial cognitive load associated with ensuring that parallel code is data-race free. This dissertation explores the overheads associated with achieving high performance parallelism in modern high-level languages. My thesis is that, by exploiting existing underlying mechanisms of managed runtimes; and by extending existing language design, high-level languages will be able to deliver productivity and parallel performance at the levels necessary for widespread uptake. The key contributions of my thesis are: 1) a detailed analysis of the key sources of overhead associated with a work-stealing runtime, namely sequential and dynamic overheads; 2) novel techniques to reduce these overheads that use rich features of managed runtimes such as the yieldpoint mechanism, on-stack replacement, dynamic code-patching, exception handling support, and return barriers; 3) comprehensive analysis of the resulting benefits, which demonstrate that work-stealing overheads can be significantly reduced, leading to substantial performance improvements; and 4) a small set of language extensions that achieve both high performance and high productivity with minimal programmer effort. A managed runtime forms the backbone of any modern implementation of a high-level language. Managed runtimes enjoy the benefits of a long history of research and their implementations are highly optimized. My thesis demonstrates that converging these highly optimized features together with the expressiveness of high-level languages, gives further hope for achieving high performance and high productivity on modern parallel hardwar

The Australian National University

Analyzing the Impact of Concurrency on Scaling Machine Learning Programs Using TensorFlow

Author: Denizov Sheyn
Publication venue: Lehigh Preserve
Publication date
Field of study

In recent times, computer scientists and technology companies have quickly begun to realize that machine learning and creating computer software that is capable of reasoning for itself (at least in theory). What was once only considered science fiction lore is now becoming a reality in front of our very eyes. With this type of computational capability at our disposal, we are left with the question of how best to use it and where to start in creating models that can help us best utilize it. TensorFlow is an open source software library used in machine learning developed and released by Google. It was created by the company in order to help them meet their expanding needs to train systems that can build and detect neural networks for pattern recognition that could be used in their services. It was first released by the Google Brain Team in November 2015 and, at the time of the preparation of this research, the project is still being heavily developed by programmers and researchers both inside of Google and around the world. Thus, it is very possible that some future releases of the software package could remove and/or replace some current capabilities. The point of this thesis is to examine how machine learning programs written with TensorFlow that do not scale well (such as large-scale neural networks) can be made more scalable by using concurrency and distribution of computation among threads. To do this, we will be using lock elision using conditional variables and locking mechanisms (such as semaphores) to allow for smooth distribution of resources to be used by the architecture. We present the trial runs and results of the added implementations and where the results fell short of optimistic expectation. Although TensorFlow is still a work in progress, we will also address where this framework was insufficient in addressing the needs of programmers attempting to write scalable code and whether this type of implementation is sustainable

Lehigh University: Lehigh Preserve

LIPIcs

Author: Alistarh Dan-Adrian
Fedorov Alexander
Koval Nikita
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 14/11/2019
Field of study

Union-Find (or Disjoint-Set Union) is one of the fundamental problems in computer science; it has been well-studied from both theoretical and practical perspectives in the sequential case. Recently, there has been mounting interest in analyzing this problem in the concurrent scenario, and several asymptotically-efficient algorithms have been proposed. Yet, to date, there is very little known about the practical performance of concurrent Union-Find. This work addresses this gap. We evaluate and analyze the performance of several concurrent Union-Find algorithms and optimization strategies across a wide range of platforms (Intel, AMD, and ARM) and workloads (social, random, and road networks, as well as integrations into more complex algorithms). We first observe that, due to the limited computational cost, the number of induced cache misses is the critical determining factor for the performance of existing algorithms. We introduce new techniques to reduce this cost by storing node priorities implicitly and by using plain reads and writes in a way that does not affect the correctness of the algorithms. Finally, we show that Union-Find implementations are an interesting application for Transactional Memory (TM): one of the fastest algorithm variants we discovered is a sequential one that uses coarse-grained locking with the lock elision optimization to reduce synchronization cost and increase scalability

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

IST Austria: PubRep (Institute of Science and Technology)

Contention Adapting Search Trees

Author: Kjell Winblad
Konstantinos Sagonas
Publication venue
Publication date: 11/04/2020
Field of study

Abstract-With multicores being ubiquitous, concurrent data structures are becoming increasingly important. This paper proposes a novel approach to concurrent data structure design where the data structure collects statistics about contention and adapts dynamically according to this statistics. We use this approach to create a contention adapting binary search tree (CA tree) that can be used to implement concurrent ordered sets and maps. Our experimental evaluation shows that CA trees scale similar to recently proposed algorithms on a big multicore machine on various scenarios with a larger set size, and outperform the same data structures in more contended scenarios and in sequential performance. We also show that CA trees are well suited for optimization with hardware lock elision. In short, we propose a practically useful and easy to implement and show correct concurrent search tree that naturally adapts to the level of contention. I. INTRODUCTION With multicores being widespread, the need for efficient concurrent data structures has increased. In this paper we propose a novel adaptive technique for creating concurrent data structures. Our technique collects statistics about contention in locks and does local adaptations dynamically to reduce the contention or to optimize for low contention. This is the first contribution of this paper. Previous research on adapting to the level of contention has focused on objects where access cannot be easily distibuted, such as locks We demonstrate the benefits of our contention adapting technique by describing and evaluating a data structure for concurrent ordered sets or maps. We call this data structure contention adapting search tree or CA tree for short. The design of CA trees is the second contribution of this paper. Curren

CiteSeerX