856 research outputs found
Lock-free Concurrent Data Structures
Concurrent data structures are the data sharing side of parallel programming.
Data structures give the means to the program to store data, but also provide
operations to the program to access and manipulate these data. These operations
are implemented through algorithms that have to be efficient. In the sequential
setting, data structures are crucially important for the performance of the
respective computation. In the parallel programming setting, their importance
becomes more crucial because of the increased use of data and resource sharing
for utilizing parallelism.
The first and main goal of this chapter is to provide a sufficient background
and intuition to help the interested reader to navigate in the complex research
area of lock-free data structures. The second goal is to offer the programmer
familiarity to the subject that will allow her to use truly concurrent methods.Comment: To appear in "Programming Multi-core and Many-core Computing
Systems", eds. S. Pllana and F. Xhafa, Wiley Series on Parallel and
Distributed Computin
Recommended from our members
Software lock elision for x86 machine code
More than a decade after becoming a topic of intense research there is no
transactional memory hardware nor any examples of software transactional memory
use outside the research community. Using software transactional memory in large
pieces of software needs copious source code annotations and often means
that standard compilers and debuggers can no longer be used. At the same time,
overheads associated with software transactional memory fail to motivate
programmers to expend the needed effort to use software transactional
memory. The only way around the overheads in the case of general unmanaged code
is the anticipated availability of hardware support. On the other hand, architects
are unwilling to devote power and area budgets in mainstream microprocessors to
hardware transactional memory, pointing to transactional memory being a
"niche" programming construct. A deadlock has thus ensued that is blocking
transactional memory use and experimentation in the mainstream.
This dissertation covers the design and construction of a software transactional
memory runtime system called SLE_x86 that can potentially break this
deadlock by decoupling transactional memory from programs using it. Unlike most
other STM designs, the core design principle is transparency rather than
performance. SLE_x86 operates at the level of x86 machine code, thereby
becoming immediately applicable to binaries for the popular x86
architecture. The only requirement is that the binary synchronise using known
locking constructs or calls such as those in Pthreads or OpenMP
libraries. SLE_x86 provides speculative lock elision (SLE) entirely in
software, executing critical sections in the binary using transactional
memory. Optionally, the critical sections can also be executed without using
transactions by acquiring the protecting lock.
The dissertation makes a careful analysis of the impact on performance due to
the demands of the x86 memory consistency model and the need to transparently
instrument x86 machine code. It shows that both of these problems can be
overcome to reach a reasonable level of performance, where transparent
software transactional memory can perform better than a lock. SLE_x86 can
ensure that programs are ready for transactional memory in any form, without
being explicitly written for it
Boosting performance of transactional memory through transactional read tracking and set associative locks
Multi-core processors have become so prevalent in server, desktop, and even embedded systems that they are considered the norm for modem computing systems. The trend is likely toward many-core processors with many more than just 2, 4, or 8 cores per CPU. To benefit from
the increasing number of cores per chip, application developers have to develop parallel programs [1]. Traditional lock-based programming is too difficult and error prone for most of programmers and is the domain of experts. Deadlock, race, and other synchronization bugs are some of the challenges of lock-based programming. To make parallel programming mainstream, it is necessary to adapt parallel programming by the majority of programmers and not just experts, and thus simplifying parallel programming has become an important challenge.
Transactional Memory (TM) is a promising programming model for managing concurrent accesses to the shared memory locations. Transactional memory allows a programmer to specify a section of a code to be "'transactional", and the underlying system guarantees atomic execution of the code. This simplifies parallel programming and reduces the possibility of synchronization bugs.
This thesis develops several software- and hardware-based techniques to improve performance of existing transactional memory systems. The first technique is Transactional Read Tracking (TRT). TRT is a software-based approach that employs a locking mechanism for transactional read and write operations. The performance of TRT depends on memory access patterns of applications. In some cases, TRT falls behind the baseline scheme. To further improve performance of TRT, we introduce two hybrid methods that dynamically switches between TRT and the baseline scheme based on applications’ behavior.
The second optimization technique is Set Associative Lock (SAL). Memory locations are mapped to a lock table in order to synchronize accesses to the shared memory locations. Direct mapped lock tables usually result in collision which leads to false aborts. In SAL, we increase
associativity of the lock table to reduce false abort. While SAL improves performance in most of the applications, in some cases, it increases execution time due to overhead of lock tables in software. To cope with this problem, we propose Hardware-SAL (HW-SAL) which moves the
set associative lock table to the hardware. As such, true power of set associativity will be harnessed without sacrificing performance
Recommended from our members
Galois : a system for parallel execution of irregular algorithms
textA programming model which allows users to program with high productivity and which produces high performance executions has been a goal for decades. This dissertation makes progress towards this elusive goal by describing the design and implementation of the Galois system, a parallel programming model for shared-memory, multicore machines. Central to the design is the idea that scheduling of a program can be decoupled from the core computational operator and data structures. However, efficient programs often require application-specific scheduling to achieve best performance. To bridge this gap, an extensible and abstract scheduling policy language is proposed, which allows programmers to focus on selecting high-level scheduling policies while delegating the tedious task of implementing the policy to a scheduler synthesizer and runtime system. Implementations of deterministic and prioritized scheduling also are described. An evaluation of a well-studied benchmark suite reveals that factoring programs into operators, schedulers and data structures can produce significant performance improvements over unfactored approaches. Comparison of the Galois system with existing programming models for graph analytics shows significant performance improvements, often orders of magnitude more, due to (1) better support for the restrictive programming models of existing systems and (2) better support for more sophisticated algorithms and scheduling, which cannot be expressed in other systems.Computer Science
HeTM: Transactional Memory for Heterogeneous Systems
Modern heterogeneous computing architectures, which couple multi-core CPUs
with discrete many-core GPUs (or other specialized hardware accelerators),
enable unprecedented peak performance and energy efficiency levels.
Unfortunately, though, developing applications that can take full advantage of
the potential of heterogeneous systems is a notoriously hard task. This work
takes a step towards reducing the complexity of programming heterogeneous
systems by introducing the abstraction of Heterogeneous Transactional Memory
(HeTM). HeTM provides programmers with the illusion of a single memory region,
shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with
support for atomic transactions. Besides introducing the abstract semantics and
programming model of HeTM, we present the design and evaluation of a concrete
implementation of the proposed abstraction, which we named Speculative HeTM
(SHeTM). SHeTM makes use of a novel design that leverages on speculative
techniques and aims at hiding the inherently large communication latency
between CPUs and discrete GPUs and at minimizing inter-device synchronization
overhead. SHeTM is based on a modular and extensible design that allows for
easily integrating alternative TM implementations on the CPU's and GPU's sides,
which allows the flexibility to adopt, on either side, the TM implementation
(e.g., in hardware or software) that best fits the applications' workload and
the architectural characteristics of the processing unit. We demonstrate the
efficiency of the SHeTM via an extensive quantitative study based both on
synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on
Parallel Architectures and Compilation Techniques (PACT'19
Enhancing the efficiency and practicality of software transactional memory on massively multithreaded systems
Chip Multithreading (CMT) processors promise to deliver higher performance by running more than one stream of instructions in parallel. To exploit CMT's capabilities, programmers have to parallelize their applications, which is not a trivial task. Transactional Memory (TM) is one of parallel programming models that aims at simplifying synchronization by raising the level of abstraction between semantic atomicity and the means by which that atomicity is achieved. TM is a promising programming model but there are still important challenges that must be addressed to make it more practical and efficient in mainstream parallel programming.
The first challenge addressed in this dissertation is that of making the evaluation of TM proposals more solid with realistic TM benchmarks and being able to run the same benchmarks on different STM systems. We first introduce a benchmark suite, RMS-TM, a comprehensive benchmark suite to evaluate HTMs and STMs. RMS-TM consists of seven applications from the Recognition, Mining and Synthesis (RMS) domain that are representative of future workloads. RMS-TM features current TM research issues such as nesting and I/O inside transactions, while also providing various TM characteristics. Most STM systems are implemented as user-level libraries: the programmer is expected to manually instrument not only transaction boundaries, but also individual loads and stores within transactions. This library-based approach is increasingly tedious and error prone and also makes it difficult to make reliable performance comparisons. To enable an "apples-to-apples" performance comparison, we then develop a software layer that allows researchers to test the same applications with interchangeable STM back ends.
The second challenge addressed is that of enhancing performance and scalability of TM applications running on aggressive multi-core/multi-threaded processors. Performance and scalability of current TM designs, in particular STM desings, do not always meet the programmer's expectation, especially at scale. To overcome this limitation, we propose a new STM design, STM2, based on an assisted execution model in which time-consuming TM operations are offloaded to auxiliary threads while application threads optimistically perform computation. Surprisingly, our results show that STM2 provides, on average, speedups between 1.8x and 5.2x over state-of-the-art STM systems. On the other hand, we notice that assisted-execution systems may show low processor utilization. To alleviate this problem and to increase the efficiency of STM2, we enriched STM2 with a runtime mechanism that automatically and adaptively detects application and auxiliary threads' computing demands and dynamically partition hardware resources between the pair through the hardware thread prioritization mechanism implemented in POWER machines.
The third challenge is to define a notion of what it means for a TM program to be correctly synchronized. The current definition of transactional data race requires all transactions to be totally ordered "as if'' serialized by a global lock, which limits the scalability of TM designs. To remove this constraint, we first propose to relax the current definition of transactional data race to allow a higher level of concurrency. Based on this definition we propose the first practical race detection algorithm for C/C++ applications (TRADE) and implement the corresponding race detection tool. Then, we introduce a new definition of transactional data race that is more intuitive, transparent to the underlying TM implementation, can be used for a broad set of C/C++ TM programs. Based on this new definition, we proposed T-Rex, an efficient and scalable race detection tool for C/C++ TM applications. Using TRADE and T-Rex, we have discovered subtle transactional data races in widely-used STAMP applications which have not been reported in the past
- …