4,521 research outputs found
Improving Transactional Memory Performance for Irregular Applications
Postprint de autor publicado posteriormente con este DOI:http://dx.doi.org/10.1016/j.procs.2015.05.398Transactional memory (TM) offers optimistic concurrency support in modern multicore archi-
tectures, helping the programmers to extract parallelism in irregular applications when data
dependence information is not available before runtime. In fact, recent research focus on ex-
ploiting thread-level parallelism using TM approaches. However, the proposed techniques are
of general use, valid for any type of application.
This work presents ReduxSTM, a software TM system specially designed to extract maxi-
mum parallelism from irregular applications. Commit management and conflict detection are
tailored to take advantage of both, sequential transaction ordering to assure correct results,
and privatization of reduction patterns, a very frequent memory access pattern in irregular
applications. Both techniques are used to avoid unnecessary transaction aborts.
A function in 300.twolf package from SPEC CPU2000 was taken as a motivating irregular
program. This code was parallelized using ReduxSTM and an ordered version of TinySTM,
a state-of-the-art TM system. Experimental evaluation shows that ReduxTM exploits more
parallelism from the sequential program and obtains better performance than the other system.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
A Comparative Analysis of STM Approaches to Reduction Operations in Irregular Applications
As a recently consolidated paradigm for optimistic concurrency in modern multicore architectures, Transactional Memory (TM)
can help to the exploitation of parallelism in irregular applications when data dependence information is not available up to run-
time. This paper presents and discusses how to leverage TM to exploit parallelism in an important class of irregular applications, the class that exhibits irregular reduction patterns. In order to test and compare our techniques with other solutions, they were implemented in a software TM system called ReduxSTM, that acts as a proof of concept. Basically, ReduxSTM combines two major ideas: a sequential-equivalent ordering of transaction commits that assures the correct result, and an extension of the underlying TM privatization mechanism to reduce unnecessary overhead due to reduction memory updates as well as unnecesary aborts and rollbacks. A comparative study of STM solutions, including ReduxSTM, and other more classical approaches to the parallelization of reduction operations is presented in terms of time, memory and overhead.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
TMbarrier: speculative barriers using hardware transactional memory
Barrier is a very common synchronization method used in parallel programming. Barriers are used typically to enforce a partial thread execution order, since there may be dependences between code sections before and after the barrier. This work proposes TMbarrier, a new design of a barrier intended to be used in transactional applications. TMbarrier allows threads to continue executing speculatively after the barrier assuming that there are not dependences with safe threads that have not yet reached the barrier. Our design leverages transactional memory (TM) (specifically, the implementation offered by the IBM POWER8 processor) to hold the speculative updates and to detect possible conflicts between speculative and safe threads. Despite the limitations of the best-effort hardware TM implementation present in current processors, experiments show a reduction in wasted time due to synchronization compared to standard barriers.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
Automatic skeleton-driven performance optimizations for transactional memory
The recent shift toward multi -core chips has pushed the burden of extracting performance to the programmer. In fact, programmers now have to be able to uncover more
coarse -grain parallelism with every new generation of processors, or the performance
of their applications will remain roughly the same or even degrade. Unfortunately,
parallel programming is still hard and error prone. This has driven the development of
many new parallel programming models that aim to make this process efficient.This thesis first combines the skeleton -based and transactional memory programming models in a new framework, called OpenSkel, in order to improve performance
and programmability of parallel applications. This framework provides a single skeleton that allows the implementation of transactional worklist applications. Skeleton or
pattern-based programming allows parallel programs to be expressed as specialized instances of generic communication and computation patterns. This leaves the programmer with only the implementation of the particular operations required to solve the
problem at hand. Thus, this programming approach simplifies parallel programming
by eliminating some of the major challenges of parallel programming, namely thread
communication, scheduling and orchestration. However, the application programmer
has still to correctly synchronize threads on data races. This commonly requires the
use of locks to guarantee atomic access to shared data. In particular, lock programming
is vulnerable to deadlocks and also limits coarse grain parallelism by blocking threads
that could be potentially executed in parallel.Transactional Memory (TM) thus emerges as an attractive alternative model to simplify parallel programming by removing this burden of handling data races explicitly.
This model allows programmers to write parallel code as transactions, which are then
guaranteed by the runtime system to execute atomically and in isolation regardless of
eventual data races. TM programming thus frees the application from deadlocks and
enables the exploitation of coarse grain parallelism when transactions do not conflict
very often. Nevertheless, thread management and orchestration are left for the application programmer. Fortunately, this can be naturally handled by a skeleton framework.
This fact makes the combination of skeleton -based and transactional programming a
natural step to improve programmability since these models complement each other.
In fact, this combination releases the application programmer from dealing with thread
management and data races, and also inherits the performance improvements of both
models. In addition to it, a skeleton framework is also amenable to skeleton - driven
iii
performance optimizations that exploits the application pattern and system information.This thesis thus also presents a set of pattern- oriented optimizations that are automatically selected and applied in a significant subset of transactional memory applications that shares a common pattern called worklist. These optimizations exploit the
knowledge about the worklist pattern and the TM nature of the applications to avoid
transaction conflicts, to prefetch data, to reduce contention etc. Using a novel autotuning mechanism, OpenSkel dynamically selects the most suitable set of these patternoriented performance optimizations for each application and adjusts them accordingly.
Experimental results on a subset of five applications from the STAMP benchmark suite
show that the proposed autotuning mechanism can achieve performance improvements
within 2 %, on average, of a static oracle for a 16 -core UMA (Uniform Memory Access) platform and surpasses it by 7% on average for a 32 -core NUMA (Non -Uniform
Memory Access) platform.Finally, this thesis also investigates skeleton -driven system- oriented performance
optimizations such as thread mapping and memory page allocation. In order to do
it, the OpenSkel system and also the autotuning mechanism are extended to accommodate these optimizations. The conducted experimental results on a subset of five
applications from the STAMP benchmark show that the OpenSkel framework with the
extended autotuning mechanism driving both pattern and system- oriented optimizations can achieve performance improvements of up to 88 %, with an average of 46 %,
over a baseline version for a 16 -core UMA platform and up to 162 %, with an average
of 91 %, for a 32 -core NUMA platform
A Survey on Thread-Level Speculation Techniques
Producción CientíficaThread-Level Speculation (TLS) is a promising technique that allows the parallel execution of sequential code without relying on a prior, compile-time-dependence analysis. In this work, we introduce the technique, present a taxonomy of TLS solutions, and summarize and put into perspective the most relevant advances in this field.MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H5 network (TIN2014-53522-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)
08241 Abstracts Collection -- Transactional Memory : From Implementation to Application
From 08.06. to 13.06.2008, the Dagstuhl Seminar 08241 ``Transactional Memory: From Implementation to Application\u27\u27 was held in Schloss Dagstuhl -- Leibniz Center for Informatics.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if available
A key-based adaptive transactional memory executor
Software transactional memory systems enable a programmer to easily write concurrent data structures such as lists, trees, hashtables, and graphs, where nonconflicting operations proceed in parallel. Many of these structures take the abstract form of a dictionary, in which each transaction is associated with a search key. By regrouping transactions based on their keys, one may improve locality and reduce conflicts among parallel transactions. In this paper, we present an executor that partitions transactions among available processors. Our keybased adaptive partitioning monitors incoming transactions, estimates the probability distribution of their keys, and adaptively determines the (usually nonuniform) partitions. By comparing the adaptive partitioning with uniform partitioning and round-robin keyless partitioning on a 16-processor SunFire 6800 machine, we demonstrate that key-based adaptive partitioning significantly improves the throughput of finegrained parallel operations on concurrent data structures
- …