68,991 research outputs found
TMbarrier: speculative barriers using hardware transactional memory
Barrier is a very common synchronization method used in parallel programming. Barriers are used typically to enforce a partial thread execution order, since there may be dependences between code sections before and after the barrier. This work proposes TMbarrier, a new design of a barrier intended to be used in transactional applications. TMbarrier allows threads to continue executing speculatively after the barrier assuming that there are not dependences with safe threads that have not yet reached the barrier. Our design leverages transactional memory (TM) (specifically, the implementation offered by the IBM POWER8 processor) to hold the speculative updates and to detect possible conflicts between speculative and safe threads. Despite the limitations of the best-effort hardware TM implementation present in current processors, experiments show a reduction in wasted time due to synchronization compared to standard barriers.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
Compiler-assisted Adaptive Program Scheduling in big.LITTLE Systems
Energy-aware architectures provide applications with a mix of low (LITTLE)
and high (big) frequency cores. Choosing the best hardware configuration for a
program running on such an architecture is difficult, because program parts
benefit differently from the same hardware configuration. State-of-the-art
techniques to solve this problem adapt the program's execution to dynamic
characteristics of the runtime environment, such as energy consumption and
throughput. We claim that these purely dynamic techniques can be improved if
they are aware of the program's syntactic structure. To support this claim, we
show how to use the compiler to partition source code into program phases:
regions whose syntactic characteristics lead to similar runtime behavior. We
use reinforcement learning to map pairs formed by a program phase and a
hardware state to the configuration that best fit this setup. To demonstrate
the effectiveness of our ideas, we have implemented the Astro system. Astro
uses Q-learning to associate syntactic features of programs with hardware
configurations. As a proof of concept, we provide evidence that Astro
outperforms GTS, the ARM-based Linux scheduler tailored for heterogeneous
architectures, on the parallel benchmarks from Rodinia and Parsec
Recommended from our members
Percolation scheduling for non-VLIW machines
Percolation Scheduling, a technique for compile-time code parallelization, has proven very successful for exploiting fine-grain irregular parallelism in ordinary programs. Currently, this technology is targeted only to VLIW (Very Long Instruction Word) machines, which have the advantages of 'free' synchronization and communication. Shared memory multi-processors can simulate the execution characteristics of VLIW machines with the use of static barriers. Preliminary results show that Percolation Scheduling can be used with good results on this type of architecture by increasing the granularity from operation level to source statement level, removing any redundant synchronization, and providing an efficient implementation of multi-way jumps
Relaxed Operational Semantics of Concurrent Programming Languages
We propose a novel, operational framework to formally describe the semantics
of concurrent programs running within the context of a relaxed memory model.
Our framework features a "temporary store" where the memory operations issued
by the threads are recorded, in program order. A memory model then specifies
the conditions under which a pending operation from this sequence is allowed to
be globally performed, possibly out of order. The memory model also involves a
"write grain," accounting for architectures where a thread may read a write
that is not yet globally visible. Our formal model is supported by a software
simulator, allowing us to run litmus tests in our semantics.Comment: In Proceedings EXPRESS/SOS 2012, arXiv:1208.244
System software for the finite element machine
The Finite Element Machine is an experimental parallel computer developed at Langley Research Center to investigate the application of concurrent processing to structural engineering analysis. This report describes system-level software which has been developed to facilitate use of the machine by applications researchers. The overall software design is outlined, and several important parallel processing issues are discussed in detail, including processor management, communication, synchronization, and input/output. Based on experience using the system, the hardware architecture and software design are critiqued, and areas for further work are suggested
Comparing barrier algorithms
A barrier is a method for synchronizing a large number of concurrent computer processes. After considering some basic synchronization mechanisms, a collection of barrier algorithms with either linear or logarithmic depth are presented. A graphical model is described that profiles the execution of the barriers and other parallel programming constructs. This model shows how the interaction between the barrier algorithms and the work that they synchronize can impact their performance. One result is that logarithmic tree structured barriers show good performance when synchronizing fixed length work, while linear self-scheduled barriers show better performance when synchronizing fixed length work with an imbedded critical section. The linear barriers are better able to exploit the process skew associated with critical sections. Timing experiments, performed on an eighteen processor Flex/32 shared memory multiprocessor, that support these conclusions are detailed
Synchronising C/C++ and POWER
Shared memory concurrency relies on synchronisation primitives: compare-and-swap, load-reserve/store-conditional (aka LL/SC), language-level mutexes, and so on. In a sequentially consistent setting, or even in the TSO setting of x86 and Sparc, these have well-understood semantics. But in the very relaxed settings of IBM®, POWER®, ARM, or C/C++, it remains surprisingly unclear exactly what the programmer can depend on.
This paper studies relaxed-memory synchronisation. On the hardware side, we give a clear semantic characterisation of the load-reserve/store-conditional primitives as provided by POWER multiprocessors, for the first time since they were introduced 20 years ago; we cover their interaction with relaxed loads, stores, barriers, and dependencies. Our model, while not officially sanctioned by the vendor, is validated by extensive testing, comparing actual implementation behaviour against an oracle generated from the model, and by detailed discussion with IBM staff. We believe the ARM semantics to be similar.
On the software side, we prove sound a proposed compilation scheme of the C/C++ synchronisation constructs to POWER, including C/C++ spinlock mutexes, fences, and read-modify-write operations, together with the simpler atomic operations for which soundness is already known from our previous work; this is a first step in verifying concurrent algorithms that use load-reserve/store-conditional with respect to a realistic semantics. We also build confidence in the C/C++ model in its own terms, fixing some omissions and contributing to the C standards committee adoption of the C++11 concurrency model
- …