93 research outputs found
A Survey on Thread-Level Speculation Techniques
Producción CientíficaThread-Level Speculation (TLS) is a promising technique that allows the parallel execution of sequential code without relying on a prior, compile-time-dependence analysis. In this work, we introduce the technique, present a taxonomy of TLS solutions, and summarize and put into perspective the most relevant advances in this field.MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H5 network (TIN2014-53522-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)
Putting checkpoints to work in thread level speculative execution
With the advent of Chip Multi Processors (CMPs), improving performance relies on
the programmers/compilers to expose thread level parallelism to the underlying hardware.
Unfortunately, this is a difficult and error-prone process for the programmers,
while state of the art compiler techniques are unable to provide significant benefits
for many classes of applications. An interesting alternative is offered by systems that
support Thread Level Speculation (TLS), which relieve the programmer and compiler
from checking for thread dependencies and instead use the hardware to enforce them.
Unfortunately, data misspeculation results in a high cost since all the intermediate
results have to be discarded and threads have to roll back to the beginning of the
speculative task. For this reason intermediate checkpointing of the state of the TLS
threads has been proposed. When the violation does occur, we now have to roll back
to a checkpoint before the violating instruction and not to the start of the task. However,
previous work omits study of the microarchitectural details and implementation
issues that are essential for effective checkpointing. Further, checkpoints have only
been proposed and evaluated for a narrow class of benchmarks.
This thesis studies checkpoints on a state of the art TLS system running a variety
of benchmarks. The mechanisms required for checkpointing and the costs associated
are described. Hardware modifications required for making checkpointed execution
efficient in time and power are proposed and evaluated. Further, the need for accurately
identifying suitable points for placing checkpoints is established. Various techniques
for identifying these points are analysed in terms of both effectiveness and viability.
This includes an extensive evaluation of data dependence prediction techniques. The
results show that checkpointing thread level speculative execution results in consistent
power savings, and for many benchmarks leads to speedups as well
Compiler-Driven Software Speculation for Thread-Level Parallelism
Current parallelizing compilers can tackle applications exercising regular access patterns on arrays or affine indices, where data dependencies can be expressed in a linear form. Unfortunately, there are cases that independence between statements of code cannot be guaranteed and thus the compiler conservatively produces sequential code. Programs that involve extensive pointer use, irregular access patterns, and loops with unknown number of iterations are examples of such cases. This limits the extraction of parallelism in cases where dependencies are rarely or never triggered at runtime. Speculative parallelism refers to methods employed during program execution that aim to produce a valid parallel execution schedule for programs immune to static parallelization. The motivation for this article is to review recent developments in the area of compiler-driven software speculation for thread-level parallelism and how they came about. The article is divided into two parts. In the first part the fundamentals of speculative parallelization for thread-level parallelism are explained along with a design choice categorization for implementing such systems. Design choices include the ways speculative data is handled, how data dependence violations are detected and resolved, how the correct data are made visible to other threads, or how speculative threads are scheduled. The second part is structured around those design choices providing the advances and trends in the literature with reference to key developments in the area. Although the focus of the article is in software speculative parallelization, a section is dedicated for providing the interested reader with pointers and references for exploring similar topics such as hardware thread-level speculation, transactional memory, and automatic parallelization
Exploiting semantic commutativity in hardware speculation
Hardware speculative execution schemes such as hardware transactional memory (HTM) enjoy low run-time overheads but suffer from limited concurrency because they rely on reads and writes to detect conflicts. By contrast, software speculation schemes can exploit semantic knowledge of concurrent operations to reduce conflicts. In particular, they often exploit that many operations on shared data, like insertions into sets, are semantically commutative: they produce semantically equivalent results when reordered. However, software techniques often incur unacceptable run-time overheads. To solve this dichotomy, we present COMMTM, an HTM that exploits semantic commutativity. CommTM extends the coherence protocol and conflict detection scheme to support user-defined commutative operations. Multiple cores can perform commutative operations to the same data concurrently and without conflicts. CommTM preserves transactional guarantees and can be applied to arbitrary HTMs. CommTM scales on many operations that serialize in conventional HTMs, like set insertions, reference counting, and top-K insertions, and retains the low overhead of HTMs. As a result, at 128 cores, CommTM outperforms a conventional eager-lazy HTM by up to 3.4 χ and reduces or eliminates aborts.National Science Foundation (U.S.) (Grant CAREER-1452994
Complementing user-level coarse-grain parallelism with implicit speculative parallelism
Multi-core and many-core systems are the norm in contemporary processor technology
and are expected to remain so for the foreseeable future. Parallel programming
is, thus, here to stay and programmers have to endorse it if they are to exploit such
systems for their applications. Programs using parallel programming primitives like
PThreads or OpenMP often exploit coarse-grain parallelism, because it offers a good
trade-off between programming effort versus performance gain. Some parallel applications
show limited or no scaling beyond a number of cores. Given the abundant
number of cores expected in future many-cores, several cores would remain idle in such
cases while execution performance stagnates. This thesis proposes using cores that do
not contribute to performance improvement for running implicit fine-grain speculative
threads. In particular, we present a many-core architecture and protocols that allow
applications with coarse-grain explicit parallelism to further exploit implicit speculative
parallelism within each thread. We show that complementing parallel programs
with implicit speculative mechanisms offers significant performance improvements for
a large and diverse set of parallel benchmarks. Implicit speculative parallelism frees
the programmer from the additional effort to explicitly partition the work into finer
and properly synchronized tasks. Our results show that, for a many-core comprising
128 cores supporting implicit speculative parallelism in clusters of 2 or 4 cores, performance
improves on top of the highest scalability point by 44% on average for the
4-core cluster and by 31% on average for the 2-core cluster. We also show that this
approach often leads to better performance and energy efficiency compared to existing
alternatives such as Core Fusion and Turbo Boost. Moreover, we present a dynamic
mechanism to choose the number of explicit and implicit threads, which performs
within 6% of the static oracle selection of threads.
To improve energy efficiency processors allow for Dynamic Voltage and Frequency
Scaling (DVFS), which enables changing their performance and power consumption
on-the-fly. We evaluate the amenability of the proposed explicit plus implicit threads
scheme to traditional power management techniques for multithreaded applications
and identify room for improvement. We thus augment prior schemes and introduce
a novel multithreaded power management scheme that accounts for implicit threads
and aims to minimize the Energy Delay2 product (ED2). Our scheme comprises two
components: a “local” component that tries to adapt to the different program phases
on a per explicit thread basis, taking into account implicit thread behavior, and a
“global” component that augments the local components with information regarding
inter-thread synchronization. Experimental results show a reduction of ED2 of 8%
compared to having no power management, with an average reduction in power of
15% that comes at a minimal loss of performance of less than 3% on average
- …