57 research outputs found
Recommended from our members
Runtime asynchronous fault tolerance via speculation
Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date. Serving as a layer between the application and the underlying platform, RAFT automatically generates two symmetric program instances from a program binary. It detects transient faults in a non-invasive way and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multicore system demonstrates that RAFT delivers a geomean performance overhead of 2.83% on a set of 30 SPEC CPU benchmarks and STAMP benchmarks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications
Exploiting semantic commutativity in hardware speculation
Hardware speculative execution schemes such as hardware transactional memory (HTM) enjoy low run-time overheads but suffer from limited concurrency because they rely on reads and writes to detect conflicts. By contrast, software speculation schemes can exploit semantic knowledge of concurrent operations to reduce conflicts. In particular, they often exploit that many operations on shared data, like insertions into sets, are semantically commutative: they produce semantically equivalent results when reordered. However, software techniques often incur unacceptable run-time overheads. To solve this dichotomy, we present COMMTM, an HTM that exploits semantic commutativity. CommTM extends the coherence protocol and conflict detection scheme to support user-defined commutative operations. Multiple cores can perform commutative operations to the same data concurrently and without conflicts. CommTM preserves transactional guarantees and can be applied to arbitrary HTMs. CommTM scales on many operations that serialize in conventional HTMs, like set insertions, reference counting, and top-K insertions, and retains the low overhead of HTMs. As a result, at 128 cores, CommTM outperforms a conventional eager-lazy HTM by up to 3.4 χ and reduces or eliminates aborts.National Science Foundation (U.S.) (Grant CAREER-1452994
Putting checkpoints to work in thread level speculative execution
With the advent of Chip Multi Processors (CMPs), improving performance relies on
the programmers/compilers to expose thread level parallelism to the underlying hardware.
Unfortunately, this is a difficult and error-prone process for the programmers,
while state of the art compiler techniques are unable to provide significant benefits
for many classes of applications. An interesting alternative is offered by systems that
support Thread Level Speculation (TLS), which relieve the programmer and compiler
from checking for thread dependencies and instead use the hardware to enforce them.
Unfortunately, data misspeculation results in a high cost since all the intermediate
results have to be discarded and threads have to roll back to the beginning of the
speculative task. For this reason intermediate checkpointing of the state of the TLS
threads has been proposed. When the violation does occur, we now have to roll back
to a checkpoint before the violating instruction and not to the start of the task. However,
previous work omits study of the microarchitectural details and implementation
issues that are essential for effective checkpointing. Further, checkpoints have only
been proposed and evaluated for a narrow class of benchmarks.
This thesis studies checkpoints on a state of the art TLS system running a variety
of benchmarks. The mechanisms required for checkpointing and the costs associated
are described. Hardware modifications required for making checkpointed execution
efficient in time and power are proposed and evaluated. Further, the need for accurately
identifying suitable points for placing checkpoints is established. Various techniques
for identifying these points are analysed in terms of both effectiveness and viability.
This includes an extensive evaluation of data dependence prediction techniques. The
results show that checkpointing thread level speculative execution results in consistent
power savings, and for many benchmarks leads to speedups as well
Mixed Speculative Multithreaded Execution Models
Institute for Computing Systems ArchitectureThe current trend toward chip multiprocessor architectures has placed great pressure
on programmers and compilers to generate thread-parallel programs. Improved execution
performance can no longer be obtained via traditional single-thread instruction
level parallelism (ILP), but, instead, via multithreaded execution. One notable technique
that facilitates the extraction of parallel threads from sequential applications is
thread-level speculation (TLS). This technique allows programmers/compilers to generate
threads without checking for inter-thread data and control dependences, which
are then transparently enforced by the hardware. Most prior work on TLS has concentrated
on thread selection and mechanisms to efficiently support the main TLS operations,
such as squashes, data versioning, and commits.
This thesis seeks to enhance TLS functionality by combining it with other speculative
multithreaded execution models. The main idea is that TLS already requires
extensive hardware support, which when slightly augmented can accommodate other
speculative multithreaded techniques. Recognizing that for different applications, or
even program phases, the application bottlenecks may be different, it is reasonable to
assume that the more versatile a system is, the more efficiently it will be able to execute
the given program.
As mentioned above, generating thread-parallel programs is hard and TLS has
been suggested as an execution model that can speculatively exploit thread-level parallelism
(TLP) even when thread independence cannot be guaranteed by the programmer/
compiler. Alternatively, the helper threads (HT) execution model has been proposed
where subordinate threads are executed in parallel with a main thread in order to
improve the execution efficiency (i.e., ILP) of the latter. Yet another execution model,
runahead execution (RA), has also been proposed where subordinate versions of the
main thread are dynamically created especially to cope with long-latency operations,
again with the aim of improving the execution efficiency of the main thread (ILP).
Each one of these multithreaded execution models works best for different applications
and application phases. We combine these three models into a single execution
model and single hardware infrastructure such that the system can dynamically adapt
to find the most appropriate multithreaded execution model. More specifically, TLS is favored whenever successful parallel execution of instructions in multiple threads
(i.e., TLP) is possible and the system can seamlessly transition at run-time to the other
models otherwise. In order to understand the tradeoffs involved, we also develop a performance
model that allows one to quantitatively attribute overall performance gains
to either TLP or ILP in such combined multithreaded execution model.
Experimental results show that our combined execution model achieves speedups
of up to 41.2%, with an average of 10.2%, over an existing state-of-the-art TLS system
and speedups of up to 35.2%, with an average of 18.3%, over a flavor of runahead
execution for a subset of the SPEC2000 Integer benchmark suite.
We then investigate how a common ILP-enhancingmicroarchitectural feature, namely
branch prediction, interacts with TLS.We show that branch prediction for TLS is even
more important than it is for single core machines. Unfortunately, branch prediction for
TLS systems is also inherently harder. Code partitioning and re-executions of squashed
threads pollute the branch history making it harder for predictors to be accurate.
We thus propose to augment the hardware, so as to accommodate Multi-Path (MP)
execution within the existing TLS protocol. Under the MP execution model, all paths
following a number of hard-to-predict conditional branches are followed. MP execution
thus, removes branches that would have been otherwise mispredicted helping in
this way the processor to exploit more ILP. We show that with only minimal hardware
support, one can combine these two execution models into a unified one, which can
achieve far better performance than both TLS and MP execution.
Experimental results show that our combied execution model achieves speedups of
up to 20.1%, with an average of 8.8%, over an existing state-of-the-art TLS system and
speedups of up to 125%, with an average of 29.0%, when compared with multi-path
execution for a subset of the SPEC2000 Integer benchmark suite.
Finally, Since systems that support speculative multithreading usually treat all
threads equally, they are energy-inefficient. This inefficiency stems from the fact that
speculation occasionally fails and, thus, power is spent on threads that will have to
be discarded. We propose a profitability-based power allocation scheme, where we
“steal” power from non-profitable threads and use it to speed up more useful ones. We
evaluate our techniques for a state-of-the-art TLS system and show that, with minimalhardware support, we achieve improvements in ED of up to 25.5% with an average of
18.9%, for a subset of the SPEC 2000 Integer benchmark suite
The parallel event loop model and runtime: a parallel programming model and runtime system for safe event-based parallel programming
Recent trends in programming models for server-side development have shown an increasing popularity of event-based single- threaded programming models based on the combination of dynamic languages such as JavaScript and event-based runtime systems for asynchronous I/O management such as Node.JS. Reasons for the success of such models are the simplicity of the single-threaded event-based programming model as well as the growing popularity of the Cloud as a deployment platform for Web applications. Unfortunately, the popularity of single-threaded models comes at the price of performance and scalability, as single-threaded event-based models present limitations when parallel processing is needed, and traditional approaches to concurrency such as threads and locks don't play well with event-based systems. This dissertation proposes a programming model and a runtime system to overcome such limitations by enabling single-threaded event-based applications with support for speculative parallel execution. The model, called Parallel Event Loop, has the goal of bringing parallel execution to the domain of single-threaded event-based programming without relaxing the main characteristics of the single-threaded model, and therefore providing developers with the impression of a safe, single-threaded, runtime. Rather than supporting only pure single-threaded programming, however, the parallel event loop can also be used to derive safe, high-level, parallel programming models characterized by a strong compatibility with single-threaded runtimes. We describe three distinct implementations of speculative runtimes enabling the parallel execution of event-based applications. The first implementation we describe is a pessimistic runtime system based on locks to implement speculative parallelization. The second and the third implementations are based on two distinct optimistic runtimes using software transactional memory. Each of the implementations supports the parallelization of applications written using an asynchronous single-threaded programming style, and each of them enables applications to benefit from parallel execution
Recommended from our members
Logical partitioning of parallel system simulations
Simulation has been a fundamental tool to prototype, hypothesize, and evaluate
new ideas to continue improving system performance. However, increasing levels
of processor parallelism and heterogeneity have introduced additional
constraints when evaluating new designs. The work embodied in this dissertation
explores how to leverage novel ideas in simulator partitioning to improve
simulator speed and flexibility for simulating these new types of systems.
The contribution of this work includes the introduction of optimistic
partitioned simulation to improve parallelization, and the introduction of
warped partitioned simulation for improved flexibility. These ideas are refined
and demonstrated through the use of prototypes to demonstrate their benefits
compared to state-of-the-art approaches. By leveraging partitioning in a
structured manner, it is possible to design simulators that better address the
open challenges of parallel and heterogeneous systems design.Electrical and Computer Engineerin
Functional programming abstractions for weakly consistent systems
In recent years, there has been a wide-spread adoption of both multicore and cloud computing. Traditionally, concurrent programmers have relied on the underlying system providing strong memory consistency, where there is a semblance of concurrent tasks operating over a shared global address space. However, providing scalable strong consistency guarantees as the scale of the system grows is an increasingly difficult endeavor. In a multicore setting, the increasing complexity and the lack of scalability of hardware mechanisms such as cache coherence deters scalable strong consistency. In geo-distributed compute clouds, the availability concerns in the presence of partial failures prohibit strong consistency. Hence, modern multicore and cloud computing platforms eschew strong consistency in favor of weakly consistent memory, where each task\u27s memory view is incomparable with the other tasks. As a result, programmers on these platforms must tackle the full complexity of concurrent programming for an asynchronous distributed system. ^ This dissertation argues that functional programming language abstractions can simplify scalable concurrent programming for weakly consistent systems. Functional programming espouses mutation-free programming, and rare mutations when present are explicit in their types. By controlling and explicitly reasoning about shared state mutations, functional abstractions simplify concurrent programming. Building upon this intuition, this dissertation presents three major contributions, each focused on addressing a particular challenge associated with weakly consistent loosely coupled systems. First, it describes A NERIS, a concurrent functional programming language and runtime for the Intel Single-chip Cloud Computer, and shows how to provide an efficient cache coherent virtual address space on top of a non cache coherent multicore architecture. Next, it describes RxCML, a distributed extension of MULTIMLTON and shows that, with the help of speculative execution, synchronous communication can be utilized as an efficient abstraction for programming asynchronous distributed systems. Finally, it presents QUELEA, a programming system for eventually consistent distributed stores, and shows that the choice of correct consistency level for replicated data type operations and transactions can be automated with the help of high-level declarative contracts
Reducing exception management overhead with software restart markers
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (p. 181-196).Modern processors rely on exception handling mechanisms to detect errors and to implement various features such as virtual memory. However, these mechanisms are typically hardware-intensive because of the need to buffer partially-completed instructions to implement precise exceptions and enforce in-order instruction commit, often leading to issues with performance and energy efficiency. The situation is exacerbated in highly parallel machines with large quantities of programmer-visible state, such as VLIW or vector processors. As architects increasingly rely on parallel architectures to achieve higher performance, the problem of exception handling is becoming critical. In this thesis, I present software restart markers as the foundation of an exception handling mechanism for explicitly parallel architectures. With this model, the compiler is responsible for delimiting regions of idempotent code. If an exception occurs, the operating system will resume execution from the beginning of the region. One advantage of this approach is that instruction results can be committed to architectural state in any order within a region, eliminating the need to buffer those values. Enabling out-of-order commit can substantially reduce the exception management overhead found in precise exception implementations, and enable the use of new architectural features that might be prohibitively costly with conventional precise exception implementations. Additionally, software restart markers can be used to reduce context switch overhead in a multiprogrammed environment. This thesis demonstrates the applicability of software restart markers to vector, VLIW, and multithreaded architectures. It also contains an implementation of this exception handling approach that uses the Trimaran compiler infrastructure to target the Scale vectorthread architecture. I show that using software restart markers incurs very little performance overhead for vector-style execution on Scale.(cont.) Finally, I describe the Scale compiler flow developed as part of this work and discuss how it targets certain features facilitated by the use of software restart markersby Mark Jerome Hampton.Ph.D
Identifying, Quantifying, Extracting and Enhancing Implicit Parallelism
The shift of the microprocessor industry towards multicore architectures has
placed a huge burden on the programmers by requiring explicit parallelization
for performance. Implicit Parallelization is an alternative that could ease the
burden on programmers by parallelizing applications ???under the covers??? while
maintaining sequential semantics externally. This thesis develops a novel
approach for thinking about parallelism, by casting the problem of
parallelization in terms of instruction criticality. Using this approach,
parallelism in a program region is readily identified when certain conditions
about fetch-criticality are satisfied by the region. The thesis formalizes this
approach by developing a criticality-driven model of task-based
parallelization. The model can accurately predict the parallelism that would be
exposed by potential task choices by capturing a wide set of sources of
parallelism as well as costs to parallelization.
The criticality-driven model enables the development of two key components for
Implicit Parallelization: a task selection policy, and a bottleneck analysis
tool. The task selection policy can partition a single-threaded program into
tasks that will profitably execute concurrently on a multicore architecture in
spite of the costs associated with enforcing data-dependences and with
task-related actions. The bottleneck analysis tool gives feedback to the
programmers about data-dependences that limit parallelism. In particular, there
are several ???accidental dependences??? that can be easily removed with large
improvements in parallelism. These tools combine into a systematic methodology
for performance tuning in Implicit Parallelization. Finally, armed with the
criticality-driven model, the thesis revisits several architectural design
decisions, and finds several encouraging ways forward to increase the scope of
Implicit Parallelization.unpublishednot peer reviewe
- …