119 research outputs found
Executing requests concurrently in state machine replication
State machine replication is one of the most popular ways to achieve fault tolerance. In
a nutshell, the state machine replication approach maintains multiple replicas that both
store a copy of the system’s data and execute operations on that data. When requests
to execute operations arrive, an “agree-execute” protocol keeps replicas synchronized:
they first agree on an order to execute the incoming operations, and then execute the
operations one at a time in the agreed upon order, so that every replica reaches the same
final state.
Multi-core processors are the norm, but taking advantage of the available processor
cores to execute operations simultaneously is at odds with the “agree-execute” protocol:
simultaneous execution is inherently unpredictable, so in the end replicas may arrive
at different final states and the system becomes inconsistent. On one hand, we want to
take advantage of the available processor cores to execute operations simultaneously and
improve performance. But on the other hand, replicas must abide by the operation order
that they agreed upon for the system to remain consistent. This dissertation proposes
a solution to this dilemma. At a high level, we propose to use speculative execution
techniques to execute operations simultaneously while nonetheless ensuring that their
execution is equivalent to having executed the operations sequentially in the order the
replicas agreed upon. To achieve this, we: (1) propose to execute operations as serializable
transactions, and (2) develop a new concurrency control protocol that ensures that the
concurrent execution of a set of transactions respects the serialization order the replicas
agreed upon. Since speculation is only effective if it is successful, we also (3) propose
a modification to the typical API to declare transactions, which allows transactions to
execute their logic over an abstract replica state, resulting in fewer conflicts between
transactions and thus improving the effectiveness of the speculative executions.
An experimental evaluation shows that the contributions in this dissertation can
improve the performance of a state-machine-replicated server up to 4 , reaching up to
75% the performance of a concurrent fault-prone server
Operating System Support for Redundant Multithreading
Failing hardware is a fact and trends in microprocessor design indicate that the fraction of hardware suffering from permanent and transient faults will continue to increase in future chip generations. Researchers proposed various solutions to this issue with different downsides: Specialized hardware components make hardware more expensive in production and consume additional energy at runtime. Fault-tolerant algorithms and libraries enforce specific programming models on the developer. Compiler-based fault tolerance requires the source code for all applications to be available for recompilation. In this thesis I present ASTEROID, an operating system architecture that integrates applications with different reliability needs.
ASTEROID is built on top of the L4/Fiasco.OC microkernel and extends the system with Romain, an operating system service that transparently replicates user applications. Romain supports single- and multi-threaded applications without requiring access to the application's source code. Romain replicates applications and their resources completely and thereby does not rely on hardware extensions, such as ECC-protected memory. In my thesis I describe how to efficiently implement replication as a form of redundant multithreading in software. I develop mechanisms to manage replica resources and to make multi-threaded programs behave deterministically for replication.
I furthermore present an approach to handle applications that use shared-memory channels with other programs. My evaluation shows that Romain provides 100% error detection and more than 99.6% error correction for single-bit flips in memory and general-purpose registers. At the same time, Romain's execution time overhead is below 14% for single-threaded applications running in triple-modular redundant mode. The last part of my thesis acknowledges that software-implemented fault tolerance methods often rely on the correct functioning of a certain set of hardware and software components, the Reliable Computing Base (RCB).
I introduce the concept of the RCB and discuss what constitutes the RCB of the ASTEROID system and other fault tolerance mechanisms. Thereafter I show three case studies that evaluate approaches to protecting RCB components and thereby aim to achieve a software stack that is fully protected against hardware errors
Recommended from our members
Reliable and Efficient Multithreading
The advent of multicore architecture has increased the demand for multithreaded programs. It is notoriously far more challenging to write parallel programs correctly and efficiently than sequential ones because of the wide range of concurrency errors and performance problems. In this thesis, I developed a series of runtime systems and tools to combat concurrency errors and performance problems of multithreaded programs.
The first system, Dthreads, automatically ensures determinism for unmodified C/C++ applications using the pthreads library without requiring programmer intervention and hardware support. Dthreads greatly simplifies the understanding and debugging of multithreaded programs. Dthreads often matches or even exceeds the performance of standard thread libraries, making deterministic multithreading a practical alternative for the first time.
The second system attacks one notorious performance problem of multithreaded programs: false sharing. We provide the first accurate and precise detection tool, Sheriff-Detect, which can pinpoint the name of global variables or the allocation context of heap objects that involve in false sharing problems, without false positives. However, rewriting a program to fix false sharing can be infeasible when source code is unavailable, or undesirable when padding objects can increase excessive memory consumption or further worsen runtime performance. To resolve this problem, we provide a runtime system, Sheriff-Protect, to automatically boost the performance of programs with false sharing problems.
The third system, Predator, improves the effectiveness of false sharing detection. It can detect one more type of false sharing: read-write false sharing. Also, it can even detect false sharing problems without occurrences, thus overcomes a shortcoming of all existing tools: they can only detect those observed false sharing problems
Extracting Safe Thread Schedules from Incomplete Model Checking Results
Model checkers frequently fail to completely verify a concurrent program, even if partial-order reduction is applied. The verification engineer is left in doubt whether the program is safe and the effort toward verifying the program is wasted. We present a technique that uses the results of such incomplete verification attempts to construct a (fair) scheduler that allows the safe execution of the partially verified concurrent program. This scheduler restricts the execution to schedules that have been proven safe (and prevents executions that were found to be erroneous). We evaluate the performance of our technique and show how it can be improved using partial-order reduction. While constraining the scheduler results in a considerable performance penalty in general, we show that in some cases our approach—somewhat surprisingly—even leads to faster executions
Recommended from our members
Stable Multithreading: A New Paradigm for Reliable and Secure Threads
Multi threaded programs have become pervasive and critical due to the rise of the multi core hardware and the accelerating computational demand. Unfortunately, despite decades of research and engineering effort, these programs remain notoriously difficult to get right, and they are plagued with harmful concurrency bugs that can cause wrong outputs, program crashes, security breaches, and so on. Our research reveals that a root cause of this difficulty is that multithreaded programs have too many possible thread interleavings (or schedules) at runtime. Even given only a single input, a program may run into a great number of schedules, depending on factors such as hardware timing and OS scheduling. Considering all inputs, the number of schedules is even much greater. It is extremely challenging to understand, test, analyze, or verify this huge number of schedules for a multi threaded program and make sure that all these schedules are free of concurrency bugs. Thus, multi threaded programs are extremely difficult to get right.
To reduce the number of possible schedules for all inputs, we looked into the relation between inputs and schedules of real-world programs, and made an exciting discovery: many programs need only a small set of schedules to efficiently process a wide range of inputs! Leveraging this discovery, we have proposed a new idea called Stable Multithreading (or StableMT) that reuses each schedule on a wide range of inputs, greatly reducing the number of possible schedules for all inputs. By addressing the root cause that makes multithreading difficult to get right, StableMT makes understanding, testing, analyzing, and verification of multithreaded programs much easier. To realize StableMT, we have built three StableMT systems, TERN, PEREGRINE, and PARROT, with each addressing a distinct research challenge. Evaluation on a wide range of 108 popular multithreaded programs with our latest StableMT system, PARROT, shows that StableMT is simple, fast, and deployable. All PARROT's source code, entire benchmarks, and raw evaluation results are available at http://github.com/columbia/smt-mc.
To encourage deployment, we have applied StableMT to improve several reliability techniques, including: (1) making reproducing real world concurrency bugs much easier; (2) greatly improving the precision of static program analysis, leading to the detection of several new harmful data races in heavily tested programs; and (3) greatly increasing the coverage of model checking, a systematic testing technique, by many orders of magnitudes. StableMT has attracted the research community's interests, and some techniques and ideas in our StableMT systems have been leveraged by other researchers to compute a small set of schedules to cover all or most inputs for multi threaded programs
- …