1,113 research outputs found
Design and development of algorithms for fault tolerant distributed systems
PhD ThesisThis thesis describes the design and development of algorithms for fault
tolerant distributed systems. The development of such algorithms requires
making assumptions about the types of component faults for which toler-
ance is to be provided. Such assumptions must be specified accurately. To
this end, this thesis develops a classification of faults in systems. This fault
classification identifies a range of fault types from the most restricted to the
least restricted. For each fault type, an algorithm for reaching distributed
agreement in the presence of a bounded number of faulty processors is
developed, and thus a family of agreement algorithms is presented. The
influence of the various fault types on the complexities of these algorithms
is discussed. Early stopping algorithms are also developed for selected fault
types and the influence of fault types on the early stopping conditions of the
respective algorithms is analysed. The problem of evaluating the perfor-
mance of distributed replicated systems which will require agreement algo-
rithms is considered next. As a first step in the direction of meeting this
challenging task, a pipeline triple modular redundant system is considered
and analytical methods are derived to evaluate the performance of such a
system. Finally, the accuracy of these methods is examined using computer
simulations.UK Science and Engineering Research Council (SERC),
DELTA-4 consortium of ESPIRI
Distributed eventual leader election in the crash-recovery and general omission failure models.
102 p.Distributed applications are present in many aspects of everyday life. Banking, healthcare or transportation are examples of such applications. These applications are built on top of distributed systems. Roughly speaking, a distributed system is composed of a set of processes that collaborate among them to achieve a common goal. When building such systems, designers have to cope with several issues, such as different synchrony assumptions and failure occurrence. Distributed systems must ensure that the delivered service is trustworthy.Agreement problems compose a fundamental class of problems in distributed systems. All agreement problems follow the same pattern: all processes must agree on some common decision. Most of the agreement problems can be considered as a particular instance of the Consensus problem. Hence, they can be solved by reduction to consensus. However, a fundamental impossibility result, namely (FLP), states that in an asynchronous distributed system it is impossible to achieve consensus deterministically when at least one process may fail. A way to circumvent this obstacle is by using unreliable failure detectors. A failure detector allows to encapsulate synchrony assumptions of the system, providing (possibly incorrect) information about process failures. A particular failure detector, called Omega, has been shown to be the weakest failure detector for solving consensus with a majority of correct processes. Informally, Omega lies on providing an eventual leader election mechanism
Distributed algorithms for hard real-time systems
viii+124hlm.;24c
Replication and fault-tolerance in real-time systems
PhD ThesisThe increased availability of sophisticated computer hardware and the corresponding
decrease in its cost has led to a widespread growth in the use of computer systems for realtime
plant and process control applications. Such applications typically place very high
demands upon computer control systems and the development of appropriate control
software for these application areas can present a number of problems not normally
encountered in other applications.
First of all, real-time applications must be correct in the time domain as well as the value
domain: returning results which are not only correct but also delivered on time. Further,
since the potential for catastrophic failures can be high in a process or plant control
environment, many real-time applications also have to meet high reliability requirements.
These requirements will typically be met by means of a combination of fault avoidance and
fault tolerance techniques.
This thesis is intended to address some of the problems encountered in the provision of fault
tolerance in real-time applications programs. Specifically,it considers the use of replication
to ensure the availability of services in real-time systems. In a real-time environment,
providing support for replicated services can introduce a number of problems. In particular,
the scope for non-deterministic behaviour in real-time applications can be quite large and
this can lead to difficultiesin maintainingconsistent internal states across the members of a
replica group. To tackle this problem, a model is proposed for fault tolerant real-time
objects which not only allows such objects to perform application specific recovery
operations and real-time processing activities such as event handling, but which also allows
objects to be replicated. The architectural support required for such replicated objects is
also discussed and, to conclude, the run-time overheads associated with the use of such
replicated services are considered.The Science and Engineering Research Council
Fault injection testing of software implemented fault tolerance mechanisms of distributed systems
PhD ThesisOne way of gaining confidence in the adequacy of fault tolerance mechanisms of a
system is to test the system by injecting faults and see how the system performs under
faulty conditions. This thesis investigates the issues of testing software-implemented
fault tolerance mechanisms of distributed systems through fault injection.
A fault injection method has been developed. The method requires that the target
software system be structured as a collection of objects interacting via messages. This
enables easy insertion of fault injection objects into the target system to emulate
incorrect behaviour of faulty processors by manipulating messages. This approach
allows one to inject specific classes of faults while not requiring any significant changes
to the target system. The method differs from the previous work in that it exploits an
object oriented approach of software implementation to support the injection of specific
classes of faults at the system level.
The proposed fault injection method has been applied to test software-implemented
reliable node systems: a TMR (triple modular redundant) node and a fail-silent node.
The nodes have integrated fault tolerance mechanisms and are expected to exhibit
certain behaviour in the presence of a failure. The thesis describes how various such
mechanisms (for example, clock synchronisation protocol, and atomic broadcast
protocol) were tested. The testing revealed flaws in implementation that had not been
discovered before, thereby demonstrating the usefulness of the method. Application of
the approach to other distributed systems is also described in the thesis.CEC ESPRIT programme,
UK Engineering and Physical Sciences Research Council (EPSRC)
Constructing fail-controlled nodes for distributed systems: a software approach
PhD ThesisDesigning and implementing distributed systems which continue to provide specified services
in the presence of processing site and communication failures is a difficult task. To facilitate
their development, distributed systems have been built assuming that their underlying hardware
components are Jail-controlled, i.e. present a well defined failure mode. However, if conventional
hardware cannot provide the assumed failure mode, there is a need to build processing sites
or nodes, and communication infra-structure that present the fail-controlled behaviour assumed.
Coupling a number of redundant processors within a replicated node is a well known way
of constructing fail-controlled nodes. Computation is replicated and executed simultaneously at
each processor, and by employing suitable validation techniques to the outputs generated by processors
(e.g. majority voting, comparison), outputs from faulty processors can be prevented from
appearing at the application level.
One way of constructing replicated nodes is by introducing hardwired mechanisms to
couple replicated processors with specialised validation hardware circuits. Processors are tightly
synchronised at the clock cycle level, and have their outputs validated by a reliable validation
hardware. Another approach is to use software mechanisms to perform synchronisation of processors
and validation of the outputs. The main advantage of hardware based nodes is the minimum
performance overhead incurred. However, the introduction of special circuits may increase
the complexity of the design tremendously. Further, every new microprocessor architecture requires
considerable redesign overhead. Software based nodes do not present these problems, on
the other hand, they introduce much bigger performance overheads to the system.
In this thesis we investigate alternative ways of constructing efficient fail-controlled, software
based replicated nodes. In particular, we present much more efficient order protocols, which
are necessary for the implementation of these nodes. Our protocols, unlike others published to
date, do not require processors' physical clocks to be explicitly synchronised. The main contribution
of this thesis is the precise definition of the semantics of a software based Jail-silent node,
along with its efficient design, implementation and performance evaluation.The Brazilian National Research Council (CNPq/Brasil)
Proactive resilience
Tese de doutoramento em Informática (Ciências da Computação), apresentada à Universidade de Lisboa através da Faculdade de Ciências, 2007Disponível no document
- …