27,958 research outputs found
Fault-tolerant software: dependability/performance trade-offs, concurrency and system support
PhD ThesisAs the use of computer systems becomes more and more widespread in applications
that demand high levels of dependability, these applications themselves are growing in
complexity in a rapid rate, especially in the areas that require concurrent and distributed
computing. Such complex systems are very prone to faults and errors. No matter how
rigorously fault avoidance and fault removal techniques are applied, software design
faults often remain in systems when they are delivered to the customers. In fact,
residual software faults are becoming the significant underlying cause of system
failures and the lack of dependability. There is tremendous need for systematic
techniques for building dependable software, including the fault tolerance techniques
that ensure software-based systems to operate dependably even when potential faults
are present. However, although there has been a large amount of research in the area of
fault-tolerant software, existing techniques are not yet sufficiently mature as a practical
engineering discipline for realistic applications. In particular, they are often inadequate
when applied to highly concurrent and distributed software.
This thesis develops new techniques for building fault-tolerant software, addresses the
problem of achieving high levels of dependability in concurrent and distributed object
systems, and studies system-level support for implementing dependable software. Two
schemes are developed - the t/(n-l)-VP approach is aimed at increasing software
reliability and controlling additional complexity, while the SCOP approach presents an
adaptive way of dynamically adjusting software reliability and efficiency aspects. As a
more general framework for constructing dependable concurrent and distributed
software, the Coordinated Atomic (CA) Action scheme is examined thoroughly. Key
properties of CA actions are formalized, conceptual model and mechanisms for
handling application level exceptions are devised, and object-based diversity
techniques are introduced to cope with potential software faults. These three schemes
are evaluated analytically and validated by controlled experiments. System-level
support is also addressed with a multi-level system architecture. An architectural
pattern for implementing fault-tolerant objects is documented in detail to capture
existing solutions and our previous experience. An industrial safety-critical application,
the Fault-Tolerant Production Cell, is used as a case study to examine most of the
concepts and techniques developed in this research.ESPRIT
Implementing fault tolerant applications using reflective object-oriented programming
Abstract: Shows how reflection and object-oriented programming can be used to ease the implementation of classical fault tolerance mechanisms in distributed applications. When the underlying runtime system does not provide fault tolerance transparently, classical approaches to implementing fault tolerance mechanisms often imply mixing functional programming with non-functional programming (e.g. error processing mechanisms). The use of reflection improves the transparency of fault tolerance mechanisms to the programmer and more generally provides a clearer separation between functional and non-functional programming. The implementations of some classical replication techniques using a reflective approach are presented in detail and illustrated by several examples, which have been prototyped on a network of Unix workstations. Lessons learnt from our experiments are drawn and future work is discussed
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
Supercomputing systems today often come in the form of large numbers of
commodity systems linked together into a computing cluster. These systems, like
any distributed system, can have large numbers of independent hardware
components cooperating or collaborating on a computation. Unfortunately, any of
this vast number of components can fail at any time, resulting in potentially
erroneous output. In order to improve the robustness of supercomputing
applications in the presence of failures, many techniques have been developed
to provide resilience to these kinds of system faults. This survey provides an
overview of these various fault-tolerance techniques.Comment: 11 page
FRIENDS - A flexible architecture for implementing fault tolerant and secure distributed applications
FRIENDS is a software-based architecture for implementing fault-tolerant and, to some extent, secure applications. This architecture is composed of sub-systems and libraries of metaobjects. Transparency and separation of concerns is provided not only to the application programmer but also to the programmers implementing metaobjects for fault tolerance, secure communication and distribution. Common services required for implementing metaobjects are provided by the sub-systems. Metaobjects are implemented using object-oriented techniques and can be reused and customised according to the application needs, the operational environment and its related fault assumptions. Flexibility is increased by a recursive use of metaobjects. Examples and experiments are also described
A metaobject architecture for fault-tolerant distributed systems : the FRIENDS approach
The FRIENDS system developed at LAAS-CNRS is a metalevel architecture providing libraries of metaobjects for fault
tolerance, secure communication, and group-based distributed applications. The use of metaobjects provides a nice separation of concerns between mechanisms and applications. Metaobjects can be used transparently by applications and can be composed according to the needs of a given application, a given architecture, and its underlying properties. In FRIENDS, metaobjects are used recursively to add new properties to applications. They are designed using an object oriented design method and implemented on top of basic system services. This paper describes the FRIENDS software-based architecture, the object-oriented development of metaobjects, the experiments that we have done, and summarizes the advantages and drawbacks of a metaobject approach for building fault-tolerant system
- âŠ