56,973 research outputs found

    Fault-tolerant software: dependability/performance trade-offs, concurrency and system support

    Get PDF
    PhD ThesisAs the use of computer systems becomes more and more widespread in applications that demand high levels of dependability, these applications themselves are growing in complexity in a rapid rate, especially in the areas that require concurrent and distributed computing. Such complex systems are very prone to faults and errors. No matter how rigorously fault avoidance and fault removal techniques are applied, software design faults often remain in systems when they are delivered to the customers. In fact, residual software faults are becoming the significant underlying cause of system failures and the lack of dependability. There is tremendous need for systematic techniques for building dependable software, including the fault tolerance techniques that ensure software-based systems to operate dependably even when potential faults are present. However, although there has been a large amount of research in the area of fault-tolerant software, existing techniques are not yet sufficiently mature as a practical engineering discipline for realistic applications. In particular, they are often inadequate when applied to highly concurrent and distributed software. This thesis develops new techniques for building fault-tolerant software, addresses the problem of achieving high levels of dependability in concurrent and distributed object systems, and studies system-level support for implementing dependable software. Two schemes are developed - the t/(n-l)-VP approach is aimed at increasing software reliability and controlling additional complexity, while the SCOP approach presents an adaptive way of dynamically adjusting software reliability and efficiency aspects. As a more general framework for constructing dependable concurrent and distributed software, the Coordinated Atomic (CA) Action scheme is examined thoroughly. Key properties of CA actions are formalized, conceptual model and mechanisms for handling application level exceptions are devised, and object-based diversity techniques are introduced to cope with potential software faults. These three schemes are evaluated analytically and validated by controlled experiments. System-level support is also addressed with a multi-level system architecture. An architectural pattern for implementing fault-tolerant objects is documented in detail to capture existing solutions and our previous experience. An industrial safety-critical application, the Fault-Tolerant Production Cell, is used as a case study to examine most of the concepts and techniques developed in this research.ESPRIT

    System support for object replication in distributed systems

    Get PDF
    Distributed systems are composed of a collection of cooperating but failure prone system components. The number of components in such systems is often large and, despite low probabilities of any particular component failing, the likelihood that there will be at least a small number of failures within the system at a given time is high. Therefore, distributed systems must be able to withstand partial failures. By being resilient to partial failures, a distributed system becomes more able to offer a dependable service and therefore more useful. Replication is a well known technique used to mask partial failures and increase reliability in distributed computer systems. However, replication management requires sophisticated distributed control algorithms, and is therefore a labour intensive and error prone task. Furthermore, replication is in most cases employed due to applications' non-functional requirements for reliability, as dependability is generally an orthogonal issue to the problem domain of the application. If system level support for replication is provided, the application developer can devote more effort to application specific issues. Distributed systems are inherently more complex than centralised systems. Encapsulation and abstraction of components and services can be of paramount importance in managing their complexity. The use of object oriented techniques and languages, providing support for encapsulation and abstraction, has made development of distributed systems more manageable. In systems where applications are being developed using object-oriented techniques, system support mechanisms must recognise this, and provide support for the object-oriented approach. The architecture presented exploits object-oriented techniques to improve transparency and to reduce the application programmer involvement required to use the replication mechanisms. This dissertation describes an approach to implementing system support for object replication, which is distinct from other approaches such as replicated objects in that objects are not specially designed for replication. Additionally, object replication, in contrast to data replication, is a function-shipping approach and deals with the replication of both operations and data. Object replication is complicated by objects' encapsulation of local state and the arbitrary interaction patterns that may exist among objects. Although fully transparent object replication has not been achieved, my thesis is that partial system support for replication of program-level objects is practicable and assists the development of certain classes of reliable distributed applications. I demonstrate the usefulness of this approach by describing a prototype implementation and showing how it supports the development of an example toy application. To increase their flexibility, the system support mechanisms described are tailorable. The approach adopted in this work is to provide partial support for object replication, relying on some assistance from the application developer to supply application dependent functionality within particular collators for dealing with processing of results from object replicas. Care is taken to make the programming model as simple and concise as possible

    FRIENDS - A flexible architecture for implementing fault tolerant and secure distributed applications

    Get PDF
    FRIENDS is a software-based architecture for implementing fault-tolerant and, to some extent, secure applications. This architecture is composed of sub-systems and libraries of metaobjects. Transparency and separation of concerns is provided not only to the application programmer but also to the programmers implementing metaobjects for fault tolerance, secure communication and distribution. Common services required for implementing metaobjects are provided by the sub-systems. Metaobjects are implemented using object-oriented techniques and can be reused and customised according to the application needs, the operational environment and its related fault assumptions. Flexibility is increased by a recursive use of metaobjects. Examples and experiments are also described

    Replica determinism and flexible scheduling in hard real-time dependable systems

    Get PDF
    Fault-tolerant real-time systems are typically based on active replication where replicated entities are required to deliver their outputs in an identical order within a given time interval. Distributed scheduling of replicated tasks, however, violates this requirement if on-line scheduling, preemptive scheduling, or scheduling of dissimilar replicated task sets is employed. This problem of inconsistent task outputs has been solved previously by coordinating the decisions of the local schedulers such that replicated tasks are executed in an identical order. Global coordination results either in an extremely high communication effort to agree on each schedule decision or in an overly restrictive execution model where on-line scheduling, arbitrary preemptions, and nonidentically replicated task sets are not allowed. To overcome these restrictions, a new method, called timed messages, is introduced. Timed messages guarantee deterministic operation by presenting consistent message versions to the replicated tasks. This approach is based on simulated common knowledge and a sparse time base. Timed messages are very effective since they neither require communication between the local scheduler nor do they restrict usage of on-line flexible scheduling, preemptions and nonidentically replicated task sets

    Introduction to the special section on dependable network computing

    Get PDF
    Dependable network computing is becoming a key part of our daily economic and social life. Every day, millions of users and businesses are utilizing the Internet infrastructure for real-time electronic commerce transactions, scheduling important events, and building relationships. While network traffic and the number of users are rapidly growing, the mean-time between failures (MTTF) is surprisingly short; according to recent studies, in the majority of Internet backbone paths, the MTTF is 28 days. This leads to a strong requirement for highly dependable networks, servers, and software systems. The challenge is to build interconnected systems, based on available technology, that are inexpensive, accessible, scalable, and dependable. This special section provides insights into a number of these exciting challenges

    Adaptive service discovery on service-oriented and spontaneous sensor systems

    Get PDF
    Service-oriented architecture, Spontaneous networks, Self-organisation, Self-configuration, Sensor systems, Social patternsNatural and man-made disasters can significantly impact both people and environments. Enhanced effect can be achieved through dynamic networking of people, systems and procedures and seamless integration of them to fulfil mission objectives with service-oriented sensor systems. However, the benefits of integration of services will not be realised unless we have a dependable method to discover all required services in dynamic environments. In this paper, we propose an Adaptive and Efficient Peer-to-peer Search (AEPS) approach for dependable service integration on service-oriented architecture based on a number of social behaviour patterns. In the AEPS network, the networked nodes can autonomously support and co-operate with each other in a peer-to-peer (P2P) manner to quickly discover and self-configure any services available on the disaster area and deliver a real-time capability by self-organising themselves in spontaneous groups to provide higher flexibility and adaptability for disaster monitoring and relief
    • …
    corecore