5,645 research outputs found

    Maintaining consistency in distributed systems

    Get PDF
    In systems designed as assemblies of independently developed components, concurrent access to data or data structures normally arises within individual programs, and is controlled using mutual exclusion constructs, such as semaphores and monitors. Where data is persistent and/or sets of operation are related to one another, transactions or linearizability may be more appropriate. Systems that incorporate cooperative styles of distributed execution often replicate or distribute data within groups of components. In these cases, group oriented consistency properties must be maintained, and tools based on the virtual synchrony execution model greatly simplify the task confronting an application developer. All three styles of distributed computing are likely to be seen in future systems - often, within the same application. This leads us to propose an integrated approach that permits applications that use virtual synchrony with concurrent objects that respect a linearizability constraint, and vice versa. Transactional subsystems are treated as a special case of linearizability

    Implementing atomic actions in Ada 95

    Get PDF
    Atomic actions are an important dynamic structuring technique that aid the construction of fault-tolerant concurrent systems. Although they were developed some years ago, none of the well-known commercially-available programming languages directly support their use. This paper summarizes software fault tolerance techniques for concurrent systems, evaluates the Ada 95 programming language from the perspective of its support for software fault tolerance, and shows how Ada 95 can be used to implement software fault tolerance techniques. In particular, it shows how packages, protected objects, requeue, exceptions, asynchronous transfer of control, tagged types, and controlled types can be used as building blocks from which to construct atomic actions with forward and backward error recovery, which are resilient to deserter tasks and task abortion

    CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

    Get PDF
    In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities together, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks

    Distributed operating systems

    Get PDF
    In the past five years, distributed operating systems research has gone through a consolidation phase. On a large number of design issues there is now considerable consensus between different research groups.\ud \ud In this paper, an overview of recent research in distributed systems is given. In turn, the paper discusses overall system structure, protection issues, file system designs, problems and solutions for fault tolerance and a mechanism that is rapidly becoming very important for efficient distributed systems design: hints.\ud \ud An attempt was made to provide sufficient references to interesting research projects for the reader to find material for more detailed study

    The implementation and use of Ada on distributed systems with high reliability requirements

    Get PDF
    The use and implementation of Ada in distributed environments in which reliability is the primary concern were investigated. In particular, the concept that a distributed system may be programmed entirely in Ada so that the individual tasks of the system are unconcerned with which processors they are executing on, and that failures may occur in the software or underlying hardware was examined. Progress is discussed for the following areas: continued development and testing of the fault-tolerant Ada testbed; development of suggested changes to Ada so that it might more easily cope with the failure of interest; and design of new approaches to fault-tolerant software in real-time systems, and integration of these ideas into Ada

    A study of System Interface Sets (SIS) for the host, target and integration environments of the Space Station Program (SSP)

    Get PDF
    System interface sets (SIS) for large, complex, non-stop, distributed systems are examined. The SIS of the Space Station Program (SSP) was selected as the focus of this study because an appropriate virtual interface specification of the SIS is believed to have the most potential to free the project from four life cycle tyrannies which are rooted in a dependance on either a proprietary or particular instance of: operating systems, data management systems, communications systems, and instruction set architectures. The static perspective of the common Ada programming support environment interface set (CAIS) and the portable common execution environment (PCEE) activities are discussed. Also, the dynamic perspective of the PCEE is addressed

    EOS: A project to investigate the design and construction of real-time distributed embedded operating systems

    Get PDF
    The EOS project is investigating the design and construction of a family of real-time distributed embedded operating systems for reliable, distributed aerospace applications. Using the real-time programming techniques developed in co-operation with NASA in earlier research, the project staff is building a kernel for a multiple processor networked system. The first six months of the grant included a study of scheduling in an object-oriented system, the design philosophy of the kernel, and the architectural overview of the operating system. In this report, the operating system and kernel concepts are described. An environment for the experiments has been built and several of the key concepts of the system have been prototyped. The kernel and operating system is intended to support future experimental studies in multiprocessing, load-balancing, routing, software fault-tolerance, distributed data base design, and real-time processing
    corecore