916 research outputs found

    FADI: a fault-tolerant environment for open distributed computing

    Get PDF
    FADI is a complete programming environment that serves the reliable execution of distributed application programs. FADI encompasses all aspects of modern fault-tolerant distributed computing. The built-in user-transparent error detection mechanism covers processor node crashes and hardware transient failures. The mechanism also integrates user-assisted error checks into the system failure model. The nucleus non-blocking checkpointing mechanism combined with a novel selective message logging technique delivers an efficient, low-overhead backup and recovery mechanism for distributed processes. FADI also provides means for remote automatic process allocation on the distributed system nodes

    A development framework for artificial intelligence based distributed operations support systems

    Get PDF
    Advanced automation is required to reduce costly human operations support requirements for complex space-based and ground control systems. Existing knowledge based technologies have been used successfully to automate individual operations tasks. Considerably less progress has been made in integrating and coordinating multiple operations applications for unified intelligent support systems. To fill this gap, SOCIAL, a tool set for developing Distributed Artificial Intelligence (DAI) systems is being constructed. SOCIAL consists of three primary language based components defining: models of interprocess communication across heterogeneous platforms; models for interprocess coordination, concurrency control, and fault management; and for accessing heterogeneous information resources. DAI applications subsystems, either new or existing, will access these distributed services non-intrusively, via high-level message-based protocols. SOCIAL will reduce the complexity of distributed communications, control, and integration, enabling developers to concentrate on the design and functionality of the target DAI system itself

    An occam Style Communications System for UNIX Networks

    Get PDF
    This document describes the design of a communications system which provides occam style communications primitives under a Unix environment, using TCP/IP protocols, and any number of other protocols deemed suitable as underlying transport layers. The system will integrate with a low overhead scheduler/kernel without incurring significant costs to the execution of processes within the run time environment. A survey of relevant occam and occam3 features and related research is followed by a look at the Unix and TCP/IP facilities which determine our working constraints, and a description of the T9000 transputer's Virtual Channel Processor, which was instrumental in our formulation. Drawing from the information presented here, a design for the communications system is subsequently proposed. Finally, a preliminary investigation of methods for lightweight access control to shared resources in an environment which does not provide support for critical sections, semaphores, or busy waiting, is made. This is presented with relevance to mutual exclusion problems which arise within the proposed design. Future directions for the evolution of this project are discussed in conclusion

    Practical Distributed Control Synthesis

    Full text link
    Classic distributed control problems have an interesting dichotomy: they are either trivial or undecidable. If we allow the controllers to fully synchronize, then synthesis is trivial. In this case, controllers can effectively act as a single controller with complete information, resulting in a trivial control problem. But when we eliminate communication and restrict the supervisors to locally available information, the problem becomes undecidable. In this paper we argue in favor of a middle way. Communication is, in most applications, expensive, and should hence be minimized. We therefore study a solution that tries to communicate only scarcely and, while allowing communication in order to make joint decision, favors local decisions over joint decisions that require communication.Comment: In Proceedings INFINITY 2011, arXiv:1111.267

    The Watchdog Task: Concurrent error detection using assertions

    Get PDF
    The Watchdog Task, a software abstraction of the Watchdog-processor, is shown to be a powerful error detection tool with a great deal of flexibility and the advantages of watchdog techniques. A Watchdog Task system in Ada is presented; issues of recovery, latency, efficiency (communication) and preprocessing are discussed. Different applications, one of which is error detection on a single processor, are examined

    Integrated analysis of error detection and recovery

    Get PDF
    An integrated modeling and analysis of error detection and recovery is presented. When fault latency and/or error latency exist, the system may suffer from multiple faults or error propagations which seriously deteriorate the fault-tolerant capability. Several detection models that enable analysis of the effect of detection mechanisms on the subsequent error handling operations and the overall system reliability were developed. Following detection of the faulty unit and reconfiguration of the system, the contaminated processes or tasks have to be recovered. The strategies of error recovery employed depend on the detection mechanisms and the available redundancy. Several recovery methods including the rollback recovery are considered. The recovery overhead is evaluated as an index of the capabilities of the detection and reconfiguration mechanisms
    • …
    corecore