Search CORE

916 research outputs found

FADI: a fault-tolerant environment for open distributed computing

Author: A. Bargiela
BARGIELA
ELNOZAHY
JANAKIRAMAN
LAMOTTE
OSMAN
OSMAN
PLANK
POWELL
RIBERIO
SENS
SILVA
T. Osman
Publication venue: 'Institution of Engineering and Technology (IET)'
Publication date: 01/01/2000
Field of study

FADI is a complete programming environment that serves the reliable execution of distributed application programs. FADI encompasses all aspects of modern fault-tolerant distributed computing. The built-in user-transparent error detection mechanism covers processor node crashes and hardware transient failures. The mechanism also integrates user-assisted error checks into the system failure model. The nucleus non-blocking checkpointing mechanism combined with a novel selective message logging technique delivers an efficient, low-overhead backup and recovery mechanism for distributed processes. FADI also provides means for remote automatic process allocation on the distributed system nodes

Crossref

Nottingham Trent Institutional Repository (IRep)

Recommended from our members

Computing infrastructure issues in distributed communications systems : a survey of operating system transport system architectures

Author: Schmidt Douglas C.
Suda Tatsuya
Publication venue: eScholarship, University of California
Publication date: 01/01/1992
Field of study

The performance of distributed applications (such as file transfer, remote login, tele-conferencing, full-motion video, and scientific visualization) is influenced by several factors that interact in complex ways. In particular, application performance is significantly affected both by communication infrastructure factors and computing infrastructure factors. Several communication infrastructure factors include channel speed, bit-error rate, and congestion at intermediate switching nodes. Computing infrastructure factors include (among other things) both protocol processing activities (such as connection management, flow control, error detection, and retransmission) and general operating system factors (such as memory latency, CPU speed, interrupt and context switching overhead, process architecture, and message buffering). Due to a several orders of magnitude increase in network channel speed and an increase in application diversity, performance bottlenecks are shifting from the network factors to the transport system factors.This paper defines an abstraction called an "Operating System Transport System Architecture" (OSTSA) that is used to classify the major components and services in the computing infrastructure. End-to-end network protocols such as TCP, TP4, VMTP, XTP, and Delta-t typically run on general-purpose computers, where they utilize various operating system resources such as processors, virtual memory, and network controllers. The OSTSA provides services that integrate these resources to support distributed applications running on local and wide area networks.A taxonomy is presented to evaluate OSTSAs in terms of their support for protocol processing activities. We use this taxonomy to compare and contrast five general-purpose commercial and experimental operating systems including System V UNIX, BSD UNIX, the x-kernel, Choices, and Xinu

eScholarship - University of California

A development framework for artificial intelligence based distributed operations support systems

Author: Adler Richard M.
Cottman Bruce H.
Publication venue
Publication date
Field of study

Advanced automation is required to reduce costly human operations support requirements for complex space-based and ground control systems. Existing knowledge based technologies have been used successfully to automate individual operations tasks. Considerably less progress has been made in integrating and coordinating multiple operations applications for unified intelligent support systems. To fill this gap, SOCIAL, a tool set for developing Distributed Artificial Intelligence (DAI) systems is being constructed. SOCIAL consists of three primary language based components defining: models of interprocess communication across heterogeneous platforms; models for interprocess coordination, concurrency control, and fault management; and for accessing heterogeneous information resources. DAI applications subsystems, either new or existing, will access these distributed services non-intrusively, via high-level message-based protocols. SOCIAL will reduce the complexity of distributed communications, control, and integration, enabling developers to concentrate on the design and functionality of the target DAI system itself

NASA Technical Reports Server

An occam Style Communications System for UNIX Networks

Author: Vella Kevin J.
Publication venue: University of Kent, Computing Laboratory
Publication date: 01/12/1995
Field of study

This document describes the design of a communications system which provides occam style communications primitives under a Unix environment, using TCP/IP protocols, and any number of other protocols deemed suitable as underlying transport layers. The system will integrate with a low overhead scheduler/kernel without incurring significant costs to the execution of processes within the run time environment. A survey of relevant occam and occam3 features and related research is followed by a look at the Unix and TCP/IP facilities which determine our working constraints, and a description of the T9000 transputer's Virtual Channel Processor, which was instrumental in our formulation. Drawing from the information presented here, a design for the communications system is subsequently proposed. Finally, a preliminary investigation of methods for lightweight access control to shared resources in an environment which does not provide support for critical sections, semaphores, or busy waiting, is made. This is presented with relevance to mutual exclusion problems which arise within the proposed design. Future directions for the evolution of this project are discussed in conclusion

Kent Academic Repository

Practical Distributed Control Synthesis

Author: Chao Wang
Doron Peled
Fang Yu
Sven Schewe
Publication venue: 'Open Publishing Association'
Publication date: 01/11/2011
Field of study

Classic distributed control problems have an interesting dichotomy: they are either trivial or undecidable. If we allow the controllers to fully synchronize, then synthesis is trivial. In this case, controllers can effectively act as a single controller with complete information, resulting in a trivial control problem. But when we eliminate communication and restrict the supervisors to locally available information, the problem becomes undecidable. In this paper we argue in favor of a middle way. Communication is, in most applications, expensive, and should hence be minimized. We therefore study a solution that tries to communicate only scarcely and, while allowing communication in order to make joint decision, favors local decisions over joint decisions that require communication.Comment: In Proceedings INFINITY 2011, arXiv:1111.267

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

The Watchdog Task: Concurrent error detection using assertions

Author: Andrews D. M.
Ersoz A.
Mccluskey E. J.
Publication venue
Publication date
Field of study

The Watchdog Task, a software abstraction of the Watchdog-processor, is shown to be a powerful error detection tool with a great deal of flexibility and the advantages of watchdog techniques. A Watchdog Task system in Ada is presented; issues of recovery, latency, efficiency (communication) and preprocessing are discussed. Different applications, one of which is error detection on a single processor, are examined

NASA Technical Reports Server

Integrated analysis of error detection and recovery

Author: Lee Y. H.
Shin K. G.
Publication venue
Publication date
Field of study

An integrated modeling and analysis of error detection and recovery is presented. When fault latency and/or error latency exist, the system may suffer from multiple faults or error propagations which seriously deteriorate the fault-tolerant capability. Several detection models that enable analysis of the effect of detection mechanisms on the subsequent error handling operations and the overall system reliability were developed. Following detection of the faulty unit and reconfiguration of the system, the contaminated processes or tasks have to be recovered. The strategies of error recovery employed depend on the detection mechanisms and the available redundancy. Several recovery methods including the rollback recovery are considered. The recovery overhead is evaluated as an index of the capabilities of the detection and reconfiguration mechanisms

NASA Technical Reports Server