Search CORE

7,185 research outputs found

Dynamic FTSS in Asynchronous Systems: the Case of Unison

Author: Dubois Swan
Potop-Butucaru Maria
Tixeuil Sébastien
Publication venue
Publication date: 10/02/2011
Field of study

Distributed fault-tolerance can mask the effect of a limited number of permanent faults, while self-stabilization provides forward recovery after an arbitrary number of transient fault hit the system. FTSS protocols combine the best of both worlds since they are simultaneously fault-tolerant and self-stabilizing. To date, FTSS solutions either consider static (i.e. fixed point) tasks, or assume synchronous scheduling of the system components. In this paper, we present the first study of dynamic tasks in asynchronous systems, considering the unison problem as a benchmark. Unison can be seen as a local clock synchronization problem as neighbors must maintain digital clocks at most one time unit away from each other, and increment their own clock value infinitely often. We present many impossibility results for this difficult problem and propose a FTSS solution when the problem is solvable that exhibits optimal fault containment

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Validation of a fault-tolerant clock synchronization system

Author: Butler R. W.
Johnson S. C.
Publication venue
Publication date
Field of study

A validation method for the synchronization subsystem of a fault tolerant computer system is investigated. The method combines formal design verification with experimental testing. The design proof reduces the correctness of the clock synchronization system to the correctness of a set of axioms which are experimentally validated. Since the reliability requirements are often extreme, requiring the estimation of extremely large quantiles, an asymptotic approach to estimation in the tail of a distribution is employed

NASA Technical Reports Server

Fault tolerant architectures for integrated aircraft electronics systems

Author: Levitt K. N.
Melliar-Smith P. M.
Schwartz R. L.
Publication venue
Publication date
Field of study

Work into possible architectures for future flight control computer systems is described. Ada for Fault-Tolerant Systems, the NETS Network Error-Tolerant System architecture, and voting in asynchronous systems are covered

NASA Technical Reports Server

Fault Tolerant Gradient Clock Synchronization

Author: Bund Johannes
Lenzen Christoph
Rosenbaum Will
Publication venue
Publication date: 01/01/2019
Field of study

Synchronizing clocks in distributed systems is well-understood, both in terms of fault-tolerance in fully connected systems and the dependence of local and global worst-case skews (i.e., maximum clock difference between neighbors and arbitrary pairs of nodes, respectively) on the diameter of fault-free systems. However, so far nothing non-trivial is known about the local skew that can be achieved in topologies that are not fully connected even under a single Byzantine fault. Put simply, in this work we show that the most powerful known techniques for fault-tolerant and gradient clock synchronization are compatible, in the sense that the best of both worlds can be achieved simultaneously. Concretely, we combine the Lynch-Welch algorithm [Welch1988] for synchronizing a clique of

n

nodes despite up to

f<n/3

Byzantine faults with the gradient clock synchronization (GCS) algorithm by Lenzen et al. [Lenzen2010] in order to render the latter resilient to faults. As this is not possible on general graphs, we augment an input graph

\mathcal{G}

by replacing each node by

3f+1

fully connected copies, which execute an instance of the Lynch-Welch algorithm. We then interpret these clusters as supernodes executing the GCS algorithm, where for each cluster its correct nodes' Lynch-Welch clocks provide estimates of the logical clock of the supernode in the GCS algorithm. By connecting clusters corresponding to neighbors in

\mathcal{G}

in a fully bipartite manner, supernodes can inform each other about (estimates of) their logical clock values. This way, we achieve asymptotically optimal local skew, granted that no cluster contains more than

f

faulty nodes, at factor

O(f)

and

O(f^2)

overheads in terms of nodes and edges, respectively. Note that tolerating

f

faulty neighbors trivially requires degree larger than

f

, so this is asymptotically optimal as well

arXiv.org e-Print Archive

MPG.PuRe

Fault-Tolerant Adaptive Parallel and Distributed Simulation

Author: Armaroli Lorenzo
D'Angelo Gabriele
Ferretti Stefano
Marzolla Moreno
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization messages are replicated as well, so that the receiving entity can identify and discard corrupted messages. We provide an experimental evaluation of FT-GAIA on a running prototype. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increase in the computational load of the execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2016

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer

Author: Goldberg J.
Green M. W.
Kautz W. H.
Levitt K. N.
Melliar-Smith P. M.
Schwartz R. L.
Weinstock C. B.
Publication venue
Publication date
Field of study

SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness

NASA Technical Reports Server

Rapid Recovery for Systems with Scarce Faults

Author: Huang Chung-Hao
Peled Doron
Schewe Sven
Wang Farn
Publication venue: 'Open Publishing Association'
Publication date: 01/10/2012
Field of study

Our goal is to achieve a high degree of fault tolerance through the control of a safety critical systems. This reduces to solving a game between a malicious environment that injects failures and a controller who tries to establish a correct behavior. We suggest a new control objective for such systems that offers a better balance between complexity and precision: we seek systems that are k-resilient. In order to be k-resilient, a system needs to be able to rapidly recover from a small number, up to k, of local faults infinitely many times, provided that blocks of up to k faults are separated by short recovery periods in which no fault occurs. k-resilience is a simple but powerful abstraction from the precise distribution of local faults, but much more refined than the traditional objective to maximize the number of local faults. We argue why we believe this to be the right level of abstraction for safety critical systems when local faults are few and far between. We show that the computational complexity of constructing optimal control with respect to resilience is low and demonstrate the feasibility through an implementation and experimental results.Comment: In Proceedings GandALF 2012, arXiv:1210.202

arXiv.org e-Print Archive

Directory of Open Access Journals