7,185 research outputs found
Dynamic FTSS in Asynchronous Systems: the Case of Unison
Distributed fault-tolerance can mask the effect of a limited number of
permanent faults, while self-stabilization provides forward recovery after an
arbitrary number of transient fault hit the system. FTSS protocols combine the
best of both worlds since they are simultaneously fault-tolerant and
self-stabilizing. To date, FTSS solutions either consider static (i.e. fixed
point) tasks, or assume synchronous scheduling of the system components. In
this paper, we present the first study of dynamic tasks in asynchronous
systems, considering the unison problem as a benchmark. Unison can be seen as a
local clock synchronization problem as neighbors must maintain digital clocks
at most one time unit away from each other, and increment their own clock value
infinitely often. We present many impossibility results for this difficult
problem and propose a FTSS solution when the problem is solvable that exhibits
optimal fault containment
Validation of a fault-tolerant clock synchronization system
A validation method for the synchronization subsystem of a fault tolerant computer system is investigated. The method combines formal design verification with experimental testing. The design proof reduces the correctness of the clock synchronization system to the correctness of a set of axioms which are experimentally validated. Since the reliability requirements are often extreme, requiring the estimation of extremely large quantiles, an asymptotic approach to estimation in the tail of a distribution is employed
Fault tolerant architectures for integrated aircraft electronics systems
Work into possible architectures for future flight control computer systems is described. Ada for Fault-Tolerant Systems, the NETS Network Error-Tolerant System architecture, and voting in asynchronous systems are covered
Fault Tolerant Gradient Clock Synchronization
Synchronizing clocks in distributed systems is well-understood, both in terms
of fault-tolerance in fully connected systems and the dependence of local and
global worst-case skews (i.e., maximum clock difference between neighbors and
arbitrary pairs of nodes, respectively) on the diameter of fault-free systems.
However, so far nothing non-trivial is known about the local skew that can be
achieved in topologies that are not fully connected even under a single
Byzantine fault. Put simply, in this work we show that the most powerful known
techniques for fault-tolerant and gradient clock synchronization are
compatible, in the sense that the best of both worlds can be achieved
simultaneously.
Concretely, we combine the Lynch-Welch algorithm [Welch1988] for
synchronizing a clique of nodes despite up to Byzantine faults with
the gradient clock synchronization (GCS) algorithm by Lenzen et al.
[Lenzen2010] in order to render the latter resilient to faults. As this is not
possible on general graphs, we augment an input graph by
replacing each node by fully connected copies, which execute an instance
of the Lynch-Welch algorithm. We then interpret these clusters as supernodes
executing the GCS algorithm, where for each cluster its correct nodes'
Lynch-Welch clocks provide estimates of the logical clock of the supernode in
the GCS algorithm. By connecting clusters corresponding to neighbors in
in a fully bipartite manner, supernodes can inform each other
about (estimates of) their logical clock values. This way, we achieve
asymptotically optimal local skew, granted that no cluster contains more than
faulty nodes, at factor and overheads in terms of nodes and
edges, respectively. Note that tolerating faulty neighbors trivially
requires degree larger than , so this is asymptotically optimal as well
Fault-Tolerant Adaptive Parallel and Distributed Simulation
Discrete Event Simulation is a widely used technique that is used to model
and analyze complex systems in many fields of science and engineering. The
increasingly large size of simulation models poses a serious computational
challenge, since the time needed to run a simulation can be prohibitively
large. For this reason, Parallel and Distributes Simulation techniques have
been proposed to take advantage of multiple execution units which are found in
multicore processors, cluster of workstations or HPC systems. The current
generation of HPC systems includes hundreds of thousands of computing nodes and
a vast amount of ancillary components. Despite improvements in manufacturing
processes, failures of some components are frequent, and the situation will get
worse as larger systems are built. In this paper we describe FT-GAIA, a
software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation
middleware. FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some
protection against byzantine failures since synchronization messages are
replicated as well, so that the receiving entity can identify and discard
corrupted messages. We provide an experimental evaluation of FT-GAIA on a
running prototype. Results show that a high degree of fault tolerance can be
achieved, at the cost of a moderate increase in the computational load of the
execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed
Simulation and Real Time Applications (DS-RT 2016
Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer
SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness
Rapid Recovery for Systems with Scarce Faults
Our goal is to achieve a high degree of fault tolerance through the control
of a safety critical systems. This reduces to solving a game between a
malicious environment that injects failures and a controller who tries to
establish a correct behavior. We suggest a new control objective for such
systems that offers a better balance between complexity and precision: we seek
systems that are k-resilient. In order to be k-resilient, a system needs to be
able to rapidly recover from a small number, up to k, of local faults
infinitely many times, provided that blocks of up to k faults are separated by
short recovery periods in which no fault occurs. k-resilience is a simple but
powerful abstraction from the precise distribution of local faults, but much
more refined than the traditional objective to maximize the number of local
faults. We argue why we believe this to be the right level of abstraction for
safety critical systems when local faults are few and far between. We show that
the computational complexity of constructing optimal control with respect to
resilience is low and demonstrate the feasibility through an implementation and
experimental results.Comment: In Proceedings GandALF 2012, arXiv:1210.202
- …