808 research outputs found
Doing-it-All with Bounded Work and Communication
We consider the Do-All problem, where cooperating processors need to
complete similar and independent tasks in an adversarial setting. Here we
deal with a synchronous message passing system with processors that are subject
to crash failures. Efficiency of algorithms in this setting is measured in
terms of work complexity (also known as total available processor steps) and
communication complexity (total number of point-to-point messages). When work
and communication are considered to be comparable resources, then the overall
efficiency is meaningfully expressed in terms of effort defined as work +
communication. We develop and analyze a constructive algorithm that has work
and a nonconstructive
algorithm that has work . The latter result is close to the
lower bound on work. The effort of each of
these algorithms is proportional to its work when the number of crashes is
bounded above by , for some positive constant . We also present a
nonconstructive algorithm that has effort
Distributed algorithms for hard real-time systems
viii+124hlm.;24c
Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication
This paper presents FT-GAIA, a software-based fault-tolerant parallel and
distributed simulation middleware. FT-GAIA has being designed to reliably
handle Parallel And Distributed Simulation (PADS) models, which are needed to
properly simulate and analyze complex systems arising in any kind of scientific
or engineering field. PADS takes advantage of multiple execution units run in
multicore processors, cluster of workstations or HPC systems. However, large
computing systems, such as HPC systems that include hundreds of thousands of
computing nodes, have to handle frequent failures of some components. To cope
with this issue, FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some
protection against Byzantine failures, since interaction messages among the
simulated entities are replicated as well, so that the receiving entity can
identify and discard corrupted messages. Results from an analytical model and
from an experimental evaluation show that FT-GAIA provides a high degree of
fault tolerance, at the cost of a moderate increase in the computational load
of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731
Fast Deterministic Consensus in a Noisy Environment
It is well known that the consensus problem cannot be solved
deterministically in an asynchronous environment, but that randomized solutions
are possible. We propose a new model, called noisy scheduling, in which an
adversarial schedule is perturbed randomly, and show that in this model
randomness in the environment can substitute for randomness in the algorithm.
In particular, we show that a simplified, deterministic version of Chandra's
wait-free shared-memory consensus algorithm (PODC, 1996, pp. 166-175) solves
consensus in time at most logarithmic in the number of active processes. The
proof of termination is based on showing that a race between independent
delayed renewal processes produces a winner quickly. In addition, we show that
the protocol finishes in constant time using quantum and priority-based
scheduling on a uniprocessor, suggesting that it is robust against the choice
of model over a wide range.Comment: Typographical errors fixe
A technique for evaluating the application of the pin-level stuck-at fault model to VLSI circuits
Accurate fault models are required to conduct the experiments defined in validation methodologies for highly reliable fault-tolerant computers (e.g., computers with a probability of failure of 10 to the -9 for a 10-hour mission). Described is a technique by which a researcher can evaluate the capability of the pin-level stuck-at fault model to simulate true error behavior symptoms in very large scale integrated (VLSI) digital circuits. The technique is based on a statistical comparison of the error behavior resulting from faults applied at the pin-level of and internal to a VLSI circuit. As an example of an application of the technique, the error behavior of a microprocessor simulation subjected to internal stuck-at faults is compared with the error behavior which results from pin-level stuck-at faults. The error behavior is characterized by the time between errors and the duration of errors. Based on this example data, the pin-level stuck-at fault model is found to deliver less than ideal performance. However, with respect to the class of faults which cause a system crash, the pin-level, stuck-at fault model is found to provide a good modeling capability
Adaptive control in rollforward recovery for extreme scale multigrid
With the increasing number of compute components, failures in future
exa-scale computer systems are expected to become more frequent. This motivates
the study of novel resilience techniques. Here, we extend a recently proposed
algorithm-based recovery method for multigrid iterations by introducing an
adaptive control. After a fault, the healthy part of the system continues the
iterative solution process, while the solution in the faulty domain is
re-constructed by an asynchronous on-line recovery. The computations in both
the faulty and healthy subdomains must be coordinated in a sensitive way, in
particular, both under and over-solving must be avoided. Both of these waste
computational resources and will therefore increase the overall
time-to-solution. To control the local recovery and guarantee an optimal
re-coupling, we introduce a stopping criterion based on a mathematical error
estimator. It involves hierarchical weighted sums of residuals within the
context of uniformly refined meshes and is well-suited in the context of
parallel high-performance computing. The re-coupling process is steered by
local contributions of the error estimator. We propose and compare two criteria
which differ in their weights. Failure scenarios when solving up to
unknowns on more than 245\,766 parallel processes will be
reported on a state-of-the-art peta-scale supercomputer demonstrating the
robustness of the method
Critical fault patterns determination in fault-tolerant computer systems
The method proposed tries to enumerate all the critical fault-patterns (successive occurrences of failures) without analyzing every single possible fault. The conditions for the system to be operating in a given mode can be expressed in terms of the static states. Thus, one can find all the system states that correspond to a given critical mode of operation. The next step consists in analyzing the fault-detection mechanisms, the diagnosis algorithm and the process of switch control. From them, one can find all the possible system configurations that can result from a failure occurrence. Thus, one can list all the characteristics, with respect to detection, diagnosis, and switch control, that failures must have to constitute critical fault-patterns. Such an enumeration of the critical fault-patterns can be directly used to evaluate the overall system tolerance to failures. Present research is focused on how to efficiently make use of these system-level characteristics to enumerate all the failures that verify these characteristics
Algorithms For Extracting Timeliness Graphs
We consider asynchronous message-passing systems in which some links are
timely and processes may crash. Each run defines a timeliness graph among
correct processes: (p; q) is an edge of the timeliness graph if the link from p
to q is timely (that is, there is bound on communication delays from p to q).
The main goal of this paper is to approximate this timeliness graph by graphs
having some properties (such as being trees, rings, ...). Given a family S of
graphs, for runs such that the timeliness graph contains at least one graph in
S then using an extraction algorithm, each correct process has to converge to
the same graph in S that is, in a precise sense, an approximation of the
timeliness graph of the run. For example, if the timeliness graph contains a
ring, then using an extraction algorithm, all correct processes eventually
converge to the same ring and in this ring all nodes will be correct processes
and all links will be timely. We first present a general extraction algorithm
and then a more specific extraction algorithm that is communication efficient
(i.e., eventually all the messages of the extraction algorithm use only links
of the extracted graph)
- …