2,550 research outputs found
Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication
This paper presents FT-GAIA, a software-based fault-tolerant parallel and
distributed simulation middleware. FT-GAIA has being designed to reliably
handle Parallel And Distributed Simulation (PADS) models, which are needed to
properly simulate and analyze complex systems arising in any kind of scientific
or engineering field. PADS takes advantage of multiple execution units run in
multicore processors, cluster of workstations or HPC systems. However, large
computing systems, such as HPC systems that include hundreds of thousands of
computing nodes, have to handle frequent failures of some components. To cope
with this issue, FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some
protection against Byzantine failures, since interaction messages among the
simulated entities are replicated as well, so that the receiving entity can
identify and discard corrupted messages. Results from an analytical model and
from an experimental evaluation show that FT-GAIA provides a high degree of
fault tolerance, at the cost of a moderate increase in the computational load
of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731
Fault-Tolerant Adaptive Parallel and Distributed Simulation
Discrete Event Simulation is a widely used technique that is used to model
and analyze complex systems in many fields of science and engineering. The
increasingly large size of simulation models poses a serious computational
challenge, since the time needed to run a simulation can be prohibitively
large. For this reason, Parallel and Distributes Simulation techniques have
been proposed to take advantage of multiple execution units which are found in
multicore processors, cluster of workstations or HPC systems. The current
generation of HPC systems includes hundreds of thousands of computing nodes and
a vast amount of ancillary components. Despite improvements in manufacturing
processes, failures of some components are frequent, and the situation will get
worse as larger systems are built. In this paper we describe FT-GAIA, a
software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation
middleware. FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some
protection against byzantine failures since synchronization messages are
replicated as well, so that the receiving entity can identify and discard
corrupted messages. We provide an experimental evaluation of FT-GAIA on a
running prototype. Results show that a high degree of fault tolerance can be
achieved, at the cost of a moderate increase in the computational load of the
execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed
Simulation and Real Time Applications (DS-RT 2016
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
Supercomputing systems today often come in the form of large numbers of
commodity systems linked together into a computing cluster. These systems, like
any distributed system, can have large numbers of independent hardware
components cooperating or collaborating on a computation. Unfortunately, any of
this vast number of components can fail at any time, resulting in potentially
erroneous output. In order to improve the robustness of supercomputing
applications in the presence of failures, many techniques have been developed
to provide resilience to these kinds of system faults. This survey provides an
overview of these various fault-tolerance techniques.Comment: 11 page
Rapid Recovery for Systems with Scarce Faults
Our goal is to achieve a high degree of fault tolerance through the control
of a safety critical systems. This reduces to solving a game between a
malicious environment that injects failures and a controller who tries to
establish a correct behavior. We suggest a new control objective for such
systems that offers a better balance between complexity and precision: we seek
systems that are k-resilient. In order to be k-resilient, a system needs to be
able to rapidly recover from a small number, up to k, of local faults
infinitely many times, provided that blocks of up to k faults are separated by
short recovery periods in which no fault occurs. k-resilience is a simple but
powerful abstraction from the precise distribution of local faults, but much
more refined than the traditional objective to maximize the number of local
faults. We argue why we believe this to be the right level of abstraction for
safety critical systems when local faults are few and far between. We show that
the computational complexity of constructing optimal control with respect to
resilience is low and demonstrate the feasibility through an implementation and
experimental results.Comment: In Proceedings GandALF 2012, arXiv:1210.202
FairLedger: A Fair Blockchain Protocol for Financial Institutions
Financial institutions are currently looking into technologies for
permissioned blockchains. A major effort in this direction is Hyperledger, an
open source project hosted by the Linux Foundation and backed by a consortium
of over a hundred companies. A key component in permissioned blockchain
protocols is a byzantine fault tolerant (BFT) consensus engine that orders
transactions. However, currently available BFT solutions in Hyperledger (as
well as in the literature at large) are inadequate for financial settings; they
are not designed to ensure fairness or to tolerate selfish behavior that arises
when financial institutions strive to maximize their own profit.
We present FairLedger, a permissioned blockchain BFT protocol, which is fair,
designed to deal with rational behavior, and, no less important, easy to
understand and implement. The secret sauce of our protocol is a new
communication abstraction, called detectable all-to-all (DA2A), which allows us
to detect participants (byzantine or rational) that deviate from the protocol,
and punish them. We implement FairLedger in the Hyperledger open source
project, using Iroha framework, one of the biggest projects therein. To
evaluate FairLegder's performance, we also implement it in the PBFT framework
and compare the two protocols. Our results show that in failure-free scenarios
FairLedger achieves better throughput than both Iroha's implementation and PBFT
in wide-area settings
Study of a unified hardware and software fault-tolerant architecture
A unified architectural concept, called the Fault Tolerant Processor Attached Processor (FTP-AP), that can tolerate hardware as well as software faults is proposed for applications requiring ultrareliable computation capability. An emulation of the FTP-AP architecture, consisting of a breadboard Motorola 68010-based quadruply redundant Fault Tolerant Processor, four VAX 750s as attached processors, and four versions of a transport aircraft yaw damper control law, is used as a testbed in the AIRLAB to examine a number of critical issues. Solutions of several basic problems associated with N-Version software are proposed and implemented on the testbed. This includes a confidence voter to resolve coincident errors in N-Version software. A reliability model of N-Version software that is based upon the recent understanding of software failure mechanisms is also developed. The basic FTP-AP architectural concept appears suitable for hosting N-Version application software while at the same time tolerating hardware failures. Architectural enhancements for greater efficiency, software reliability modeling, and N-Version issues that merit further research are identified
Recommended from our members
Analysis of operating system diversity for intrusion tolerance
One of the key benefits of using intrusion-tolerant systems is the possibility of ensuring correct behavior in the presence of attacks and intrusions. These security gains are directly dependent on the components exhibiting failure diversity. To what extent failure diversity is observed in practical deployment depends on how diverse are the components that constitute the system. In this paper, we present a study with operating system's (OS's) vulnerability data from the NIST National Vulnerability Database (NVD). We have analyzed the vulnerabilities of 11 different OSs over a period of 18 years, to check how many of these vulnerabilities occur in more than one OS. We found this number to be low for several combinations of OSs. Hence, although there are a few caveats on the use of NVD data to support definitive conclusions, our analysis shows that by selecting appropriate OSs, one can preclude (or reduce substantially) common vulnerabilities from occurring in the replicas of the intrusion-tolerant system
- …