819 research outputs found
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
Supercomputing systems today often come in the form of large numbers of
commodity systems linked together into a computing cluster. These systems, like
any distributed system, can have large numbers of independent hardware
components cooperating or collaborating on a computation. Unfortunately, any of
this vast number of components can fail at any time, resulting in potentially
erroneous output. In order to improve the robustness of supercomputing
applications in the presence of failures, many techniques have been developed
to provide resilience to these kinds of system faults. This survey provides an
overview of these various fault-tolerance techniques.Comment: 11 page
06371 Abstracts Collection -- From Security to Dependability
From 10.09.06 to 15.09.06, the Dagstuhl Seminar 06371 ``From Security to Dependability\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if available
Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication
This paper presents FT-GAIA, a software-based fault-tolerant parallel and
distributed simulation middleware. FT-GAIA has being designed to reliably
handle Parallel And Distributed Simulation (PADS) models, which are needed to
properly simulate and analyze complex systems arising in any kind of scientific
or engineering field. PADS takes advantage of multiple execution units run in
multicore processors, cluster of workstations or HPC systems. However, large
computing systems, such as HPC systems that include hundreds of thousands of
computing nodes, have to handle frequent failures of some components. To cope
with this issue, FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some
protection against Byzantine failures, since interaction messages among the
simulated entities are replicated as well, so that the receiving entity can
identify and discard corrupted messages. Results from an analytical model and
from an experimental evaluation show that FT-GAIA provides a high degree of
fault tolerance, at the cost of a moderate increase in the computational load
of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731
The Bedrock of Byzantine Fault Tolerance: A Unified Platform for BFT Protocol Design and Implementation
Byzantine Fault-Tolerant (BFT) protocols have recently been extensively used
by decentralized data management systems with non-trustworthy infrastructures,
e.g., permissioned blockchains. BFT protocols cover a broad spectrum of design
dimensions from infrastructure settings such as the communication topology, to
more technical features such as commitment strategy and even fundamental social
choice properties like order-fairness. The proliferation of different BFT
protocols has rendered it difficult to navigate the BFT landscape, let alone
determine the protocol that best meets application needs. This paper presents
Bedrock, a unified platform for BFT protocols design, analysis, implementation,
and experiments. Bedrock proposes a design space consisting of a set of design
choices capturing the trade-offs between different design space dimensions and
providing fundamentally new insights into the strengths and weaknesses of BFT
protocols. Bedrock enables users to analyze and experiment with BFT protocols
within the space of plausible choices, evolve current protocols to design new
ones, and even uncover previously unknown protocols. Our experimental results
demonstrate the capability of Bedrock to uniformly evaluate BFT protocols in
new ways that were not possible before due to the diverse assumptions made by
these protocols. The results validate Bedrock's ability to analyze and derive
BFT protocols
- …