27,952 research outputs found
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
Supercomputing systems today often come in the form of large numbers of
commodity systems linked together into a computing cluster. These systems, like
any distributed system, can have large numbers of independent hardware
components cooperating or collaborating on a computation. Unfortunately, any of
this vast number of components can fail at any time, resulting in potentially
erroneous output. In order to improve the robustness of supercomputing
applications in the presence of failures, many techniques have been developed
to provide resilience to these kinds of system faults. This survey provides an
overview of these various fault-tolerance techniques.Comment: 11 page
Fault-Tolerant Adaptive Parallel and Distributed Simulation
Discrete Event Simulation is a widely used technique that is used to model
and analyze complex systems in many fields of science and engineering. The
increasingly large size of simulation models poses a serious computational
challenge, since the time needed to run a simulation can be prohibitively
large. For this reason, Parallel and Distributes Simulation techniques have
been proposed to take advantage of multiple execution units which are found in
multicore processors, cluster of workstations or HPC systems. The current
generation of HPC systems includes hundreds of thousands of computing nodes and
a vast amount of ancillary components. Despite improvements in manufacturing
processes, failures of some components are frequent, and the situation will get
worse as larger systems are built. In this paper we describe FT-GAIA, a
software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation
middleware. FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some
protection against byzantine failures since synchronization messages are
replicated as well, so that the receiving entity can identify and discard
corrupted messages. We provide an experimental evaluation of FT-GAIA on a
running prototype. Results show that a high degree of fault tolerance can be
achieved, at the cost of a moderate increase in the computational load of the
execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed
Simulation and Real Time Applications (DS-RT 2016
NASA space station automation: AI-based technology review
Research and Development projects in automation for the Space Station are discussed. Artificial Intelligence (AI) based automation technologies are planned to enhance crew safety through reduced need for EVA, increase crew productivity through the reduction of routine operations, increase space station autonomy, and augment space station capability through the use of teleoperation and robotics. AI technology will also be developed for the servicing of satellites at the Space Station, system monitoring and diagnosis, space manufacturing, and the assembly of large space structures
DeSyRe: on-Demand System Reliability
The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints
Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication
This paper presents FT-GAIA, a software-based fault-tolerant parallel and
distributed simulation middleware. FT-GAIA has being designed to reliably
handle Parallel And Distributed Simulation (PADS) models, which are needed to
properly simulate and analyze complex systems arising in any kind of scientific
or engineering field. PADS takes advantage of multiple execution units run in
multicore processors, cluster of workstations or HPC systems. However, large
computing systems, such as HPC systems that include hundreds of thousands of
computing nodes, have to handle frequent failures of some components. To cope
with this issue, FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some
protection against Byzantine failures, since interaction messages among the
simulated entities are replicated as well, so that the receiving entity can
identify and discard corrupted messages. Results from an analytical model and
from an experimental evaluation show that FT-GAIA provides a high degree of
fault tolerance, at the cost of a moderate increase in the computational load
of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731
- …