Search CORE

28,725 research outputs found

Design diversity: an update from research on reliability modelling

Author: B Littlewood
B Littlewood
B Littlewood
B Littlewood
DE Eckhardt
JC Knight
L Hatton
P Popov
P Popov
P Popov
Publication venue
Publication date: 01/01/2001
Field of study

Diversity between redundant subsystems is, in various forms, a common design approach for improving system dependability. Its value in the case of software-based systems is still controversial. This paper gives an overview of reliability modelling work we carried out in recent projects on design diversity, presented in the context of previous knowledge and practice. These results provide additional insight for decisions in applying diversity and in assessing diverseredundant systems. A general observation is that, just as diversity is a very general design approach, the models of diversity can help conceptual understanding of a range of different situations. We summarise results in the general modelling of common-mode failure, in inference from observed failure data, and in decision-making for diversity in development.

CiteSeerX

City Research Online

Crossref

Recommended from our members

Fault Tolerance Against Design Faults

Author: Strigini L.
Publication venue: 'Wiley'
Publication date: 01/01/2005
Field of study

City Research Online

Choosing effective methods for design diversity - How to progress from intuition to science

Author: Popov P. T.
Romanovsky A.
Strigini L.
Publication venue
Publication date: 01/01/1999
Field of study

Design diversity is a popular defence against design faults in safety critical systems. Design diversity is at times pursued by simply isolating the development teams of the different versions, but it is presumably better to "force" diversity, by appropriate prescriptions to the teams. There are many ways of forcing diversity. Yet, managers who have to choose a cost-effective combination of these have little guidance except their own intuition. We argue the need for more scientifically based recommendations, and outline the problems with producing them. We focus on what we think is the standard basis for most recommendations: the belief that, in order to produce failure diversity among versions, project decisions should aim at causing "diversity" among the faults in the versions. We attempt to clarify what these beliefs mean, in which cases they may be justified and how they can be checked or disproved experimentally

CiteSeerX

City Research Online

Computing in the RAIN: a reliable array of independent nodes

Author: Bohossian Vasken
Bruck Jehoshua
Fan Chenggong C.
LeMahieu Paul S.
Riedel Marc D.
Xu Lihao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology

CiteSeerX

Caltech Authors

Recommended from our members

Modeling software design diversity

Author: ANDERSON T.
BABBAGE C.
Bev Littlewood
BISHOP P. G.
BISHOP P.G.
FAA
HAGELIN G.
KNIGHT J.C.
LARYD A.
LINDEBERG J. F.
Lorenzo Strigini
Peter Popov
POPOV P.
TRAVERSE P. J.
VOGES U.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2001
Field of study

Design diversity has been used for many years now as a means of achieving a degree of fault tolerance in software-based systems. Whilst there is clear evidence that the approach can be expected to deliver some increase in reliability compared with a single version, there is not agreement about the extent of this. More importantly, it remains difficult to evaluate exactly how reliable a particular diverse fault-tolerant system is. This difficulty arises because assumptions of independence of failures between different versions have been shown not to be tenable: assessment of the actual level of dependence present is therefore needed, and this is hard. In this tutorial we survey the modelling issues here, with an emphasis upon the impact these have upon the problem of assessing the reliability of fault tolerant systems. The intended audience is one of designers, assessors and project managers with only a basic knowledge of probabilities, as well as reliability experts without detailed knowledge of software, who seek an introduction to the probabilistic issues in decisions about design diversity

City Research Online

Crossref

Resource Requirements for Fault-Tolerant Quantum Simulation: The Transverse Ising Model Ground State

Author: A. Y. Kitaev
C. M. Dawson
Craig R. Clark
D. Gottesman
J. Chiaverini
K. M. Svore
Kenneth R. Brown
M. A. Nielsen
P. Aliferis
S. Beauregard
S. Sachdev
Samuel D. Gasster
Tzvetan S. Metodi
Publication venue: 'American Physical Society (APS)'
Publication date: 14/02/2009
Field of study

We estimate the resource requirements, the total number of physical qubits and computational time, required to compute the ground state energy of a 1-D quantum Transverse Ising Model (TIM) of N spin-1/2 particles, as a function of the system size and the numerical precision. This estimate is based on analyzing the impact of fault-tolerant quantum error correction in the context of the Quantum Logic Array (QLA) architecture. Our results show that due to the exponential scaling of the computational time with the desired precision of the energy, significant amount of error correciton is required to implement the TIM problem. Comparison of our results to the resource requirements for a fault-tolerant implementation of Shor's quantum factoring algorithm reveals that the required logical qubit reliability is similar for both the TIM problem and the factoring problem.Comment: 19 pages, 8 figure

arXiv.org e-Print Archive

Crossref

Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

Author: D'Angelo Gabriele
Ferretti Stefano
Marzolla Moreno
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Urbino

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna