Search CORE

158 research outputs found

Optimal discrimination between transient and permanent faults

Author: Bondavalli A.
Di Giandomenico F.
Pizza M.
Strigini L.
Publication venue
Publication date: 01/01/1998
Field of study

An important practical problem in fault diagnosis is discriminating between permanent faults and transient faults. In many computer systems, the majority of errors are due to transient faults. Many heuristic methods have been used for discriminating between transient and permanent faults; however, we have found no previous work stating this decision problem in clear probabilistic terms. We present an optimal procedure for discriminating between transient and permanent faults, based on applying Bayesian inference to the observed events (correct and erroneous results). We describe how the assessed probability that a module is permanently faulty must vary with observed symptoms. We describe and demonstrate our proposed method on a simple application problem, building the appropriate equations and showing numerical examples. The method can be implemented as a run-time diagnosis algorithm at little computational cost; it can also be used to evaluate any heuristic diagnostic procedure by compariso

CiteSeerX

City Research Online

Florence Research

Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

Author: Cores González Iván
González Patricia
Martín María J.
Osorio Roberto
Rodríguez Gabriel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

This is a post-peer-review, pre-copyedit version of an article published in New Generation Computing. The final authenticated version is available online at: https://doi.org/10.1007/s00354-013-0302-4[Abstract] The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storage resources. Thus, current checkpoint-recovery techniques should minimize these costs in order to be useful for large scale systems. In this paper three different and complementary techniques to reduce the size of the checkpoints generated by application-level checkpointing are proposed and implemented. Detailed experimental results obtained on a multicore cluster show the effectiveness of the proposed methods to reduce checkpointing cost.Ministerio de Ciencia e Innovación; TIN2010-16735Galicia. Consellería de Economía e Industria; 10PXIB105180P

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

A Dual Digraph Approach for Leaderless Atomic Broadcast (Extended Version)

Author: Poke Marius
Glass Colin W.
Publication venue
Publication date: 05/02/2019
Field of study

Many distributed systems work on a common shared state; in such systems, distributed agreement is necessary for consistency. With an increasing number of servers, these systems become more susceptible to single-server failures, increasing the relevance of fault-tolerance. Atomic broadcast enables fault-tolerant distributed agreement, yet it is costly to solve. Most practical algorithms entail linear work per broadcast message. AllConcur -- a leaderless approach -- reduces the work, by connecting the servers via a sparse resilient overlay network; yet, this resiliency entails redundancy, limiting the reduction of work. In this paper, we propose AllConcur+, an atomic broadcast algorithm that lifts this limitation: During intervals with no failures, it achieves minimal work by using a redundancy-free overlay network. When failures do occur, it automatically recovers by switching to a resilient overlay network. In our performance evaluation of non-failure scenarios, AllConcur+ achieves comparable throughput to AllGather -- a non-fault-tolerant distributed agreement algorithm -- and outperforms AllConcur, LCR and Libpaxos both in terms of throughput and latency. Furthermore, our evaluation of failure scenarios shows that AllConcur+'s expected performance is robust with regard to occasional failures. Thus, for realistic use cases, leveraging redundancy-free distributed agreement during intervals with no failures improves performance significantly.Comment: Overview: 24 pages, 6 sections, 3 appendices, 8 figures, 3 tables. Modifications from previous version: extended the evaluation of AllConcur+ with a simulation of a multiple datacenters deploymen

arXiv.org e-Print Archive

FigShare

Polar Microbiology: Recent Advances and Future Perspectives

Author
Publication venue: MDPI - Multidisciplinary Digital Publishing Institute
Publication date: 12/05/2016
Field of study

Directory of Open Access Books (DOAB)

Improving the reliability of commodity operating systems

Author: Brian N. Bershad
Chen P.
Custer H.
Forin A.
Gosling J.
Hand S. M.
Henry M. Levy
Intel Corporation
Levy H. M.
Michael M. Swift
Ng W. T.
Organick E. I.
UDI.
Young M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Recommended from our members

A survey on online monitoring approaches of computer-based systems

Author: Stankovic V.
Strigini L.
Publication venue: Centre for Software Reliability, City University London
Publication date
Field of study

This report surveys forms of online data collection that are in current use (as well as being the subject of research to adapt them to changing technology and demands), and can be used as inputs to assessment of dependability and resilience, although they are not primarily meant for this use

City Research Online