Search CORE

18,241 research outputs found

Recommended from our members

Enhancing Fault / Intrusion Tolerance through Design and Configuration Diversity

Author: Bessani A. N.
Daidone A.
Gashi I.
Obelheiro R. R.
Sousa P.
Stankovic V.
Publication venue
Publication date: 01/01/2009
Field of study

Fault/intrusion tolerance is usually the only viable way of improving the system dependability and security in the presence of continuously evolving threats. Many of the solutions in the literature concern a specific snapshot in the production or deployment of a fault-tolerant system and no immediate considerations are made about how the system should evolve to deal with novel threats. In this paper we outline and evaluate a set of operating systems’ and applications’ reconfiguration rules which can be used to modify the state of a system replica prior to deployment or in between recoveries, and hence increase the replicas chance of a longer intrusion-free operation

City Research Online

Universidade de Lisboa: Repositório.UL

Parallelizing Deadlock Resolution in Symbolic Synthesis of Distributed Programs

Author: A. Arora
A. Arora
A. Ebnenasir
B. Bonakdarpour
Borzoo Bonakdarpour
E. A. Emerson
F. Abujarad
F. Somenzi
Fuad Abujarad
J. Ezekiel
J. Ezekiel
J. Ezekiel
Jaco van de Pol
K. Milvang-Jensen
L. Lamport
Lubos Brim
Maurice Herlihy
O. Grumberg
O. Grumberg
S. S. Kulkarni
S. S. Kulkarni
Sandeep S. Kulkarni
T. Stornetta
Publication venue: 'Open Publishing Association'
Publication date: 01/12/2009
Field of study

Previous work has shown that there are two major complexity barriers in the synthesis of fault-tolerant distributed programs: (1) generation of fault-span, the set of states reachable in the presence of faults, and (2) resolving deadlock states, from where the program has no outgoing transitions. Of these, the former closely resembles with model checking and, hence, techniques for efficient verification are directly applicable to it. Hence, we focus on expediting the latter with the use of multi-core technology. We present two approaches for parallelization by considering different design choices. The first approach is based on the computation of equivalence classes of program transitions (called group computation) that are needed due to the issue of distribution (i.e., inability of processes to atomically read and write all program variables). We show that in most cases the speedup of this approach is close to the ideal speedup and in some cases it is superlinear. The second approach uses traditional technique of partitioning deadlock states among multiple threads. However, our experiments show that the speedup for this approach is small. Consequently, our analysis demonstrates that a simple approach of parallelizing the group computation is likely to be the effective method for using multi-core computing in the context of deadlock resolution

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

Pinwheel Scheduling for Fault-tolerant Broadcast Disks in Real-time Database Systems

Author: Baruah Sanjoy
Bestavros Azer
Publication venue: Boston University Computer Science Department
Publication date: 22/08/1996
Field of study

The design of programs for broadcast disks which incorporate real-time and fault-tolerance requirements is considered. A generalized model for real-time fault-tolerant broadcast disks is defined. It is shown that designing programs for broadcast disks specified in this model is closely related to the scheduling of pinwheel task systems. Some new results in pinwheel scheduling theory are derived, which facilitate the efficient generation of real-time fault-tolerant broadcast disk programs.National Science Foundation (CCR-9308344, CCR-9596282

Boston University Institutional Repository (OpenBU)

Fault-Tolerant Adaptive Parallel and Distributed Simulation

Author: Armaroli Lorenzo
D'Angelo Gabriele
Ferretti Stefano
Marzolla Moreno
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization messages are replicated as well, so that the receiving entity can identify and discard corrupted messages. We provide an experimental evaluation of FT-GAIA on a running prototype. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increase in the computational load of the execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2016

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Damage Tolerant Active Contro l: Concept and State of the Art

Author: MECHBAL Nazih
NOBREGA Euripedes
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

Damage tolerant active control is a new research area relating to fault tolerant control design applied to mechanical structures. It encompasses several techniques already used to design controllers and to detect and to diagnose faults, as well to monitor structural integrity. Brief reviews of the common intersections of these areas are presented, with the purpose to clarify its relations and also to justify the new controller design paradigm. Some examples help to better understand the role of the new area

HAL Descartes

SAM : Science Arts et Métiers

Hal-Diderot

Algorithmic Based Fault Tolerance Applied to High Performance Computing

Author: Bosilca George
Delmas Remi
Dongarra Jack
Langou Julien
Publication venue
Publication date: 01/01/2008
Field of study

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly

arXiv.org e-Print Archive

CiteSeerX

MIMS EPrints

The University of Manchester - Institutional Repository

Recommended from our members

Fault Tolerance Against Design Faults

Author: Strigini L.
Publication venue: 'Wiley'
Publication date: 01/01/2005
Field of study

City Research Online