9,103 research outputs found
A Graph Transformation-Based Approach for the Validation of Checkpointing Algorithms in Distributed Systems
International audience—Autonomic Computing Systems are oriented to pre-vente the human intervention and to enable distributed systems to manage themselves. One of their challenges is the efficient monitoring at runtime oriented to collect information from which the system can automatically repair itself in case of failure. Quasi-Synchronous Checkpointing is a well-known technique, which allows processes to recover in spite of failures. Based on this technique, several checkpointing algorithms have been developed. According to the checkpoint properties detected and ensured, they are classified into: Strictly Z-Path Free (SZPF), Z-Path Free (ZPF) and Z-Cycle Free (ZCF). In the literature, the simulation has been the method adopted for the performance evaluation of checkpointing algorithms. However, few works have been designed to validate their correctness. In this paper, we propose a validation approach based on graph transformation oriented to automatically detect the previous mentioned checkpointing properties. To achieve this, we take the vector clocks resulting from the algorithm execution, and we model it into a causal graph. Then, we design and use transformation rules oriented to verify if in such a causal graph, the algorithm is exempt from non desirable patterns, such as Z-paths or Z-cycles, according to the case
Automatic Generation of Minimal Cut Sets
A cut set is a collection of component failure modes that could lead to a
system failure. Cut Set Analysis (CSA) is applied to critical systems to
identify and rank system vulnerabilities at design time. Model checking tools
have been used to automate the generation of minimal cut sets but are generally
based on checking reachability of system failure states. This paper describes a
new approach to CSA using a Linear Temporal Logic (LTL) model checker called BT
Analyser that supports the generation of multiple counterexamples. The approach
enables a broader class of system failures to be analysed, by generalising from
failure state formulae to failure behaviours expressed in LTL. The traditional
approach to CSA using model checking requires the model or system failure to be
modified, usually by hand, to eliminate already-discovered cut sets, and the
model checker to be rerun, at each step. By contrast, the new approach works
incrementally and fully automatically, thereby removing the tedious and
error-prone manual process and resulting in significantly reduced computation
time. This in turn enables larger models to be checked. Two different
strategies for using BT Analyser for CSA are presented. There is generally no
single best strategy for model checking: their relative efficiency depends on
the model and property being analysed. Comparative results are given for the
A320 hydraulics case study in the Behavior Tree modelling language.Comment: In Proceedings ESSS 2015, arXiv:1506.0325
Machine learning and its applications in reliability analysis systems
In this thesis, we are interested in exploring some aspects of Machine Learning (ML) and its application in the Reliability Analysis systems (RAs). We begin by investigating some ML paradigms and their- techniques, go on to discuss the possible applications of ML in improving RAs performance, and lastly give guidelines of the architecture of learning RAs. Our survey of ML covers both levels of Neural Network learning and Symbolic learning. In symbolic process learning, five types of learning and their applications are discussed: rote learning, learning from instruction, learning from analogy, learning from examples, and learning from observation and discovery. The Reliability Analysis systems (RAs) presented in this thesis are mainly designed for maintaining plant safety supported by two functions: risk analysis function, i.e., failure mode effect analysis (FMEA) ; and diagnosis function, i.e., real-time fault location (RTFL). Three approaches have been discussed in creating the RAs. According to the result of our survey, we suggest currently the best design of RAs is to embed model-based RAs, i.e., MORA (as software) in a neural network based computer system (as hardware). However, there are still some improvement which can be made through the applications of Machine Learning. By implanting the 'learning element', the MORA will become learning MORA (La MORA) system, a learning Reliability Analysis system with the power of automatic knowledge acquisition and inconsistency checking, and more. To conclude our thesis, we propose an architecture of La MORA
A Dual Digraph Approach for Leaderless Atomic Broadcast (Extended Version)
Many distributed systems work on a common shared state; in such systems,
distributed agreement is necessary for consistency. With an increasing number
of servers, these systems become more susceptible to single-server failures,
increasing the relevance of fault-tolerance. Atomic broadcast enables
fault-tolerant distributed agreement, yet it is costly to solve. Most practical
algorithms entail linear work per broadcast message. AllConcur -- a leaderless
approach -- reduces the work, by connecting the servers via a sparse resilient
overlay network; yet, this resiliency entails redundancy, limiting the
reduction of work. In this paper, we propose AllConcur+, an atomic broadcast
algorithm that lifts this limitation: During intervals with no failures, it
achieves minimal work by using a redundancy-free overlay network. When failures
do occur, it automatically recovers by switching to a resilient overlay
network. In our performance evaluation of non-failure scenarios, AllConcur+
achieves comparable throughput to AllGather -- a non-fault-tolerant distributed
agreement algorithm -- and outperforms AllConcur, LCR and Libpaxos both in
terms of throughput and latency. Furthermore, our evaluation of failure
scenarios shows that AllConcur+'s expected performance is robust with regard to
occasional failures. Thus, for realistic use cases, leveraging redundancy-free
distributed agreement during intervals with no failures improves performance
significantly.Comment: Overview: 24 pages, 6 sections, 3 appendices, 8 figures, 3 tables.
Modifications from previous version: extended the evaluation of AllConcur+
with a simulation of a multiple datacenters deploymen
Algorithms for Replica Placement in High-Availability Storage
A new model of causal failure is presented and used to solve a novel replica
placement problem in data centers. The model describes dependencies among
system components as a directed graph. A replica placement is defined as a
subset of vertices in such a graph. A criterion for optimizing replica
placements is formalized and explained. In this work, the optimization goal is
to avoid choosing placements in which a single failure event is likely to wipe
out multiple replicas. Using this criterion, a fast algorithm is given for the
scenario in which the dependency model is a tree. The main contribution of the
paper is an dynamic programming algorithm for placing
replicas on a tree with vertices. This algorithm exhibits the
interesting property that only two subproblems need to be recursively
considered at each stage. An greedy algorithm is also briefly
reported.Comment: 22 pages, 7 figures, 4 algorithm listing
- …