5,273 research outputs found
A Short Counterexample Property for Safety and Liveness Verification of Fault-tolerant Distributed Algorithms
Distributed algorithms have many mission-critical applications ranging from
embedded systems and replicated databases to cloud computing. Due to
asynchronous communication, process faults, or network failures, these
algorithms are difficult to design and verify. Many algorithms achieve fault
tolerance by using threshold guards that, for instance, ensure that a process
waits until it has received an acknowledgment from a majority of its peers.
Consequently, domain-specific languages for fault-tolerant distributed systems
offer language support for threshold guards.
We introduce an automated method for model checking of safety and liveness of
threshold-guarded distributed algorithms in systems where the number of
processes and the fraction of faulty processes are parameters. Our method is
based on a short counterexample property: if a distributed algorithm violates a
temporal specification (in a fragment of LTL), then there is a counterexample
whose length is bounded and independent of the parameters. We prove this
property by (i) characterizing executions depending on the structure of the
temporal formula, and (ii) using commutativity of transitions to accelerate and
shorten executions. We extended the ByMC toolset (Byzantine Model Checker) with
our technique, and verified liveness and safety of 10 prominent fault-tolerant
distributed algorithms, most of which were out of reach for existing
techniques.Comment: 16 pages, 11 pages appendi
Rapid Recovery for Systems with Scarce Faults
Our goal is to achieve a high degree of fault tolerance through the control
of a safety critical systems. This reduces to solving a game between a
malicious environment that injects failures and a controller who tries to
establish a correct behavior. We suggest a new control objective for such
systems that offers a better balance between complexity and precision: we seek
systems that are k-resilient. In order to be k-resilient, a system needs to be
able to rapidly recover from a small number, up to k, of local faults
infinitely many times, provided that blocks of up to k faults are separated by
short recovery periods in which no fault occurs. k-resilience is a simple but
powerful abstraction from the precise distribution of local faults, but much
more refined than the traditional objective to maximize the number of local
faults. We argue why we believe this to be the right level of abstraction for
safety critical systems when local faults are few and far between. We show that
the computational complexity of constructing optimal control with respect to
resilience is low and demonstrate the feasibility through an implementation and
experimental results.Comment: In Proceedings GandALF 2012, arXiv:1210.202
Communication Efficient Checking of Big Data Operations
We propose fast probabilistic algorithms with low (i.e., sublinear in the
input size) communication volume to check the correctness of operations in Big
Data processing frameworks and distributed databases. Our checkers cover many
of the commonly used operations, including sum, average, median, and minimum
aggregation, as well as sorting, union, merge, and zip. An experimental
evaluation of our implementation in Thrill (Bingmann et al., 2016) confirms the
low overhead and high failure detection rate predicted by theoretical analysis
Verification and Synthesis of Symmetric Uni-Rings for Leads-To Properties
This paper investigates the verification and synthesis of parameterized
protocols that satisfy leadsto properties on symmetric
unidirectional rings (a.k.a. uni-rings) of deterministic and constant-space
processes under no fairness and interleaving semantics, where and are
global state predicates. First, we show that verifying for
parameterized protocols on symmetric uni-rings is undecidable, even for
deterministic and constant-space processes, and conjunctive state predicates.
Then, we show that surprisingly synthesizing symmetric uni-ring protocols that
satisfy is actually decidable. We identify necessary and
sufficient conditions for the decidability of synthesis based on which we
devise a sound and complete polynomial-time algorithm that takes the predicates
and , and automatically generates a parameterized protocol that
satisfies for unbounded (but finite) ring sizes. Moreover, we
present some decidability results for cases where leadsto is required from
multiple distinct predicates to different predicates. To demonstrate
the practicality of our synthesis method, we synthesize some parameterized
protocols, including agreement and parity protocols
Rigorous Design of Fault-Tolerant Transactions for Replicated Database Systems using Event B
System availability is improved by the replication of data objects in a distributed database system. However, during updates, the complexity of keeping replicas identical arises due to failures of sites and race conditions among conflicting transactions. Fault tolerance and reliability are key issues to be addressed in the design and architecture of these systems. Event B is a formal technique which provides a framework for developing mathematical models of distributed systems by rigorous description of the problem, gradually introducing solutions in refinement steps, and verification of solutions by discharge of proof obligations. In this paper, we present a formal development of a distributed system using Event B that ensures atomic commitment of distributed transactions consisting of communicating transaction components at participating sites. This formal approach carries the development of the system from an initial abstract specification of transactional updates on a one copy database to a detailed design containing replicated databases in refinement. Through refinement we verify that the design of the replicated database confirms to the one copy database abstraction
- âŠ