13,472 research outputs found
Pinwheel Scheduling for Fault-tolerant Broadcast Disks in Real-time Database Systems
The design of programs for broadcast disks which incorporate real-time and fault-tolerance requirements is considered. A generalized model for real-time fault-tolerant broadcast disks is defined. It is shown that designing programs for broadcast disks specified in this model is closely related to the scheduling of pinwheel task systems. Some new results in pinwheel scheduling theory are derived, which facilitate the efficient generation of real-time fault-tolerant broadcast disk programs.National Science Foundation (CCR-9308344, CCR-9596282
Parallelizing Deadlock Resolution in Symbolic Synthesis of Distributed Programs
Previous work has shown that there are two major complexity barriers in the
synthesis of fault-tolerant distributed programs: (1) generation of fault-span,
the set of states reachable in the presence of faults, and (2) resolving
deadlock states, from where the program has no outgoing transitions. Of these,
the former closely resembles with model checking and, hence, techniques for
efficient verification are directly applicable to it. Hence, we focus on
expediting the latter with the use of multi-core technology.
We present two approaches for parallelization by considering different design
choices. The first approach is based on the computation of equivalence classes
of program transitions (called group computation) that are needed due to the
issue of distribution (i.e., inability of processes to atomically read and
write all program variables). We show that in most cases the speedup of this
approach is close to the ideal speedup and in some cases it is superlinear. The
second approach uses traditional technique of partitioning deadlock states
among multiple threads. However, our experiments show that the speedup for this
approach is small. Consequently, our analysis demonstrates that a simple
approach of parallelizing the group computation is likely to be the effective
method for using multi-core computing in the context of deadlock resolution
EOS: A project to investigate the design and construction of real-time distributed embedded operating systems
The EOS project is investigating the design and construction of a family of real-time distributed embedded operating systems for reliable, distributed aerospace applications. Using the real-time programming techniques developed in co-operation with NASA in earlier research, the project staff is building a kernel for a multiple processor networked system. The first six months of the grant included a study of scheduling in an object-oriented system, the design philosophy of the kernel, and the architectural overview of the operating system. In this report, the operating system and kernel concepts are described. An environment for the experiments has been built and several of the key concepts of the system have been prototyped. The kernel and operating system is intended to support future experimental studies in multiprocessing, load-balancing, routing, software fault-tolerance, distributed data base design, and real-time processing
A Short Counterexample Property for Safety and Liveness Verification of Fault-tolerant Distributed Algorithms
Distributed algorithms have many mission-critical applications ranging from
embedded systems and replicated databases to cloud computing. Due to
asynchronous communication, process faults, or network failures, these
algorithms are difficult to design and verify. Many algorithms achieve fault
tolerance by using threshold guards that, for instance, ensure that a process
waits until it has received an acknowledgment from a majority of its peers.
Consequently, domain-specific languages for fault-tolerant distributed systems
offer language support for threshold guards.
We introduce an automated method for model checking of safety and liveness of
threshold-guarded distributed algorithms in systems where the number of
processes and the fraction of faulty processes are parameters. Our method is
based on a short counterexample property: if a distributed algorithm violates a
temporal specification (in a fragment of LTL), then there is a counterexample
whose length is bounded and independent of the parameters. We prove this
property by (i) characterizing executions depending on the structure of the
temporal formula, and (ii) using commutativity of transitions to accelerate and
shorten executions. We extended the ByMC toolset (Byzantine Model Checker) with
our technique, and verified liveness and safety of 10 prominent fault-tolerant
distributed algorithms, most of which were out of reach for existing
techniques.Comment: 16 pages, 11 pages appendi
Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 1: Army fault tolerant architecture overview
Digital computing systems needed for Army programs such as the Computer-Aided Low Altitude Helicopter Flight Program and the Armored Systems Modernization (ASM) vehicles may be characterized by high computational throughput and input/output bandwidth, hard real-time response, high reliability and availability, and maintainability, testability, and producibility requirements. In addition, such a system should be affordable to produce, procure, maintain, and upgrade. To address these needs, the Army Fault Tolerant Architecture (AFTA) is being designed and constructed under a three-year program comprised of a conceptual study, detailed design and fabrication, and demonstration and validation phases. Described here are the results of the conceptual study phase of the AFTA development. Given here is an introduction to the AFTA program, its objectives, and key elements of its technical approach. A format is designed for representing mission requirements in a manner suitable for first order AFTA sizing and analysis, followed by a discussion of the current state of mission requirements acquisition for the targeted Army missions. An overview is given of AFTA's architectural theory of operation
On model checking data-independent systems with arrays without reset
A system is data-independent with respect to a data type X iff the operations
it can perform on values of type X are restricted to just equality testing. The
system may also store, input and output values of type X. We study model
checking of systems which are data-independent with respect to two distinct
type variables X and Y, and may in addition use arrays with indices from X and
values from Y . Our main interest is the following parameterised model-checking
problem: whether a given program satisfies a given temporal-logic formula for
all non-empty nite instances of X and Y . Initially, we consider instead the
abstraction where X and Y are infinite and where partial functions with finite
domains are used to model arrays. Using a translation to data-independent
systems without arrays, we show that the u-calculus model-checking problem is
decidable for these systems. From this result, we can deduce properties of all
systems with finite instances of X and Y . We show that there is a procedure
for the above parameterised model-checking problem of the universal fragment of
the u-calculus, such that it always terminates but may give false negatives. We
also deduce that the parameterised model-checking problem of the universal
disjunction-free fragment of the u-calculus is decidable. Practical motivations
for model checking data-independent systems with arrays include verification of
memory and cache systems, where X is the type of memory addresses, and Y the
type of storable values. As an example we verify a fault-tolerant memory
interface over a set of unreliable memories.Comment: Appeared in Theory and Practice of Logic Programming, vol. 4, no.
5&6, 200
Blazes: Coordination Analysis for Distributed Programs
Distributed consistency is perhaps the most discussed topic in distributed
systems today. Coordination protocols can ensure consistency, but in practice
they cause undesirable performance unless used judiciously. Scalable
distributed architectures avoid coordination whenever possible, but
under-coordinated systems can exhibit behavioral anomalies under fault, which
are often extremely difficult to debug. This raises significant challenges for
distributed system architects and developers. In this paper we present Blazes,
a cross-platform program analysis framework that (a) identifies program
locations that require coordination to ensure consistent executions, and (b)
automatically synthesizes application-specific coordination code that can
significantly outperform general-purpose techniques. We present two case
studies, one using annotated programs in the Twitter Storm system, and another
using the Bloom declarative language.Comment: Updated to include additional materials from the original technical
report: derivation rules, output stream label
- …