6,365 research outputs found
Byzantine Approximate Agreement on Graphs
Consider a distributed system with n processors out of which f can be Byzantine faulty. In the approximate agreement task, each processor i receives an input value x_i and has to decide on an output value y_i such that
1) the output values are in the convex hull of the non-faulty processors\u27 input values,
2) the output values are within distance d of each other.
Classically, the values are assumed to be from an m-dimensional Euclidean space, where m >= 1.
In this work, we study the task in a discrete setting, where input values with some structure expressible as a graph. Namely, the input values are vertices of a finite graph G and the goal is to output vertices that are within distance d of each other in G, but still remain in the graph-induced convex hull of the input values. For d=0, the task reduces to consensus and cannot be solved with a deterministic algorithm in an asynchronous system even with a single crash fault. For any d >= 1, we show that the task is solvable in asynchronous systems when G is chordal and n > (omega+1)f, where omega is the clique number of G. In addition, we give the first Byzantine-tolerant algorithm for a variant of lattice agreement. For synchronous systems, we show tight resilience bounds for the exact variants of these and related tasks over a large class of combinatorial structures
The Impact of RDMA on Agreement
Remote Direct Memory Access (RDMA) is becoming widely available in data
centers. This technology allows a process to directly read and write the memory
of a remote host, with a mechanism to control access permissions. In this
paper, we study the fundamental power of these capabilities. We consider the
well-known problem of achieving consensus despite failures, and find that RDMA
can improve the inherent trade-off in distributed computing between failure
resilience and performance. Specifically, we show that RDMA allows algorithms
that simultaneously achieve high resilience and high performance, while
traditional algorithms had to choose one or another. With Byzantine failures,
we give an algorithm that only requires processes (where
is the maximum number of faulty processes) and decides in two (network)
delays in common executions. With crash failures, we give an algorithm that
only requires processes and also decides in two delays. Both
algorithms tolerate a minority of memory failures inherent to RDMA, and they
provide safety in asynchronous systems and liveness with standard additional
assumptions.Comment: Full version of PODC'19 paper, strengthened broadcast algorith
Computing in the RAIN: a reliable array of independent nodes
The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology
Machine-checked proofs of the design and implementation of a fault-tolerant circuit
A formally verified implementation of the 'oral messages' algorithm of Pease, Shostak, and Lamport is described. An abstract implementation of the algorithm is verified to achieve interactive consistency in the presence of faults. This abstract characterization is then mapped down to a hardware level implementation which inherits the fault-tolerant characteristics of the abstract version. All steps in the proof were checked with the Boyer-Moore theorem prover. A significant results is the demonstration of a fault-tolerant device that is formally specified and whose implementation is proved correct with respect to this specification. A significant simplifying assumption is that the redundant processors behave synchronously. A mechanically checked proof that the oral messages algorithm is 'optimal' in the sense that no algorithm which achieves agreement via similar message passing can tolerate a larger proportion of faulty processor is also described
Algorithms For Extracting Timeliness Graphs
We consider asynchronous message-passing systems in which some links are
timely and processes may crash. Each run defines a timeliness graph among
correct processes: (p; q) is an edge of the timeliness graph if the link from p
to q is timely (that is, there is bound on communication delays from p to q).
The main goal of this paper is to approximate this timeliness graph by graphs
having some properties (such as being trees, rings, ...). Given a family S of
graphs, for runs such that the timeliness graph contains at least one graph in
S then using an extraction algorithm, each correct process has to converge to
the same graph in S that is, in a precise sense, an approximation of the
timeliness graph of the run. For example, if the timeliness graph contains a
ring, then using an extraction algorithm, all correct processes eventually
converge to the same ring and in this ring all nodes will be correct processes
and all links will be timely. We first present a general extraction algorithm
and then a more specific extraction algorithm that is communication efficient
(i.e., eventually all the messages of the extraction algorithm use only links
of the extracted graph)
Agreement in wider environments with weaker assumptions.
The set agreement problem states that from n proposed values at most n?1 can be decided. Traditionally, this problem is solved using a failure detector in asynchronous systems where processes may crash but do not recover, where processes have different identities, and where all processes initially know the membership. In this paper we study the set agreement problem and the weakest failure detector L used to solve it in asynchronous message passing systems where processes may crash and recover, with homonyms (i.e., processes may have equal identities) and without a complete initial knowledge of the membership
k-Set Agreement in Communication Networks with Omission Faults
We consider an arbitrary communication network G where at most f messages can be lost at each round, and consider the classical k-set agreement problem in this setting. We characterize exactly for which f the k-set agreement problem can be solved on G.
The case with k = 1, that is the Consensus problem, has first been introduced by Santoro and Widmayer in 1989, the characterization is already known from [Coulouma/Godard/Peters, TCS, 2015]. As a first contribution, we present a detailed and complete characterization for the 2-set problem. The proof of the impossibility result uses topological methods. We introduce a new subdivision approach for these topological methods that is of independent interest.
In the second part, we show how to extend to the general case with k in N. This characterization is the first complete characterization for this kind of synchronous message passing model, a model that is a subclass of the family of oblivious message adversaries
- …