1,819 research outputs found
Coordination-Free Byzantine Replication with Minimal Communication Costs
State-of-the-art fault-tolerant and federated data management systems rely on fully-replicated designs in which all participants have equivalent roles. Consequently, these systems have only limited scalability and are ill-suited for high-performance data management. As an alternative, we propose a hierarchical design in which a Byzantine cluster manages data, while an arbitrary number of learners can reliable learn these updates and use the corresponding data.
To realize our design, we propose the delayed-replication algorithm, an efficient solution to the Byzantine learner problem that is central to our design. The delayed-replication algorithm is coordination-free, scalable, and has minimal communication cost for all participants involved. In doing so, the delayed-broadcast algorithm opens the door to new high-performance fault-tolerant and federated data management systems. To illustrate this, we show that the delayed-replication algorithm is not only useful to support specialized learners, but can also be used to reduce the overall communication cost of permissioned blockchains and to improve their storage scalability
Hosting Byzantine Fault Tolerant Services on a Chord Ring
In this paper we demonstrate how stateful Byzantine Fault Tolerant services
may be hosted on a Chord ring. The strategy presented is fourfold: firstly a
replication scheme that dissociates the maintenance of replicated service state
from ring recovery is developed. Secondly, clients of the ring based services
are made replication aware. Thirdly, a consensus protocol is introduced that
supports the serialization of updates. Finally Byzantine fault tolerant
replication protocols are developed that ensure the integrity of service data
hosted on the ring.Comment: Submitted to DSN 2007 Workshop on Architecting Dependable System
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
Supercomputing systems today often come in the form of large numbers of
commodity systems linked together into a computing cluster. These systems, like
any distributed system, can have large numbers of independent hardware
components cooperating or collaborating on a computation. Unfortunately, any of
this vast number of components can fail at any time, resulting in potentially
erroneous output. In order to improve the robustness of supercomputing
applications in the presence of failures, many techniques have been developed
to provide resilience to these kinds of system faults. This survey provides an
overview of these various fault-tolerance techniques.Comment: 11 page
- …