7,175 research outputs found

    Coordination-Free Byzantine Replication with Minimal Communication Costs

    Get PDF
    State-of-the-art fault-tolerant and federated data management systems rely on fully-replicated designs in which all participants have equivalent roles. Consequently, these systems have only limited scalability and are ill-suited for high-performance data management. As an alternative, we propose a hierarchical design in which a Byzantine cluster manages data, while an arbitrary number of learners can reliable learn these updates and use the corresponding data. To realize our design, we propose the delayed-replication algorithm, an efficient solution to the Byzantine learner problem that is central to our design. The delayed-replication algorithm is coordination-free, scalable, and has minimal communication cost for all participants involved. In doing so, the delayed-broadcast algorithm opens the door to new high-performance fault-tolerant and federated data management systems. To illustrate this, we show that the delayed-replication algorithm is not only useful to support specialized learners, but can also be used to reduce the overall communication cost of permissioned blockchains and to improve their storage scalability

    Fault tolerant architectures for integrated aircraft electronics systems

    Get PDF
    Work into possible architectures for future flight control computer systems is described. Ada for Fault-Tolerant Systems, the NETS Network Error-Tolerant System architecture, and voting in asynchronous systems are covered

    Total order broadcast for fault tolerant exascale systems

    Full text link
    In the process of designing a new fault tolerant run-time for future exascale systems, we discovered that a total order broadcast would be necessary. That is, nodes of a supercomputer should be able to broadcast messages to other nodes even in the face of failures. All messages should be seen in the same order at all nodes. While this is a well studied problem in distributed systems, few researchers have looked at how to perform total order broadcasts at large scales for data availability. Our experience implementing a published total order broadcast algorithm showed poor scalability at tens of nodes. In this paper we present a novel algorithm for total order broadcast which scales logarithmically in the number of processes and is not delayed by most process failures. While we are motivated by the needs of our run-time we believe this primitive is of general applicability. Total order broadcasts are used often in datacenter environments and as HPC developers begins to address fault tolerance at the application level we believe they will need similar primitives

    CATS: linearizability and partition tolerance in scalable and self-organizing key-value stores

    Get PDF
    Distributed key-value stores provide scalable, fault-tolerant, and self-organizing storage services, but fall short of guaranteeing linearizable consistency in partially synchronous, lossy, partitionable, and dynamic networks, when data is distributed and replicated automatically by the principle of consistent hashing. This paper introduces consistent quorums as a solution for achieving atomic consistency. We present the design and implementation of CATS, a distributed key-value store which uses consistent quorums to guarantee linearizability and partition tolerance in such adverse and dynamic network conditions. CATS is scalable, elastic, and self-organizing; key properties for modern cloud storage middleware. Our system shows that consistency can be achieved with practical performance and modest throughput overhead (5%) for read-intensive workloads
    • …
    corecore