54,600 research outputs found
A Failure Detector for k-Set Agreement in Asynchronous Dynamic Systems
The k-set agreement problem is a generalization of the consensus problem where processes can decide up to k different values. Very few papers have tackled this problem in dynamic networks, and to the best of our knowledge, every algorithm proposed so far for k-set agreement in dynamic networks assumed synchronous communications. Exploiting the formalism of the Time-Varying Graph model, this paper proposes a new failure detector Sigma-bottom-k, based on Sigma-k and Sigma-bottom, for solving k-set agreement in asynchronous dynamic networks. We present two algorithms that implement this new failure detector, making assumptions on the number of process failures and graph connectivity. We also provide an algorithm for solving k-set agreement using Sigma-bottom-z , under an assumption on the relative values of k and z
Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure
As machine learning systems move from computer-science laboratories into the
open world, their accountability becomes a high priority problem.
Accountability requires deep understanding of system behavior and its failures.
Current evaluation methods such as single-score error metrics and confusion
matrices provide aggregate views of system performance that hide important
shortcomings. Understanding details about failures is important for identifying
pathways for refinement, communicating the reliability of systems in different
settings, and for specifying appropriate human oversight and engagement.
Characterization of failures and shortcomings is particularly complex for
systems composed of multiple machine learned components. For such systems,
existing evaluation methods have limited expressiveness in describing and
explaining the relationship among input content, the internal states of system
components, and final output quality. We present Pandora, a set of hybrid
human-machine methods and tools for describing and explaining system failures.
Pandora leverages both human and system-generated observations to summarize
conditions of system malfunction with respect to the input content and system
architecture. We share results of a case study with a machine learning pipeline
for image captioning that show how detailed performance views can be beneficial
for analysis and debugging
The Weakest Failure Detector for Eventual Consistency
In its classical form, a consistent replicated service requires all replicas
to witness the same evolution of the service state. Assuming a message-passing
environment with a majority of correct processes, the necessary and sufficient
information about failures for implementing a general state machine replication
scheme ensuring consistency is captured by the {\Omega} failure detector. This
paper shows that in such a message-passing environment, {\Omega} is also the
weakest failure detector to implement an eventually consistent replicated
service, where replicas are expected to agree on the evolution of the service
state only after some (a priori unknown) time. In fact, we show that {\Omega}
is the weakest to implement eventual consistency in any message-passing
environment, i.e., under any assumption on when and where failures might occur.
Ensuring (strong) consistency in any environment requires, in addition to
{\Omega}, the quorum failure detector {\Sigma}. Our paper thus captures, for
the first time, an exact computational difference be- tween building a
replicated state machine that ensures consistency and one that only ensures
eventual consistency
Reconfigurable Lattice Agreement and Applications
Reconfiguration is one of the central mechanisms in distributed systems. Due to failures and connectivity disruptions, the very set of service replicas (or servers) and their roles in the computation may have to be reconfigured over time. To provide the desired level of consistency and availability to applications running on top of these servers, the clients of the service should be able to reach some form of agreement on the system configuration. We observe that this agreement is naturally captured via a lattice partial order on the system states. We propose an asynchronous implementation of reconfigurable lattice agreement that implies elegant reconfigurable versions of a large class of lattice abstract data types, such as max-registers and conflict detectors, as well as popular distributed programming abstractions, such as atomic snapshot and commit-adopt
The Impact of RDMA on Agreement
Remote Direct Memory Access (RDMA) is becoming widely available in data
centers. This technology allows a process to directly read and write the memory
of a remote host, with a mechanism to control access permissions. In this
paper, we study the fundamental power of these capabilities. We consider the
well-known problem of achieving consensus despite failures, and find that RDMA
can improve the inherent trade-off in distributed computing between failure
resilience and performance. Specifically, we show that RDMA allows algorithms
that simultaneously achieve high resilience and high performance, while
traditional algorithms had to choose one or another. With Byzantine failures,
we give an algorithm that only requires processes (where
is the maximum number of faulty processes) and decides in two (network)
delays in common executions. With crash failures, we give an algorithm that
only requires processes and also decides in two delays. Both
algorithms tolerate a minority of memory failures inherent to RDMA, and they
provide safety in asynchronous systems and liveness with standard additional
assumptions.Comment: Full version of PODC'19 paper, strengthened broadcast algorith
- …