4,642 research outputs found
Wait-Freedom with Advice
We motivate and propose a new way of thinking about failure detectors which
allows us to define, quite surprisingly, what it means to solve a distributed
task \emph{wait-free} \emph{using a failure detector}. In our model, the system
is composed of \emph{computation} processes that obtain inputs and are supposed
to output in a finite number of steps and \emph{synchronization} processes that
are subject to failures and can query a failure detector. We assume that, under
the condition that \emph{correct} synchronization processes take sufficiently
many steps, they provide the computation processes with enough \emph{advice} to
solve the given task wait-free: every computation process outputs in a finite
number of its own steps, regardless of the behavior of other computation
processes. Every task can thus be characterized by the \emph{weakest} failure
detector that allows for solving it, and we show that every such failure
detector captures a form of set agreement. We then obtain a complete
classification of tasks, including ones that evaded comprehensible
characterization so far, such as renaming or weak symmetry breaking
The Weakest Failure Detector for Eventual Consistency
In its classical form, a consistent replicated service requires all replicas
to witness the same evolution of the service state. Assuming a message-passing
environment with a majority of correct processes, the necessary and sufficient
information about failures for implementing a general state machine replication
scheme ensuring consistency is captured by the {\Omega} failure detector. This
paper shows that in such a message-passing environment, {\Omega} is also the
weakest failure detector to implement an eventually consistent replicated
service, where replicas are expected to agree on the evolution of the service
state only after some (a priori unknown) time. In fact, we show that {\Omega}
is the weakest to implement eventual consistency in any message-passing
environment, i.e., under any assumption on when and where failures might occur.
Ensuring (strong) consistency in any environment requires, in addition to
{\Omega}, the quorum failure detector {\Sigma}. Our paper thus captures, for
the first time, an exact computational difference be- tween building a
replicated state machine that ensures consistency and one that only ensures
eventual consistency
On the Hardness of the Strongly Dependent Decision Problem
We present necessary and sufficient conditions for solving the strongly
dependent decision (SDD) problem in various distributed systems. Our main
contribution is a novel characterization of the SDD problem based on point-set
topology. For partially synchronous systems, we show that any algorithm that
solves the SDD problem induces a set of executions that is closed with respect
to the point-set topology. We also show that the SDD problem is not solvable in
the asynchronous system augmented with any arbitrarily strong failure
detectors.Comment: Appeared in ICDCN 201
Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure
As machine learning systems move from computer-science laboratories into the
open world, their accountability becomes a high priority problem.
Accountability requires deep understanding of system behavior and its failures.
Current evaluation methods such as single-score error metrics and confusion
matrices provide aggregate views of system performance that hide important
shortcomings. Understanding details about failures is important for identifying
pathways for refinement, communicating the reliability of systems in different
settings, and for specifying appropriate human oversight and engagement.
Characterization of failures and shortcomings is particularly complex for
systems composed of multiple machine learned components. For such systems,
existing evaluation methods have limited expressiveness in describing and
explaining the relationship among input content, the internal states of system
components, and final output quality. We present Pandora, a set of hybrid
human-machine methods and tools for describing and explaining system failures.
Pandora leverages both human and system-generated observations to summarize
conditions of system malfunction with respect to the input content and system
architecture. We share results of a case study with a machine learning pipeline
for image captioning that show how detailed performance views can be beneficial
for analysis and debugging
Failure Detectors in Homonymous Distributed Systems (with an Application to Consensus).
This paper is on homonymous distributed systems where processes are prone to crash failures and have no initial knowledge of the system membership (?homonymous? means that several processes may have the same identi?er). New classes of failure detectors suited to these systems are ?rst de?ned. Among them, the classes H? and H? are introduced that are the homonymous counterparts of the classes ? and ?, respectively. (Recall that the pair h?,?i de?nes the weakest failure detector to solve consensus.) Then, the paper shows how H? and H? can be implemented in homonymous systems without membership knowledge (under different synchrony requirements). Finally, two algorithms are presented that use these failure detectors to solve consensus in homonymous asynchronous systems where there is no initial knowledge ofthe membership. One algorithm solves consensus with hH?, H?i, while the other uses only H?, but needs a majority of correct processes. Observe that the systems with unique identi?ers and anonymous systems are extreme cases of homonymous systems from which follows that all these results also apply to these systems. Interestingly, the new failure detector class H? can be implemented with partial synchrony, while the analogous class A? de?ned for anonymous systems can not be implemented (even in synchronous systems). Hence, the paper provides us with the ?rst proof showing that consensus can be solved in anonymous systems with only partial synchrony (and a majority of correct processes)
Distributed eventual leader election in the crash-recovery and general omission failure models.
102 p.Distributed applications are present in many aspects of everyday life. Banking, healthcare or transportation are examples of such applications. These applications are built on top of distributed systems. Roughly speaking, a distributed system is composed of a set of processes that collaborate among them to achieve a common goal. When building such systems, designers have to cope with several issues, such as different synchrony assumptions and failure occurrence. Distributed systems must ensure that the delivered service is trustworthy.Agreement problems compose a fundamental class of problems in distributed systems. All agreement problems follow the same pattern: all processes must agree on some common decision. Most of the agreement problems can be considered as a particular instance of the Consensus problem. Hence, they can be solved by reduction to consensus. However, a fundamental impossibility result, namely (FLP), states that in an asynchronous distributed system it is impossible to achieve consensus deterministically when at least one process may fail. A way to circumvent this obstacle is by using unreliable failure detectors. A failure detector allows to encapsulate synchrony assumptions of the system, providing (possibly incorrect) information about process failures. A particular failure detector, called Omega, has been shown to be the weakest failure detector for solving consensus with a majority of correct processes. Informally, Omega lies on providing an eventual leader election mechanism
- …