11,374 research outputs found
Optimal Gossip with Direct Addressing
Gossip algorithms spread information by having nodes repeatedly forward
information to a few random contacts. By their very nature, gossip algorithms
tend to be distributed and fault tolerant. If done right, they can also be fast
and message-efficient. A common model for gossip communication is the random
phone call model, in which in each synchronous round each node can PUSH or PULL
information to or from a random other node. For example, Karp et al. [FOCS
2000] gave algorithms in this model that spread a message to all nodes in
rounds while sending only messages per node
on average.
Recently, Avin and Els\"asser [DISC 2013], studied the random phone call
model with the natural and commonly used assumption of direct addressing.
Direct addressing allows nodes to directly contact nodes whose ID (e.g., IP
address) was learned before. They show that in this setting, one can "break the
barrier" and achieve a gossip algorithm running in
rounds, albeit while using messages per node.
We study the same model and give a simple gossip algorithm which spreads a
message in only rounds. We also prove a matching lower bound which shows that this running time is best possible. In
particular we show that any gossip algorithm takes with high probability at
least rounds to terminate. Lastly, our algorithm can be
tweaked to send only messages per node on average with only
bits per message. Our algorithm therefore simultaneously achieves the optimal
round-, message-, and bit-complexity for this setting. As all prior gossip
algorithms, our algorithm is also robust against failures. In particular, if in
the beginning an oblivious adversary fails any nodes our algorithm still,
with high probability, informs all but surviving nodes
Rapid Recovery for Systems with Scarce Faults
Our goal is to achieve a high degree of fault tolerance through the control
of a safety critical systems. This reduces to solving a game between a
malicious environment that injects failures and a controller who tries to
establish a correct behavior. We suggest a new control objective for such
systems that offers a better balance between complexity and precision: we seek
systems that are k-resilient. In order to be k-resilient, a system needs to be
able to rapidly recover from a small number, up to k, of local faults
infinitely many times, provided that blocks of up to k faults are separated by
short recovery periods in which no fault occurs. k-resilience is a simple but
powerful abstraction from the precise distribution of local faults, but much
more refined than the traditional objective to maximize the number of local
faults. We argue why we believe this to be the right level of abstraction for
safety critical systems when local faults are few and far between. We show that
the computational complexity of constructing optimal control with respect to
resilience is low and demonstrate the feasibility through an implementation and
experimental results.Comment: In Proceedings GandALF 2012, arXiv:1210.202
Middleware Fault Tolerance Support for the BOSS Embedded Operating System
Critical embedded systems need a dependable operating system and application. Despite all efforts to prevent and remove faults in system development, residual software faults usually persist. Therefore, critical systems need some sort of fault tolerance to deal with these faults and also with hardware faults at operation time.
This work proposes fault-tolerant support mechanisms for the BOSS embedded operating system, based on the application of proven fault tolerance strategies by middleware control software which transparently delivers the added functionality to the application software. Special attention is taken to complexity control and resource constraints, targeting the needs of the embedded market.Fundação para a Ciência e a Tecnologia (FCT
Blazes: Coordination Analysis for Distributed Programs
Distributed consistency is perhaps the most discussed topic in distributed
systems today. Coordination protocols can ensure consistency, but in practice
they cause undesirable performance unless used judiciously. Scalable
distributed architectures avoid coordination whenever possible, but
under-coordinated systems can exhibit behavioral anomalies under fault, which
are often extremely difficult to debug. This raises significant challenges for
distributed system architects and developers. In this paper we present Blazes,
a cross-platform program analysis framework that (a) identifies program
locations that require coordination to ensure consistent executions, and (b)
automatically synthesizes application-specific coordination code that can
significantly outperform general-purpose techniques. We present two case
studies, one using annotated programs in the Twitter Storm system, and another
using the Bloom declarative language.Comment: Updated to include additional materials from the original technical
report: derivation rules, output stream label
Passive Fault-Tolerance Management in Component-Based Embedded Systems
It is imperative to accept that failures can and will occur even in meticulously designed distributed systems and to design proper measures to counter those failures. Passive replication minimizes resource consumption by only activating redundant replicas in case of failures, as typically, providing and applying state updates is less resource demanding than requesting execution. However, most existing solutions for passive fault tolerance are usually designed and configured at design time, explicitly and statically identifying the most critical components and their number of replicas, lacking the needed flexibility to handle the runtime dynamics of distributed component-based embedded systems. This paper proposes a cost-effective adaptive fault tolerance solution with a significant lower overhead compared to a strict active redundancy-based approach, achieving a high error coverage with a minimum amount of redundancy. The activation of passive replicas is coordinated through a feedback-based coordination model that reduces the complexity of the needed interactions among components until a new collective global service solution is determined, hence improving the overall maintainability and robustness of the system
- …