4,547 research outputs found
Uncovering Bugs in Distributed Storage Systems during Testing (not in Production!)
Testing distributed systems is challenging due to multiple sources of nondeterminism. Conventional testing techniques, such as unit, integration and stress testing, are ineffective in preventing serious but subtle bugs from reaching production. Formal techniques, such as TLA+, can only verify high-level specifications of systems at the level of logic-based models, and fall short of checking the actual executable code. In this paper, we present a new methodology for testing distributed systems. Our approach applies advanced systematic testing techniques to thoroughly check that the executable code adheres to its high-level specifications, which significantly improves coverage of important system behaviors. Our methodology has been applied to three distributed storage systems in the Microsoft Azure cloud computing platform. In the process, numerous bugs were identified, reproduced, confirmed and fixed. These bugs required a subtle combination of concurrency and failures, making them extremely difficult to find with conventional testing techniques. An important advantage of our approach is that a bug is uncovered in a small setting and witnessed by a full system trace, which dramatically increases the productivity of debugging
A framework for proving the self-organization of dynamic systems
This paper aims at providing a rigorous definition of self- organization, one
of the most desired properties for dynamic systems (e.g., peer-to-peer systems,
sensor networks, cooperative robotics, or ad-hoc networks). We characterize
different classes of self-organization through liveness and safety properties
that both capture information re- garding the system entropy. We illustrate
these classes through study cases. The first ones are two representative P2P
overlays (CAN and Pas- try) and the others are specific implementations of
\Omega (the leader oracle) and one-shot query abstractions for dynamic
settings. Our study aims at understanding the limits and respective power of
existing self-organized protocols and lays the basis of designing robust
algorithm for dynamic systems
A Protocol for the Atomic Capture of Multiple Molecules at Large Scale
With the rise of service-oriented computing, applications are more and more
based on coordination of autonomous services. Envisioned over largely
distributed and highly dynamic platforms, expressing this coordination calls
for alternative programming models. The chemical programming paradigm, which
models applications as chemical solutions where molecules representing digital
entities involved in the computation, react together to produce a result, has
been recently shown to provide the needed abstractions for autonomic
coordination of services. However, the execution of such programs over large
scale platforms raises several problems hindering this paradigm to be actually
leveraged. Among them, the atomic capture of molecules participating in concur-
rent reactions is one of the most significant. In this paper, we propose a
protocol for the atomic capture of these molecules distributed and evolving
over a large scale platform. As the density of possible reactions is crucial
for the liveness and efficiency of such a capture, the protocol proposed is
made up of two sub-protocols, each of them aimed at addressing different levels
of densities of potential reactions in the solution. While the decision to
choose one or the other is local to each node participating in a program's
execution, a global coherent behaviour is obtained. Proof of liveness, as well
as intensive simulation results showing the efficiency and limited overhead of
the protocol are given.Comment: 13th International Conference on Distributed Computing and Networking
(2012
ARES: Adaptive, Reconfigurable, Erasure coded, atomic Storage
Atomicity or strong consistency is one of the fundamental, most intuitive,
and hardest to provide primitives in distributed shared memory emulations. To
ensure survivability, scalability, and availability of a storage service in the
presence of failures, traditional approaches for atomic memory emulation, in
message passing environments, replicate the objects across multiple servers.
Compared to replication based algorithms, erasure code-based atomic memory
algorithms has much lower storage and communication costs, but usually, they
are harder to design. The difficulty of designing atomic memory algorithms
further grows, when the set of servers may be changed to ensure survivability
of the service over software and hardware upgrades, while avoiding service
interruptions. Atomic memory algorithms for performing server reconfiguration,
in the replicated systems, are very few, complex, and are still part of an
active area of research; reconfigurations of erasure-code based algorithms are
non-existent.
In this work, we present ARES, an algorithmic framework that allows
reconfiguration of the underlying servers, and is particularly suitable for
erasure-code based algorithms emulating atomic objects. ARES introduces new
configurations while keeping the service available. To use with ARES we also
propose a new, and to our knowledge, the first two-round erasure code based
algorithm TREAS, for emulating multi-writer, multi-reader (MWMR) atomic objects
in asynchronous, message-passing environments, with near-optimal communication
and storage costs. Our algorithms can tolerate crash failures of any client and
some fraction of servers, and yet, guarantee safety and liveness property.
Moreover, by bringing together the advantages of ARES and TREAS, we propose an
optimized algorithm where new configurations can be installed without the
objects values passing through the reconfiguration clients
S+Net: extending functional coordination with extra-functional semantics
This technical report introduces S+Net, a compositional coordination language
for streaming networks with extra-functional semantics. Compositionality
simplifies the specification of complex parallel and distributed applications;
extra-functional semantics allow the application designer to reason about and
control resource usage, performance and fault handling. The key feature of
S+Net is that functional and extra-functional semantics are defined
orthogonally from each other. S+Net can be seen as a simultaneous
simplification and extension of the existing coordination language S-Net, that
gives control of extra-functional behavior to the S-Net programmer. S+Net can
also be seen as a transitional research step between S-Net and AstraKahn,
another coordination language currently being designed at the University of
Hertfordshire. In contrast with AstraKahn which constitutes a re-design from
the ground up, S+Net preserves the basic operational semantics of S-Net and
thus provides an incremental introduction of extra-functional control in an
existing language.Comment: 34 pages, 11 figures, 3 table
Reconfigurable Lattice Agreement and Applications
Reconfiguration is one of the central mechanisms in distributed systems. Due to failures and connectivity disruptions, the very set of service replicas (or servers) and their roles in the computation may have to be reconfigured over time. To provide the desired level of consistency and availability to applications running on top of these servers, the clients of the service should be able to reach some form of agreement on the system configuration. We observe that this agreement is naturally captured via a lattice partial order on the system states. We propose an asynchronous implementation of reconfigurable lattice agreement that implies elegant reconfigurable versions of a large class of lattice abstract data types, such as max-registers and conflict detectors, as well as popular distributed programming abstractions, such as atomic snapshot and commit-adopt
- âŠ