26,873 research outputs found
Incremental Consistency Guarantees for Replicated Objects
Programming with replicated objects is difficult. Developers must face the
fundamental trade-off between consistency and performance head on, while
struggling with the complexity of distributed storage stacks. We introduce
Correctables, a novel abstraction that hides most of this complexity, allowing
developers to focus on the task of balancing consistency and performance. To
aid developers with this task, Correctables provide incremental consistency
guarantees, which capture successive refinements on the result of an ongoing
operation on a replicated object. In short, applications receive both a
preliminary---fast, possibly inconsistent---result, as well as a
final---consistent---result that arrives later.
We show how to leverage incremental consistency guarantees by speculating on
preliminary values, trading throughput and bandwidth for improved latency. We
experiment with two popular storage systems (Cassandra and ZooKeeper) and three
applications: a Twissandra-based microblogging service, an ad serving system,
and a ticket selling system. Our evaluation on the Amazon EC2 platform with
YCSB workloads A, B, and C shows that we can reduce the latency of strongly
consistent operations by up to 40% (from 100ms to 60ms) at little cost (10%
bandwidth increase, 6% throughput drop) in the ad system. Even if the
preliminary result is frequently inconsistent (25% of accesses), incremental
consistency incurs a bandwidth overhead of only 27%.Comment: 16 total pages, 12 figures. OSDI'16 (to appear
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
Supercomputing systems today often come in the form of large numbers of
commodity systems linked together into a computing cluster. These systems, like
any distributed system, can have large numbers of independent hardware
components cooperating or collaborating on a computation. Unfortunately, any of
this vast number of components can fail at any time, resulting in potentially
erroneous output. In order to improve the robustness of supercomputing
applications in the presence of failures, many techniques have been developed
to provide resilience to these kinds of system faults. This survey provides an
overview of these various fault-tolerance techniques.Comment: 11 page
Replica determinism and flexible scheduling in hard real-time dependable systems
Fault-tolerant real-time systems are typically based on active replication where replicated entities are required to deliver their outputs in an identical order within a given time interval. Distributed scheduling of replicated tasks, however, violates this requirement if on-line scheduling, preemptive scheduling, or scheduling of dissimilar replicated task sets is employed. This problem of inconsistent task outputs has been solved previously by coordinating the decisions of the local schedulers such that replicated tasks are executed in an identical order. Global coordination results either in an extremely high communication effort to agree on each schedule decision or in an overly restrictive execution model where on-line scheduling, arbitrary preemptions, and nonidentically replicated task sets are not allowed. To overcome these restrictions, a new method, called timed messages, is introduced. Timed messages guarantee deterministic operation by presenting consistent message versions to the replicated tasks. This approach is based on simulated common knowledge and a sparse time base. Timed messages are very effective since they neither require communication between the local scheduler nor do they restrict usage of on-line flexible scheduling, preemptions and nonidentically replicated task sets
DataWarp: Building Applications which Make Progress in an Inconsistent World
The usual approach to dealing with imperfections in data is to attempt to eliminate them. However, the nature of modern systems means this is often futile. This paper describes an approach which permits applications to operate notwithstanding inconsistent data. Instead of attempting to extract a single, correct view of the world from its data, a DataWarp application constructs a collection of interpretations. It adopts one of these and continues work. Since it acts on assumptions, the DataWarp application considers its recent work to be provisional, expecting eventually most of these actions will become definitive. Should the application decide to adopt an alternative data view, it may then need to void provisional actions before resuming work. We describe the DataWarp architecture, discuss its implementation and describe an experiment in which a DataWarp application in an environment containing inconsistent data achieves better results than its conventional counterpart
LightChain: A DHT-based Blockchain for Resource Constrained Environments
As an append-only distributed database, blockchain is utilized in a vast
variety of applications including the cryptocurrency and Internet-of-Things
(IoT). The existing blockchain solutions have downsides in communication and
storage efficiency, convergence to centralization, and consistency problems. In
this paper, we propose LightChain, which is the first blockchain architecture
that operates over a Distributed Hash Table (DHT) of participating peers.
LightChain is a permissionless blockchain that provides addressable blocks and
transactions within the network, which makes them efficiently accessible by all
the peers. Each block and transaction is replicated within the DHT of peers and
is retrieved in an on-demand manner. Hence, peers in LightChain are not
required to retrieve or keep the entire blockchain. LightChain is fair as all
of the participating peers have a uniform chance of being involved in the
consensus regardless of their influence such as hashing power or stake.
LightChain provides a deterministic fork-resolving strategy as well as a
blacklisting mechanism, and it is secure against colluding adversarial peers
attacking the availability and integrity of the system. We provide mathematical
analysis and experimental results on scenarios involving 10K nodes to
demonstrate the security and fairness of LightChain. As we experimentally show
in this paper, compared to the mainstream blockchains like Bitcoin and
Ethereum, LightChain requires around 66 times less per node storage, and is
around 380 times faster on bootstrapping a new node to the system, while each
LightChain node is rewarded equally likely for participating in the protocol
PaRiS: Causally Consistent Transactions with Non-blocking Reads and Partial Replication
Geo-replicated data platforms are at the backbone of several large-scale
online services. Transactional Causal Consistency (TCC) is an attractive
consistency level for building such platforms. TCC avoids many anomalies of
eventual consistency, eschews the synchronization costs of strong consistency,
and supports interactive read-write transactions. Partial replication is
another attractive design choice for building geo-replicated platforms, as it
increases the storage capacity and reduces update propagation costs. This paper
presents PaRiS, the first TCC system that supports partial replication and
implements non-blocking parallel read operations, whose latency is paramount
for the performance of read-intensive applications. PaRiS relies on a novel
protocol to track dependencies, called Universal Stable Time (UST). By means of
a lightweight background gossip process, UST identifies a snapshot of the data
that has been installed by every DC in the system. Hence, transactions can
consistently read from such a snapshot on any server in any replication site
without having to block. Moreover, PaRiS requires only one timestamp to track
dependencies and define transactional snapshots, thereby achieving resource
efficiency and scalability. We evaluate PaRiS on a large-scale AWS deployment
composed of up to 10 replication sites. We show that PaRiS scales well with the
number of DCs and partitions, while being able to handle larger data-sets than
existing solutions that assume full replication. We also demonstrate a
performance gain of non-blocking reads vs. a blocking alternative (up to 1.47x
higher throughput with 5.91x lower latency for read-dominated workloads and up
to 1.46x higher throughput with 20.56x lower latency for write-heavy
workloads)
Semiconductor manufacturing simulation design and analysis with limited data
This paper discusses simulation design and analysis for Silicon Carbide (SiC) manufacturing operations management at New York Power Electronics Manufacturing Consortium (PEMC) facility. Prior work has addressed the development of manufacturing system simulation as the decision support to solve the strategic equipment portfolio selection problem for the SiC fab design [1]. As we move into the phase of collecting data from the equipment purchased for the PEMC facility, we discuss how to redesign our manufacturing simulations and analyze their outputs to overcome the challenges that naturally arise in the presence of limited fab data. We conclude with insights on how an approach aimed to reflect learning from data can enable our discrete-event stochastic simulation to accurately estimate the performance measures for SiC manufacturing at the PEMC facility
- …