2,290 research outputs found
A Robust Fault-Tolerant and Scalable Cluster-wide Deduplication for Shared-Nothing Storage Systems
Deduplication has been largely employed in distributed storage systems to
improve space efficiency. Traditional deduplication research ignores the design
specifications of shared-nothing distributed storage systems such as no central
metadata bottleneck, scalability, and storage rebalancing. Further,
deduplication introduces transactional changes, which are prone to errors in
the event of a system failure, resulting in inconsistencies in data and
deduplication metadata. In this paper, we propose a robust, fault-tolerant and
scalable cluster-wide deduplication that can eliminate duplicate copies across
the cluster. We design a distributed deduplication metadata shard which
guarantees performance scalability while preserving the design constraints of
shared- nothing storage systems. The placement of chunks and deduplication
metadata is made cluster-wide based on the content fingerprint of chunks. To
ensure transactional consistency and garbage identification, we employ a
flag-based asynchronous consistency mechanism. We implement the proposed
deduplication on Ceph. The evaluation shows high disk-space savings with
minimal performance degradation as well as high robustness in the event of
sudden server failure.Comment: 6 Pages including reference
An Information-Theoretic Analysis of Deduplication
Deduplication finds and removes long-range data duplicates. It is commonly
used in cloud and enterprise server settings and has been successfully applied
to primary, backup, and archival storage. Despite its practical importance as a
source-coding technique, its analysis from the point of view of information
theory is missing. This paper provides such an information-theoretic analysis
of data deduplication. It introduces a new source model adapted to the
deduplication setting. It formalizes the two standard fixed-length and
variable-length deduplication schemes, and it introduces a novel multi-chunk
deduplication scheme. It then provides an analysis of these three deduplication
variants, emphasizing the importance of boundary synchronization between source
blocks and deduplication chunks. In particular, under fairly mild assumptions,
the proposed multi-chunk deduplication scheme is shown to be order optimal.Comment: 27 page
Exploring heterogeneity of unreliable machines for p2p backup
P2P architecture is a viable option for enterprise backup. In contrast to
dedicated backup servers, nowadays a standard solution, making backups directly
on organization's workstations should be cheaper (as existing hardware is
used), more efficient (as there is no single bottleneck server) and more
reliable (as the machines are geographically dispersed).
We present the architecture of a p2p backup system that uses pairwise
replication contracts between a data owner and a replicator. In contrast to
standard p2p storage systems using directly a DHT, the contracts allow our
system to optimize replicas' placement depending on a specific optimization
strategy, and so to take advantage of the heterogeneity of the machines and the
network. Such optimization is particularly appealing in the context of backup:
replicas can be geographically dispersed, the load sent over the network can be
minimized, or the optimization goal can be to minimize the backup/restore time.
However, managing the contracts, keeping them consistent and adjusting them in
response to dynamically changing environment is challenging.
We built a scientific prototype and ran the experiments on 150 workstations
in the university's computer laboratories and, separately, on 50 PlanetLab
nodes. We found out that the main factor affecting the quality of the system is
the availability of the machines. Yet, our main conclusion is that it is
possible to build an efficient and reliable backup system on highly unreliable
machines (our computers had just 13% average availability)
Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties
We study a generalization of deduplication, which enables lossless
deduplication of highly similar data and show that standard deduplication with
fixed chunk length is a special case. We provide bounds on the expected length
of coded sequences for generalized deduplication and show that the coding has
asymptotic near-entropy cost under the proposed source model. More importantly,
we show that generalized deduplication allows for multiple orders of magnitude
faster convergence than standard deduplication. This means that generalized
deduplication can provide compression benefits much earlier than standard
deduplication, which is key in practical systems. Numerical examples
demonstrate our results, showing that our lower bounds are achievable, and
illustrating the potential gain of using the generalization over standard
deduplication. In fact, we show that even for a simple case of generalized
deduplication, the gain in convergence speed is linear with the size of the
data chunks.Comment: 15 pages, 4 figures. This is the full version of a paper accepted for
GLOBECOM 201
- …