2,290 research outputs found

    A Robust Fault-Tolerant and Scalable Cluster-wide Deduplication for Shared-Nothing Storage Systems

    Full text link
    Deduplication has been largely employed in distributed storage systems to improve space efficiency. Traditional deduplication research ignores the design specifications of shared-nothing distributed storage systems such as no central metadata bottleneck, scalability, and storage rebalancing. Further, deduplication introduces transactional changes, which are prone to errors in the event of a system failure, resulting in inconsistencies in data and deduplication metadata. In this paper, we propose a robust, fault-tolerant and scalable cluster-wide deduplication that can eliminate duplicate copies across the cluster. We design a distributed deduplication metadata shard which guarantees performance scalability while preserving the design constraints of shared- nothing storage systems. The placement of chunks and deduplication metadata is made cluster-wide based on the content fingerprint of chunks. To ensure transactional consistency and garbage identification, we employ a flag-based asynchronous consistency mechanism. We implement the proposed deduplication on Ceph. The evaluation shows high disk-space savings with minimal performance degradation as well as high robustness in the event of sudden server failure.Comment: 6 Pages including reference

    An Information-Theoretic Analysis of Deduplication

    Full text link
    Deduplication finds and removes long-range data duplicates. It is commonly used in cloud and enterprise server settings and has been successfully applied to primary, backup, and archival storage. Despite its practical importance as a source-coding technique, its analysis from the point of view of information theory is missing. This paper provides such an information-theoretic analysis of data deduplication. It introduces a new source model adapted to the deduplication setting. It formalizes the two standard fixed-length and variable-length deduplication schemes, and it introduces a novel multi-chunk deduplication scheme. It then provides an analysis of these three deduplication variants, emphasizing the importance of boundary synchronization between source blocks and deduplication chunks. In particular, under fairly mild assumptions, the proposed multi-chunk deduplication scheme is shown to be order optimal.Comment: 27 page

    Exploring heterogeneity of unreliable machines for p2p backup

    Full text link
    P2P architecture is a viable option for enterprise backup. In contrast to dedicated backup servers, nowadays a standard solution, making backups directly on organization's workstations should be cheaper (as existing hardware is used), more efficient (as there is no single bottleneck server) and more reliable (as the machines are geographically dispersed). We present the architecture of a p2p backup system that uses pairwise replication contracts between a data owner and a replicator. In contrast to standard p2p storage systems using directly a DHT, the contracts allow our system to optimize replicas' placement depending on a specific optimization strategy, and so to take advantage of the heterogeneity of the machines and the network. Such optimization is particularly appealing in the context of backup: replicas can be geographically dispersed, the load sent over the network can be minimized, or the optimization goal can be to minimize the backup/restore time. However, managing the contracts, keeping them consistent and adjusting them in response to dynamically changing environment is challenging. We built a scientific prototype and ran the experiments on 150 workstations in the university's computer laboratories and, separately, on 50 PlanetLab nodes. We found out that the main factor affecting the quality of the system is the availability of the machines. Yet, our main conclusion is that it is possible to build an efficient and reliable backup system on highly unreliable machines (our computers had just 13% average availability)

    Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties

    Full text link
    We study a generalization of deduplication, which enables lossless deduplication of highly similar data and show that standard deduplication with fixed chunk length is a special case. We provide bounds on the expected length of coded sequences for generalized deduplication and show that the coding has asymptotic near-entropy cost under the proposed source model. More importantly, we show that generalized deduplication allows for multiple orders of magnitude faster convergence than standard deduplication. This means that generalized deduplication can provide compression benefits much earlier than standard deduplication, which is key in practical systems. Numerical examples demonstrate our results, showing that our lower bounds are achievable, and illustrating the potential gain of using the generalization over standard deduplication. In fact, we show that even for a simple case of generalized deduplication, the gain in convergence speed is linear with the size of the data chunks.Comment: 15 pages, 4 figures. This is the full version of a paper accepted for GLOBECOM 201
    • …
    corecore