7,653 research outputs found
Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties
We study a generalization of deduplication, which enables lossless
deduplication of highly similar data and show that standard deduplication with
fixed chunk length is a special case. We provide bounds on the expected length
of coded sequences for generalized deduplication and show that the coding has
asymptotic near-entropy cost under the proposed source model. More importantly,
we show that generalized deduplication allows for multiple orders of magnitude
faster convergence than standard deduplication. This means that generalized
deduplication can provide compression benefits much earlier than standard
deduplication, which is key in practical systems. Numerical examples
demonstrate our results, showing that our lower bounds are achievable, and
illustrating the potential gain of using the generalization over standard
deduplication. In fact, we show that even for a simple case of generalized
deduplication, the gain in convergence speed is linear with the size of the
data chunks.Comment: 15 pages, 4 figures. This is the full version of a paper accepted for
GLOBECOM 201
An Information-Theoretic Analysis of Deduplication
Deduplication finds and removes long-range data duplicates. It is commonly
used in cloud and enterprise server settings and has been successfully applied
to primary, backup, and archival storage. Despite its practical importance as a
source-coding technique, its analysis from the point of view of information
theory is missing. This paper provides such an information-theoretic analysis
of data deduplication. It introduces a new source model adapted to the
deduplication setting. It formalizes the two standard fixed-length and
variable-length deduplication schemes, and it introduces a novel multi-chunk
deduplication scheme. It then provides an analysis of these three deduplication
variants, emphasizing the importance of boundary synchronization between source
blocks and deduplication chunks. In particular, under fairly mild assumptions,
the proposed multi-chunk deduplication scheme is shown to be order optimal.Comment: 27 page
A Robust Fault-Tolerant and Scalable Cluster-wide Deduplication for Shared-Nothing Storage Systems
Deduplication has been largely employed in distributed storage systems to
improve space efficiency. Traditional deduplication research ignores the design
specifications of shared-nothing distributed storage systems such as no central
metadata bottleneck, scalability, and storage rebalancing. Further,
deduplication introduces transactional changes, which are prone to errors in
the event of a system failure, resulting in inconsistencies in data and
deduplication metadata. In this paper, we propose a robust, fault-tolerant and
scalable cluster-wide deduplication that can eliminate duplicate copies across
the cluster. We design a distributed deduplication metadata shard which
guarantees performance scalability while preserving the design constraints of
shared- nothing storage systems. The placement of chunks and deduplication
metadata is made cluster-wide based on the content fingerprint of chunks. To
ensure transactional consistency and garbage identification, we employ a
flag-based asynchronous consistency mechanism. We implement the proposed
deduplication on Ceph. The evaluation shows high disk-space savings with
minimal performance degradation as well as high robustness in the event of
sudden server failure.Comment: 6 Pages including reference
Better duplicate detection for systematic reviewers: Evaluation of Systematic Review Assistant-Deduplication Module
BACKGROUND: A major problem arising from searching across bibliographic databases is the retrieval of duplicate citations. Removing such duplicates is an essential task to ensure systematic reviewers do not waste time screening the same citation multiple times. Although reference management software use algorithms to remove duplicate records, this is only partially successful and necessitates removing the remaining duplicates manually. This time-consuming task leads to wasted resources. We sought to evaluate the effectiveness of a newly developed deduplication program against EndNote. METHODS: A literature search of 1,988 citations was manually inspected and duplicate citations identified and coded to create a benchmark dataset. The Systematic Review Assistant-Deduplication Module (SRA-DM) was iteratively developed and tested using the benchmark dataset and compared with EndNote’s default one step auto-deduplication process matching on (‘author’, ‘year’, ‘title’). The accuracy of deduplication was reported by calculating the sensitivity and specificity. Further validation tests, with three additional benchmarked literature searches comprising a total of 4,563 citations were performed to determine the reliability of the SRA-DM algorithm. RESULTS: The sensitivity (84%) and specificity (100%) of the SRA-DM was superior to EndNote (sensitivity 51%, specificity 99.83%). Validation testing on three additional biomedical literature searches demonstrated that SRA-DM consistently achieved higher sensitivity than EndNote (90% vs 63%), (84% vs 73%) and (84% vs 64%). Furthermore, the specificity of SRA-DM was 100%, whereas the specificity of EndNote was imperfect (average 99.75%) with some unique records wrongly assigned as duplicates. Overall, there was a 42.86% increase in the number of duplicates records detected with SRA-DM compared with EndNote auto-deduplication. CONCLUSIONS: The Systematic Review Assistant-Deduplication Module offers users a reliable program to remove duplicate records with greater sensitivity and specificity than EndNote. This application will save researchers and information specialists time and avoid research waste. The deduplication program is freely available online
- …