5 research outputs found
An Information-Theoretic Analysis of Deduplication
Deduplication finds and removes long-range data duplicates. It is commonly
used in cloud and enterprise server settings and has been successfully applied
to primary, backup, and archival storage. Despite its practical importance as a
source-coding technique, its analysis from the point of view of information
theory is missing. This paper provides such an information-theoretic analysis
of data deduplication. It introduces a new source model adapted to the
deduplication setting. It formalizes the two standard fixed-length and
variable-length deduplication schemes, and it introduces a novel multi-chunk
deduplication scheme. It then provides an analysis of these three deduplication
variants, emphasizing the importance of boundary synchronization between source
blocks and deduplication chunks. In particular, under fairly mild assumptions,
the proposed multi-chunk deduplication scheme is shown to be order optimal.Comment: 27 page
File Updates Under Random/Arbitrary Insertions And Deletions
A client/encoder edits a file, as modeled by an insertion-deletion (InDel)
process. An old copy of the file is stored remotely at a data-centre/decoder,
and is also available to the client. We consider the problem of throughput- and
computationally-efficient communication from the client to the data-centre, to
enable the server to update its copy to the newly edited file. We study two
models for the source files/edit patterns: the random pre-edit sequence
left-to-right random InDel (RPES-LtRRID) process, and the arbitrary pre-edit
sequence arbitrary InDel (APES-AID) process. In both models, we consider the
regime in which the number of insertions/deletions is a small (but constant)
fraction of the original file. For both models we prove information-theoretic
lower bounds on the best possible compression rates that enable file updates.
Conversely, our compression algorithms use dynamic programming (DP) and entropy
coding, and achieve rates that are approximately optimal.Comment: The paper is an extended version of our paper to be appeared at ITW
201