1,399 research outputs found
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space
An indexed sequence of strings is a data structure for storing a string
sequence that supports random access, searching, range counting and analytics
operations, both for exact matches and prefix search. String sequences lie at
the core of column-oriented databases, log processing, and other storage and
query tasks. In these applications each string can appear several times and the
order of the strings in the sequence is relevant. The prefix structure of the
strings is relevant as well: common prefixes are sought in strings to extract
interesting features from the sequence. Moreover, space-efficiency is highly
desirable as it translates directly into higher performance, since more data
can fit in fast memory.
We introduce and study the problem of compressed indexed sequence of strings,
representing indexed sequences of strings in nearly-optimal compressed space,
both in the static and dynamic settings, while preserving provably good
performance for the supported operations.
We present a new data structure for this problem, the Wavelet Trie, which
combines the classical Patricia Trie with the Wavelet Tree, a succinct data
structure for storing a compressed sequence. The resulting Wavelet Trie
smoothly adapts to a sequence of strings that changes over time. It improves on
the state-of-the-art compressed data structures by supporting a dynamic
alphabet (i.e. the set of distinct strings) and prefix queries, both crucial
requirements in the aforementioned applications, and on traditional indexes by
reducing space occupancy to close to the entropy of the sequence
Crucial and bicrucial permutations with respect to arithmetic monotone patterns
A pattern is a permutation, and an arithmetic occurrence of in
(another) permutation is a subsequence
of that is order isomorphic to
where the numbers form an arithmetic progression. A
permutation is -crucial if it avoids arithmetically the patterns
and but its extension to the right by any element
does not avoid arithmetically these patterns. A -crucial permutation
that cannot be extended to the left without creating an arithmetic occurrence
of or is called -bicrucial.
In this paper we prove that arbitrary long -crucial and
-bicrucial permutations exist for any . Moreover, we
show that the minimal length of a -crucial permutation is
, while the minimal length of a
-bicrucial permutation is at most ,
again for
Cell-Probe Bounds for Online Edit Distance and Other Pattern Matching Problems
We give cell-probe bounds for the computation of edit distance, Hamming
distance, convolution and longest common subsequence in a stream. In this
model, a fixed string of symbols is given and one -bit symbol
arrives at a time in a stream. After each symbol arrives, the distance between
the fixed string and a suffix of most recent symbols of the stream is reported.
The cell-probe model is perhaps the strongest model of computation for showing
data structure lower bounds, subsuming in particular the popular word-RAM
model.
* We first give an lower bound for
the time to give each output for both online Hamming distance and convolution,
where is the word size. This bound relies on a new encoding scheme and for
the first time holds even when is as small as a single bit.
* We then consider the online edit distance and longest common subsequence
problems in the bit-probe model () with a constant sized input alphabet.
We give a lower bound of which
applies for both problems. This second set of results relies both on our new
encoding scheme as well as a carefully constructed hard distribution.
* Finally, for the online edit distance problem we show that there is an
upper bound in the cell-probe model. This bound gives a
contrast to our new lower bound and also establishes an exponential gap between
the known cell-probe and RAM model complexities.Comment: 32 pages, 4 figure
File Updates Under Random/Arbitrary Insertions And Deletions
A client/encoder edits a file, as modeled by an insertion-deletion (InDel)
process. An old copy of the file is stored remotely at a data-centre/decoder,
and is also available to the client. We consider the problem of throughput- and
computationally-efficient communication from the client to the data-centre, to
enable the server to update its copy to the newly edited file. We study two
models for the source files/edit patterns: the random pre-edit sequence
left-to-right random InDel (RPES-LtRRID) process, and the arbitrary pre-edit
sequence arbitrary InDel (APES-AID) process. In both models, we consider the
regime in which the number of insertions/deletions is a small (but constant)
fraction of the original file. For both models we prove information-theoretic
lower bounds on the best possible compression rates that enable file updates.
Conversely, our compression algorithms use dynamic programming (DP) and entropy
coding, and achieve rates that are approximately optimal.Comment: The paper is an extended version of our paper to be appeared at ITW
201
Tree Contractions and Evolutionary Trees
An evolutionary tree is a rooted tree where each internal vertex has at least
two children and where the leaves are labeled with distinct symbols
representing species. Evolutionary trees are useful for modeling the
evolutionary history of species. An agreement subtree of two evolutionary trees
is an evolutionary tree which is also a topological subtree of the two given
trees. We give an algorithm to determine the largest possible number of leaves
in any agreement subtree of two trees T_1 and T_2 with n leaves each. If the
maximum degree d of these trees is bounded by a constant, the time complexity
is O(n log^2(n)) and is within a log(n) factor of optimal. For general d, this
algorithm runs in O(n d^2 log(d) log^2(n)) time or alternatively in O(n d
sqrt(d) log^3(n)) time
Edit Distance: Sketching, Streaming and Document Exchange
We show that in the document exchange problem, where Alice holds and Bob holds , Alice can send Bob a message of
size bits such that Bob can recover using the
message and his input if the edit distance between and is no more
than , and output "error" otherwise. Both the encoding and decoding can be
done in time . This result significantly
improves the previous communication bounds under polynomial encoding/decoding
time. We also show that in the referee model, where Alice and Bob hold and
respectively, they can compute sketches of and of sizes
bits (the encoding), and send to the referee, who can
then compute the edit distance between and together with all the edit
operations if the edit distance is no more than , and output "error"
otherwise (the decoding). To the best of our knowledge, this is the first
result for sketching edit distance using bits.
Moreover, the encoding phase of our sketching algorithm can be performed by
scanning the input string in one pass. Thus our sketching algorithm also
implies the first streaming algorithm for computing edit distance and all the
edits exactly using bits of space.Comment: Full version of an article to be presented at the 57th Annual IEEE
Symposium on Foundations of Computer Science (FOCS 2016
- …