1,399 research outputs found

    The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space

    Full text link
    An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory. We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations. We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence

    Crucial and bicrucial permutations with respect to arithmetic monotone patterns

    Full text link
    A pattern τ\tau is a permutation, and an arithmetic occurrence of τ\tau in (another) permutation π=π1π2...πn\pi=\pi_1\pi_2...\pi_n is a subsequence πi1πi2...πim\pi_{i_1}\pi_{i_2}...\pi_{i_m} of π\pi that is order isomorphic to τ\tau where the numbers i1<i2<...<imi_1<i_2<...<i_m form an arithmetic progression. A permutation is (k,)(k,\ell)-crucial if it avoids arithmetically the patterns 12...k12... k and (1)...1\ell(\ell-1)... 1 but its extension to the right by any element does not avoid arithmetically these patterns. A (k,)(k,\ell)-crucial permutation that cannot be extended to the left without creating an arithmetic occurrence of 12...k12... k or (1)...1\ell(\ell-1)... 1 is called (k,)(k,\ell)-bicrucial. In this paper we prove that arbitrary long (k,)(k,\ell)-crucial and (k,)(k,\ell)-bicrucial permutations exist for any k,3k,\ell\geq 3. Moreover, we show that the minimal length of a (k,)(k,\ell)-crucial permutation is max(k,)(min(k,)1)\max(k,\ell)(\min(k,\ell)-1), while the minimal length of a (k,)(k,\ell)-bicrucial permutation is at most 2max(k,)(min(k,)1)2\max(k,\ell)(\min(k,\ell)-1), again for k,3k,\ell\geq3

    Cell-Probe Bounds for Online Edit Distance and Other Pattern Matching Problems

    Full text link
    We give cell-probe bounds for the computation of edit distance, Hamming distance, convolution and longest common subsequence in a stream. In this model, a fixed string of nn symbols is given and one δ\delta-bit symbol arrives at a time in a stream. After each symbol arrives, the distance between the fixed string and a suffix of most recent symbols of the stream is reported. The cell-probe model is perhaps the strongest model of computation for showing data structure lower bounds, subsuming in particular the popular word-RAM model. * We first give an Ω((δlogn)/(w+loglogn))\Omega((\delta \log n)/(w+\log\log n)) lower bound for the time to give each output for both online Hamming distance and convolution, where ww is the word size. This bound relies on a new encoding scheme and for the first time holds even when ww is as small as a single bit. * We then consider the online edit distance and longest common subsequence problems in the bit-probe model (w=1w=1) with a constant sized input alphabet. We give a lower bound of Ω(logn/(loglogn)3/2)\Omega(\sqrt{\log n}/(\log\log n)^{3/2}) which applies for both problems. This second set of results relies both on our new encoding scheme as well as a carefully constructed hard distribution. * Finally, for the online edit distance problem we show that there is an O((logn)2/w)O((\log n)^2/w) upper bound in the cell-probe model. This bound gives a contrast to our new lower bound and also establishes an exponential gap between the known cell-probe and RAM model complexities.Comment: 32 pages, 4 figure

    File Updates Under Random/Arbitrary Insertions And Deletions

    Full text link
    A client/encoder edits a file, as modeled by an insertion-deletion (InDel) process. An old copy of the file is stored remotely at a data-centre/decoder, and is also available to the client. We consider the problem of throughput- and computationally-efficient communication from the client to the data-centre, to enable the server to update its copy to the newly edited file. We study two models for the source files/edit patterns: the random pre-edit sequence left-to-right random InDel (RPES-LtRRID) process, and the arbitrary pre-edit sequence arbitrary InDel (APES-AID) process. In both models, we consider the regime in which the number of insertions/deletions is a small (but constant) fraction of the original file. For both models we prove information-theoretic lower bounds on the best possible compression rates that enable file updates. Conversely, our compression algorithms use dynamic programming (DP) and entropy coding, and achieve rates that are approximately optimal.Comment: The paper is an extended version of our paper to be appeared at ITW 201

    Tree Contractions and Evolutionary Trees

    Full text link
    An evolutionary tree is a rooted tree where each internal vertex has at least two children and where the leaves are labeled with distinct symbols representing species. Evolutionary trees are useful for modeling the evolutionary history of species. An agreement subtree of two evolutionary trees is an evolutionary tree which is also a topological subtree of the two given trees. We give an algorithm to determine the largest possible number of leaves in any agreement subtree of two trees T_1 and T_2 with n leaves each. If the maximum degree d of these trees is bounded by a constant, the time complexity is O(n log^2(n)) and is within a log(n) factor of optimal. For general d, this algorithm runs in O(n d^2 log(d) log^2(n)) time or alternatively in O(n d sqrt(d) log^3(n)) time

    Edit Distance: Sketching, Streaming and Document Exchange

    Full text link
    We show that in the document exchange problem, where Alice holds x{0,1}nx \in \{0,1\}^n and Bob holds y{0,1}ny \in \{0,1\}^n, Alice can send Bob a message of size O(K(log2K+logn))O(K(\log^2 K+\log n)) bits such that Bob can recover xx using the message and his input yy if the edit distance between xx and yy is no more than KK, and output "error" otherwise. Both the encoding and decoding can be done in time O~(n+poly(K))\tilde{O}(n+\mathsf{poly}(K)). This result significantly improves the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold xx and yy respectively, they can compute sketches of xx and yy of sizes poly(Klogn)\mathsf{poly}(K \log n) bits (the encoding), and send to the referee, who can then compute the edit distance between xx and yy together with all the edit operations if the edit distance is no more than KK, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using poly(Klogn)\mathsf{poly}(K \log n) bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using poly(Klogn)\mathsf{poly}(K \log n) bits of space.Comment: Full version of an article to be presented at the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2016
    corecore