1,753 research outputs found

    Fingerprints in Compressed Strings

    Get PDF
    The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string S of size N compressed by a context-free grammar of size n that answers fingerprint queries. That is, given indices i and j, the answer to a query is the fingerprint of the substring S[i,j]. We present the first O(n) space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get O(logN) query time, and for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get O(log log N) query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time O(log N log l) and O(log l log log l + log log N) for SLPs and Linear SLPs, respectively. Here, l denotes the length of the LCE

    Fingerprints in compressed strings

    Get PDF
    Abstract. The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string S of size N compressed by a context-free grammar of size n that answers fingerprint queries. That is, given indices i and j, the answer to a query is the fingerprint of the substring S[i, j]. We present the first O(n) space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get O(logN) query time, and for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get O(log logN) query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time O(logN log `) and O(log ` log log `+ log logN) for SLPs and Linear SLPs, respectively. Here, ` denotes the length of the LCE.

    Finger Search in Grammar-Compressed Strings

    Get PDF
    Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. Given a grammar, the random access problem is to compactly represent the grammar while supporting random access, that is, given a position in the original uncompressed string report the character at that position. In this paper we study the random access problem with the finger search property, that is, the time for a random access query should depend on the distance between a specified index ff, called the \emph{finger}, and the query index ii. We consider both a static variant, where we first place a finger and subsequently access indices near the finger efficiently, and a dynamic variant where also moving the finger such that the time depends on the distance moved is supported. Let nn be the size the grammar, and let NN be the size of the string. For the static variant we give a linear space representation that supports placing the finger in O(logN)O(\log N) time and subsequently accessing in O(logD)O(\log D) time, where DD is the distance between the finger and the accessed index. For the dynamic variant we give a linear space representation that supports placing the finger in O(logN)O(\log N) time and accessing and moving the finger in O(logD+loglogN)O(\log D + \log \log N) time. Compared to the best linear space solution to random access, we improve a O(logN)O(\log N) query bound to O(logD)O(\log D) for the static variant and to O(logD+loglogN)O(\log D + \log \log N) for the dynamic variant, while maintaining linear space. As an application of our results we obtain an improved solution to the longest common extension problem in grammar compressed strings. To obtain our results, we introduce several new techniques of independent interest, including a novel van Emde Boas style decomposition of grammars

    String Indexing with Compressed Patterns

    Get PDF
    Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern

    Universal Compressed Text Indexing

    Get PDF
    The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let γ\gamma be the size of a string attractor for a text of length nn. Our index takes O(γlog(n/γ))O(\gamma\log(n/\gamma)) words of space and supports locating the occocc occurrences of any pattern of length mm in O(mlogn+occlogϵn)O(m\log n + occ\log^{\epsilon}n) time, for any constant ϵ>0\epsilon>0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

    Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries

    Full text link
    We present the first thorough practical study of the Lempel-Ziv-78 and the Lempel-Ziv-Welch computation based on trie data structures. With a careful selection of trie representations we can beat well-tuned popular trie data structures like Judy, m-Bonsai or Cedar

    Optimal-Time Text Indexing in BWT-runs Bounded Space

    Full text link
    Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is rr, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r)O(r) space and was able to efficiently count the number of occurrences of a pattern of length mm in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of rr. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occocc occurrences efficiently within O(r)O(r) space (in loglogarithmic time each), and reaching optimal time O(m+occ)O(m+occ) within O(rlog(n/r))O(r\log(n/r)) space, on a RAM machine of w=Ω(logn)w=\Omega(\log n) bits. Within O(rlog(n/r))O(r\log (n/r)) space, our index can also count in optimal time O(m)O(m). Raising the space to O(rwlogσ(n/r))O(r w\log_\sigma(n/r)), we support count and locate in O(mlog(σ)/w)O(m\log(\sigma)/w) and O(mlog(σ)/w+occ)O(m\log(\sigma)/w+occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(rlog(n/r))O(r\log(n/r)) space that replaces the text and extracts any text substring of length \ell in almost-optimal time O(log(n/r)+log(σ)/w)O(\log(n/r)+\ell\log(\sigma)/w). (...continues...
    corecore