2,270 research outputs found

    Finger Search in Grammar-Compressed Strings

    Get PDF
    Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. Given a grammar, the random access problem is to compactly represent the grammar while supporting random access, that is, given a position in the original uncompressed string report the character at that position. In this paper we study the random access problem with the finger search property, that is, the time for a random access query should depend on the distance between a specified index ff, called the \emph{finger}, and the query index ii. We consider both a static variant, where we first place a finger and subsequently access indices near the finger efficiently, and a dynamic variant where also moving the finger such that the time depends on the distance moved is supported. Let nn be the size the grammar, and let NN be the size of the string. For the static variant we give a linear space representation that supports placing the finger in O(logN)O(\log N) time and subsequently accessing in O(logD)O(\log D) time, where DD is the distance between the finger and the accessed index. For the dynamic variant we give a linear space representation that supports placing the finger in O(logN)O(\log N) time and accessing and moving the finger in O(logD+loglogN)O(\log D + \log \log N) time. Compared to the best linear space solution to random access, we improve a O(logN)O(\log N) query bound to O(logD)O(\log D) for the static variant and to O(logD+loglogN)O(\log D + \log \log N) for the dynamic variant, while maintaining linear space. As an application of our results we obtain an improved solution to the longest common extension problem in grammar compressed strings. To obtain our results, we introduce several new techniques of independent interest, including a novel van Emde Boas style decomposition of grammars

    Universal Compressed Text Indexing

    Get PDF
    The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let γ\gamma be the size of a string attractor for a text of length nn. Our index takes O(γlog(n/γ))O(\gamma\log(n/\gamma)) words of space and supports locating the occocc occurrences of any pattern of length mm in O(mlogn+occlogϵn)O(m\log n + occ\log^{\epsilon}n) time, for any constant ϵ>0\epsilon>0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

    Algorithms and data structures for grammar-compressed strings

    Get PDF

    Linear Compressed Pattern Matching for Polynomial Rewriting (Extended Abstract)

    Full text link
    This paper is an extended abstract of an analysis of term rewriting where the terms in the rewrite rules as well as the term to be rewritten are compressed by a singleton tree grammar (STG). This form of compression is more general than node sharing or representing terms as dags since also partial trees (contexts) can be shared in the compression. In the first part efficient but complex algorithms for detecting applicability of a rewrite rule under STG-compression are constructed and analyzed. The second part applies these results to term rewriting sequences. The main result for submatching is that finding a redex of a left-linear rule can be performed in polynomial time under STG-compression. The main implications for rewriting and (single-position or parallel) rewriting steps are: (i) under STG-compression, n rewriting steps can be performed in nondeterministic polynomial time. (ii) under STG-compression and for left-linear rewrite rules a sequence of n rewriting steps can be performed in polynomial time, and (iii) for compressed rewrite rules where the left hand sides are either DAG-compressed or ground and STG-compressed, and an STG-compressed target term, n rewriting steps can be performed in polynomial time.Comment: In Proceedings TERMGRAPH 2013, arXiv:1302.599

    Fingerprints in Compressed Strings

    Get PDF
    The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string S of size N compressed by a context-free grammar of size n that answers fingerprint queries. That is, given indices i and j, the answer to a query is the fingerprint of the substring S[i,j]. We present the first O(n) space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get O(logN) query time, and for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get O(log log N) query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time O(log N log l) and O(log l log log l + log log N) for SLPs and Linear SLPs, respectively. Here, l denotes the length of the LCE

    Fingerprints in compressed strings

    Get PDF
    Abstract. The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. In this paper we show how to construct a data structure for a string S of size N compressed by a context-free grammar of size n that answers fingerprint queries. That is, given indices i and j, the answer to a query is the fingerprint of the substring S[i, j]. We present the first O(n) space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get O(logN) query time, and for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get O(log logN) query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time O(logN log `) and O(log ` log log `+ log logN) for SLPs and Linear SLPs, respectively. Here, ` denotes the length of the LCE.

    Compression by Contracting Straight-Line Programs

    Get PDF
    In grammar-based compression a string is represented by a context-free grammar, also called a straight-line program (SLP), that generates only that string. We refine a recent balancing result stating that one can transform an SLP of size gg in linear time into an equivalent SLP of size O(g)O(g) so that the height of the unique derivation tree is O(logN)O(\log N) where NN is the length of the represented string (FOCS 2019). We introduce a new class of balanced SLPs, called contracting SLPs, where for every rule Aβ1βkA \to \beta_1 \dots \beta_k the string length of every variable βi\beta_i on the right-hand side is smaller by a constant factor than the string length of AA. In particular, the derivation tree of a contracting SLP has the property that every subtree has logarithmic height in its leaf size. We show that a given SLP of size gg can be transformed in linear time into an equivalent contracting SLP of size O(g)O(g) with rules of constant length. We present an application to the navigation problem in compressed unranked trees, represented by forest straight-line programs (FSLPs). We extend a linear space data structure by Reh and Sieber (2020) by the operation of moving to the ii-th child in time O(logd)O(\log d) where dd is the degree of the current node. Contracting SLPs are also applied to the finger search problem over SLP-compressed strings where one wants to access positions near to a pre-specified finger position, ideally in O(logd)O(\log d) time where dd is the distance between the accessed position and the finger. We give a linear space solution where one can access symbols or move the finger in time O(logd+log(t)N)O(\log d + \log^{(t)} N) for any constant tt where log(t)N\log^{(t)} N is the tt-fold logarithm of NN. This improves a previous solution by Bille, Christiansen, Cording, and G{\o}rtz (2018) with access/move time O(logd+loglogN)O(\log d + \log \log N)

    Fully dynamic data structure for LCE queries in compressed space

    Get PDF
    A Longest Common Extension (LCE) query on a text TT of length NN asks for the length of the longest common prefix of suffixes starting at given two positions. We show that the signature encoding G\mathcal{G} of size w=O(min(zlogNlogM,N))w = O(\min(z \log N \log^* M, N)) [Mehlhorn et al., Algorithmica 17(2):183-198, 1997] of TT, which can be seen as a compressed representation of TT, has a capability to support LCE queries in O(logN+loglogM)O(\log N + \log \ell \log^* M) time, where \ell is the answer to the query, zz is the size of the Lempel-Ziv77 (LZ77) factorization of TT, and M4NM \geq 4N is an integer that can be handled in constant time under word RAM model. In compressed space, this is the fastest deterministic LCE data structure in many cases. Moreover, G\mathcal{G} can be enhanced to support efficient update operations: After processing G\mathcal{G} in O(wfA)O(w f_{\mathcal{A}}) time, we can insert/delete any (sub)string of length yy into/from an arbitrary position of TT in O((y+logNlogM)fA)O((y+ \log N\log^* M) f_{\mathcal{A}}) time, where fA=O(min{loglogMloglogwlogloglogM,logwloglogw})f_{\mathcal{A}} = O(\min \{ \frac{\log\log M \log\log w}{\log\log\log M}, \sqrt{\frac{\log w}{\log\log w}} \}). This yields the first fully dynamic LCE data structure. We also present efficient construction algorithms from various types of inputs: We can construct G\mathcal{G} in O(NfA)O(N f_{\mathcal{A}}) time from uncompressed string TT; in O(nloglognlogNlogM)O(n \log\log n \log N \log^* M) time from grammar-compressed string TT represented by a straight-line program of size nn; and in O(zfAlogNlogM)O(z f_{\mathcal{A}} \log N \log^* M) time from LZ77-compressed string TT with zz factors. On top of the above contributions, we show several applications of our data structures which improve previous best known results on grammar-compressed string processing.Comment: arXiv admin note: text overlap with arXiv:1504.0695
    corecore