2 research outputs found

    Fast Longest Common Extensions in Small Space

    Full text link
    In this paper we address the longest common extension (LCE) problem: to compute the length β„“\ell of the longest common prefix between any two suffixes of T∈ΣnT\in \Sigma^n with Ξ£={0,β€¦Οƒβˆ’1} \Sigma = \{0, \ldots \sigma-1\} . We present two fast and space-efficient solutions based on (Karp-Rabin) \textit{fingerprinting} and \textit{sampling}. Our first data structure exploits properties of Mersenne prime numbers when used as moduli of the Karp-Rabin hash function and takes n⌈log⁑2ΟƒβŒ‰n\lceil \log_2\sigma\rceil bits of space. Our second structure works with any prime modulus and takes n⌈log⁑2ΟƒβŒ‰+n/w+wlog⁑2nn\lceil \log_2\sigma\rceil + n/w + w\log_2 n bits of space (w w memory-word size). Both structures support O(mlog⁑σ/w)\mathcal O\left(m\log\sigma/w \right)-time extraction of any length-mm text substring, O(log⁑ℓ)\mathcal O(\log\ell)-time LCE queries with high probability, and can be built in optimal O(n)\mathcal O(n) time. In the first case, ours is the first result showing that it is possible to answer LCE queries in o(n)o(n) time while using only O(1)\mathcal O(1) words on top of the space required to store the text. Our results improve the state of the art in space usage, query times, and preprocessing times and are extremely practical: we present a C++ implementation that is very fast and space-efficient in practice

    Small-space encoding LCE data structure with constant-time queries

    Full text link
    The \emph{longest common extension} (\emph{LCE}) problem is to preprocess a given string ww of length nn so that the length of the longest common prefix between suffixes of ww that start at any two given positions is answered quickly. In this paper, we present a data structure of O(zΟ„2+nΟ„)O(z \tau^2 + \frac{n}{\tau}) words of space which answers LCE queries in O(1)O(1) time and can be built in O(nlog⁑σ)O(n \log \sigma) time, where 1≀τ≀n1 \leq \tau \leq \sqrt{n} is a parameter, zz is the size of the Lempel-Ziv 77 factorization of ww and Οƒ\sigma is the alphabet size. This is an \emph{encoding} data structure, i.e., it does not access the input string ww when answering queries and thus ww can be deleted after preprocessing. On top of this main result, we obtain further results using (variants of) our LCE data structure, which include the following: - For highly repetitive strings where the zΟ„2z\tau^2 term is dominated by nΟ„\frac{n}{\tau}, we obtain a \emph{constant-time and sub-linear space} LCE query data structure. - Even when the input string is not well compressible via Lempel-Ziv 77 factorization, we still can obtain a \emph{constant-time and sub-linear space} LCE data structure for suitable Ο„\tau and for σ≀2o(log⁑n)\sigma \leq 2^{o(\log n)}. - The time-space trade-off lower bounds for the LCE problem by Bille et al. [J. Discrete Algorithms, 25:42-50, 2014] and by Kosolobov [CoRR, abs/1611.02891, 2016] can be "surpassed" in some cases with our LCE data structure
    corecore