18 research outputs found

    Tight Upper and Lower Bounds on Suffix Tree Breadth

    Get PDF
    The suffix tree - the compacted trie of all the suffixes of a string - is the most important and widely-used data structure in string processing. We consider a natural combinatorial question about suffix trees: for a string S of length n, how many nodes nu(S)(d) can there be at (string) depth d in its suffix tree? We prove nu(n, d) = max(S) (is an element of Sigma n) nu(S)(d) is O ((n/d) log(n/d)), and show that this bound is asymptotically tight, describing strings for which nu(S)(d) is Omega((n/d)log(n/d)). (C) 2020 Elsevier B.V. All rights reserved.Peer reviewe

    Checking whether a word is Hamming-isometric in linear time

    Full text link
    A finite word ff is Hamming-isometric if for any two word uu and vv of same length avoiding ff, uu can be transformed into vv by changing one by one all the letters on which uu differs from vv, in such a way that all of the new words obtained in this process also avoid~ff. Words which are not Hamming-isometric have been characterized as words having a border with two mismatches. We derive from this characterization a linear-time algorithm to check whether a word is Hamming-isometric. It is based on pattern matching algorithms with kk mismatches. Lee-isometric words over a four-letter alphabet have been characterized as words having a border with two Lee-errors. We derive from this characterization a linear-time algorithm to check whether a word over an alphabet of size four is Lee-isometric.Comment: A second algorithm for checking whether a word is Hamming-isometric is added using the result given in reference [5

    On Suffix Tree Breadth

    Get PDF
    The suffix tree—the compacted trie of all the suffixes of a string—is the most important and widely-used data structure in string processing. We consider a natural combinatorial question about suffix trees: for a string S of length n, how many nodes νS(d) can there be at (string) depth d in its suffix tree? We prove ν(n,d)=maxS∈ΣnνS(d) is O((n/d)logn) , and show that this bound is almost tight, describing strings for which νS(d)=d is Ω((n/d)log(n/d)

    Dichotomic Selection on Words: A Probabilistic Analysis

    Get PDF
    The paper studies the behaviour of selection algorithms that are based on dichotomy principles. On the entry formed by an ordered list L and a searched element x not in L, they return the interval of the list L the element x belongs to. We focus here on the case of words, where dichotomy principles lead to a selection algorithm designed by Crochemore, Hancart and Lecroq, which appears to be "quasi-optimal". We perform a probabilistic analysis of this algorithm that exhibits its quasi-optimality on average

    Longest Common Abelian Factors and Large Alphabets

    Get PDF
    Two strings X and Y are considered Abelian equal if the letters of X can be permuted to obtain Y (and vice versa). Recently, Alatabbi et al. (2015) considered the longest common Abelian factor problem in which we are asked to find the length of the longest Abelian-equal factor present in a given pair of strings. They provided an algorithm that uses O(σn2) time and O(σn) space, where n is the length of the pair of strings and σ is the alphabet size. In this paper we describe an algorithm that uses O(n2log2nlog∗n) time and O(nlog2n) space, significantly improving Alatabbi et al.’s result unless the alphabet is small. Our algorithm makes use of techniques for maintaining a dynamic set of strings under split, join, and equality testing (Melhorn et al., Algorithmica 17(2), 1997)

    Substring Complexity in Sublinear Space

    Get PDF
    Shannon’s entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size z of the Lempel–Ziv parse or the number r of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ of a smallest string attractor. Let T be a string of length n. A string attractor of T is a set of positions of T capturing the occurrences of all the substrings of T. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function S_T(k) counting the number of distinct substrings of length k of T, also known as the substring complexity of T. This new measure is defined as δ = sup{S_T(k)/k, k ≥ 1} and lower bounds all the relevant ad hoc measures previously considered. In particular, δ ≤ γ always holds and δ can be computed in O(n) time using Θ(n) working space. Kociumaka et al. showed that one can construct an O(δ log n/(δ))-sized representation of T supporting efficient direct access and efficient pattern matching queries on T. Given that for highly compressible strings, δ is significantly smaller than n, it is natural to pose the following question: Can we compute δ efficiently using sublinear working space? It is straightforward to show that in the comparison model, any algorithm computing δ using O(b) space requires Ω(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We thus wanted to investigate whether we can indeed match this lower bound. We address this algorithmic challenge by showing the following bounds to compute δ: - O((n3log b)/b2) time using O(b) space, for any b ∈ [1,n], in the comparison model. - Õ(n2/b) time using Õ(b) space, for any b ∈ [√n,n], in the word RAM model. This gives an Õ(n^{1+ε})-time and Õ(n^{1-ε})-space algorithm to compute δ, for any 0 < ε ≤ 1/2. Let us remark that our algorithms compute S_T(k), for all k, within the same complexities

    Substring Complexity in Sublinear Space

    Get PDF
    Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad-hoc measures are employed to estimate the repetitiveness of strings, e.g., the size zz of the Lempel-Ziv parse or the number rr of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ\gamma of a smallest string attractor. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ\gamma is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure that is based on the function STS_T counting the cardinalities of the sets of substrings of each length of TT, also known as the substring complexity. This new measure is defined as δ=sup{ST(k)/k,k1}\delta= \sup\{S_T(k)/k, k\geq 1\} and lower bounds all the measures previously considered. In particular, δγ\delta\leq \gamma always holds and δ\delta can be computed in O(n)\mathcal{O}(n) time using Ω(n)\Omega(n) working space. Kociumaka et al. showed that if δ\delta is given, one can construct an O(δlognδ)\mathcal{O}(\delta \log \frac{n}{\delta})-sized representation of TT supporting efficient direct access and efficient pattern matching queries on TT. Given that for highly compressible strings, δ\delta is significantly smaller than nn, it is natural to pose the following question: Can we compute δ\delta efficiently using sublinear working space? It is straightforward to show that any algorithm computing δ\delta using O(b)\mathcal{O}(b) space requires Ω(n2o(1)/b)\Omega(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present the following results: an O(n3/b2)\mathcal{O}(n^3/b^2)-time and O(b)\mathcal{O}(b)-space algorithm to compute δ\delta, for any b[1,n]b\in[1,n]; and an O~(n2/b)\tilde{\mathcal{O}}(n^2/b)-time and O(b)\mathcal{O}(b)-space algorithm to compute δ\delta, for any b[n2/3,n]b\in[n^{2/3},n]
    corecore