3 research outputs found

    Constructing Antidictionaries in Output-Sensitive Space

    Get PDF
    A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y_1,y_2,...,y_k over an alphabet Σ, we are asked to compute the set M^ℓ_y_1#...#y_k of minimal absent words of length at most ℓ of word y=y_1#y_2#...#y_k, #∉Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ_y_1#...#y_N||=o(n), for all N∈[1,k]. For instance, in the human genome, n ≈ 3× 10^9 but ||M^12_y_1#...#y_k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ_y_1,...,M^ℓ_y_1#...#y_k can be computed in O(kn+∑^k_N=1||M^ℓ_y_1#...#y_N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in {y_1,...,y_k} and MaxOut={||M^ℓ_y_1#...#y_N||:N∈[1,k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution

    Substring Complexity in Sublinear Space

    Get PDF
    Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad-hoc measures are employed to estimate the repetitiveness of strings, e.g., the size zz of the Lempel-Ziv parse or the number rr of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ\gamma of a smallest string attractor. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ\gamma is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure that is based on the function STS_T counting the cardinalities of the sets of substrings of each length of TT, also known as the substring complexity. This new measure is defined as δ=sup{ST(k)/k,k1}\delta= \sup\{S_T(k)/k, k\geq 1\} and lower bounds all the measures previously considered. In particular, δγ\delta\leq \gamma always holds and δ\delta can be computed in O(n)\mathcal{O}(n) time using Ω(n)\Omega(n) working space. Kociumaka et al. showed that if δ\delta is given, one can construct an O(δlognδ)\mathcal{O}(\delta \log \frac{n}{\delta})-sized representation of TT supporting efficient direct access and efficient pattern matching queries on TT. Given that for highly compressible strings, δ\delta is significantly smaller than nn, it is natural to pose the following question: Can we compute δ\delta efficiently using sublinear working space? It is straightforward to show that any algorithm computing δ\delta using O(b)\mathcal{O}(b) space requires Ω(n2o(1)/b)\Omega(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present the following results: an O(n3/b2)\mathcal{O}(n^3/b^2)-time and O(b)\mathcal{O}(b)-space algorithm to compute δ\delta, for any b[1,n]b\in[1,n]; and an O~(n2/b)\tilde{\mathcal{O}}(n^2/b)-time and O(b)\mathcal{O}(b)-space algorithm to compute δ\delta, for any b[n2/3,n]b\in[n^{2/3},n]

    Internal Shortest Absent Word Queries in Constant Time and Linear Space

    Get PDF
    International audienceGiven a string T of length n over an alphabet Σ ⊂ {1, 2,. .. , n O(1) } of size σ, we are to preprocess T so that given a range [i, j], we can return a representation of a shortest string over Σ that is absent in the fragment T [i] • • • T [j] of T. We present an O(n)-space data structure that answers such queries in constant time and can be constructed in O(n log σ n) time