19,774 research outputs found

    Space-efficient detection of unusual words

    Full text link
    Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of O(σ2log2n)O(\sigma^2\log^2 n) bits, where nn is the length of the string and σ\sigma is the size of the alphabet. The size of the stack is o(n)o(n) except for very large values of σ\sigma. We further improve the algorithm by removing its time dependency on σ\sigma, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that do not occur\textit{do not occur} in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637

    Lightweight LCP Construction for Very Large Collections of Strings

    Full text link
    The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from "next-generation" DNA sequencing (NGS) technologies. In this paper we present the first lightweight algorithm (called extLCP) for the simultaneous computation of the longest common prefix array and the Burrows-Wheeler transform of a very large collection of strings having any length. The computation is realized by performing disk data accesses only via sequential scans, and the total disk space usage never needs more than twice the output size, excluding the disk space required for the input. Moreover, extLCP allows to compute also the suffix array of the strings of the collection, without any other further data structure is needed. Finally, we test our algorithm on real data and compare our results with another tool capable to work in external memory on large collections of strings.Comment: This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/ The final version of this manuscript is in press in Journal of Discrete Algorithm

    Optimal Substring-Equality Queries with Applications to Sparse Text Indexing

    Full text link
    We consider the problem of encoding a string of length nn from an integer alphabet of size σ\sigma so that access and substring equality queries (that is, determining the equality of any two substrings) can be answered efficiently. Any uniquely-decodable encoding supporting access must take nlogσ+Θ(log(nlogσ))n\log\sigma + \Theta(\log (n\log\sigma)) bits. We describe a new data structure matching this lower bound when σnO(1)\sigma\leq n^{O(1)} while supporting both queries in optimal O(1)O(1) time. Furthermore, we show that the string can be overwritten in-place with this structure. The redundancy of Θ(logn)\Theta(\log n) bits and the constant query time break exponentially a lower bound that is known to hold in the read-only model. Using our new string representation, we obtain the first in-place subquadratic (indeed, even sublinear in some cases) algorithms for several string-processing problems in the restore model: the input string is rewritable and must be restored before the computation terminates. In particular, we describe the first in-place subquadratic Monte Carlo solutions to the sparse suffix sorting, sparse LCP array construction, and suffix selection problems. With the sole exception of suffix selection, our algorithms are also the first running in sublinear time for small enough sets of input suffixes. Combining these solutions, we obtain the first sublinear-time Monte Carlo algorithm for building the sparse suffix tree in compact space. We also show how to derandomize our algorithms using small space. This leads to the first Las Vegas in-place algorithm computing the full LCP array in O(nlogn)O(n\log n) time and to the first Las Vegas in-place algorithms solving the sparse suffix sorting and sparse LCP array construction problems in O(n1.5logσ)O(n^{1.5}\sqrt{\log \sigma}) time. Running times of these Las Vegas algorithms hold in the worst case with high probability.Comment: Refactored according to TALG's reviews. New w.h.p. bounds and Las Vegas algorithm

    Truly Subquadratic-Time Extension Queries and Periodicity Detection in Strings with Uncertainties

    Get PDF
    Strings with don\u27t care symbols, also called partial words, and more general indeterminate strings are a natural representation of strings containing uncertain symbols. A considerable effort has been made to obtain efficient algorithms for pattern matching and periodicity detection in such strings. Among those, a number of algorithms have been proposed that behave well on random data, but still their worst-case running time is Theta(n^2). We present the first truly subquadratic-time solutions for a number of such problems on partial words that can also be adapted to indeterminate strings over a constant-sized alphabet. We show that nn longest common compatible prefix queries (which correspond to longest common extension queries in regular strings) can be answered on-line in O(n * sqrt(n * log(n)) time after O(n * sqrt(n * log(n))-time preprocessing. We also present O(n * sqrt(n * log(n))-time algorithms for computing the prefix array and two types of border array of a partial word

    Covering Problems for Partial Words and for Indeterminate Strings

    Full text link
    We consider the problem of computing a shortest solid cover of an indeterminate string. An indeterminate string may contain non-solid symbols, each of which specifies a subset of the alphabet that could be present at the corresponding position. We also consider covering partial words, which are a special case of indeterminate strings where each non-solid symbol is a don't care symbol. We prove that indeterminate string covering problem and partial word covering problem are NP-complete for binary alphabet and show that both problems are fixed-parameter tractable with respect to kk, the number of non-solid symbols. For the indeterminate string covering problem we obtain a 2O(klogk)+nkO(1)2^{O(k \log k)} + n k^{O(1)}-time algorithm. For the partial word covering problem we obtain a 2O(klogk)+nkO(1)2^{O(\sqrt{k}\log k)} + nk^{O(1)}-time algorithm. We prove that, unless the Exponential Time Hypothesis is false, no 2o(k)nO(1)2^{o(\sqrt{k})} n^{O(1)}-time solution exists for either problem, which shows that our algorithm for this case is close to optimal. We also present an algorithm for both problems which is feasible in practice.Comment: full version (simplified and corrected); preliminary version appeared at ISAAC 2014; 14 pages, 4 figure

    Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation

    Get PDF
    Given a static reference string RR and a source string SS, a relative compression of SS with respect to RR is an encoding of SS as a sequence of references to substrings of RR. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data sets such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string SS is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combinations of parameters. We also present solutions for restricted and extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem

    Faster Compact On-Line Lempel-Ziv Factorization

    Get PDF
    We present a new on-line algorithm for computing the Lempel-Ziv factorization of a string that runs in O(NlogN)O(N\log N) time and uses only O(Nlogσ)O(N\log\sigma) bits of working space, where NN is the length of the string and σ\sigma is the size of the alphabet. This is a notable improvement compared to the performance of previous on-line algorithms using the same order of working space but running in either O(Nlog3N)O(N\log^3N) time (Okanohara & Sadakane 2009) or O(Nlog2N)O(N\log^2N) time (Starikovskaya 2012). The key to our new algorithm is in the utilization of an elegant but less popular index structure called Directed Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an opportunistic variant of our algorithm, which, given the run length encoding of size mm of a string of length NN, computes the Lempel-Ziv factorization on-line, in O(mmin{(loglogm)(loglogN)logloglogN,logmloglogm})O\left(m \cdot \min \left\{\frac{(\log\log m)(\log \log N)}{\log\log\log N}, \sqrt{\frac{\log m}{\log \log m}} \right\}\right) time and O(mlogN)O(m\log N) bits of space, which is faster and more space efficient when the string is run-length compressible
    corecore