334 research outputs found
Space-efficient detection of unusual words
Detecting all the strings that occur in a text more frequently or less
frequently than expected according to an IID or a Markov model is a basic
problem in string mining, yet current algorithms are based on data structures
that are either space-inefficient or incur large slowdowns, and current
implementations cannot scale to genomes or metagenomes in practice. In this
paper we engineer an algorithm based on the suffix tree of a string to use just
a small data structure built on the Burrows-Wheeler transform, and a stack of
bits, where is the length of the string and
is the size of the alphabet. The size of the stack is except for very
large values of . We further improve the algorithm by removing its time
dependency on , by reporting only a subset of the maximal repeats and
of the minimal rare words of the string, and by detecting and scoring candidate
under-represented strings that in the string. Our
algorithms are practical and work directly on the BWT, thus they can be
immediately applied to a number of existing datasets that are available in this
form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637
Worst-case efficient single and multiple string matching on packed texts in the word-RAM model
AbstractIn this paper, we explore worst-case solutions for the problems of single and multiple matching on strings in the word-RAM model with word length w. In the first problem, we have to build a data structure based on a pattern p of length m over an alphabet of size σ such that we can answer to the following query: given a text T of length n, where each character is encoded using logσ bits return the positions of all the occurrences of p in T (in the following we refer by occ to the number of reported occurrences). For the multi-pattern matching problem we have a set S of d patterns of total length m and a query on a text T consists in finding all positions of all occurrences in T of the patterns in S. As each character of the text is encoded using logσ bits and we can read w bits in constant time in the RAM model, we assume that we can read up to Θ(w/logσ) consecutive characters of the text in one time step. This implies that the fastest possible query time for both problems is O(nlogσw+occ). In this paper we present several different results for both problems which come close to that best possible query time. We first present two different linear space data structures for the first and second problem: the first one answers to single pattern matching queries in time O(n(1m+logσw)+occ) while the second one answers to multiple pattern matching queries to O(n(logd+logy+loglogmy+logσw)+occ) where y is the length of the shortest pattern. We then show how a simple application of the four Russian technique permits to get data structures with query times independent of the length of the shortest pattern (the length of the only pattern in case of single string matching) at the expense of using more space
R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space
Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in O(n log log (n/r)) time and with O(r log n) bits of working space for string length n and number r of runs in RLBWT. Here, r is expected to be significantly smaller than n for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes
- …