7,180 research outputs found
On Maximal Repeats in Compressed Strings
This paper presents and proves a new non-trivial upper bound on the number of maximal repeats of compressed strings. Using Theorem 1 of Raffinot\u27s article "On Maximal Repeats in Strings", this upper bound can be directly translated into an upper bound on the number of nodes in the Compacted Directed Acyclic Word Graphs of compressed strings.
More formally, this paper proves that the number of maximal repeats in a string with z (self-referential) LZ77-factors and without q-th powers is at most 3q(z+1)^3-2. Also, this paper proves that for 2000 <= z <= q this upper bound is tight up to a constant factor
On Extensions of Maximal Repeats in Compressed Strings
This paper provides an upper bound for several subsets of maximal repeats and
maximal pairs in compressed strings and also presents a formerly unknown
relationship between maximal pairs and the run-length Burrows-Wheeler
transform.
This relationship is used to obtain a different proof for the Burrows-Wheeler
conjecture which has recently been proven by Kempa and Kociumaka in "Resolution
of the Burrows-Wheeler Transform Conjecture".
More formally, this paper proves that a string with LZ77-factors and
without -th powers has at most runs in the
run-length Burrows-Wheeler transform and the number of arcs in the compacted
directed acyclic word graph of is bounded from above by
Composite repetition-aware data structures
In highly repetitive strings, like collections of genomes from the same
species, distinct measures of repetition all grow sublinearly in the length of
the text, and indexes targeted to such strings typically depend only on one of
these measures. We describe two data structures whose size depends on multiple
measures of repetition at once, and that provide competitive tradeoffs between
the time for counting and reporting all the exact occurrences of a pattern, and
the space taken by the structure. The key component of our constructions is the
run-length encoded BWT (RLBWT), which takes space proportional to the number of
BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it
with data structures from LZ77 indexes, which take space proportional to the
number of LZ77 factors, and with the compact directed acyclic word graph
(CDAWG), which takes space proportional to the number of extensions of maximal
repeats. The combination of CDAWG and RLBWT enables also a new representation
of the suffix tree, whose size depends again on the number of extensions of
maximal repeats, and that is powerful enough to support matching statistics and
constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from
previous version
Faster algorithms for computing maximal multirepeats in multiple sequences
A repeat in a string is a substring that occurs more than once. A repeat is extendible if every occurrence of the repeat has an identical letter either on the left or on the right; otherwise, it is maximal. A multirepeat is a repeat that occurs at least mmin times (mmin greater than/equal to 2) in each of at least q greater than/equal to 1 strings in a given set of strings. In this paper, we describe a family of efficient algorithms based on suffix arrays to compute maximal multirepeats under various constraints. Our algorithms are faster, more flexible and much more space-efficient than algorithms recently proposed for this problem. The results extend recent work by two of the authors computing all maximal repeats in a single string
Space-efficient detection of unusual words
Detecting all the strings that occur in a text more frequently or less
frequently than expected according to an IID or a Markov model is a basic
problem in string mining, yet current algorithms are based on data structures
that are either space-inefficient or incur large slowdowns, and current
implementations cannot scale to genomes or metagenomes in practice. In this
paper we engineer an algorithm based on the suffix tree of a string to use just
a small data structure built on the Burrows-Wheeler transform, and a stack of
bits, where is the length of the string and
is the size of the alphabet. The size of the stack is except for very
large values of . We further improve the algorithm by removing its time
dependency on , by reporting only a subset of the maximal repeats and
of the minimal rare words of the string, and by detecting and scoring candidate
under-represented strings that in the string. Our
algorithms are practical and work directly on the BWT, thus they can be
immediately applied to a number of existing datasets that are available in this
form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637
A framework for space-efficient string kernels
String kernels are typically used to compare genome-scale sequences whose
length makes alignment impractical, yet their computation is based on data
structures that are either space-inefficient, or incur large slowdowns. We show
that a number of exact string kernels, like the -mer kernel, the substrings
kernels, a number of length-weighted kernels, the minimal absent words kernel,
and kernels with Markovian corrections, can all be computed in time and
in bits of space in addition to the input, using just a
data structure on the Burrows-Wheeler transform of the
input strings, which takes time per element in its output. The same
bounds hold for a number of measures of compositional complexity based on
multiple value of , like the -mer profile and the -th order empirical
entropy, and for calibrating the value of using the data
Fast Label Extraction in the CDAWG
The compact directed acyclic word graph (CDAWG) of a string of length
takes space proportional just to the number of right extensions of the
maximal repeats of , and it is thus an appealing index for highly repetitive
datasets, like collections of genomes from similar species, in which grows
significantly more slowly than . We reduce from to
the time needed to count the number of occurrences of a pattern of
length , using an existing data structure that takes an amount of space
proportional to the size of the CDAWG. This implies a reduction from
to in the time needed to
locate all the occurrences of the pattern. We also reduce from
to the time needed to read the characters of the
label of an edge of the suffix tree of , and we reduce from
to the time needed to compute the matching
statistics between a query of length and , using an existing
representation of the suffix tree based on the CDAWG. All such improvements
derive from extracting the label of a vertex or of an arc of the CDAWG using a
straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International
Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv
admin note: text overlap with arXiv:1705.0864
- …