26,261 research outputs found
Longest Common Extensions in Trees
The longest common extension (LCE) of two indices in a string is the length
of the longest identical substrings starting at these two indices. The LCE
problem asks to preprocess a string into a compact data structure that supports
fast LCE queries. In this paper we generalize the LCE problem to trees and
suggest a few applications of LCE in trees to tries and XML databases. Given a
labeled and rooted tree of size , the goal is to preprocess into a
compact data structure that support the following LCE queries between subpaths
and subtrees in . Let , , , and be nodes of such
that and are descendants of and respectively.
\begin{itemize} \item \LCEPP(v_1, w_1, v_2, w_2): (path-path \LCE) return
the longest common prefix of the paths and . \item \LCEPT(v_1, w_1, v_2): (path-tree \LCE) return maximal
path-path LCE of the path and any path from to a
descendant leaf. \item \LCETT(v_1, v_2): (tree-tree \LCE) return a maximal
path-path LCE of any pair of paths from and to descendant leaves.
\end{itemize} We present the first non-trivial bounds for supporting these
queries. For \LCEPP queries, we present a linear-space solution with
query time. For \LCEPT queries, we present a linear-space
solution with query time, and complement this with a
lower bound showing that any path-tree LCE structure of size O(n \polylog(n))
must necessarily use time to answer queries. For \LCETT
queries, we present a time-space trade-off, that given any parameter , , leads to an space and query-time
solution. This is complemented with a reduction to the the set intersection
problem implying that a fast linear space solution is not likely to exist
Finger Search in Grammar-Compressed Strings
Grammar-based compression, where one replaces a long string by a small
context-free grammar that generates the string, is a simple and powerful
paradigm that captures many popular compression schemes. Given a grammar, the
random access problem is to compactly represent the grammar while supporting
random access, that is, given a position in the original uncompressed string
report the character at that position. In this paper we study the random access
problem with the finger search property, that is, the time for a random access
query should depend on the distance between a specified index , called the
\emph{finger}, and the query index . We consider both a static variant,
where we first place a finger and subsequently access indices near the finger
efficiently, and a dynamic variant where also moving the finger such that the
time depends on the distance moved is supported.
Let be the size the grammar, and let be the size of the string. For
the static variant we give a linear space representation that supports placing
the finger in time and subsequently accessing in time,
where is the distance between the finger and the accessed index. For the
dynamic variant we give a linear space representation that supports placing the
finger in time and accessing and moving the finger in time. Compared to the best linear space solution to random
access, we improve a query bound to for the static
variant and to for the dynamic variant, while
maintaining linear space. As an application of our results we obtain an
improved solution to the longest common extension problem in grammar compressed
strings. To obtain our results, we introduce several new techniques of
independent interest, including a novel van Emde Boas style decomposition of
grammars
Fast Label Extraction in the CDAWG
The compact directed acyclic word graph (CDAWG) of a string of length
takes space proportional just to the number of right extensions of the
maximal repeats of , and it is thus an appealing index for highly repetitive
datasets, like collections of genomes from similar species, in which grows
significantly more slowly than . We reduce from to
the time needed to count the number of occurrences of a pattern of
length , using an existing data structure that takes an amount of space
proportional to the size of the CDAWG. This implies a reduction from
to in the time needed to
locate all the occurrences of the pattern. We also reduce from
to the time needed to read the characters of the
label of an edge of the suffix tree of , and we reduce from
to the time needed to compute the matching
statistics between a query of length and , using an existing
representation of the suffix tree based on the CDAWG. All such improvements
derive from extracting the label of a vertex or of an arc of the CDAWG using a
straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International
Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv
admin note: text overlap with arXiv:1705.0864
Space-efficient detection of unusual words
Detecting all the strings that occur in a text more frequently or less
frequently than expected according to an IID or a Markov model is a basic
problem in string mining, yet current algorithms are based on data structures
that are either space-inefficient or incur large slowdowns, and current
implementations cannot scale to genomes or metagenomes in practice. In this
paper we engineer an algorithm based on the suffix tree of a string to use just
a small data structure built on the Burrows-Wheeler transform, and a stack of
bits, where is the length of the string and
is the size of the alphabet. The size of the stack is except for very
large values of . We further improve the algorithm by removing its time
dependency on , by reporting only a subset of the maximal repeats and
of the minimal rare words of the string, and by detecting and scoring candidate
under-represented strings that in the string. Our
algorithms are practical and work directly on the BWT, thus they can be
immediately applied to a number of existing datasets that are available in this
form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637
- …