12 research outputs found
Fast Label Extraction in the CDAWG
The compact directed acyclic word graph (CDAWG) of a string of length
takes space proportional just to the number of right extensions of the
maximal repeats of , and it is thus an appealing index for highly repetitive
datasets, like collections of genomes from similar species, in which grows
significantly more slowly than . We reduce from to
the time needed to count the number of occurrences of a pattern of
length , using an existing data structure that takes an amount of space
proportional to the size of the CDAWG. This implies a reduction from
to in the time needed to
locate all the occurrences of the pattern. We also reduce from
to the time needed to read the characters of the
label of an edge of the suffix tree of , and we reduce from
to the time needed to compute the matching
statistics between a query of length and , using an existing
representation of the suffix tree based on the CDAWG. All such improvements
derive from extracting the label of a vertex or of an arc of the CDAWG using a
straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International
Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv
admin note: text overlap with arXiv:1705.0864
Composite repetition-aware data structures
In highly repetitive strings, like collections of genomes from the same
species, distinct measures of repetition all grow sublinearly in the length of
the text, and indexes targeted to such strings typically depend only on one of
these measures. We describe two data structures whose size depends on multiple
measures of repetition at once, and that provide competitive tradeoffs between
the time for counting and reporting all the exact occurrences of a pattern, and
the space taken by the structure. The key component of our constructions is the
run-length encoded BWT (RLBWT), which takes space proportional to the number of
BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it
with data structures from LZ77 indexes, which take space proportional to the
number of LZ77 factors, and with the compact directed acyclic word graph
(CDAWG), which takes space proportional to the number of extensions of maximal
repeats. The combination of CDAWG and RLBWT enables also a new representation
of the suffix tree, whose size depends again on the number of extensions of
maximal repeats, and that is powerful enough to support matching statistics and
constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from
previous version
Flexible Indexing of Repetitive Collections
Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such variant uses an amount of space comparable to LZ77 indexes, but it answers count queries between two and four orders of magnitude faster than all LZ77 and hybrid index implementations, at the cost of slower locate queries. Combining the RLBWT with the compact directed acyclic word graph answers locate queries for short patterns between four and ten times faster than a version of the run-length compressed suffix array (RLCSA) that uses comparable memory, and with very short patterns our index achieves speedups even greater than ten with respect to RLCSA
Flexible indexing of repetitive collections
Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such variant uses an amount of space comparable to LZ77 indexes, but it answers count queries between two and four orders of magnitude faster than all LZ77 and hybrid index implementations, at the cost of slower locate queries. Combining the RLBWT with the compact directed acyclic word graph answers locate queries for short patterns between four and ten times faster than a version of the run-length compressed suffix array (RLCSA) that uses comparablememory, and with very short patterns our index achieves speedups even greater than ten with respect to RLCS