12 research outputs found

    Fast Label Extraction in the CDAWG

    Full text link
    The compact directed acyclic word graph (CDAWG) of a string TT of length nn takes space proportional just to the number ee of right extensions of the maximal repeats of TT, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which ee grows significantly more slowly than nn. We reduce from O(mloglogn)O(m\log{\log{n}}) to O(m)O(m) the time needed to count the number of occurrences of a pattern of length mm, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from O(mloglogn+occ)O(m\log{\log{n}}+\mathtt{occ}) to O(m+occ)O(m+\mathtt{occ}) in the time needed to locate all the occ\mathtt{occ} occurrences of the pattern. We also reduce from O(kloglogn)O(k\log{\log{n}}) to O(k)O(k) the time needed to read the kk characters of the label of an edge of the suffix tree of TT, and we reduce from O(mloglogn)O(m\log{\log{n}}) to O(m)O(m) the time needed to compute the matching statistics between a query of length mm and TT, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv admin note: text overlap with arXiv:1705.0864

    Composite repetition-aware data structures

    Get PDF
    In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from previous version

    Flexible Indexing of Repetitive Collections

    Get PDF
    Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such variant uses an amount of space comparable to LZ77 indexes, but it answers count queries between two and four orders of magnitude faster than all LZ77 and hybrid index implementations, at the cost of slower locate queries. Combining the RLBWT with the compact directed acyclic word graph answers locate queries for short patterns between four and ten times faster than a version of the run-length compressed suffix array (RLCSA) that uses comparable memory, and with very short patterns our index achieves speedups even greater than ten with respect to RLCSA

    Flexible indexing of repetitive collections

    No full text
    Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such variant uses an amount of space comparable to LZ77 indexes, but it answers count queries between two and four orders of magnitude faster than all LZ77 and hybrid index implementations, at the cost of slower locate queries. Combining the RLBWT with the compact directed acyclic word graph answers locate queries for short patterns between four and ten times faster than a version of the run-length compressed suffix array (RLCSA) that uses comparablememory, and with very short patterns our index achieves speedups even greater than ten with respect to RLCS
    corecore