4 research outputs found
Prefix-free parsing for building large tunnelled Wheeler graphs
We propose a new technique for creating a space-efficient index for large
repetitive text collections, such as pangenomic databases containing sequences
of many individuals from the same species. We combine two recent techniques
from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing
(PFP, Boucher et al., 2019). Wheeler graphs (WGs) are a general framework
encompassing several indexes based on the Burrows-Wheeler transform (BWT), such
as the FM-index. Wheeler graphs admit a succinct representation which can be
further compacted by employing the idea of tunnelling, which exploits
redundancies in the form of parallel, equally-labelled paths called blocks that
can be merged into a single path. The problem of finding the optimal set of
blocks for tunnelling, i.e. the one that minimizes the size of the resulting
WG, is known to be NP-complete and remains the most computationally challenging
part of the tunnelling process.
To find an adequate set of blocks in less time, we propose a new method based
on the prefix-free parsing (PFP). The idea of PFP is to divide the input text
into phrases of roughly equal sizes that overlap by a fixed number of
characters. The original text is represented by a sequence of phrase ranks (the
parse) and a list of all used phrases (the dictionary). In repetitive texts,
the PFP of the text is generally much shorter than the original. To speed up
the block selection for tunnelling, we apply the PFP to obtain the parse and
the dictionary of the text, tunnel the WG of the parse using existing
heuristics and subsequently use this tunnelled parse to construct a compact WG
of the original text. Compared with constructing a WG from the original text
without PFP, our method is much faster and uses less memory on collections of
pangenomic sequences. Therefore, our method enables the use of WGs as a
pangenomic reference for real-world datasets.Comment: 12 pages, 3 figures, 2 tables, to be published in the WABI (Workshop
on Algorithms in Bioinformatics) 2022 conference proceeding
Space-efficient conversions from SLPs
We give algorithms that, given a straight-line program (SLP) with rules
that generates (only) a text , builds within space the
Lempel-Ziv (LZ) parse of (of phrases) in time or in time
. We also show how to build a locally consistent grammar
(LCG) of optimal size from the SLP
within space and in time, where is the
substring complexity measure of . Finally, we show how to build the LZ parse
of from such a LCG within space and in time . All our results hold with high probability
MARIA: Multiple-alignment -index with aggregation
There now exist compact indexes that can efficiently list all the occurrences
of a pattern in a dataset consisting of thousands of genomes, or even all the
occurrences of all the pattern's maximal exact matches (MEMs) with respect to
the dataset. Unless we are lucky and the pattern is specific to only a few
genomes, however, we could be swamped by hundreds of matches -- or even
hundreds per MEM -- only to discover that most or all of the matches are to
substrings that occupy the same few columns in a multiple alignment. To address
this issue, in this paper we present a simple and compact data index MARIA that
stores a multiple alignment such that, given the position of one match of a
pattern (or a MEM or other substring of a pattern) and its length, we can
quickly list all the distinct columns of the multiple alignment where matches
start
Recommended from our members
Wheeler Maps
Motivated by challenges in pangenomic read alignment, we propose a generalization of Wheeler graphs that we call Wheeler maps. A Wheeler map stores a text T[1..n] and an assignment of tags to the characters of T such that we can preprocess a pattern P[1..m] and then, given i and j, quickly return all the distinct tags labeling the first characters of the occurrences of P[i..j] in T. For the applications that most interest us, characters with long common contexts are likely to have the same tag, so we consider the number t of runs in the list of tags sorted by their characters’ positions in the Burrows-Wheeler Transform (BWT) of T. We show how, given a straight-line program with g rules for T, we can build an O(g+r+t)-space Wheeler map, where r is the number of runs in the BWT of T, with which we can preprocess a pattern P[1..m] in O(mlogn) time and then return the k distinct tags for P[i..j] in optimal O(k) time for any given i and j