6,715 research outputs found
Haplotype-aware graph indexes
The variation graph toolkit (VG) represents genetic variation as a graph. Each path in the graph is a potential haplotype, though most paths are unlikely recombinations of true haplotypes. We augment the VG model with haplotype information to identify which paths are more likely to be correct. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by indexing the 1000 Genomes Project haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes
String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure
Burrows-Wheeler transform (BWT) is an invertible text transformation that,
given a text of length , permutes its symbols according to the
lexicographic order of suffixes of . BWT is one of the most heavily studied
algorithms in data compression with numerous applications in indexing, sequence
analysis, and bioinformatics. Its construction is a bottleneck in many
scenarios, and settling the complexity of this task is one of the most
important unsolved problems in sequence analysis that has remained open for 25
years. Given a binary string of length , occupying machine
words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009)
runs in time and space. Recent advancements (Belazzougui,
STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size
dependency in the time complexity, but they still require time.
In this paper, we propose the first algorithm that breaks the -time
barrier for BWT construction. Given a binary string of length , our
procedure builds the Burrows-Wheeler transform in time and
space. We complement this result with a conditional lower bound
proving that any further progress in the time complexity of BWT construction
would yield faster algorithms for the very well studied problem of counting
inversions: it would improve the state-of-the-art -time
solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a
novel concept of string synchronizing sets, which is of independent interest.
As one of the applications, we show that this technique lets us design a data
structure of the optimal size that answers Longest Common
Extension queries (LCE queries) in time and, furthermore, can be
deterministically constructed in the optimal time.Comment: Full version of a paper accepted to STOC 201
Sorting conjugates and Suffixes of Words in a Multiset
In this paper we are interested in the study of the combinatorial aspects related to the extension of the Burrows-Wheeler transform to a multiset of words. Such study involves the notion of suffixes and conjugates of words and is based on two different order relations, denoted by <_lex and âș_Ï, that, even if strictly connected, are quite different from the computational point of view. In particular, we introduce a method that only uses the <_lex sorting among suffixes of a multiset of words in order to sort their conjugates according to âș_Ï-order. In this study an important role is played by Lyndon words. This strategy could be used in applications specially in the field of Bioinformatics, where for instance the advent of "next-generation" DNA sequencing technologies has meant that huge collections of DNA sequences are now commonplace
Clustering words
We characterize words which cluster under the Burrows-Wheeler transform as
those words such that occurs in a trajectory of an interval exchange
transformation, and build examples of clustering words
Recommended from our members
Haplotype-aware graph indexes.
MOTIVATION: The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. RESULTS: We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108Â 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. AVAILABILITY AND IMPLEMENTATION: Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
Universal lossless source coding with the Burrows Wheeler transform
The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n â â, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source
String Comparison in -Order: New Lexicographic Properties & On-line Applications
-order is a global order on strings related to Unique Maximal
Factorization Families (UMFFs), which are themselves generalizations of Lyndon
words. -order has recently been proposed as an alternative to
lexicographical order in the computation of suffix arrays and in the
suffix-sorting induced by the Burrows-Wheeler transform. Efficient -ordering
of strings thus becomes a matter of considerable interest. In this paper we
present new and surprising results on -order in strings, then go on to
explore the algorithmic consequences
On Bijective Variants of the Burrows-Wheeler Transform
The sort transform (ST) is a modification of the Burrows-Wheeler transform
(BWT). Both transformations map an arbitrary word of length n to a pair
consisting of a word of length n and an index between 1 and n. The BWT sorts
all rotation conjugates of the input word, whereas the ST of order k only uses
the first k letters for sorting all such conjugates. If two conjugates start
with the same prefix of length k, then the indices of the rotations are used
for tie-breaking. Both transforms output the sequence of the last letters of
the sorted list and the index of the input within the sorted list. In this
paper, we discuss a bijective variant of the BWT (due to Scott), proving its
correctness and relations to other results due to Gessel and Reutenauer (1993)
and Crochemore, Desarmenien, and Perrin (2005). Further, we present a novel
bijective variant of the ST.Comment: 15 pages, presented at the Prague Stringology Conference 2009 (PSC
2009
- âŠ