6,715 research outputs found

    Haplotype-aware graph indexes

    Get PDF
    The variation graph toolkit (VG) represents genetic variation as a graph. Each path in the graph is a potential haplotype, though most paths are unlikely recombinations of true haplotypes. We augment the VG model with haplotype information to identify which paths are more likely to be correct. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by indexing the 1000 Genomes Project haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes

    String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

    Full text link
    Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text TT of length nn, permutes its symbols according to the lexicographic order of suffixes of TT. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length nn, occupying O(n/log⁥n)O(n/\log n) machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in O(n)O(n) time and O(n/log⁥n)O(n/\log n) space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require Ω(n)\Omega(n) time. In this paper, we propose the first algorithm that breaks the O(n)O(n)-time barrier for BWT construction. Given a binary string of length nn, our procedure builds the Burrows-Wheeler transform in O(n/log⁥n)O(n/\sqrt{\log n}) time and O(n/log⁥n)O(n/\log n) space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art O(mlog⁥m)O(m\sqrt{\log m})-time solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size O(n/log⁥n)O(n/\log n) that answers Longest Common Extension queries (LCE queries) in O(1)O(1) time and, furthermore, can be deterministically constructed in the optimal O(n/log⁥n)O(n/\log n) time.Comment: Full version of a paper accepted to STOC 201

    Sorting conjugates and Suffixes of Words in a Multiset

    Get PDF
    In this paper we are interested in the study of the combinatorial aspects related to the extension of the Burrows-Wheeler transform to a multiset of words. Such study involves the notion of suffixes and conjugates of words and is based on two different order relations, denoted by <_lex and â‰ș_ω, that, even if strictly connected, are quite different from the computational point of view. In particular, we introduce a method that only uses the <_lex sorting among suffixes of a multiset of words in order to sort their conjugates according to â‰ș_ω-order. In this study an important role is played by Lyndon words. This strategy could be used in applications specially in the field of Bioinformatics, where for instance the advent of "next-generation" DNA sequencing technologies has meant that huge collections of DNA sequences are now commonplace

    Clustering words

    Full text link
    We characterize words which cluster under the Burrows-Wheeler transform as those words ww such that wwww occurs in a trajectory of an interval exchange transformation, and build examples of clustering words

    Universal lossless source coding with the Burrows Wheeler transform

    Get PDF
    The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n → ∞, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source

    String Comparison in VV-Order: New Lexicographic Properties & On-line Applications

    Get PDF
    VV-order is a global order on strings related to Unique Maximal Factorization Families (UMFFs), which are themselves generalizations of Lyndon words. VV-order has recently been proposed as an alternative to lexicographical order in the computation of suffix arrays and in the suffix-sorting induced by the Burrows-Wheeler transform. Efficient VV-ordering of strings thus becomes a matter of considerable interest. In this paper we present new and surprising results on VV-order in strings, then go on to explore the algorithmic consequences

    On Bijective Variants of the Burrows-Wheeler Transform

    Full text link
    The sort transform (ST) is a modification of the Burrows-Wheeler transform (BWT). Both transformations map an arbitrary word of length n to a pair consisting of a word of length n and an index between 1 and n. The BWT sorts all rotation conjugates of the input word, whereas the ST of order k only uses the first k letters for sorting all such conjugates. If two conjugates start with the same prefix of length k, then the indices of the rotations are used for tie-breaking. Both transforms output the sequence of the last letters of the sorted list and the index of the input within the sorted list. In this paper, we discuss a bijective variant of the BWT (due to Scott), proving its correctness and relations to other results due to Gessel and Reutenauer (1993) and Crochemore, Desarmenien, and Perrin (2005). Further, we present a novel bijective variant of the ST.Comment: 15 pages, presented at the Prague Stringology Conference 2009 (PSC 2009
