Search CORE

6,715 research outputs found

Haplotype-aware graph indexes

Author: Durbin Richard
Garrison Erik
Novak Adam M.
Paten Benedict J.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)
Publication date: 01/01/2018
Field of study

The variation graph toolkit (VG) represents genetic variation as a graph. Each path in the graph is a potential haplotype, though most paths are unlikely recombinations of true haplotypes. We augment the VG model with haplotype information to identify which paths are more likely to be correct. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by indexing the 1000 Genomes Project haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

eScholarship - University of California

Apollo (Cambridge)

String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

Author: A
Alzamel Mai
Counting
Grossi Roberto
Hagerup Torben
Optimal
Uniqueness
Wavelet
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/05/2019
Field of study

Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text

T

of length

n

, permutes its symbols according to the lexicographic order of suffixes of

T

. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length

n

, occupying

O(n/\log n)

machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in

O(n)

time and

O(n/\log n)

space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require

\Omega(n)

time. In this paper, we propose the first algorithm that breaks the

O(n)

-time barrier for BWT construction. Given a binary string of length

n

, our procedure builds the Burrows-Wheeler transform in

O(n/\sqrt{\log n})

time and

O(n/\log n)

space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art

O(m\sqrt{\log m})

-time solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size

O(n/\log n)

that answers Longest Common Extension queries (LCE queries) in

O(1)

time and, furthermore, can be deterministically constructed in the optimal

O(n/\log n)

time.Comment: Full version of a paper accepted to STOC 201

arXiv.org e-Print Archive

Crossref

Sorting conjugates and Suffixes of Words in a Multiset

Author: Bonomo S
Mantaci S
Restivo A
Rosone G
Sciortino M
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/01/2014
Field of study

In this paper we are interested in the study of the combinatorial aspects related to the extension of the Burrows-Wheeler transform to a multiset of words. Such study involves the notion of suffixes and conjugates of words and is based on two different order relations, denoted by <_lex and ≺_ω, that, even if strictly connected, are quite different from the computational point of view. In particular, we introduce a method that only uses the <_lex sorting among suffixes of a multiset of words in order to sort their conjugates according to ≺_ω-order. In this study an important role is played by Lyndon words. This strategy could be used in applications specially in the field of Bioinformatics, where for instance the advent of "next-generation" DNA sequencing technologies has meant that huge collections of DNA sequences are now commonplace

Archivio della Ricerca - Università di Pisa

Clustering words

Author: Ferenczi Sébastien
Zamboni Luca Q.
Publication venue
Publication date: 06/04/2012
Field of study

We characterize words which cluster under the Burrows-Wheeler transform as those words

w

such that

ww

occurs in a trajectory of an interval exchange transformation, and build examples of clustering words

arXiv.org e-Print Archive

HAL-UJM

Hal-Diderot

Recommended from our members

Haplotype-aware graph indexes.

Author: Durbin Richard
Garrison Erik
Novak Adam M
Paten Benedict
Sirén Jouni
Publication venue: Bioinformatics
Publication date: 15/01/2020
Field of study

MOTIVATION: The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. RESULTS: We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. AVAILABILITY AND IMPLEMENTATION: Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

eScholarship - University of California

Apollo (Cambridge)

Universal lossless source coding with the Burrows Wheeler transform

Author: Effros Michelle
Kulkarni Sanjeev R.
Verdú Sergio
Visweswariah Karthik
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n → ∞, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source

CiteSeerX

Caltech Authors

String Comparison in $V$ -Order: New Lexicographic Properties & On-line Applications

Author: Alatabbi Ali
Daykin Jacqueline W.
Rahman M. Sohel
Smyth W. F.
Publication venue
Publication date: 01/01/2015
Field of study

V

-order is a global order on strings related to Unique Maximal Factorization Families (UMFFs), which are themselves generalizations of Lyndon words.

V

-order has recently been proposed as an alternative to lexicographical order in the computation of suffix arrays and in the suffix-sorting induced by the Burrows-Wheeler transform. Efficient

V

-ordering of strings thus becomes a matter of considerable interest. In this paper we present new and surprising results on

V

-order in strings, then go on to explore the algorithmic consequences

arXiv.org e-Print Archive

Research Repository

On Bijective Variants of the Burrows-Wheeler Transform

Author: Kufleitner Manfred
Publication venue
Publication date: 01/01/2009
Field of study

The sort transform (ST) is a modification of the Burrows-Wheeler transform (BWT). Both transformations map an arbitrary word of length n to a pair consisting of a word of length n and an index between 1 and n. The BWT sorts all rotation conjugates of the input word, whereas the ST of order k only uses the first k letters for sorting all such conjugates. If two conjugates start with the same prefix of length k, then the indices of the rotations are used for tie-breaking. Both transforms output the sequence of the last letters of the sorted list and the index of the input within the sorted list. In this paper, we discuss a bijective variant of the BWT (due to Scott), proving its correctness and relations to other results due to Gessel and Reutenauer (1993) and Crochemore, Desarmenien, and Perrin (2005). Further, we present a novel bijective variant of the ST.Comment: 15 pages, presented at the Prague Stringology Conference 2009 (PSC 2009

arXiv.org e-Print Archive

CiteSeerX