225 research outputs found
Space efficient merging of de Bruijn graphs and Wheeler graphs
The merging of succinct data structures is a well established technique for
the space efficient construction of large succinct indexes. In the first part
of the paper we propose a new algorithm for merging succinct representations of
de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of
the art algorithm for the same problem but it uses less than half of its
working space. A novel important feature of our algorithm, not found in any of
the existing tools, is that it can compute the Variable Order succinct
representation of the union graph within the same asymptotic time/space bounds.
In the second part of the paper we consider the more general problem of merging
succinct representations of Wheeler graphs, a recently introduced graph family
which includes as special cases de Bruijn graphs and many other known succinct
indexes based on the BWT or one of its variants. We show that Wheeler graphs
merging is in general a much more difficult problem, and we provide a space
efficient algorithm for the slightly simplified problem of determining whether
the union graph has an ordering that satisfies the Wheeler conditions.Comment: 24 pages, 10 figures. arXiv admin note: text overlap with
arXiv:1902.0288
Space-efficient merging of succinct de Bruijn graphs
We propose a new algorithm for merging succinct representations of de Bruijn
graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is based on the
lightweight BWT merging approach by Holt and McMillan [Bionformatics 2014,
ACM-BCB 2014]. Our algorithm has the same asymptotic cost of the state of the
art tool for the same problem presented by Muggli et al. [bioRxiv 2017,
Bioinformatics 2019], but it uses less than half of its working space. A novel
important feature of our algorithm, not found in any of the existing tools, is
that it can compute the Variable Order succinct representation of the union
graph within the same asymptotic time/space bounds.Comment: Accepted to SPIRE'1
MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph
MEGAHIT is a NGS de novo assembler for assembling large and complex
metagenomics data in a time- and cost-efficient manner. It finished assembling
a soil metagenomics dataset with 252Gbps in 44.1 hours and 99.6 hours on a
single computing node with and without a GPU, respectively. MEGAHIT assembles
the data as a whole, i.e., it avoids pre-processing like partitioning and
normalization, which might compromise on result integrity. MEGAHIT generates 3
times larger assembly, with longer contig N50 and average contig length than
the previous assembly. 55.8% of the reads were aligned to the assembly, which
is 4 times higher than the previous. The source code of MEGAHIT is freely
available at https://github.com/voutcn/megahit under GPLv3 license.Comment: 2 pages, 2 tables, 1 figure, submitted to Oxford Bioinformatics as an
Application Not
External memory BWT and LCP computation for sequence collections with applications
Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.ResultsWe propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs.ConclusionsWe prove that our algorithm performs O(nmaxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.14CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO - CNPQCOORDENAÇÃO DE APERFEIÇOAMENTO DE PESSOAL DE NÍVEL SUPERIOR - CAPESUniversity of Eastern Piedmont project Behavioural Types for Dependability Analysis with Bayesian Networks; Sao Paulo Research Foundation (FAPESP)Fundacao de Amparo a Pesquisa do Estado de Sao Paulo (FAPESP) [2017/09105-0, 2018/21509-2]; PRIN grant [201534HNXC]; INdAM-GNCS Project 2019 Innovative methods for the solution of medical and biological big data; Brazilian agency Conselho Nacional de Desenvolvimento Cientifico e Tecnologico (CNPq)National Council for Scientific and Technological Development (CNPq); Brazilian agency Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES)CAPE
External memory BWT and LCP computation for sequence collections with applications
We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external memory and in the process it also computes the LCP values. We show that our algorithm performs O(n maxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm outperforms the current best algorithm for collections of sequences with different lengths and when the average LCP of the collection is relatively small compared to the length of the sequences.
In the second part of the paper, we show that our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP arrays, provide simple, scan based, external memory algorithms for three well known problems in bioinformatics: the computation of the all pairs suffix-prefix overlaps, the computation of maximal repeats, and the construction of succinct de Bruijn graphs
Wheeler graphs: A framework for BWT-based data structures
The famous Burrows\u2013Wheeler Transform (BWT) was originally defined for a single string but variations have been developed for sets of strings, labeled trees, de Bruijn graphs, etc. In this paper we propose a framework that includes many of these variations and that we hope will simplify the search for more.
We first define Wheeler graphs and show they have a property we call path coherence. We show that if the state diagram of a finite-state automaton is a Wheeler graph then, by its path coherence, we can order the nodes such that, for any string, the nodes reachable from the initial state or states by processing that string are consecutive. This means that even if the automaton is non-deterministic, we can still store it compactly and process strings with it quickly.
We then rederive several variations of the BWT by designing straightforward finite-state automata for the relevant problems and showing that their state diagrams are Wheeler graphs
- …