682 research outputs found
Succinct dynamic de Bruijn graphs
Motivation: The de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space- and time-efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction, e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes. Results: In this article, we present DynamicBOSS, a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k-mers). Competing dynamic methods, e.g. FDBG cannot be constructed on large scale datasets, or cannot support both addition and deletion, e.g. BiFrost.Peer reviewe
GPU accelerating distributed succinct de Bruijn graph construction
The research and methods in the field of computational biology have grown in the last decades, thanks to the availability of biological data. One of the applications in computational biology is genome sequencing or sequence alignment, a method to arrange sequences of, for example, DNA or RNA, to determine regions of similarity between these sequences. Sequence alignment applications include public health purposes, such as monitoring antimicrobial resistance.
Demand for fast sequence alignment has led to the usage of data structures, such as the de Bruijn graph, to store a large amount of information efficiently. De Bruijn graphs are currently one of the top data structures used in indexing genome sequences, and different methods to represent them have been explored. One of these methods is the BOSS data structure, a special case of Wheeler graph index, which uses succinct data structures to represent a de Bruijn graph.
As genomes can take a large amount of space, the construction of succinct de Bruijn graphs is slow. This has led to experimental research on using large-scale cluster engines such as Apache Spark and Graphic Processing Units (GPUs) in genome data processing.
This thesis explores the use of Apache Spark and Spark RAPIDS, a GPU computing library for Apache Spark, in the construction of a succinct de Bruijn graph index from genome sequences. The experimental results indicate that Spark RAPIDS can provide up to 8 times speedups to specific operations, but for some other operations has severe limitations that limit its processing power in terms of succinct de Bruijn graph index construction
Space efficient merging of de Bruijn graphs and Wheeler graphs
The merging of succinct data structures is a well established technique for
the space efficient construction of large succinct indexes. In the first part
of the paper we propose a new algorithm for merging succinct representations of
de Bruijn graphs. Our algorithm has the same asymptotic cost of the state of
the art algorithm for the same problem but it uses less than half of its
working space. A novel important feature of our algorithm, not found in any of
the existing tools, is that it can compute the Variable Order succinct
representation of the union graph within the same asymptotic time/space bounds.
In the second part of the paper we consider the more general problem of merging
succinct representations of Wheeler graphs, a recently introduced graph family
which includes as special cases de Bruijn graphs and many other known succinct
indexes based on the BWT or one of its variants. We show that Wheeler graphs
merging is in general a much more difficult problem, and we provide a space
efficient algorithm for the slightly simplified problem of determining whether
the union graph has an ordering that satisfies the Wheeler conditions.Comment: 24 pages, 10 figures. arXiv admin note: text overlap with
arXiv:1902.0288
Space-efficient merging of succinct de Bruijn graphs
We propose a new algorithm for merging succinct representations of de Bruijn
graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is based on the
lightweight BWT merging approach by Holt and McMillan [Bionformatics 2014,
ACM-BCB 2014]. Our algorithm has the same asymptotic cost of the state of the
art tool for the same problem presented by Muggli et al. [bioRxiv 2017,
Bioinformatics 2019], but it uses less than half of its working space. A novel
important feature of our algorithm, not found in any of the existing tools, is
that it can compute the Variable Order succinct representation of the union
graph within the same asymptotic time/space bounds.Comment: Accepted to SPIRE'1
Relative Select
Motivated by the problem of storing coloured de Bruijn graphs, we show how,
if we can already support fast select queries on one string, then we can store
a little extra information and support fairly fast select queries on a similar
string
Detecting Superbubbles in Assembly Graphs
We introduce a new concept of a subgraph class called a superbubble for
analyzing assembly graphs, and propose an efficient algorithm for detecting it.
Most assembly algorithms utilize assembly graphs like the de Bruijn graph or
the overlap graph constructed from reads. From these graphs, many assembly
algorithms first detect simple local graph structures (motifs), such as tips
and bubbles, mainly to find sequencing errors. These motifs are easy to detect,
but they are sometimes too simple to deal with more complex errors. The
superbubble is an extension of the bubble, which is also important for
analyzing assembly graphs. Though superbubbles are much more complex than
ordinary bubbles, we show that they can be efficiently enumerated. We propose
an average-case linear time algorithm (i.e., O(n+m) for a graph with n vertices
and m edges) for graphs with a reasonable model, though the worst-case time
complexity of our algorithm is quadratic (i.e., O(n(n+m))). Moreover, the
algorithm is practically very fast: Our experiments show that our algorithm
runs in reasonable time with a single CPU core even against a very large graph
of a whole human genome.Comment: Peer-reviewed and presented as part of the 13th Workshop on
Algorithms in Bioinformatics (WABI2013
- …