1,406 research outputs found
A polynomial delay algorithm for the enumeration of bubbles with length constraints in directed graphs and its application to the detection of alternative splicing in RNA-seq data
We present a new algorithm for enumerating bubbles with length constraints in
directed graphs. This problem arises in transcriptomics, where the question is
to identify all alternative splicing events present in a sample of mRNAs
sequenced by RNA-seq. This is the first polynomial-delay algorithm for this
problem and we show that in practice, it is faster than previous approaches.
This enables us to deal with larger instances and therefore to discover novel
alternative splicing events, especially long ones, that were previously
overseen using existing methods.Comment: Peer-reviewed and presented as part of the 13th Workshop on
Algorithms in Bioinformatics (WABI2013
Navigating in a sea of repeats in RNA-seq without drowning
The main challenge in de novo assembly of NGS data is certainly to deal with
repeats that are longer than the reads. This is particularly true for RNA- seq
data, since coverage information cannot be used to flag repeated sequences, of
which transposable elements are one of the main examples. Most transcriptome
assemblers are based on de Bruijn graphs and have no clear and explicit model
for repeats in RNA-seq data, relying instead on heuristics to deal with them.
The results of this work are twofold. First, we introduce a formal model for
repre- senting high copy number repeats in RNA-seq data and exploit its
properties for inferring a combinatorial characteristic of repeat-associated
subgraphs. We show that the problem of identifying in a de Bruijn graph a
subgraph with this charac- teristic is NP-complete. In a second step, we show
that in the specific case of a local assembly of alternative splicing (AS)
events, we can implicitly avoid such subgraphs. In particular, we designed and
implemented an algorithm to efficiently identify AS events that are not
included in repeated regions. Finally, we validate our results using synthetic
data. We also give an indication of the usefulness of our method on real data
Simplicial and Cellular Trees
Much information about a graph can be obtained by studying its spanning
trees. On the other hand, a graph can be regarded as a 1-dimensional cell
complex, raising the question of developing a theory of trees in higher
dimension. As observed first by Bolker, Kalai and Adin, and more recently by
numerous authors, the fundamental topological properties of a tree --- namely
acyclicity and connectedness --- can be generalized to arbitrary dimension as
the vanishing of certain cellular homology groups. This point of view is
consistent with the matroid-theoretic approach to graphs, and yields
higher-dimensional analogues of classical enumerative results including
Cayley's formula and the matrix-tree theorem. A subtlety of the
higher-dimensional case is that enumeration must account for the possibility of
torsion homology in trees, which is always trivial for graphs. Cellular trees
are the starting point for further high-dimensional extensions of concepts from
algebraic graph theory including the critical group, cut and flow spaces, and
discrete dynamical systems such as the abelian sandpile model.Comment: 39 pages (including 5-page bibliography); 5 figures. Chapter for
forthcoming IMA volume "Recent Trends in Combinatorics
Linear-Time Superbubble Identification Algorithm for Genome Assembly
DNA sequencing is the process of determining the exact order of the
nucleotide bases of an individual's genome in order to catalogue sequence
variation and understand its biological implications. Whole-genome sequencing
techniques produce masses of data in the form of short sequences known as
reads. Assembling these reads into a whole genome constitutes a major
algorithmic challenge. Most assembly algorithms utilize de Bruijn graphs
constructed from reads for this purpose. A critical step of these algorithms is
to detect typical motif structures in the graph caused by sequencing errors and
genome repeats, and filter them out; one such complex subgraph class is a
so-called superbubble. In this paper, we propose an O(n+m)-time algorithm to
detect all superbubbles in a directed acyclic graph with n nodes and m
(directed) edges, improving the best-known O(m log m)-time algorithm by Sung et
al
Detecting Superbubbles in Assembly Graphs
We introduce a new concept of a subgraph class called a superbubble for
analyzing assembly graphs, and propose an efficient algorithm for detecting it.
Most assembly algorithms utilize assembly graphs like the de Bruijn graph or
the overlap graph constructed from reads. From these graphs, many assembly
algorithms first detect simple local graph structures (motifs), such as tips
and bubbles, mainly to find sequencing errors. These motifs are easy to detect,
but they are sometimes too simple to deal with more complex errors. The
superbubble is an extension of the bubble, which is also important for
analyzing assembly graphs. Though superbubbles are much more complex than
ordinary bubbles, we show that they can be efficiently enumerated. We propose
an average-case linear time algorithm (i.e., O(n+m) for a graph with n vertices
and m edges) for graphs with a reasonable model, though the worst-case time
complexity of our algorithm is quadratic (i.e., O(n(n+m))). Moreover, the
algorithm is practically very fast: Our experiments show that our algorithm
runs in reasonable time with a single CPU core even against a very large graph
of a whole human genome.Comment: Peer-reviewed and presented as part of the 13th Workshop on
Algorithms in Bioinformatics (WABI2013
- …