1,406 research outputs found

    A polynomial delay algorithm for the enumeration of bubbles with length constraints in directed graphs and its application to the detection of alternative splicing in RNA-seq data

    Full text link
    We present a new algorithm for enumerating bubbles with length constraints in directed graphs. This problem arises in transcriptomics, where the question is to identify all alternative splicing events present in a sample of mRNAs sequenced by RNA-seq. This is the first polynomial-delay algorithm for this problem and we show that in practice, it is faster than previous approaches. This enables us to deal with larger instances and therefore to discover novel alternative splicing events, especially long ones, that were previously overseen using existing methods.Comment: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013

    Navigating in a sea of repeats in RNA-seq without drowning

    Full text link
    The main challenge in de novo assembly of NGS data is certainly to deal with repeats that are longer than the reads. This is particularly true for RNA- seq data, since coverage information cannot be used to flag repeated sequences, of which transposable elements are one of the main examples. Most transcriptome assemblers are based on de Bruijn graphs and have no clear and explicit model for repeats in RNA-seq data, relying instead on heuristics to deal with them. The results of this work are twofold. First, we introduce a formal model for repre- senting high copy number repeats in RNA-seq data and exploit its properties for inferring a combinatorial characteristic of repeat-associated subgraphs. We show that the problem of identifying in a de Bruijn graph a subgraph with this charac- teristic is NP-complete. In a second step, we show that in the specific case of a local assembly of alternative splicing (AS) events, we can implicitly avoid such subgraphs. In particular, we designed and implemented an algorithm to efficiently identify AS events that are not included in repeated regions. Finally, we validate our results using synthetic data. We also give an indication of the usefulness of our method on real data

    Simplicial and Cellular Trees

    Get PDF
    Much information about a graph can be obtained by studying its spanning trees. On the other hand, a graph can be regarded as a 1-dimensional cell complex, raising the question of developing a theory of trees in higher dimension. As observed first by Bolker, Kalai and Adin, and more recently by numerous authors, the fundamental topological properties of a tree --- namely acyclicity and connectedness --- can be generalized to arbitrary dimension as the vanishing of certain cellular homology groups. This point of view is consistent with the matroid-theoretic approach to graphs, and yields higher-dimensional analogues of classical enumerative results including Cayley's formula and the matrix-tree theorem. A subtlety of the higher-dimensional case is that enumeration must account for the possibility of torsion homology in trees, which is always trivial for graphs. Cellular trees are the starting point for further high-dimensional extensions of concepts from algebraic graph theory including the critical group, cut and flow spaces, and discrete dynamical systems such as the abelian sandpile model.Comment: 39 pages (including 5-page bibliography); 5 figures. Chapter for forthcoming IMA volume "Recent Trends in Combinatorics

    Linear-Time Superbubble Identification Algorithm for Genome Assembly

    Get PDF
    DNA sequencing is the process of determining the exact order of the nucleotide bases of an individual's genome in order to catalogue sequence variation and understand its biological implications. Whole-genome sequencing techniques produce masses of data in the form of short sequences known as reads. Assembling these reads into a whole genome constitutes a major algorithmic challenge. Most assembly algorithms utilize de Bruijn graphs constructed from reads for this purpose. A critical step of these algorithms is to detect typical motif structures in the graph caused by sequencing errors and genome repeats, and filter them out; one such complex subgraph class is a so-called superbubble. In this paper, we propose an O(n+m)-time algorithm to detect all superbubbles in a directed acyclic graph with n nodes and m (directed) edges, improving the best-known O(m log m)-time algorithm by Sung et al

    Detecting Superbubbles in Assembly Graphs

    Full text link
    We introduce a new concept of a subgraph class called a superbubble for analyzing assembly graphs, and propose an efficient algorithm for detecting it. Most assembly algorithms utilize assembly graphs like the de Bruijn graph or the overlap graph constructed from reads. From these graphs, many assembly algorithms first detect simple local graph structures (motifs), such as tips and bubbles, mainly to find sequencing errors. These motifs are easy to detect, but they are sometimes too simple to deal with more complex errors. The superbubble is an extension of the bubble, which is also important for analyzing assembly graphs. Though superbubbles are much more complex than ordinary bubbles, we show that they can be efficiently enumerated. We propose an average-case linear time algorithm (i.e., O(n+m) for a graph with n vertices and m edges) for graphs with a reasonable model, though the worst-case time complexity of our algorithm is quadratic (i.e., O(n(n+m))). Moreover, the algorithm is practically very fast: Our experiments show that our algorithm runs in reasonable time with a single CPU core even against a very large graph of a whole human genome.Comment: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013
    • …
    corecore