137 research outputs found

    On the complexity of Minimum Path Cover with Subpath Constraints for multi-assembly

    Get PDF
    From RECOMB-Seq: Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing Pittsburgh, PA, USA. 31 March - 05 April 2014.Background Multi-assembly problems have gathered much attention in the last years, as Next-Generation Sequencing technologies have started being applied to mixed settings, such as reads from the transcriptome (RNA-Seq), or from viral quasi-species. One classical model that has resurfaced in many multi-assembly methods (e.g. in Cufflinks, ShoRAH, BRANCH, CLASS) is the Minimum Path Cover (MPC) Problem, which asks for the minimum number of directed paths that cover all the nodes of a directed acyclic graph. The MPC Problem is highly popular because the acyclicity of the graph ensures its polynomial-time solvability. Results In this paper, we consider two generalizations of it dealing with integrating constraints arising from long reads or paired-end reads; these extensions have also been considered by two recent methods, but not fully solved. More specifically, we study the two problems where also a set of subpaths, or pairs of subpaths, of the graph have to be entirely covered by some path in the MPC. We show that in the case of long reads (subpaths), the generalized problem can be solved in polynomial-time by a reduction to the classical MPC Problem. We also consider the weighted case, and show that it can be solved in polynomial-time by a reduction to a min-cost circulation problem. As a side result, we also improve the time complexity of the classical minimum weight MPC Problem. In the case of paired-end reads (pairs of subpaths), the generalized problem becomes NP-hard, but we show that it is fixed-parameter tractable (FPT) in the total number of constraints. This computational dichotomy between long reads and paired-end reads is also a general insight into multi-assembly problems.Peer reviewe

    Flow Decomposition With Subpath Constraints

    Get PDF
    Flow network decomposition is a natural model for problems where we are given a flow network arising from superimposing a set of weighted paths and would like to recover the underlying data, i.e., decompose the flow into the original paths and their weights. Thus, variations on flow decomposition are often used as subroutines in multiassembly problems such as RNA transcript assembly. In practice, we frequently have access to information beyond flow values in the form of subpaths, and many tools incorporate these heuristically. But despite acknowledging their utility in practice, previous work has not formally addressed the effect of subpath constraints on the accuracy of flow network decomposition approaches. We formalize the flow decomposition with subpath constraints problem, give the first algorithms for it, and study its usefulness for recovering ground truth decompositions. For finding a minimum decomposition, we propose both a heuristic and an FPTalgorithm. Experiments on RNA transcript datasets show that for instances with larger solution path sets, the addition of subpath constraints finds 13% more ground truth solutions when minimal decompositions are found exactly, and 30% more ground truth solutions when minimal decompositions are found heuristically.Peer reviewe

    Feasibility of Flow Decomposition with Subpath Constraints in Linear Time

    Get PDF

    Improving RNA Assembly via Safety and Completeness in Flow Decompositions

    Get PDF
    Extended version of RECOMB 2022 paperDecomposing a network flow into weighted paths is a problem with numerous applications, ranging from networking, transportation planning, to bioinformatics. In some applications we look for a decomposition that is optimal with respect to some property, such as the number of paths used, robustness to edge deletion, or length of the longest path. However, in many bioinformatic applications, we seek a specific decomposition where the paths correspond to some underlying data that generated the flow. In these cases, no optimization criteria guarantee the identification of the correct decomposition. Therefore, we propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition. In this work, we give the first local characterization of safe paths for flow decompositions in directed acyclic graphs, leading to a practical algorithm for finding the complete set of safe paths. In addition, we evaluate our algorithm on RNA transcript data sets against a trivial safe algorithm (extended unitigs), the recently proposed safe paths for path covers (TCBB 2021) and the popular heuristic greedy-width. On the one hand, we found that besides maintaining perfect precision, our safe and complete algorithm reports a significantly higher coverage (≈50 compared with the other safe algorithms. On the other hand, the greedy-width algorithm although reporting a better coverage, it also reports a significantly lower precision on complex graphs (for genes expressing a large number of transcripts). Overall, our safe and complete algorithm outperforms (by ≈20 greedy-width on a unified metric (F-score) considering both coverage and precision when the evaluated data set has a significant number of complex graphs. Moreover, it also has a superior time (4−5×) and space performance (1.2−2.2×), resulting in a better and more practical approach for bioinformatic applications of flow decomposition.Peer reviewe

    Covering Pairs in Directed Acyclic Graphs†

    Get PDF
    The Minimum Path Cover (MinPC) problem on directed acyclic graphs (DAGs) is a classical problem in graph theory that provides a clear and simple mathematical formulation for several applications in computational biology. In this paper, we study the computational complexity of three constrained variants of MinPC motivated by the recent introduction of Next-Generation Sequencing technologies. The first variant (MinRPC), given a DAG and a set of pairs of vertices, asks for a minimum-cardinality set of (not necessarily disjoint) paths such that both vertices of each pair belong to the same path. For this problem, we establish a sharp tractability borderline depending on the ‘overlapping degree' of the instance, a natural parameter in some applications of the problem. The second variant we consider (MinPCRP), given a DAG and a set of pairs of vertices, asks for a minimum-cardinality set of (not necessarily disjoint) paths ‘covering' all the vertices of the graph and such that both vertices of each pair belong to the same path. For this problem, we show that, while it is NP-hard to compute if there exists a solution consisting of at most three paths, it is possible to decide in polynomial time whether a solution consisting of at most two paths exists. The third variant (MaxRPSP), given a DAG and a set of pairs of vertices, asks for a single path containing the maximum number of the given pairs of vertices. We show that MaxRPSP is W[1]-hard when parameterized by the number of covered pairs and we give a fixed-parameter algorithm when the parameter is the maximum overlapping degre

    Accelerating ILP solvers for Minimum Flow Decompositions through search space and dimensionality reductions

    Full text link
    Given a flow network, the Minimum Flow Decomposition (MFD) problem is finding the smallest possible set of weighted paths whose superposition equals the flow. It is a classical, strongly NP-hard problem that is proven to be useful in RNA transcript assembly and applications outside of Bioinformatics. We improve an existing ILP (Integer Linear Programming) model by Dias et al. [RECOMB 2022] for DAGs by decreasing the solver's search space using solution safety and several other optimizations. This results in a significant speedup compared to the original ILP, of up to 55-90x on average on the hardest instances. Moreover, we show that our optimizations apply also to MFD problem variants, resulting in similar speedups, going up to 123x on the hardest instances. We also developed an ILP model of reduced dimensionality for an MFD variant in which the solution path weights are restricted to a given set. This model can find an optimal MFD solution for most instances, and overall, its accuracy significantly outperforms that of previous greedy algorithms while being up to an order of magnitude faster than our optimized ILP
    • …