662 research outputs found
Sparse Dynamic Programming on DAGs with Small Width
The minimum path cover problem asks us to find a minimum-cardinality set of paths that cover all the nodes of a directed acyclic graph (DAG). We study the case when the size k of a minimum path cover is small, that is, when the DAG has a small width. This case is motivated by applications in pan-genomics, where the genomic variation of a population is expressed as a DAG. We observe that classical alignment algorithms exploiting sparse dynamic programming can be extended to the sequence-against-DAG case by mimicking the algorithm for sequences on each path of a minimum path cover and handling an evaluation order anomaly with reachability queries. Namely, we introduce a general framework for DAG-extensions of sparse dynamic programming. This framework produces algorithms that are slower than their counterparts on sequences only by a factor k. We illustrate this on two classical problems extended to DAGs: longest increasing subsequence and longest common subsequence. For the former, we obtain an algorithm with running time O(k vertical bar E vertical bar log vertical bar V vertical bar). This matches the optimal solution to the classical problem variant when the input sequence is modeled as a path. We obtain an analogous result for the longest common subsequence problem. We then apply this technique to the co-linear chaining problem, which is a generalization of the above two problems. The algorithm for this problem turns out to be more involved, needing further ingredients, such as an FM-index tailored for large alphabets and a two-dimensional range search tree modified to support range maximum queries. We also study a general sequence-to-DAG alignment formulation that allows affine gap costs in the sequence. The main ingredient of the proposed framework is a new algorithm for finding a minimum path cover of a DAG (V, E) in O(k vertical bar E vertical bar log vertical bar V vertical bar) time, improving all known time-bounds when k is small and the DAG is not too dense. In addition to boosting the sparse dynamic programming framework, an immediate consequence of this new minimum path cover algorithm is an improved space/time tradeoff for reachability queries in arbitrary directed graphs.Peer reviewe
High-dimensional learning of linear causal networks via inverse covariance estimation
We establish a new framework for statistical estimation of directed acyclic
graphs (DAGs) when data are generated from a linear, possibly non-Gaussian
structural equation model. Our framework consists of two parts: (1) inferring
the moralized graph from the support of the inverse covariance matrix; and (2)
selecting the best-scoring graph amongst DAGs that are consistent with the
moralized graph. We show that when the error variances are known or estimated
to close enough precision, the true DAG is the unique minimizer of the score
computed using the reweighted squared l_2-loss. Our population-level results
have implications for the identifiability of linear SEMs when the error
covariances are specified up to a constant multiple. On the statistical side,
we establish rigorous conditions for high-dimensional consistency of our
two-part algorithm, defined in terms of a "gap" between the true DAG and the
next best candidate. Finally, we demonstrate that dynamic programming may be
used to select the optimal DAG in linear time when the treewidth of the
moralized graph is bounded.Comment: 41 pages, 7 figure
Forbidden Directed Minors and Kelly-width
Partial 1-trees are undirected graphs of treewidth at most one. Similarly,
partial 1-DAGs are directed graphs of KellyWidth at most two. It is well-known
that an undirected graph is a partial 1-tree if and only if it has no K_3
minor. In this paper, we generalize this characterization to partial 1-DAGs. We
show that partial 1-DAGs are characterized by three forbidden directed minors,
K_3, N_4 and M_5
Advances in Learning Bayesian Networks of Bounded Treewidth
This work presents novel algorithms for learning Bayesian network structures
with bounded treewidth. Both exact and approximate methods are developed. The
exact method combines mixed-integer linear programming formulations for
structure learning and treewidth computation. The approximate method consists
in uniformly sampling -trees (maximal graphs of treewidth ), and
subsequently selecting, exactly or approximately, the best structure whose
moral graph is a subgraph of that -tree. Some properties of these methods
are discussed and proven. The approaches are empirically compared to each other
and to a state-of-the-art method for learning bounded treewidth structures on a
collection of public data sets with up to 100 variables. The experiments show
that our exact algorithm outperforms the state of the art, and that the
approximate approach is fairly accurate.Comment: 23 pages, 2 figures, 3 table
Co-Linear Chaining on Pangenome Graphs
Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width [Makinen et al., TALG\u2719] and how incorporating gap cost in the scoring function improves alignment accuracy [Chandra and Jain, RECOMB\u2723]. However, it remains open on how to effectively generalize these techniques for general pangenome graphs which contain cycles. Here we present the first practical formulation and an exact algorithm for co-linear chaining on cyclic pangenome graphs. We rigorously prove the correctness and computational complexity of the proposed algorithm. We evaluate the empirical performance of our algorithm by aligning simulated long reads from the human genome to a cyclic pangenome graph constructed from 95 publicly available haplotype-resolved human genome assemblies. While the existing heuristic-based algorithms are faster, the proposed algorithm provides a significant advantage in terms of accuracy
Minimum Path Cover: The Power of Parameterization
Computing a minimum path cover (MPC) of a directed acyclic graph (DAG) is a
fundamental problem with a myriad of applications, including reachability.
Although it is known how to solve the problem by a simple reduction to minimum
flow, recent theoretical advances exploit this idea to obtain algorithms
parameterized by the number of paths of an MPC, known as the width. These
results obtain fast [M\"akinen et al., TALG] and even linear time [C\'aceres et
al., SODA 2022] algorithms in the small-width regime.
In this paper, we present the first publicly available high-performance
implementation of state-of-the-art MPC algorithms, including the parameterized
approaches. Our experiments on random DAGs show that parameterized algorithms
are orders-of-magnitude faster on dense graphs. Additionally, we present new
pre-processing heuristics based on transitive edge sparsification. We show that
our heuristics improve MPC-solvers by orders-of-magnitude
- …