507 research outputs found
Consistency and convergence rate of phylogenetic inference via regularization
It is common in phylogenetics to have some, perhaps partial, information
about the overall evolutionary tree of a group of organisms and wish to find an
evolutionary tree of a specific gene for those organisms. There may not be
enough information in the gene sequences alone to accurately reconstruct the
correct "gene tree." Although the gene tree may deviate from the "species tree"
due to a variety of genetic processes, in the absence of evidence to the
contrary it is parsimonious to assume that they agree. A common statistical
approach in these situations is to develop a likelihood penalty to incorporate
such additional information. Recent studies using simulation and empirical data
suggest that a likelihood penalty quantifying concordance with a species tree
can significantly improve the accuracy of gene tree reconstruction compared to
using sequence data alone. However, the consistency of such an approach has not
yet been established, nor have convergence rates been bounded. Because
phylogenetics is a non-standard inference problem, the standard theory does not
apply. In this paper, we propose a penalized maximum likelihood estimator for
gene tree reconstruction, where the penalty is the square of the
Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species
tree. We prove that this method is consistent, and derive its convergence rate
for estimating the discrete gene tree structure and continuous edge lengths
(representing the amount of evolution that has occurred on that branch)
simultaneously. We find that the regularized estimator is "adaptive fast
converging," meaning that it can reconstruct all edges of length greater than
any given threshold from gene sequences of polynomial length. Our method does
not require the species tree to be known exactly; in fact, our asymptotic
theory holds for any such guide tree.Comment: 34 pages, 5 figures. To appear on The Annals of Statistic
Uncovering latent structure in valued graphs: A variational approach
As more and more network-structured data sets are available, the statistical
analysis of valued graphs has become common place. Looking for a latent
structure is one of the many strategies used to better understand the behavior
of a network. Several methods already exist for the binary case. We present a
model-based strategy to uncover groups of nodes in valued graphs. This
framework can be used for a wide span of parametric random graphs models and
allows to include covariates. Variational tools allow us to achieve approximate
maximum likelihood estimation of the parameters of these models. We provide a
simulation study showing that our estimation method performs well over a broad
range of situations. We apply this method to analyze host--parasite interaction
networks in forest ecosystems.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS361 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
The Markov blankets of life: autonomy, active inference and the free energy principle
This work addresses the autonomous organization of biological systems. It does so by considering the boundaries of biological systems, from individual cells to Home sapiens, in terms of the presence of Markov blankets under the active inference scheme—a corollary of the free energy principle. A Markov blanket defines the boundaries of a system in a statistical sense. Here we consider how a collective of Markov blankets can self-assemble into a global system that itself has a Markov blanket; thereby providing an illustration of how autonomous systems can be understood as having layers of nested and self-sustaining boundaries. This allows us to show that: (i) any living system is a Markov blanketed system and (ii) the boundaries of such systems need not be co-extensive with the biophysical boundaries of a living organism. In other words, autonomous systems are hierarchically composed of Markov blankets of Markov blankets—all the way down to individual cells, all the way up to you and me, and all the way out to include elements of the local environment
Template-Based Static Posterior Inference for Bayesian Probabilistic Programming
In Bayesian probabilistic programming, a central problem is to estimate the
normalised posterior distribution (NPD) of a probabilistic program with
conditioning. Prominent approximate approaches to address this problem include
Markov chain Monte Carlo and variational inference, but neither can generate
guaranteed outcomes within limited time. Moreover, most existing formal
approaches that perform exact inference for NPD are restricted to programs with
closed-form solutions or bounded loops/recursion. A recent work (Beutner et
al., PLDI 2022) derived guaranteed bounds for NPD over programs with unbounded
recursion. However, as this approach requires recursion unrolling, it suffers
from the path explosion problem. Furthermore, previous approaches do not
consider score-recursive probabilistic programs that allow score statements
inside loops, which is non-trivial and requires careful treatment to ensure the
integrability of the normalising constant in NPD.
In this work, we propose a novel automated approach to derive bounds for NPD
via polynomial templates. Our approach can handle probabilistic programs with
unbounded while loops and continuous distributions with infinite supports. The
novelties in our approach are three-fold: First, we use polynomial templates to
circumvent the path explosion problem from recursion unrolling; Second, we
derive a novel multiplicative variant of Optional Stopping Theorem that
addresses the integrability issue in score-recursive programs; Third, to
increase the accuracy of the derived bounds via polynomial templates, we
propose a novel technique of truncation that truncates a program into a bounded
range of program values. Experiments over a wide range of benchmarks
demonstrate that our approach is time-efficient and can derive bounds for NPD
that are comparable with (or tighter than) the recursion-unrolling approach
(Beutner et al., PLDI 2022)
Variational Inference in Nonconjugate Models
Mean-field variational methods are widely used for approximate posterior
inference in many probabilistic models. In a typical application, mean-field
methods approximately compute the posterior with a coordinate-ascent
optimization algorithm. When the model is conditionally conjugate, the
coordinate updates are easily derived and in closed form. However, many models
of interest---like the correlated topic model and Bayesian logistic
regression---are nonconjuate. In these models, mean-field methods cannot be
directly applied and practitioners have had to develop variational algorithms
on a case-by-case basis. In this paper, we develop two generic methods for
nonconjugate models, Laplace variational inference and delta method variational
inference. Our methods have several advantages: they allow for easily derived
variational algorithms with a wide class of nonconjugate models; they extend
and unify some of the existing algorithms that have been derived for specific
models; and they work well on real-world datasets. We studied our methods on
the correlated topic model, Bayesian logistic regression, and hierarchical
Bayesian logistic regression
Prior Density Learning in Variational Bayesian Phylogenetic Parameters Inference
The advances in variational inference are providing promising paths in
Bayesian estimation problems. These advances make variational phylogenetic
inference an alternative approach to Markov Chain Monte Carlo methods for
approximating the phylogenetic posterior. However, one of the main drawbacks of
such approaches is the modelling of the prior through fixed distributions,
which could bias the posterior approximation if they are distant from the
current data distribution. In this paper, we propose an approach and an
implementation framework to relax the rigidity of the prior densities by
learning their parameters using a gradient-based method and a neural
network-based parameterization. We applied this approach for branch lengths and
evolutionary parameters estimation under several Markov chain substitution
models. The results of performed simulations show that the approach is powerful
in estimating branch lengths and evolutionary model parameters. They also show
that a flexible prior model could provide better results than a predefined
prior model. Finally, the results highlight that using neural networks improves
the initialization of the optimization of the prior density parameters.Comment: Accepted as a full paper for publication at RECOMB-CG 2023
(Camera-ready version). 15 pages (excluding references), 6 tables and 1
figur
Phylogenetic information complexity: Is testing a tree easier than finding it?
Phylogenetic trees describe the evolutionary history of a group of
present-day species from a common ancestor. These trees are typically
reconstructed from aligned DNA sequence data. In this paper we analytically
address the following question: is the amount of sequence data required to
accurately reconstruct a tree significantly more than the amount required to
test whether or not a candidate tree was the `true' tree? By `significantly',
we mean that the two quantities behave the same way as a function of the number
of species being considered. We prove that, for a certain type of model, the
amount of information required is not significantly different; while for
another type of model, the information required to test a tree is independent
of the number of leaves, while that required to reconstruct it grows with this
number. Our results combine probabilistic and combinatorial arguments.Comment: 15 pages, 3 figure
- …