1,178 research outputs found
Consistency and convergence rate of phylogenetic inference via regularization
It is common in phylogenetics to have some, perhaps partial, information
about the overall evolutionary tree of a group of organisms and wish to find an
evolutionary tree of a specific gene for those organisms. There may not be
enough information in the gene sequences alone to accurately reconstruct the
correct "gene tree." Although the gene tree may deviate from the "species tree"
due to a variety of genetic processes, in the absence of evidence to the
contrary it is parsimonious to assume that they agree. A common statistical
approach in these situations is to develop a likelihood penalty to incorporate
such additional information. Recent studies using simulation and empirical data
suggest that a likelihood penalty quantifying concordance with a species tree
can significantly improve the accuracy of gene tree reconstruction compared to
using sequence data alone. However, the consistency of such an approach has not
yet been established, nor have convergence rates been bounded. Because
phylogenetics is a non-standard inference problem, the standard theory does not
apply. In this paper, we propose a penalized maximum likelihood estimator for
gene tree reconstruction, where the penalty is the square of the
Billera-Holmes-Vogtmann geodesic distance from the gene tree to the species
tree. We prove that this method is consistent, and derive its convergence rate
for estimating the discrete gene tree structure and continuous edge lengths
(representing the amount of evolution that has occurred on that branch)
simultaneously. We find that the regularized estimator is "adaptive fast
converging," meaning that it can reconstruct all edges of length greater than
any given threshold from gene sequences of polynomial length. Our method does
not require the species tree to be known exactly; in fact, our asymptotic
theory holds for any such guide tree.Comment: 34 pages, 5 figures. To appear on The Annals of Statistic
On the convergence of the maximum likelihood estimator for the transition rate under a 2-state symmetric model
Maximum likelihood estimators are used extensively to estimate unknown
parameters of stochastic trait evolution models on phylogenetic trees. Although
the MLE has been proven to converge to the true value in the independent-sample
case, we cannot appeal to this result because trait values of different species
are correlated due to shared evolutionary history. In this paper, we consider a
-state symmetric model for a single binary trait and investigate the
theoretical properties of the MLE for the transition rate in the large-tree
limit. Here, the large-tree limit is a theoretical scenario where the number of
taxa increases to infinity and we can observe the trait values for all species.
Specifically, we prove that the MLE converges to the true value under some
regularity conditions. These conditions ensure that the tree shape is not too
irregular, and holds for many practical scenarios such as trees with bounded
edges, trees generated from the Yule (pure birth) process, and trees generated
from the coalescent point process. Our result also provides an upper bound for
the distance between the MLE and the true value
SPADE4: Sparsity and Delay Embedding based Forecasting of Epidemics
Predicting the evolution of diseases is challenging, especially when the data
availability is scarce and incomplete. The most popular tools for modelling and
predicting infectious disease epidemics are compartmental models. They stratify
the population into compartments according to health status and model the
dynamics of these compartments using dynamical systems. However, these
predefined systems may not capture the true dynamics of the epidemic due to the
complexity of the disease transmission and human interactions. In order to
overcome this drawback, we propose Sparsity and Delay Embedding based
Forecasting (SPADE4) for predicting epidemics. SPADE4 predicts the future
trajectory of an observable variable without the knowledge of the other
variables or the underlying system. We use random features model with sparse
regression to handle the data scarcity issue and employ Takens' delay embedding
theorem to capture the nature of the underlying system from the observed
variable. We show that our approach outperforms compartmental models when
applied to both simulated and real data.Comment: 24 pages, 13 figures, 2 table
Birth/birth-death processes and their computable transition probabilities with biological applications
Birth-death processes track the size of a univariate population, but many
biological systems involve interaction between populations, necessitating
models for two or more populations simultaneously. A lack of efficient methods
for evaluating finite-time transition probabilities of bivariate processes,
however, has restricted statistical inference in these models. Researchers rely
on computationally expensive methods such as matrix exponentiation or Monte
Carlo approximation, restricting likelihood-based inference to small systems,
or indirect methods such as approximate Bayesian computation. In this paper, we
introduce the birth(death)/birth-death process, a tractable bivariate extension
of the birth-death process. We develop an efficient and robust algorithm to
calculate the transition probabilities of birth(death)/birth-death processes
using a continued fraction representation of their Laplace transforms. Next, we
identify several exemplary models arising in molecular epidemiology,
macro-parasite evolution, and infectious disease modeling that fall within this
class, and demonstrate advantages of our proposed method over existing
approaches to inference in these models. Notably, the ubiquitous stochastic
susceptible-infectious-removed (SIR) model falls within this class, and we
emphasize that computable transition probabilities newly enable direct
inference of parameters in the SIR model. We also propose a very fast method
for approximating the transition probabilities under the SIR model via a novel
branching process simplification, and compare it to the continued fraction
representation method with application to the 17th century plague in Eyam.
Although the two methods produce similar maximum a posteriori estimates, the
branching process approximation fails to capture the correlation structure in
the joint posterior distribution
- …