2,003 research outputs found
The identifiability of tree topology for phylogenetic models, including covarion and mixture models
For a model of molecular evolution to be useful for phylogenetic inference,
the topology of evolutionary trees must be identifiable. That is, from a joint
distribution the model predicts, it must be possible to recover the tree
parameter. We establish tree identifiability for a number of phylogenetic
models, including a covarion model and a variety of mixture models with a
limited number of classes. The proof is based on the introduction of a more
general model, allowing more states at internal nodes of the tree than at
leaves, and the study of the algebraic variety formed by the joint
distributions to which it gives rise. Tree identifiability is first established
for this general model through the use of certain phylogenetic invariants.Comment: 20 pages, 1 figur
Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites
The general Markov plus invariable sites (GM+I) model of biological sequence
evolution is a two-class model in which an unknown proportion of sites are not
allowed to change, while the remainder undergo substitutions according to a
Markov process on a tree. For statistical use it is important to know if the
model is identifiable; can both the tree topology and the numerical parameters
be determined from a joint distribution describing sequences only at the leaves
of the tree? We establish that for generic parameters both the tree and all
numerical parameter values can be recovered, up to clearly understood issues of
`label swapping.' The method of analysis is algebraic, using phylogenetic
invariants to study the variety defined by the model. Simple rational formulas,
expressed in terms of determinantal ratios, are found for recovering numerical
parameters describing the invariable sites
Identifiability of parameters in latent structure models with many observed variables
While hidden class models of various types arise in many statistical
applications, it is often difficult to establish the identifiability of their
parameters. Focusing on models in which there is some structure of independence
of some of the observed variables conditioned on hidden ones, we demonstrate a
general approach for establishing identifiability utilizing algebraic
arguments. A theorem of J. Kruskal for a simple latent-class model with finite
state space lies at the core of our results, though we apply it to a diverse
set of models. These include mixtures of both finite and nonparametric product
distributions, hidden Markov models and random graph mixture models, and lead
to a number of new results and improvements to old ones. In the parametric
setting, this approach indicates that for such models, the classical definition
of identifiability is typically too strong. Instead generic identifiability
holds, which implies that the set of nonidentifiable parameters has measure
zero, so that parameter inference is still meaningful. In particular, this
sheds light on the properties of finite mixtures of Bernoulli products, which
have been used for decades despite being known to have nonidentifiable
parameters. In the nonparametric setting, we again obtain identifiability only
when certain restrictions are placed on the distributions that are mixed, but
we explicitly describe the conditions.Comment: Published in at http://dx.doi.org/10.1214/09-AOS689 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …