14 research outputs found
Algebraic methods in phylogenetics
To those outside the field, and even to some focused on empirical applications, phylogenetics may appear to have little to do with algebra. Probability and statistics are clearly important ingredients, as modeling and inferring evolutionary relationships motivate the field. Combinatorics is also an obvious component, as the graph-theoretic notions of trees, and more recently networks, are used to describe the relationships. But where does the algebra arise? The models used in phylogenetics are necessarily complex. At the simplest, they depend on a tree structure, as well as Markov matrices describing changes in nucleotide sequences along the edges. These two components result in probability distributions given by rather complicated polynomials on the parameters of the models, whose precise form reflects the structure of the tree.Peer ReviewedPostprint (author's final draft
Identifiability of Large Phylogenetic Mixture Models
Phylogenetic mixture models are statistical models of character evolution
allowing for heterogeneity. Each of the classes in some unknown partition of
the characters may evolve by different processes, or even along different
trees. The fundamental question of whether parameters of such a model are
identifiable is difficult to address, due to the complexity of the
parameterization. We analyze mixture models on large trees, with many mixture
components, showing that both numerical and tree parameters are indeed
identifiable in these models when all trees are the same. We also explore the
extent to which our algebraic techniques can be employed to extend the result
to mixtures on different trees.Comment: 15 page
PROCOV: maximum likelihood estimation of protein phylogeny under covarion models and site-specific covarion pattern analysis
<p>Abstract</p> <p>Background</p> <p>The covarion hypothesis of molecular evolution holds that selective pressures on a given amino acid or nucleotide site are dependent on the identity of other sites in the molecule that change throughout time, resulting in changes of evolutionary rates of sites along the branches of a phylogenetic tree. At the sequence level, covarion-like evolution at a site manifests as conservation of nucleotide or amino acid states among some homologs where the states are not conserved in other homologs (or groups of homologs). Covarion-like evolution has been shown to relate to changes in functions at sites in different clades, and, if ignored, can adversely affect the accuracy of phylogenetic inference.</p> <p>Results</p> <p>PROCOV (protein covarion analysis) is a software tool that implements a number of previously proposed covarion models of protein evolution for phylogenetic inference in a maximum likelihood framework. Several algorithmic and implementation improvements in this tool over previous versions make computationally expensive tree searches with covarion models more efficient and analyses of large phylogenomic data sets tractable. PROCOV can be used to identify covarion sites by comparing the site likelihoods under the covarion process to the corresponding site likelihoods under a rates-across-sites (RAS) process. Those sites with the greatest log-likelihood difference between a 'covarion' and an RAS process were found to be of functional or structural significance in a dataset of bacterial and eukaryotic elongation factors.</p> <p>Conclusion</p> <p>Covarion models implemented in PROCOV may be especially useful for phylogenetic estimation when ancient divergences between sequences have occurred and rates of evolution at sites are likely to have changed over the tree. It can also be used to study lineage-specific functional shifts in protein families that result in changes in the patterns of site variability among subtrees.</p
Identifiability of parameters in latent structure models with many observed variables
While hidden class models of various types arise in many statistical
applications, it is often difficult to establish the identifiability of their
parameters. Focusing on models in which there is some structure of independence
of some of the observed variables conditioned on hidden ones, we demonstrate a
general approach for establishing identifiability utilizing algebraic
arguments. A theorem of J. Kruskal for a simple latent-class model with finite
state space lies at the core of our results, though we apply it to a diverse
set of models. These include mixtures of both finite and nonparametric product
distributions, hidden Markov models and random graph mixture models, and lead
to a number of new results and improvements to old ones. In the parametric
setting, this approach indicates that for such models, the classical definition
of identifiability is typically too strong. Instead generic identifiability
holds, which implies that the set of nonidentifiable parameters has measure
zero, so that parameter inference is still meaningful. In particular, this
sheds light on the properties of finite mixtures of Bernoulli products, which
have been used for decades despite being known to have nonidentifiable
parameters. In the nonparametric setting, we again obtain identifiability only
when certain restrictions are placed on the distributions that are mixed, but
we explicitly describe the conditions.Comment: Published in at http://dx.doi.org/10.1214/09-AOS689 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org