8,085 research outputs found
Combinatorics of least squares trees
A recurring theme in the least squares approach to phylogenetics has been the
discovery of elegant combinatorial formulas for the least squares estimates of
edge lengths. These formulas have proved useful for the development of
efficient algorithms, and have also been important for understanding
connections among popular phylogeny algorithms. For example, the selection
criterion of the neighbor-joining algorithm is now understood in terms of the
combinatorial formulas of Pauplin for estimating tree length.
We highlight a phylogenetically desirable property that weighted least
squares methods should satisfy, and provide a complete characterization of
methods that satisfy the property. The necessary and sufficient condition is a
multiplicative four point condition that the the variance matrix needs to
satisfy. The proof is based on the observation that the Lagrange multipliers in
the proof of the Gauss--Markov theorem are tree-additive. Our results
generalize and complete previous work on ordinary least squares, balanced
minimum evolution and the taxon weighted variance model. They also provide a
time optimal algorithm for computation
Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting
Phylogenetic networks are necessary to represent the tree of life expanded by
edges to represent events such as horizontal gene transfers, hybridizations or
gene flow. Not all species follow the paradigm of vertical inheritance of their
genetic material. While a great deal of research has flourished into the
inference of phylogenetic trees, statistical methods to infer phylogenetic
networks are still limited and under development. The main disadvantage of
existing methods is a lack of scalability. Here, we present a statistical
method to infer phylogenetic networks from multi-locus genetic data in a
pseudolikelihood framework. Our model accounts for incomplete lineage sorting
through the coalescent model, and for horizontal inheritance of genes through
reticulation nodes in the network. Computation of the pseudolikelihood is fast
and simple, and it avoids the burdensome calculation of the full likelihood
which can be intractable with many species. Moreover, estimation at the
quartet-level has the added computational benefit that it is easily
parallelizable. Simulation studies comparing our method to a full likelihood
approach show that our pseudolikelihood approach is much faster without
compromising accuracy. We applied our method to reconstruct the evolutionary
relationships among swordtails and platyfishes (: Poeciliidae),
which is characterized by widespread hybridizations
A probabilistic model for gene content evolution with duplication, loss, and horizontal transfer
We introduce a Markov model for the evolution of a gene family along a
phylogeny. The model includes parameters for the rates of horizontal gene
transfer, gene duplication, and gene loss, in addition to branch lengths in the
phylogeny. The likelihood for the changes in the size of a gene family across
different organisms can be calculated in O(N+hM^2) time and O(N+M^2) space,
where N is the number of organisms, is the height of the phylogeny, and M
is the sum of family sizes. We apply the model to the evolution of gene content
in Preoteobacteria using the gene families in the COG (Clusters of Orthologous
Groups) database
Determining species tree topologies from clade probabilities under the coalescent
One approach to estimating a species tree from a collection of gene trees is
to first estimate probabilities of clades from the gene trees, and then to
construct the species tree from the estimated clade probabilities. While a
greedy consensus algorithm, which consecutively accepts the most probable
clades compatible with previously accepted clades, can be used for this second
stage, this method is known to be statistically inconsistent under the
multispecies coalescent model. This raises the question of whether it is
theoretically possible to reconstruct the species tree from known probabilities
of clades on gene trees. We investigate clade probabilities arising from the
multispecies coalescent model, with an eye toward identifying features of the
species tree. Clades on gene trees with probability greater than 1/3 are shown
to reflect clades on the species tree, while those with smaller probabilities
may not. Linear invariants of clade probabilities are studied both
computationally and theoretically, with certain linear invariants giving
insight into the clade structure of the species tree. For species trees with
generic edge lengths, these invariants can be used to identify the species tree
topology. These theoretical results both confirm that clade probabilities
contain full information on the species tree topology and suggest future
directions of study for developing statistically consistent inference methods
from clade frequencies on gene trees.Comment: 25 pages, 2 figure
- …