5,252 research outputs found
Minimizing the average distance to a closest leaf in a phylogenetic tree
When performing an analysis on a collection of molecular sequences, it can be
convenient to reduce the number of sequences under consideration while
maintaining some characteristic of a larger collection of sequences. For
example, one may wish to select a subset of high-quality sequences that
represent the diversity of a larger collection of sequences. One may also wish
to specialize a large database of characterized "reference sequences" to a
smaller subset that is as close as possible on average to a collection of
"query sequences" of interest. Such a representative subset can be useful
whenever one wishes to find a set of reference sequences that is appropriate to
use for comparative analysis of environmentally-derived sequences, such as for
selecting "reference tree" sequences for phylogenetic placement of metagenomic
reads. In this paper we formalize these problems in terms of the minimization
of the Average Distance to the Closest Leaf (ADCL) and investigate algorithms
to perform the relevant minimization. We show that the greedy algorithm is not
effective, show that a variant of the Partitioning Among Medoids (PAM)
heuristic gets stuck in local minima, and develop an exact dynamic programming
approach. Using this exact program we note that the performance of PAM appears
to be good for simulated trees, and is faster than the exact algorithm for
small trees. On the other hand, the exact program gives solutions for all
numbers of leaves less than or equal to the given desired number of leaves,
while PAM only gives a solution for the pre-specified number of leaves. Via
application to real data, we show that the ADCL criterion chooses chimeric
sequences less often than random subsets, while the maximization of
phylogenetic diversity chooses them more often than random. These algorithms
have been implemented in publicly available software.Comment: Please contact us with any comments or questions
When two trees go to war
Rooted phylogenetic networks are often constructed by combining trees,
clusters, triplets or characters into a single network that in some
well-defined sense simultaneously represents them all. We review these four
models and investigate how they are related. In general, the model chosen
influences the minimum number of reticulation events required. However, when
one obtains the input data from two binary trees, we show that the minimum
number of reticulations is independent of the model. The number of
reticulations necessary to represent the trees, triplets, clusters (in the
softwired sense) and characters (with unrestricted multiple crossover
recombination) are all equal. Furthermore, we show that these results also hold
when not the number of reticulations but the level of the constructed network
is minimised. We use these unification results to settle several complexity
questions that have been open in the field for some time. We also give explicit
examples to show that already for data obtained from three binary trees the
models begin to diverge
Polyhedral geometry of Phylogenetic Rogue Taxa
It is well known among phylogeneticists that adding an extra taxon (e.g.
species) to a data set can alter the structure of the optimal phylogenetic tree
in surprising ways. However, little is known about this "rogue taxon" effect.
In this paper we characterize the behavior of balanced minimum evolution (BME)
phylogenetics on data sets of this type using tools from polyhedral geometry.
First we show that for any distance matrix there exist distances to a "rogue
taxon" such that the BME-optimal tree for the data set with the new taxon does
not contain any nontrivial splits (bipartitions) of the optimal tree for the
original data. Second, we prove a theorem which restricts the topology of
BME-optimal trees for data sets of this type, thus showing that a rogue taxon
cannot have an arbitrary effect on the optimal tree. Third, we construct
polyhedral cones computationally which give complete answers for BME rogue
taxon behavior when our original data fits a tree on four, five, and six taxa.
We use these cones to derive sufficient conditions for rogue taxon behavior for
four taxa, and to understand the frequency of the rogue taxon effect via
simulation.Comment: In this version, we add quartet distances and fix Table 4
Multivariate Approaches to Classification in Extragalactic Astronomy
Clustering objects into synthetic groups is a natural activity of any
science. Astrophysics is not an exception and is now facing a deluge of data.
For galaxies, the one-century old Hubble classification and the Hubble tuning
fork are still largely in use, together with numerous mono-or bivariate
classifications most often made by eye. However, a classification must be
driven by the data, and sophisticated multivariate statistical tools are used
more and more often. In this paper we review these different approaches in
order to situate them in the general context of unsupervised and supervised
learning. We insist on the astrophysical outcomes of these studies to show that
multivariate analyses provide an obvious path toward a renewal of our
classification of galaxies and are invaluable tools to investigate the physics
and evolution of galaxies.Comment: Open Access paper.
http://www.frontiersin.org/milky\_way\_and\_galaxies/10.3389/fspas.2015.00003/abstract\>.
\<10.3389/fspas.2015.00003 \&g
- …