5,473 research outputs found
Reconciling taxonomy and phylogenetic inference: formalism and algorithms for describing discord and inferring taxonomic roots
Although taxonomy is often used informally to evaluate the results of
phylogenetic inference and find the root of phylogenetic trees, algorithmic
methods to do so are lacking. In this paper we formalize these procedures and
develop algorithms to solve the relevant problems. In particular, we introduce
a new algorithm that solves a "subcoloring" problem for expressing the
difference between the taxonomy and phylogeny at a given rank. This algorithm
improves upon the current best algorithm in terms of asymptotic complexity for
the parameter regime of interest; we also describe a branch-and-bound algorithm
that saves orders of magnitude in computation on real data sets. We also
develop a formalism and an algorithm for rooting phylogenetic trees according
to a taxonomy. All of these algorithms are implemented in freely-available
software.Comment: Version submitted to Algorithms for Molecular Biology. A number of
fixes from previous versio
Minimizing the average distance to a closest leaf in a phylogenetic tree
When performing an analysis on a collection of molecular sequences, it can be
convenient to reduce the number of sequences under consideration while
maintaining some characteristic of a larger collection of sequences. For
example, one may wish to select a subset of high-quality sequences that
represent the diversity of a larger collection of sequences. One may also wish
to specialize a large database of characterized "reference sequences" to a
smaller subset that is as close as possible on average to a collection of
"query sequences" of interest. Such a representative subset can be useful
whenever one wishes to find a set of reference sequences that is appropriate to
use for comparative analysis of environmentally-derived sequences, such as for
selecting "reference tree" sequences for phylogenetic placement of metagenomic
reads. In this paper we formalize these problems in terms of the minimization
of the Average Distance to the Closest Leaf (ADCL) and investigate algorithms
to perform the relevant minimization. We show that the greedy algorithm is not
effective, show that a variant of the Partitioning Among Medoids (PAM)
heuristic gets stuck in local minima, and develop an exact dynamic programming
approach. Using this exact program we note that the performance of PAM appears
to be good for simulated trees, and is faster than the exact algorithm for
small trees. On the other hand, the exact program gives solutions for all
numbers of leaves less than or equal to the given desired number of leaves,
while PAM only gives a solution for the pre-specified number of leaves. Via
application to real data, we show that the ADCL criterion chooses chimeric
sequences less often than random subsets, while the maximization of
phylogenetic diversity chooses them more often than random. These algorithms
have been implemented in publicly available software.Comment: Please contact us with any comments or questions
Molecular Infectious Disease Epidemiology: Survival Analysis and Algorithms Linking Phylogenies to Transmission Trees
Recent work has attempted to use whole-genome sequence data from pathogens to
reconstruct the transmission trees linking infectors and infectees in
outbreaks. However, transmission trees from one outbreak do not generalize to
future outbreaks. Reconstruction of transmission trees is most useful to public
health if it leads to generalizable scientific insights about disease
transmission. In a survival analysis framework, estimation of transmission
parameters is based on sums or averages over the possible transmission trees. A
phylogeny can increase the precision of these estimates by providing partial
information about who infected whom. The leaves of the phylogeny represent
sampled pathogens, which have known hosts. The interior nodes represent common
ancestors of sampled pathogens, which have unknown hosts. Starting from
assumptions about disease biology and epidemiologic study design, we prove that
there is a one-to-one correspondence between the possible assignments of
interior node hosts and the transmission trees simultaneously consistent with
the phylogeny and the epidemiologic data on person, place, and time. We develop
algorithms to enumerate these transmission trees and show these can be used to
calculate likelihoods that incorporate both epidemiologic data and a phylogeny.
A simulation study confirms that this leads to more efficient estimates of
hazard ratios for infectiousness and baseline hazards of infectious contact,
and we use these methods to analyze data from a foot-and-mouth disease virus
outbreak in the United Kingdom in 2001. These results demonstrate the
importance of data on individuals who escape infection, which is often
overlooked. The combination of survival analysis and algorithms linking
phylogenies to transmission trees is a rigorous but flexible statistical
foundation for molecular infectious disease epidemiology.Comment: 28 pages, 11 figures, 3 table
Visualizing Co-Phylogenetic Reconciliations
We introduce a hybrid metaphor for the visualization of the reconciliations
of co-phylogenetic trees, that are mappings among the nodes of two trees. The
typical application is the visualization of the co-evolution of hosts and
parasites in biology. Our strategy combines a space-filling and a node-link
approach. Differently from traditional methods, it guarantees an unambiguous
and `downward' representation whenever the reconciliation is time-consistent
(i.e., meaningful). We address the problem of the minimization of the number of
crossings in the representation, by giving a characterization of planar
instances and by establishing the complexity of the problem. Finally, we
propose heuristics for computing representations with few crossings.Comment: This paper appears in the Proceedings of the 25th International
Symposium on Graph Drawing and Network Visualization (GD 2017
A chloroplast phylogeny of Arisaema (Araceae) illustrates Tertiary floristic links between Asia, North America, and East Africa
The evolution of Arisaema is reconstructed, based on combined sequences (2048 aligned bases) from the chloroplast trnL intron, trnL-trnF spacer, and rpl20-rps12 spacer obtained for species from all 11 sections, including sectional type species and geographically disjunct East African and North American/Mexican species. Analyses were rooted with a representative sample of the closest outgroups, Pinellia and Typhonium, to rigorously test the monophyly of Arisaema. Sections in Arisaema are mostly based on leaf, stem, and inflorescence characters and, with one exception, are not rejected by the molecular data; however, statistical support for sectional relationships in the genus remains poor. Section Tortuosa, which includes eastern North American A. dracontium and Mexican A. macrospathum, is demonstrably polyphyletic. The third New World species, A. triphyllum, also occurs in eastern North America and groups with a different Asian clade than do A. dracontium/A. macrospathum. The genus thus appears to have entered North America twice. Fossil infructescences similar to those of A. triphyllum are known from approximately 18 million-year-old deposits inWashington State and can serve to calibrate a molecular clock. Constraining the age of A. triphyllum to 18 million years (my) and applying either a semiparametric or an ultrametric clock model to the combined data yields an age of approximately 31–49 my for the divergence of A. dracontium/A. macrospathum from their Asian relatives and of 19–32 my for the divergence between African A. schimperianum and a Tibetan/Nepalese relative. The genus thus provides an example of the Oligocene/Miocene floristic links between East Africa, Arabia, the Himalayan region, China, and North America. The phylogeny also suggests secondary loss of the environmental sex determination strategy that characterizes all arisaemas except for two subspecies of A. flavum, which have consistently bisexual spathes. These subspecies are tetraploid and capable of selfing, while a third subspecies of A. flavum is diploid and retains the sex-changing strategy. In the molecular trees, the sex-changing subspecies is sister to the two non-sex-changing ones, and the entire species is not basal in the genus
Learning Latent Tree Graphical Models
We study the problem of learning a latent tree graphical model where samples
are available only from a subset of variables. We propose two consistent and
computationally efficient algorithms for learning minimal latent trees, that
is, trees without any redundant hidden nodes. Unlike many existing methods, the
observed nodes (or variables) are not constrained to be leaf nodes. Our first
algorithm, recursive grouping, builds the latent tree recursively by
identifying sibling groups using so-called information distances. One of the
main contributions of this work is our second algorithm, which we refer to as
CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree
over the observed variables is constructed. This global step groups the
observed nodes that are likely to be close to each other in the true latent
tree, thereby guiding subsequent recursive grouping (or equivalent procedures)
on much smaller subsets of variables. This results in more accurate and
efficient learning of latent trees. We also present regularized versions of our
algorithms that learn latent tree approximations of arbitrary distributions. We
compare the proposed algorithms to other methods by performing extensive
numerical experiments on various latent tree graphical models such as hidden
Markov models and star graphs. In addition, we demonstrate the applicability of
our methods on real-world datasets by modeling the dependency structure of
monthly stock returns in the S&P index and of the words in the 20 newsgroups
dataset
- …