9,157 research outputs found
Evolutionary Inference via the Poisson Indel Process
We address the problem of the joint statistical inference of phylogenetic
trees and multiple sequence alignments from unaligned molecular sequences. This
problem is generally formulated in terms of string-valued evolutionary
processes along the branches of a phylogenetic tree. The classical evolutionary
process, the TKF91 model, is a continuous-time Markov chain model comprised of
insertion, deletion and substitution events. Unfortunately this model gives
rise to an intractable computational problem---the computation of the marginal
likelihood under the TKF91 model is exponential in the number of taxa. In this
work, we present a new stochastic process, the Poisson Indel Process (PIP), in
which the complexity of this computation is reduced to linear. The new model is
closely related to the TKF91 model, differing only in its treatment of
insertions, but the new model has a global characterization as a Poisson
process on the phylogeny. Standard results for Poisson processes allow key
computations to be decoupled, which yields the favorable computational profile
of inference under the PIP model. We present illustrative experiments in which
Bayesian inference under the PIP model is compared to separate inference of
phylogenies and alignments.Comment: 33 pages, 6 figure
A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity
Ortholog detection (OD) is a critical step for comparative genomic analysis
of protein-coding sequences. In this paper, we begin with a comprehensive
comparison of four popular, methodologically diverse OD methods: MultiParanoid,
Blat, Multiz, and OMA. In head-to-head comparisons, these methods are shown to
significantly outperform one another 12-30% of the time. This high
complementarity motivates the presentation of the first tool for integrating
methodologically diverse OD methods. We term this program MOSAIC, or Multiple
Orthologous Sequence Analysis and Integration by Cluster optimization. Relative
to component and competing methods, we demonstrate that MOSAIC more than
quintuples the number of alignments for which all species are present, while
simultaneously maintaining or improving functional-, phylogenetic-, and
sequence identity-based measures of ortholog quality. Further, we demonstrate
that this improvement in alignment quality yields 40-280% more confidently
aligned sites. Combined, these factors translate to higher estimated levels of
overall conservation, while at the same time allowing for the detection of up
to 180% more positively selected sites. MOSAIC is available as python package.
MOSAIC alignments, source code, and full documentation are available at
http://pythonhosted.org/bio-MOSAIC
Accurate Reconstruction of Molecular Phylogenies for Proteins Using Codon and Amino Acid Unified Sequence Alignments (CAUSA)
Based on molecular clock hypothesis, and neutral theory of molecular evolution, molecular phylogenies have been widely used for inferring evolutionary history of organisms and individual genes. Traditionally, alignments and phylogeny trees of proteins and their coding DNA sequences are constructed separately, thus often different conclusions were drawn. Here we present a new strategy for sequence alignment and phylogenetic tree reconstruction, codon and amino acid unified sequence alignment (CAUSA), which aligns DNA and protein sequences and draw phylogenetic trees in a unified manner. We demonstrated that CAUSA improves both the accuracy of multiple sequence alignments and phylogenetic trees by solving a variety of molecular evolutionary problems in virus, bacteria and mammals. Our results support the hypothesis that the molecular clock for proteins has two pointers existing separately in DNA and protein sequences. It is more accurate to read the molecular clock by combination (additive) of these two pointers, since the ticking rates of them are sometimes consistent, sometimes different. CAUSA software were released as Open Source under GNU/GPL license, and are downloadable free of charge from the website www.dnapluspro.com
The Dawn of Open Access to Phylogenetic Data
The scientific enterprise depends critically on the preservation of and open
access to published data. This basic tenet applies acutely to phylogenies
(estimates of evolutionary relationships among species). Increasingly,
phylogenies are estimated from increasingly large, genome-scale datasets using
increasingly complex statistical methods that require increasing levels of
expertise and computational investment. Moreover, the resulting phylogenetic
data provide an explicit historical perspective that critically informs
research in a vast and growing number of scientific disciplines. One such use
is the study of changes in rates of lineage diversification (speciation -
extinction) through time. As part of a meta-analysis in this area, we sought to
collect phylogenetic data (comprising nucleotide sequence alignment and tree
files) from 217 studies published in 46 journals over a 13-year period. We
document our attempts to procure those data (from online archives and by direct
request to corresponding authors), and report results of analyses (using
Bayesian logistic regression) to assess the impact of various factors on the
success of our efforts. Overall, complete phylogenetic data for ~60% of these
studies are effectively lost to science. Our study indicates that phylogenetic
data are more likely to be deposited in online archives and/or shared upon
request when: (1) the publishing journal has a strong data-sharing policy; (2)
the publishing journal has a higher impact factor, and; (3) the data are
requested from faculty rather than students. Although the situation appears
dire, our analyses suggest that it is far from hopeless: recent initiatives by
the scientific community -- including policy changes by journals and funding
agencies -- are improving the state of affairs
Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies
Existing sequence alignment algorithms use heuristic scoring schemes which
cannot be used as objective distance metrics. Therefore one relies on measures
like the p- or log-det distances, or makes explicit, and often simplistic,
assumptions about sequence evolution. Information theory provides an
alternative, in the form of mutual information (MI) which is, in principle, an
objective and model independent similarity measure. MI can be estimated by
concatenating and zipping sequences, yielding thereby the "normalized
compression distance". So far this has produced promising results, but with
uncontrolled errors. We describe a simple approach to get robust estimates of
MI from global pairwise alignments. Using standard alignment algorithms, this
gives for animal mitochondrial DNA estimates that are strikingly close to
estimates obtained from the alignment free methods mentioned above. Our main
result uses algorithmic (Kolmogorov) information theory, but we show that
similar results can also be obtained from Shannon theory. Due to the fact that
it is not additive, normalized compression distance is not an optimal metric
for phylogenetics, but we propose a simple modification that overcomes the
issue of additivity. We test several versions of our MI based distance measures
on a large number of randomly chosen quartets and demonstrate that they all
perform better than traditional measures like the Kimura or log-det (resp.
paralinear) distances. Even a simplified version based on single letter Shannon
entropies, which can be easily incorporated in existing software packages, gave
superior results throughout the entire animal kingdom. But we see the main
virtue of our approach in a more general way. For example, it can also help to
judge the relative merits of different alignment algorithms, by estimating the
significance of specific alignments.Comment: 19 pages + 16 pages of supplementary materia
- …