4 research outputs found

    Co-evolution is Incompatible with the Markov Assumption in Phylogenetics

    Get PDF
    Markov models are extensively used in the analysis of molecular evolution. A recent line of research suggests that pairs of proteins with functional and physical interactions co-evolve with each other. Here, by analyzing hundreds of orthologous sets of three fungi and their co-evolutionary relations, we demonstrate that co-evolutionary assumption may violate the Markov assumption. Our results encourage developing alternative probabilistic models for the cases of extreme co-evolution

    UNSUPERVISED LEARNING IN PHYLOGENOMIC ANALYSIS OVER THE SPACE OF PHYLOGENETIC TREES

    Get PDF
    A phylogenetic tree is a tree to represent an evolutionary history between species or other entities. Phylogenomics is a new field intersecting phylogenetics and genomics and it is well-known that we need statistical learning methods to handle and analyze a large amount of data which can be generated relatively cheaply with new technologies. Based on the existing Markov models, we introduce a new method, CURatio, to identify outliers in a given gene data set. This method, intrinsically an unsupervised method, can find outliers from thousands or even more genes. This ability to analyze large amounts of genes (even with missing information) makes it unique in many parametric methods. At the same time, the exploration of statistical analysis in high-dimensional space of phylogenetic trees has never stopped, many tree metrics are proposed to statistical methodology. Tropical metric is one of them. We implement a MCMC sampling method to estimate the principal components in a tree space with the tropical metric for achieving dimension reduction and visualizing the result in a 2-D tropical triangle

    A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics

    Get PDF
    Molecular phylogenetics plays a key role in comparative genomics and has increasingly significant impacts on science, industry, government, public health and society. In this paper, we posit that the current phylogenetic protocol is missing two critical steps, and that their absence allows model misspecification and confirmation bias to unduly influence phylogenetic estimates. Based on the potential offered by well-established but under-used procedures, such as assessment of phylogenetic assumptions and tests of goodness of fit, we introduce a new phylogenetic protocol that will reduce confirmation bias and increase the accuracy of phylogenetic estimates

    Systematic bias in phylogenetic inference: Implications, Assessment, and Reduction

    Get PDF
    Molecular phylogenetic inference is the process of reconstructing relationships between individuals, species, or higher groups from genomic sequence data. The reliability of phylogenetic analysis relies on the fit between the substitution models used and the evolutionary processes that generated the data. In phylogenetic inference, we commonly use substitution models which assume that sequence evolution is stationary, reversible, and homogeneous (SRH). Many empirical and simulation studies have shown that assuming SRH conditions can lead to significant errors in phylogenetic inference when the data violates these assumptions. Yet, the extent of SRH violations and their effects on phylogenetic inference of tree topologies are not very well understood. In Chapter I, I introduced and applied the Maximal matched-pairs tests of homogeneity (MaxSym tests) to assess the scale and impact of SRH model violations on 3,572 partitions from 35 published phylogenetic data sets. I showed that roughly one-quarter of all the partitions I analysed reject the SRH assumptions and that for more than one-quarter of data sets, tree topologies inferred from all partitions differ significantly from topologies inferred using the subset of partitions that do not reject the SRH assumptions. In Chapter II, I simulated datasets under various degrees of non-SRH conditions using empirically derived parameters to mimic real data and examine the effects of incorrectly assuming SRH conditions on inferring phylogenies. I showed that maximum likelihood inference is generally quite robust to a wide range of SRH model violations but is inaccurate under extreme convergent evolution. In addition, I tested the power of the MaxSym tests and other popular tests to detect model violations due to non-SRH evolution. I showed that MaxSym tests performed well under the different schemes of simulations, and that of all the tests I studies, the MaxSym tests perform the best at identifying datasets that might mislead phylogenetic inference. In Chapter III, I investigated the ability of non-reversible models to estimate the root of a phylogeny. In addition, I introduced a new measure of support for the placement of the root in a phylogenetic tree, the rootstrap support. I tested the ability of nonreversible models to recover the root placement of five clades of mammals for which prior studies give very strong evidence of a particular root position. I showed that the nonreversible model correctly inferred the root of all the five clades with very high rootstrap support. I then applied the same approaches to infer the roots of two clades of mammals for which previous studies have repeatedly disagreed on the root position. I show that nonreversible models recover similar roots to previous studies, but the rootstrap support is lower than the other five clades. In Chapter IV, I investigated the homogeneity assumption widely used in phylogenetic inference. To check for homogeneity in empirical datasets, I introduced a computationally feasible test for homogeneity across lineages based on the AIC score. Using empirical datasets from three different clades of life I tested the homogeneity assumption by estimating amino-acid substitution matrices for monophyletic sub-clades within each dataset. I show that forcing the models to be homogenous always provides a worse fit to the data than allowing each sub-clade to have its own model. In addition, for every dataset, I found that a simpler model where two or more clades share the same substitution matrix is always better than the fully non-homogeneous model in terms of AIC score. Together, these chapters show the impact of model violation due to non-SRH evolution on phylogenetic inference and suggest the need to test for model violation prior to phylogenetic inference, or to develop and apply more complex substitution models to relax some of the assumptions associated with the most widely used models in phylogenetics
    corecore