567 research outputs found

    Bayesian phylogenetic modelling of lateral gene transfers

    Get PDF
    PhD ThesisPhylogenetic trees represent the evolutionary relationships between a set of species. Inferring these trees from data is particularly challenging sometimes since the transfer of genetic material can occur not only from parents to their o spring but also between organisms via lateral gene transfers (LGTs). Thus, the presence of LGTs means that genes in a genome can each have di erent evolutionary histories, represented by di erent gene trees. A few statistical approaches have been introduced to explore non-vertical evolution through collections of Markov-dependent gene trees. In 2005 Suchard described a Bayesian hierarchical model for joint inference of gene trees and an underlying species tree, where a layer in the model linked gene trees to the species tree via a sequence of unknown lateral gene transfers. In his model LGT was modeled via a random walk in the tree space derived from the subtree prune and regraft (SPR) operator on unrooted trees. However, the use of SPR moves to represent LGT in an unrooted tree is problematic, since the transference of DNA between two organisms implies the contemporaneity of both organisms and therefore it can allow unrealistic LGTs. This thesis describes a related hierarchical Bayesian phylogenetic model for reconstructing phylogenetic trees which imposes a temporal constraint on LGTs, namely that they can only occur between species which exist concurrently. This is achieved by taking into account possible time orderings of divergence events in trees, without explicitly modelling divergence times. An extended version of the SPR operator is introduced as a more adequate mechanism to represent the LGT e ect in a tree. The extended SPR operation respects the time ordering. It additionaly di ers from regular SPR as it maintains a 1-to-1 correspondence between points on the species tree and points on each gene tree. Each point on a gene tree represents the existence of a population containing that gene at some point in time. Hierarchical phylogenetic models were used in the reconstruction of each gene tree from its corresponding gene alignment, enabling the pooling of information across genes. In addition to Suchard's approach, we assume variation in the rate of evolution between di erent sites. The species tree is assumed to be xed. A Markov Chain Monte Carlo (MCMC) algorithm was developed to t the model in a Bayesian framework. A novel MCMC proposal mechanism for jointly proposing the gene tree topology and branch lengths, LGT distance and LGT history has been developed as well as a novel graphical tool to represent LGT history, the LGT Biplot. Our model was applied to simulated and experimental datasets. More speci cally we analysed LGT/reassortment presence in the evolution of 2009 Swine-Origin In uenza Type A virus. Future improvements of our model and algorithm should include joint inference of the species tree, improving the computational e ciency of the MCMC algorithm and better consideration of other factors that can cause discordance of gene trees and species trees such as gene loss

    Inferring phylogenetic trees under the general Markov model via a minimum spanning tree backbone

    Get PDF
    Phylogenetic trees are models of the evolutionary relationships among species, with species typically placed at the leaves of trees. We address the following problems regarding the calculation of phylogenetic trees. (1) Leaf-labeled phylogenetic trees may not be appropriate models of evolutionary relationships among rapidly evolving pathogens which may contain ancestor-descendant pairs. (2) The models of gene evolution that are widely used unrealistically assume that the base composition of DNA sequences does not evolve. Regarding problem (1) we present a method for inferring generally labeled phylogenetic trees that allow sampled species to be placed at non-leaf nodes of the tree. Regarding problem (2), we present a structural expectation maximization method (SEM-GM) for inferring leaf-labeled phylogenetic trees under the general Markov model (GM) which is the most complex model of DNA substitution that allows the evolution of base composition. In order to improve the scalability of SEM-GM we present a minimum spanning tree (MST) framework called MST-backbone. MST-backbone scales linearly with the number of leaves. However, the unrealistic location of the root as inferred on empirical data suggests that the GM model may be overtrained. MST-backbone was inspired by the topological relationship between MSTs and phylogenetic trees that was introduced by Choi et al. (2011). We discovered that the topological relationship does not necessarily hold if there is no unique MST. We propose so-called vertex-order based MSTs (VMSTs) that guarantee a topological relationship with phylogenetic trees.Phylogenetische BĂ€ume modellieren evolutionĂ€re Beziehungen zwischen Spezies, wobei die Spezies typischerweise an den BlĂ€ttern der BĂ€ume sitzen. Wir befassen uns mit den folgenden Problemen bei der Berechnung von phylogenetischen BĂ€umen. (1) Blattmarkierte phylogenetische BĂ€ume sind möglicherweise keine geeigneten Modelle der evolutionĂ€ren Beziehungen zwischen sich schnell entwickelnden Krankheitserregern, die Vorfahren-Nachfahren-Paare enthalten können. (2) Die weit verbreiteten Modelle der Genevolution gehen unrealistischerweise davon aus, dass sich die Basenzusammensetzung von DNA-Sequenzen nicht Ă€ndert. BezĂŒglich Problem (1) stellen wir eine Methode zur Ableitung von allgemein markierten phylogenetischen BĂ€umen vor, die es erlaubt, Spezies, fĂŒr die Proben vorliegen, an inneren des Baumes zu platzieren. BezĂŒglich Problem (2) stellen wir eine strukturelle Expectation-Maximization-Methode (SEM-GM) zur Ableitung von blattmarkierten phylogenetischen BĂ€umen unter dem allgemeinen Markov-Modell (GM) vor, das das komplexeste Modell von DNA-Substitution ist und das die Evolution von Basenzusammensetzung erlaubt. Um die Skalierbarkeit von SEM-GM zu verbessern, stellen wir ein Minimale Spannbaum (MST)-Methode vor, die als MST-Backbone bezeichnet wird. MST-Backbone skaliert linear mit der Anzahl der BlĂ€tter. Die Tatsache, dass die Lage der Wurzel aus empirischen Daten nicht immer realistisch abgeleitet warden kann, legt jedoch nahe, dass das GM-Modell möglicherweise ĂŒbertrainiert ist. MST-backbone wurde von einer topologischen Beziehung zwischen minimalen SpannbĂ€umen und phylogenetischen BĂ€umen inspiriert, die von Choi et al. 2011 eingefĂŒhrt wurde. Wir entdeckten, dass die topologische Beziehung nicht unbedingt Bestand hat, wenn es keinen eindeutigen minimalen Spannbaum gibt. Wir schlagen so genannte vertex-order-based MSTs (VMSTs) vor, die eine topologische Beziehung zu phylogenetischen BĂ€umen garantieren

    The EM Algorithm and the Rise of Computational Biology

    Get PDF
    In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Probabilistic Graphical Model Representation in Phylogenetics

    Get PDF
    Recent years have seen a rapid expansion of the model space explored in statistical phylogenetics, emphasizing the need for new approaches to statistical model representation and software development. Clear communication and representation of the chosen model is crucial for: (1) reproducibility of an analysis, (2) model development and (3) software design. Moreover, a unified, clear and understandable framework for model representation lowers the barrier for beginners and non-specialists to grasp complex phylogenetic models, including their assumptions and parameter/variable dependencies. Graphical modeling is a unifying framework that has gained in popularity in the statistical literature in recent years. The core idea is to break complex models into conditionally independent distributions. The strength lies in the comprehensibility, flexibility, and adaptability of this formalism, and the large body of computational work based on it. Graphical models are well-suited to teach statistical models, to facilitate communication among phylogeneticists and in the development of generic software for simulation and statistical inference. Here, we provide an introduction to graphical models for phylogeneticists and extend the standard graphical model representation to the realm of phylogenetics. We introduce a new graphical model component, tree plates, to capture the changing structure of the subgraph corresponding to a phylogenetic tree. We describe a range of phylogenetic models using the graphical model framework and introduce modules to simplify the representation of standard components in large and complex models. Phylogenetic model graphs can be readily used in simulation, maximum likelihood inference, and Bayesian inference using, for example, Metropolis-Hastings or Gibbs sampling of the posterior distribution

    MODELING HETEROTACHY IN PHYLOGENETICS

    Get PDF
    Il a Ă©tĂ© dĂ©montrĂ© que l’hĂ©tĂ©rotachie, variation du taux de substitutions au cours du temps et entre les sites, est un phĂ©nomĂšne frĂ©quent au sein de donnĂ©es rĂ©elles. Échouer Ă  modĂ©liser l’hĂ©tĂ©rotachie peut potentiellement causer des artĂ©facts phylogĂ©nĂ©tiques. Actuellement, plusieurs modĂšles traitent l’hĂ©tĂ©rotachie : le modĂšle Ă  mĂ©lange des longueurs de branche (MLB) ainsi que diverses formes du modĂšle covarion. Dans ce projet, notre but est de trouver un modĂšle qui prenne efficacement en compte les signaux hĂ©tĂ©rotaches prĂ©sents dans les donnĂ©es, et ainsi amĂ©liorer l’infĂ©rence phylogĂ©nĂ©tique. Pour parvenir Ă  nos fins, deux Ă©tudes ont Ă©tĂ© rĂ©alisĂ©es. Dans la premiĂšre, nous comparons le modĂšle MLB avec le modĂšle covarion et le modĂšle homogĂšne grĂące aux test AIC et BIC, ainsi que par validation croisĂ©e. A partir de nos rĂ©sultats, nous pouvons conclure que le modĂšle MLB n’est pas nĂ©cessaire pour les sites dont les longueurs de branche diffĂšrent sur l’ensemble de l’arbre, car, dans les donnĂ©es rĂ©elles, le signaux hĂ©tĂ©rotaches qui interfĂšrent avec l’infĂ©rence phylogĂ©nĂ©tique sont gĂ©nĂ©ralement concentrĂ©s dans une zone limitĂ©e de l’arbre. Dans la seconde Ă©tude, nous relaxons l’hypothĂšse que le modĂšle covarion est homogĂšne entre les sites, et dĂ©veloppons un modĂšle Ă  mĂ©langes basĂ© sur un processus de Dirichlet. Afin d’évaluer diffĂ©rents modĂšles hĂ©tĂ©rogĂšnes, nous dĂ©finissons plusieurs tests de non-conformitĂ© par Ă©chantillonnage postĂ©rieur prĂ©dictif pour Ă©tudier divers aspects de l’évolution molĂ©culaire Ă  partir de cartographies stochastiques. Ces tests montrent que le modĂšle Ă  mĂ©langes covarion utilisĂ© avec une loi gamma est capable de reflĂ©ter adĂ©quatement les variations de substitutions tant Ă  l’intĂ©rieur d’un site qu’entre les sites. Notre recherche permet de dĂ©crire de façon dĂ©taillĂ©e l’hĂ©tĂ©rotachie dans des donnĂ©es rĂ©elles et donne des pistes Ă  suivre pour de futurs modĂšles hĂ©tĂ©rotaches. Les tests de non conformitĂ© par Ă©chantillonnage postĂ©rieur prĂ©dictif fournissent des outils de diagnostic pour Ă©valuer les modĂšles en dĂ©tails. De plus, nos deux Ă©tudes rĂ©vĂšlent la non spĂ©cificitĂ© des modĂšles hĂ©tĂ©rogĂšnes et, en consĂ©quence, la prĂ©sence d’interactions entre diffĂ©rents modĂšles hĂ©tĂ©rogĂšnes. Nos Ă©tudes suggĂšrent fortement que les donnĂ©es contiennent diffĂ©rents caractĂšres hĂ©tĂ©rogĂšnes qui devraient ĂȘtre pris en compte simultanĂ©ment dans les analyses phylogĂ©nĂ©tiques.Heterotachy, substitution rate variation across sites and time, has shown to be a frequent phenomenon in the real data. Failure to model heterotachy could potentially cause phylogenetic artefacts. Currently, there are several models to handle heterotachy, the mixture branch length model (MBL) and several variant forms of the covarion model. In this project, our objective is to find a model that efficiently handles heterotachous signals in the data, and thereby improves phylogenetic inference. In order to achieve our goal, two individual studies were conducted. In the first study, we make comparisons among the MBL, covarion and homotachous models using AIC, BIC and cross validation. Based on our results, we conclude that the MBL model, in which sites have different branch lengths along the entire tree, is an over-parameterized model. Real data indicate that the heterotachous signals which interfere with phylogenetic inference are generally limited to a small area of the tree. In the second study, we relax the assumption of the homogeneity of the covarion parameters over sites, and develop a mixture covarion model using a Dirichlet process. In order to evaluate different heterogeneous models, we design several posterior predictive discrepancy tests to study different aspects of molecular evolution using stochastic mappings. The posterior predictive discrepancy tests demonstrate that the covarion mixture +Γ model is able to adequately model the substitution variation within and among sites. Our research permits a detailed view of heterotachy in real datasets and gives directions for future heterotachous models. The posterior predictive discrepancy tests provide diagnostic tools to assess models in detail. Furthermore, both of our studies reveal the non-specificity of heterogeneous models. Our studies strongly suggest that different heterogeneous features in the data should be handled simultaneously
    • 

    corecore