Search CORE

2,166 research outputs found

Phylogenetic inference under recombination using Bayesian stochastic topology selection

Author: Alex Webb
Boys
Chris C. Holmes
de Oliveira Martins
Felsenstein
Felsenstein
Felsenstein
Frazer
George
Grassly
Green
Hasegawa
Hobolth
Husmeier
Husmeier
Husmeier
Husmeier
John M. Hancock
Lehrach
Liitsola
McGuire
McGuire
Minin
Page
Pattengale
Rabiner
Rambaut
Schierup
Suchard
Yang
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

Motivation: Conventional phylogenetic analysis for characterizing the relatedness between taxa typically assumes that a single relationship exists between species at every site along the genome. This assumption fails to take into account recombination which is a fundamental process for generating diversity and can lead to spurious results. Recombination induces a localized phylogenetic structure which may vary along the genome. Here, we generalize a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity; the hidden states refer to the unobserved phylogenic topology underlying the relatedness at a genomic location. The dimensionality of the number of hidden states (topologies) and their structure are random (not known a priori) and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all possible changepoints in topologies as well as all the unknown branch lengths

Crossref

PubMed Central

Oxford University Research Archive

The inference of gene trees with species trees

Author: Bastien Boussau
Eric Tannier
Gergely J. Szöllősi
Montbonnot France
Vincent Daubin
Publication venue
Publication date: 04/11/2013
Field of study

Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can co-exist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. In this article we review the various models that have been used to describe the relationship between gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a better basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational Evolutionary Biology" conference, Montpellier, 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

PubMed Central

HAL

Repository of the Academy's Library

ELTE Digital Institutional Repository (EDIT)

Hal-Diderot

Statistical analysis on detecting recombination sites in DNA-beta satellites associated with the old world geminiviruses

Author: Xu Kai
Yoshida Ruriko
Publication venue
Publication date: 01/01/2010
Field of study

Although an exchange of genetic information by recombination plays an important role in the evolution of viruses, it is not clear how it generates diversity. {\it Geminiviruses} are plant viruses which have ambisense single-stranded circular DNA genomes and one of the most economically important plant viruses in agricultural production. Small circular single-stranded DNA satellites, termed DNA-

\beta

, have recently been found associated with some geminivirus infections. In this paper we analyze a satellite molecule DNA-

\beta

of geminiviruses for recombination events using phylogenetic and statistical analysis and we find that one strain from ToLCMaB has a recombination pattern and is possibly recombinant molecule between two strains from two species, PaLCuB-[IN:Chi:05] (major parent) and ToLCB-[IN:CP:04] (minor parent).Comment: 8 figures and 2 tables. To appear in Frontiers in Systems Biolog

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

PubMed Central

Frontiers - Publisher Connector

Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study

Author: Bentley SD
Colijn C
Harris SR
Kendall M
Lees JA
Parkhill J
Publication venue: 'F1000 Research Ltd'
Publication date: 01/01/2018
Field of study

Background: Phylogenetic reconstruction is a necessary first step in many analyses which use whole genome sequence data from bacterial populations. There are many available methods to infer phylogenies, and these have various advantages and disadvantages, but few unbiased comparisons of the range of approaches have been made. Methods: We simulated data from a defined "true tree" using a realistic evolutionary model. We built phylogenies from this data using a range of methods, and compared reconstructed trees to the true tree using two measures, noting the computational time needed for different phylogenetic reconstructions. We also used real data from Streptococcus pneumoniae alignments to compare individual core gene trees to a core genome tree. Results: We found that, as expected, maximum likelihood trees from good quality alignments were the most accurate, but also the most computationally intensive. Using less accurate phylogenetic reconstruction methods, we were able to obtain results of comparable accuracy; we found that approximate results can rapidly be obtained using genetic distance based methods. In real data we found that highly conserved core genes, such as those involved in translation, gave an inaccurate tree topology, whereas genes involved in recombination events gave inaccurate branch lengths. We also show a tree-of-trees, relating the results of different phylogenetic reconstructions to each other. Conclusions: We recommend three approaches, depending on requirements for accuracy and computational time. Quicker approaches that do not perform full maximum likelihood optimisation may be useful for many analyses requiring a phylogeny, as generating a high quality input alignment is likely to be the major limiting factor of accurate tree topology. We have publicly released our simulated data and code to enable further comparisons

Oxford University Research Archive

Spiral - Imperial College Digital Repository

Distinguishing regional from within-codon rate heterogeneity in DNA sequence alignments

Author: A. Gelman
A. Webb
C. Tuffley
D. Husmeier
D. Husmeier
G. Casella
J. Felsenstein
J. Felsenstein
J. Felsenstein
M. Hasegawa
M.A. Suchard
R.J. Boys
V.N. Minin
W.K. Hastings
W.P. Lehrach
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

We present an improved phylogenetic factorial hidden Markov model (FHMM) for detecting two types of mosaic structures in DNA sequence alignments, related to (1) recombination and (2) rate heterogeneity. The focus of the present work is on improving the modelling of the latter aspect. Earlier papers have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. This approach fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code. We propose an improved model that explicitly distinguishes between these two effects, and we assess its performance on a set of simulated DNA sequence alignments

Crossref

University of Strathclyde Institutional Repository

Enlighten

MetaPIGA v2.0: maximum likelihood large phylogeny estimation using the metapopulation genetic algorithm and other stochastic heuristics

Author: A Stamatakis
A Stamatakis
AR Lemmon
B Chor
BS Chang
BS Chang
D Posada
DJ Zwickl
DJ Zwickl
DL Swofford
DR Maddison
F Bossuyt
F Ronquist
H Matsuda
I Cassens
J Felsenstein
J Felsenstein
J Felsenstein
J Holland
J Zhang
JL Thorne
JL Thorne
JP Huelsenbeck
K Katoh
LA Salter
M Blanchette
M Holder
M Lundy
M Meegaskumbura
MA Suchard
Michel C Milinkovitch
MS Springer
N Saitou
PD Williams
PO Lewis
Raphaël Helaers
RL Graham
S Guindon
S Guindon
S Kirkpatrick
S Tavaré
T Gabaldon
W-H Li
X Gu
Z Yang
Z Yang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The development, in the last decade, of stochastic heuristics implemented in robust application softwares has made large phylogeny inference a key step in most comparative studies involving molecular sequences. Still, the choice of a phylogeny inference software is often dictated by a combination of parameters not related to the raw performance of the implemented algorithm(s) but rather by practical issues such as ergonomics and/or the availability of specific functionalities. Results Here, we present MetaPIGA v2.0, a robust implementation of several stochastic heuristics for large phylogeny inference (under maximum likelihood), including a Simulated Annealing algorithm, a classical Genetic Algorithm, and the Metapopulation Genetic Algorithm (metaGA) together with complex substitution models, discrete Gamma rate heterogeneity, and the possibility to partition data. MetaPIGA v2.0 also implements the Likelihood Ratio Test, the Akaike Information Criterion, and the Bayesian Information Criterion for automated selection of substitution models that best fit the data. Heuristics and substitution models are highly customizable through manual batch files and command line processing. However, MetaPIGA v2.0 also offers an extensive graphical user interface for parameters setting, generating and running batch files, following run progress, and manipulating result trees. MetaPIGA v2.0 uses standard formats for data sets and trees, is platform independent, runs in 32 and 64-bits systems, and takes advantage of multiprocessor and multicore computers. Conclusions The metaGA resolves the major problem inherent to classical Genetic Algorithms by maintaining high inter-population variation even under strong intra-population selection. Implementation of the metaGA together with additional stochastic heuristics into a single software will allow rigorous optimization of each heuristic as well as a meaningful comparison of performances among these algorithms. MetaPIGA v2.0 gives access both to high customization for the phylogeneticist, as well as to an ergonomic interface and functionalities assisting the non-specialist for sound inference of large phylogenetic trees using nucleotide sequences. MetaPIGA v2.0 and its extensive user-manual are freely available to academics at <url>http://www.metapiga.org</url>.</p

Crossref

Springer

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

DIAL UCLouvain

Improved Bayesian methods for detecting recombination and rate heterogeneity in DNA sequence alignments

Author: Husmeir Dirk
Mantzaris Alexander Vassilios
Publication venue: University of Edinburgh
Publication date: 01/01/2011
Field of study

DNA sequence alignments are usually not homogeneous. Mosaic structures may result as a consequence of recombination or rate heterogeneity. Interspecific recombination, in which DNA subsequences are transferred between different (typically viral or bacterial) strains may result in a change of the topology of the underlying phylogenetic tree. Rate heterogeneity corresponds to a change of the nucleotide substitution rate. Various methods for simultaneously detecting recombination and rate heterogeneity in DNA sequence alignments have recently been proposed, based on complex probabilistic models that combine phylogenetic trees with factorial hidden Markov models or multiple changepoint processes. The objective of my thesis is to identify potential shortcomings of these models and explore ways of how to improve them. One shortcoming that I have identified is related to an approximation made in various recently proposed Bayesian models. The Bayesian paradigm requires the solution of an integral over the space of parameters. To render this integration analytically tractable, these models assume that the vectors of branch lengths of the phylogenetic tree are independent among sites. While this approximation reduces the computational complexity considerably, I show that it leads to the systematic prediction of spurious topology changes in the Felsenstein zone, that is, the area in the branch lengths configuration space where maximum parsimony consistently infers the wrong topology due to long-branch attraction. I demonstrate these failures by using two Bayesian hypothesis tests, based on an inter- and an intra-model approach to estimating the marginal likelihood. I then propose a revised model that addresses these shortcomings, and demonstrate its improved performance on a set of synthetic DNA sequence alignments systematically generated around the Felsenstein zone. The core model explored in my thesis is a phylogenetic factorial hidden Markov model (FHMM) for detecting two types of mosaic structures in DNA sequence alignments, related to recombination and rate heterogeneity. The focus of my work is on improving the modelling of the latter aspect. Earlier research efforts by other authors have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. Their work fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code. I have improved these earlier phylogenetic FHMMs in two respects. Firstly, by sampling the rate vector from the posterior distribution with RJMCMC I have made the modelling of regional rate heterogeneity more flexible, and I infer the number of different degrees of divergence directly from the DNA sequence alignment, thereby dispensing with the need to arbitrarily select this quantity in advance. Secondly, I explicitly model within-codon rate heterogeneity via a separate rate modification vector. In this way, the within-codon effect of rate heterogeneity is imposed on the model a priori, which facilitates the learning of the biologically more interesting effect of regional rate heterogeneity a posteriori. I have carried out simulations on synthetic DNA sequence alignments, which have borne out my conjecture. The existing model, which does not explicitly include the within-codon rate variation, has to model both effects with the same modelling mechanism. As expected, it was found to fail to disentangle these two effects. On the contrary, I have found that my new model clearly separates within-codon rate variation from regional rate heterogeneity, resulting in more accurate predictions

University of Strathclyde Institutional Repository

OpenGrey Repository

Recommended from our members

Phylogenetic patterns recover known HIV epidemiological relationships and reveal common transmission of multiple variants.

Author: Leitner Thomas
Romero-Severson Ethan
Publication venue: eScholarship, University of California
Publication date: 01/09/2018
Field of study

The growth of human immunodeficiency virus (HIV) sequence databases resulting from drug resistance testing has motivated efforts using phylogenetic methods to assess how HIV spreads1-4. Such inference is potentially both powerful and useful for tracking the epidemiology of HIV and the allocation of resources to prevention campaigns. We recently used simulation and a small number of illustrative cases to show that certain phylogenetic patterns are associated with different types of epidemiological linkage5. Our original approach was later generalized for large next-generation sequencing datasets and implemented as a free computational pipeline6. Previous work has claimed that direction and directness of transmission could not be established from phylogeny because one could not be sure that there were no intervening or missing links involved7-9. Here, we address this issue by investigating phylogenetic patterns from 272 previously identified HIV transmission chains with 955 transmission pairs representing diverse geography, risk groups, subtypes, and genomic regions. These HIV transmissions had known linkage based on epidemiological information such as partner studies, mother-to-child transmission, pairs identified by contact tracing, and criminal cases. We show that the resulting phylogeny inferred from real HIV genetic sequences indeed reveals distinct patterns associated with direct transmission contra transmissions from a common source. Thus, our results establish how to interpret phylogenetic trees based on HIV sequences when tracking who-infected-whom, when and how genetic information can be used for improved tracking of HIV spread. We also investigate limitations that stem from limited sampling and genetic time-trends in the donor and recipient HIV populations

eScholarship - University of California

A rapid and scalable method for multilocus species delimitation using Bayesian model comparison and rooted triplets

Author: Aswad A
Barraclough TG
Fujisawa T
Publication venue: 'Oxford University Press (OUP)'
Publication date: 21/03/2016
Field of study

Multilocus sequence data provide far greater power to resolve species limits than the single locus data typically used for broad surveys of clades. However, current statistical methods based on a multispecies coalescent framework are computationally demanding, because of the number of possible delimitations that must be compared and time-consuming likelihood calculations. New methods are therefore needed to open up the power of multilocus approaches to larger systematic surveys. Here, we present a rapid and scalable method that introduces two new innovations. First, the method reduces the complexity of likelihood calculations by decomposing the tree into rooted triplets. The distribution of topologies for a triplet across multiple loci has a uniform trinomial distribution when the 3 individuals belong to the same species, but a skewed distribution if they belong to separate species with a form that is specified by the multispecies coalescent. A Bayesian model comparison framework was developed and the best delimitation found by comparing the product of posterior probabilities of all triplets. The second innovation is a new dynamic programming algorithm for finding the optimum delimitation from all those compatible with a guide tree by successively analyzing subtrees defined by each node. This algorithm removes the need for heuristic searches used by current methods, and guarantees that the best solution is found and potentially could be used in other systematic applications. We assessed the performance of the method with simulated, published and newly generated data. Analyses of simulated data demonstrate that the combined method has favourable statistical properties and scalability with increasing sample sizes. Analyses of empirical data from both eukaryotes and prokaryotes demonstrate its potential for delimiting species in real cases

Spiral - Imperial College Digital Repository

The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection

Author: A Rokas
AD Leaché
B Carstens
B Holland
B Rannala
C Ané
C Ané
C Ané
C Meng
C Than
C Than
C Than
CR Linder
CV Than
D Huson
D Posada
D Ruths
D Swofford
DA Pollard
DL Swofford
ES Allman
ES Allman
EW Bloomquist
G Schwarz
H Akaike
H Huang
H Lanier
J Heled
J Mallet
J Mallet
J Wakeley
James H. Degnan
JH Degnan
JH Degnan
JH Degnan
JJ Doyle
Joseph Felsenstein
JP Huelsenbeck
K Burnham
L Liu
L Liu
L Nakhleh
LL Knowles
LS Kubatko
LS Kubatko
LS Kubatko
Luay Nakhleh
M DeGiorgio
M Nei
M Slatkin
ML Arnold
NA Rosenberg
SM Ross
SV Edwards
SV Edwards
TC Bruen
W Maddison
Y Wang
Y Wu
Y Yu
Yun Yu
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Gene tree topologies have proven a powerful data source for various tasks, including species tree inference and species delimitation. Consequently, methods for computing probabilities of gene trees within species trees have been developed and widely used in probabilistic inference frameworks. All these methods assume an underlying multispecies coalescent model. However, when reticulate evolutionary events such as hybridization occur, these methods are inadequate, as they do not account for such events. Methods that account for both hybridization and deep coalescence in computing the probability of a gene tree topology currently exist for very limited cases. However, no such methods exist for general cases, owing primarily to the fact that it is currently unknown how to compute the probability of a gene tree topology within the branches of a phylogenetic network. Here we present a novel method for computing the probability of gene tree topologies on phylogenetic networks and demonstrate its application to the inference of hybridization in the presence of incomplete lineage sorting. We reanalyze a Saccharomyces species data set for which multiple analyses had converged on a species tree candidate. Using our method, though, we show that an evolutionary hypothesis involving hybridization in this group has better support than one of strict divergence. A similar reanalysis on a group of three Drosophila species shows that the data is consistent with hybridization. Further, using extensive simulation studies, we demonstrate the power of gene tree topologies at obtaining accurate estimates of branch lengths and hybridization probabilities of a given phylogenetic network. Finally, we discuss identifiability issues with detecting hybridization, particularly in cases that involve extinction or incomplete sampling of taxa

Public Library of Science (PLOS)

Crossref

UC Research Repository

Directory of Open Access Journals

PubMed Central

DSpace at Rice University

FigShare