1,714 research outputs found

    BBCA: Improving the Scalability of *BEAST Using Random Binning

    Get PDF
    Species tree estimation can be challenging in the presence of gene tree conflict due to incomplete lineage sorting (ILS), which can occur when the time between speciation events is short relative to the population size. Of the many methods that have been developed to estimate species trees in the presence of ILS, *BEAST, a Bayesian method that co-estimates the species tree and gene trees given sequence alignments on multiple loci, has generally been shown to have the best accuracy. However, *BEAST is extremely computationally intensive so that it cannot be used with large numbers of loci; hence, *BEAST is not suitable for genome-scale analyses. Results: We present BBCA (boosted binned coalescent-based analysis), a method that can be used with *BEAST (and other such co-estimation methods) to improve scalability. BBCA partitions the loci randomly into subsets, uses *BEAST on each subset to co-estimate the gene trees and species tree for the subset, and then combines the newly estimated gene trees together using MP-EST, a popular coalescent-based summary method. We compare time-restricted versions of BBCA and *BEAST on simulated datasets, and show that BBCA is at least as accurate as *BEAST, and achieves better convergence rates for large numbers of loci. Conclusions: Phylogenomic analysis using *BEAST is currently limited to datasets with a small number of loci, and analyses with even just 100 loci can be computationally challenging. BBCA uses a very simple divide-and-conquer approach that makes it possible to use *BEAST on datasets containing hundreds of loci. This study shows that BBCA provides excellent accuracy and is highly scalable.Grant Agency of the Czech Republic P501-10-0208Academy of Sciences of the Czech Republic AVOZ50040507, AVOZ50040702, MSMT LC0604Ministry of Innovation and Science of Spain, MICINN CGL2007-64839-C02/BOSCSIC (Superior Council of Scientific InvestigationsJosé Castillejo Grant from the MICINN of the Spanish GovernmentComputer Science

    Disk Covering Methods Improve Phylogenomic Analyses

    Get PDF
    Motivation: With the rapid growth rate of newly sequenced genomes, species tree inference from multiple genes has become a basic bioinformatics task in comparative and evolutionary biology. However, accurate species tree estimation is difficult in the presence of gene tree discordance, which is often due to incomplete lineage sorting (ILS), modelled by the multi-species coalescent. Several highly accurate coalescent-based species tree estimation methods have been developed over the last decade, including MP-EST. However, the running time for MP-EST increases rapidly as the number of species grows. Results: We present divide-and-conquer techniques that improve the scalability of MP-EST so that it can run efficiently on large datasets. Surprisingly, this technique also improves the accuracy of species trees estimated by MP-EST, as our study shows on a collection of simulated and biological datasets.NSF DEB 0733029, DBI 1062335Computer Science

    Maximize Resolution or Minimize Error? Using Genotyping-By-Sequencing to Investigate the Recent Diversification of Helianthemum (Cistaceae)

    Get PDF
    A robust phylogenetic framework, in terms of extensive geographical and taxonomic sampling, well-resolved species relationships and high certainty of tree topologies and branch length estimations, is critical in the study of macroevolutionary patterns. Whereas Sanger sequencing-based methods usually recover insufficient phylogenetic signal, especially in recently diversified lineages, reduced-representation sequencing methods tend to provide well-supported phylogenetic relationships, but usually entail remarkable bioinformatic challenges due to the inherent trade-off between the number of SNPs and the magnitude of associated error rates. The genus Helianthemum (Cistaceae) is a species-rich and taxonomically complex Palearctic group of plants that diversified mainly since the Upper Miocene. It is a challenging case study since previous attempts using Sanger sequencing were unable to resolve the intrageneric phylogenetic relationships. Aiming to obtain a robust phylogenetic reconstruction based on genotyping-by-sequencing (GBS), we established a rigorous methodological workflow in which we i) explored how variable settings during dataset assembly have an impact on error rates and on the degree of resolution under concatenation and coalescent approaches, ii) assessed the effect of two extreme parameter configurations (minimizing error rates vs. maximizing phylogenetic resolution) on tree topology and branch lengths, and iii) evaluated the effects of these two configurations on estimates of divergence times and diversification rates. Our analyses produced highly supported topologically congruent phylogenetic trees for both configurations. However, minimizing error rates did produce more reliable branch lengths, critically affecting the accuracy of downstream analyses (i.e. divergence times and diversification rates). In addition to recommending a revision of intrageneric systematics, our results enabled us to identify three highly diversified lineages in Helianthemum in contrasting geographical areas and ecological conditions, which started radiating in the Upper Miocene.España, MINECO grants CGL2014- 52459-P and CGL2017-82465-PEspaña, Ministerio de Economía, Industria y Competitividad, reference IJCI-2015-2345

    Weighted Statistical Binning: enabling statistically consistent genome-scale phylogenetic analyses

    Full text link
    Because biological processes can make different loci have different evolutionary histories, species tree estimation requires multiple loci from across the genome. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called summary methods. Because summary methods are generally fast, they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate on biologically realistic conditions. Mirarab et al. (Science 2014) presented the statistical binning technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple statistical test for combinability and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomics pipeline does not have the desirable property of being statistically consistent. We show that weighting the recalculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistical consistent under the multi-species coalescent model.Comment: (1) In Press, PLoS ON
    corecore