39 research outputs found

    Quartet Sampling distinguishes lack of support from conflicting support in the green plant tree of life

    Full text link
    Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/143738/1/ajb21016-sup-0009-AppendixS9.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/143738/2/ajb21016.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/143738/3/ajb21016-sup-0004-AppendixS4.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/143738/4/ajb21016-sup-0001-AppendixS1.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/143738/5/ajb21016-sup-0002-AppendixS2.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/143738/6/ajb21016_am.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/143738/7/ajb21016-sup-0005-AppendixS5.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/143738/8/ajb21016-sup-0006-AppendixS6.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/143738/9/ajb21016-sup-0008-AppendixS8.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/143738/10/ajb21016-sup-0003-AppendixS3.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/143738/11/ajb21016-sup-0007-AppendixS7.pd

    Species trees from consensus single nucleotide polymorphism (SNP) data: Testing phylogenetic approaches with simulated and empirical data

    Get PDF
    Datasets of hundreds or thousands of SNPs (Single Nucleotide Polymorphisms) from multiple individuals per species are increasingly used to study population structure, species delimitation and shallow phylogenetics. The principal software tool to infer species or population trees from SNP data is currently the BEAST template SNAPP which uses a Bayesian coalescent analysis. However, it is computationally extremely demanding and tolerates only small amounts of missing data. We used simulated and empirical SNPs from plants (Australian Craspedia, Asteraceae, and Pelargonium, Geraniaceae) to compare species trees produced (1) by SNAPP, (2) using SVD quartets, and (3) using Bayesian and parsimony analysis with several different approaches to summarising data from multiple samples into one set of traits per species. Our aims were to explore the impact of tree topology and missing data on the results, and to test which data summarising and analyses approaches would best approximate the results obtained from SNAPP for empirical data. SVD quartets retrieved the correct topology from simulated data, as did SNAPP except in the case of a very unbalanced phylogeny. Both methods failed to retrieve the correct topology when large amounts of data were missing. Bayesian analysis of species level summary data scoring the two alleles of each SNP as independent characters and parsimony analysis of data scoring each SNP as one character produced trees with branch length distributions closest to the true trees on which SNPs were simulated. For empirical data, Bayesian inference and Dollo parsimony analysis of data scored allele-wise produced phylogenies most congruent with the results of SNAPP. In the case of study groups divergent enough for missing data to be phylogenetically informative (because of additional mutations preventing amplification of genomic fragments or bioinformatic establishment of homology), scoring of SNP data as a presence/absence matrix irrespective of allele content might be an additional option. As this depends on sampling across species being reasonably even and a random distribution of non-informative instances of missing data, however, further exploration of this approach is needed. Properly chosen data summary approaches to inferring species trees from SNP data may represent a potential alternative to currently available individual-level coalescent analyses especially for quick data exploration and when dealing with computationally demanding or patchy datasets.This study was partly supported by a Centre of Biodiversity Analysis Ignition Grant to A.N.S.-L. and Justin Borevitz in 2013/14

    Phylogenomics of the superfamily Dytiscoidea (Coleoptera: Adephaga) with an evaluation of phylogenetic conflict and systematic error.

    Get PDF
    The beetle superfamily Dytiscoidea, placed within the suborder Adephaga, comprises six families. The phylogenetic relationships of these families, whose species are aquatic, remain highly contentious. In particular the monophyly of the geographically disjunct Aspidytidae (China and South Africa) remains unclear. Here we use a phylogenomic approach to demonstrate that Aspidytidae are indeed monophyletic, as we inferred this phylogenetic relationship from analyzing nucleotide sequence data filtered for compositional heterogeneity and from analyzing amino-acid sequence data. Our analyses suggest that Aspidytidae are the sister group of Amphizoidae, although the support for this relationship is not unequivocal. A sister group relationship of Hygrobiidae to a clade comprising Amphizoidae, Aspidytidae, and Dytiscidae is supported by analyses in which model assumptions are violated the least. In general, we find that both concatenation and the applied coalescent method are sensitive to the effect of among-species compositional heterogeneity. Four-cluster likelihood-mapping suggests that despite the substantial size of the dataset and the use of advanced analytical methods, statistical support is weak for the inferred phylogenetic placement of Hygrobiidae. These results indicate that other kinds of data (e.g. genomic meta-characters) are possibly required to resolve the above-specified persisting phylogenetic uncertainties. Our study illustrates various data-driven confounding effects in phylogenetic reconstructions and highlights the need for careful monitoring of model violations prior to phylogenomic analysis

    Split Analysis Methods and Parametric Bootstrapping in Molecular Phylogenetics : Taking a closer look at model adequacy

    Get PDF
    Even though the size of datasets in molecular analyses increased rapidly during the last years, undetected systematic errors as well as unsolved problems concerning the evaluation of data quality and adequate substitution model selection still persist. This not only hampers the correct analysis of these datasets but leads to undetectable effects in phylogenetic tree reconstruction. Model-based tree reconstruction methods like maximum likelihood estimation and Bayesian inference have become the methods of choice for reconstruction of phylogenetic trees. Although maximum likelihood methods are known to be consistent if all necessary conditions are met, it depends strongly on the quality of the multiple sequence alignment and the ability of the chosen evolutionary model to reflect the underlying historical processes. This thesis addresses the assessment of model adequacy of estimated evolutionary models to multiple sequence alignments in the light of parametric bootstrapping and aims to find new methods for detection of model misspecifications with the help of split analyses. The second chapter focuses on the influence of the number of gamma rate categories used in modelling among-site rate variation when trying to assess model adequacy using an absolute goodness-of-fit test. The analyses of simulated alignments show that the Goldmann-Cox test rejects models which were only approximated by four discrete gamma rate categories for various tree shapes and branch length setups, if they were simulated with a continuous gamma distribution. Increasing the number of discrete rate categories leads to an acceptance of model adequacy for stationary datasets and a correct detection of non-stationarity and inhomogenetity in simulated data. The results illustrate that the application of the proposed Goldmann-Cox test to evaluate model adequacy might be too strict and rigorous with empirical data, in particular for large phylogenomic datasets. Approaches such as the Goldman-Cox test evaluate the absolute fit of data and model but, do not deliver a deeper insight into the structure of the misfit. The third chapter presents the visualisation of overrepresented splits within splits graphs, which provides a good tool for gaining an overview of possible patterns and contradictory signal or noise within datasets. The analysis of these split residuals, observed by comparison to parametric bootstrap datasets based on the estimated models can help to gain a deeper insight into model adequacy. Highly overrepresented splits can give hints whether heterotachy applies or non symmetric substitution processes. The fourth chapter aims to define a new split weighting scheme by formalising aspects like 'contrast of character states' or 'character state homogeneity' within split subsets. Splits which are detected by the proposed SAMS (Splits Analysis MethodS) algorithm are re-evaluated for a more objective and formal split weighting. A comparison of the published and the new approach showed that the developed weighting scheme delivers reasonable results but needs further improvement. The development of a new GUI offers a much more capable tool to perform a split analysis and visualise the results. The shape of a visualised split spectra can indicate, whether a dataset delivers a clear split signal or if there is a lot of noise present

    Phylogenomics and the evolution of hemipteroid insects.

    Get PDF
    Hemipteroid insects (Paraneoptera), with over 10% of all known insect diversity, are a major component of terrestrial and aquatic ecosystems. Previous phylogenetic analyses have not consistently resolved the relationships among major hemipteroid lineages. We provide maximum likelihood-based phylogenomic analyses of a taxonomically comprehensive dataset comprising sequences of 2,395 single-copy, protein-coding genes for 193 samples of hemipteroid insects and outgroups. These analyses yield a well-supported phylogeny for hemipteroid insects. Monophyly of each of the three hemipteroid orders (Psocodea, Thysanoptera, and Hemiptera) is strongly supported, as are most relationships among suborders and families. Thysanoptera (thrips) is strongly supported as sister to Hemiptera. However, as in a recent large-scale analysis sampling all insect orders, trees from our data matrices support Psocodea (bark lice and parasitic lice) as the sister group to the holometabolous insects (those with complete metamorphosis). In contrast, four-cluster likelihood mapping of these data does not support this result. A molecular dating analysis using 23 fossil calibration points suggests hemipteroid insects began diversifying before the Carboniferous, over 365 million years ago. We also explore implications for understanding the timing of diversification, the evolution of morphological traits, and the evolution of mitochondrial genome organization. These results provide a phylogenetic framework for future studies of the group

    Phylogenetics of the world’s largest beetle family (Coleoptera: Staphylinidae):A methodological exploration

    Get PDF

    Amélioration de l'exactitude de l'inférence phylogénomique

    Full text link
    L’explosion du nombre de sĂ©quences permet Ă  la phylogĂ©nomique, c’est-Ă -dire l’étude des liens de parentĂ© entre espĂšces Ă  partir de grands alignements multi-gĂšnes, de prendre son essor. C’est incontestablement un moyen de pallier aux erreurs stochastiques des phylogĂ©nies simple gĂšne, mais de nombreux problĂšmes demeurent malgrĂ© les progrĂšs rĂ©alisĂ©s dans la modĂ©lisation du processus Ă©volutif. Dans cette thĂšse, nous nous attachons Ă  caractĂ©riser certains aspects du mauvais ajustement du modĂšle aux donnĂ©es, et Ă  Ă©tudier leur impact sur l’exactitude de l’infĂ©rence. Contrairement Ă  l’hĂ©tĂ©rotachie, la variation au cours du temps du processus de substitution en acides aminĂ©s a reçu peu d’attention jusqu’alors. Non seulement nous montrons que cette hĂ©tĂ©rogĂ©nĂ©itĂ© est largement rĂ©pandue chez les animaux, mais aussi que son existence peut nuire Ă  la qualitĂ© de l’infĂ©rence phylogĂ©nomique. Ainsi en l’absence d’un modĂšle adĂ©quat, la suppression des colonnes hĂ©tĂ©rogĂšnes, mal gĂ©rĂ©es par le modĂšle, peut faire disparaĂźtre un artĂ©fact de reconstruction. Dans un cadre phylogĂ©nomique, les techniques de sĂ©quençage utilisĂ©es impliquent souvent que tous les gĂšnes ne sont pas prĂ©sents pour toutes les espĂšces. La controverse sur l’impact de la quantitĂ© de cellules vides a rĂ©cemment Ă©tĂ© rĂ©actualisĂ©e, mais la majoritĂ© des Ă©tudes sur les donnĂ©es manquantes sont faites sur de petits jeux de sĂ©quences simulĂ©es. Nous nous sommes donc intĂ©ressĂ©s Ă  quantifier cet impact dans le cas d’un large alignement de donnĂ©es rĂ©elles. Pour un taux raisonnable de donnĂ©es manquantes, il appert que l’incomplĂ©tude de l’alignement affecte moins l’exactitude de l’infĂ©rence que le choix du modĂšle. Au contraire, l’ajout d’une sĂ©quence incomplĂšte mais qui casse une longue branche peut restaurer, au moins partiellement, une phylogĂ©nie erronĂ©e. Comme les violations de modĂšle constituent toujours la limitation majeure dans l’exactitude de l’infĂ©rence phylogĂ©nĂ©tique, l’amĂ©lioration de l’échantillonnage des espĂšces et des gĂšnes reste une alternative utile en l’absence d’un modĂšle adĂ©quat. Nous avons donc dĂ©veloppĂ© un logiciel de sĂ©lection de sĂ©quences qui construit des jeux de donnĂ©es reproductibles, en se basant sur la quantitĂ© de donnĂ©es prĂ©sentes, la vitesse d’évolution et les biais de composition. Lors de cette Ă©tude nous avons montrĂ© que l’expertise humaine apporte pour l’instant encore un savoir incontournable. Les diffĂ©rentes analyses rĂ©alisĂ©es pour cette thĂšse concluent Ă  l’importance primordiale du modĂšle Ă©volutif.The explosion of sequence number allows for phylogenomics, the study of species relationships based on large multi-gene alignments, to flourish. Without any doubt, phylogenomics is essentially an efficient way to eliminate the problems of single gene phylogenies due to stochastic errors, but numerous problems remain despite obvious progress realized in modeling evolutionary process. In this PhD-thesis, we are trying to characterize some consequences of a poor model fit and to study their impact on the accuracy of the phylogenetic inference. In contrast to heterotachy, the variation in the amino acid substitution process over time did not attract so far a lot of attention. We demonstrate that this heterogeneity is frequently observed within animals, but also that its existence can interfere with the quality of phylogenomic inference. In absence of an adequate model, the elimination of heterogeneous columns, which are poorly handled by the model, can eliminate an artefactual reconstruction. In a phylogenomic framework, the sequencing strategies often result in a situation where some genes are absent for some species. The issue about the impact of the quantity of empty cells was recently relaunched, but the majority of studies on missing data is performed on small datasets of simulated sequences. Therefore, we were interested on measuring the impact in the case of a large alignment of real data. With a reasonable amount of missing data, it seems that the accuracy of the inference is influenced rather by the choice of the model than the incompleteness of the alignment. For example, the addition of an incomplete sequence that breaks a long branch can at least partially re-establish an artefactual phylogeny. Because, model violations are always representing the major limitation of the accuracy of the phylogenetic inference, the improvement of species and gene sampling remains a useful alternative in the absence of an adequate model. Therefore, we developed a sequence-selection software, which allows the reproducible construction of datasets, based on the quantity of data, their evolutionary speed and their compositional bias. During this study, we did realize that the human expertise still furnishes an indispensable knowledge. The various analyses performed in the course of this PhD thesis agree on the primordial importance of the model of sequence evolution

    Whole genome sequencing of the Asian Arowana (Scleropages formosus) provides insights into the evolution of ray-finned fishes

    Get PDF
    The Asian arowana (Scleropages formosus) is of commercial importance, conservation concern, and is a representative of one of the oldest lineages of ray-finned fish, the Osteoglossomorpha. To add to genomic knowledge of this species and the evolution of teleosts, the genome of a Malaysian specimen of arowana was sequenced. A draft genome is presented consisting of 42,110 scaffolds with a total size of 708 Mb (2.85% gaps) representing 93.95% of core eukaryotic genes. Using a k-mer-based method, a genome size of 900 Mb was also estimated. We present an update on the phylogenomics of fishes based on a total of 27 species (23 fish species and 4 tetrapods) using 177 orthologous proteins (71,360 amino acid sites), which supports established relationships except that arowana is placed as the sister lineage to all teleost clades (Bayesian posterior probability 1.00, bootstrap replicate 93%), that evolved after the teleost genome duplication event rather than the eels (Elopomorpha). Evolutionary rates are highly heterogeneous across the tree with fishes represented by both slowly and rapidly evolving lineages. A total of 94 putative pigment genes were identified, providing the impetus for development of molecular markers associated with the spectacular colored phenotypes found within this species
    corecore