325 research outputs found

    Long-Branch Attraction Bias and Inconsistency in Bayesian Phylogenetics

    Get PDF
    Bayesian inference (BI) of phylogenetic relationships uses the same probabilistic models of evolution as its precursor maximum likelihood (ML), so BI has generally been assumed to share ML's desirable statistical properties, such as largely unbiased inference of topology given an accurate model and increasingly reliable inferences as the amount of data increases. Here we show that BI, unlike ML, is biased in favor of topologies that group long branches together, even when the true model and prior distributions of evolutionary parameters over a group of phylogenies are known. Using experimental simulation studies and numerical and mathematical analyses, we show that this bias becomes more severe as more data are analyzed, causing BI to infer an incorrect tree as the maximum a posteriori phylogeny with asymptotically high support as sequence length approaches infinity. BI's long branch attraction bias is relatively weak when the true model is simple but becomes pronounced when sequence sites evolve heterogeneously, even when this complexity is incorporated in the model. This bias—which is apparent under both controlled simulation conditions and in analyses of empirical sequence data—also makes BI less efficient and less robust to the use of an incorrect evolutionary model than ML. Surprisingly, BI's bias is caused by one of the method's stated advantages—that it incorporates uncertainty about branch lengths by integrating over a distribution of possible values instead of estimating them from the data, as ML does. Our findings suggest that trees inferred using BI should be interpreted with caution and that ML may be a more reliable framework for modern phylogenetic analysis

    A Comparison of Phylogenetic Network Methods Using Computer Simulation

    Get PDF
    Background: We present a series of simulation studies that explore the relative performance of several phylogenetic network approaches (statistical parsimony, split decomposition, union of maximum parsimony trees, neighbor-net, simulated history recombination upper bound, median-joining, reduced median joining and minimum spanning network) compared to standard tree approaches, (neighbor-joining and maximum parsimony) in the presence and absence of recombination. Principal Findings: In the absence of recombination, all methods recovered the correct topology and branch lengths nearly all of the time when the substitution rate was low, except for minimum spanning networks, which did considerably worse. At a higher substitution rate, maximum parsimony and union of maximum parsimony trees were the most accurate. With recombination, the ability to infer the correct topology was halved for all methods and no method could accurately estimate branch lengths. Conclusions: Our results highlight the need for more accurate phylogenetic network methods and the importance of detecting and accounting for recombination in phylogenetic studies. Furthermore, we provide useful information for choosing a network algorithm and a framework in which to evaluate improvements to existing methods and nove

    Taxonomic Reliability of DNA Sequences in Public Sequence Databases: A Fungal Perspective

    Get PDF
    BACKGROUND: DNA sequences are increasingly seen as one of the primary information sources for species identification in many organism groups. Such approaches, popularly known as barcoding, are underpinned by the assumption that the reference databases used for comparison are sufficiently complete and feature correctly and informatively annotated entries. METHODOLOGY/PRINCIPAL FINDINGS: The present study uses a large set of fungal DNA sequences from the inclusive International Nucleotide Sequence Database to show that the taxon sampling of fungi is far from complete, that about 20% of the entries may be incorrectly identified to species level, and that the majority of entries lack descriptive and up-to-date annotations. CONCLUSIONS: The problems with taxonomic reliability and insufficient annotations in public DNA repositories form a tangible obstacle to sequence-based species identification, and it is manifest that the greatest challenges to biological barcoding will be of taxonomical, rather than technical, nature

    Resurrection of an ancestral 5S rRNA

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In addition to providing phylogenetic relationships, tree making procedures such as parsimony and maximum likelihood can make specific predictions of actual historical sequences. Resurrection of such sequences can be used to understand early events in evolution. In the case of RNA, the nature of parsimony is such that when applied to multiple RNA sequences it typically predicts ancestral sequences that satisfy the base pairing constraints associated with secondary structure. The case for such sequences being actual ancestors is greatly improved, if they can be shown to be biologically functional.</p> <p>Results</p> <p>A unique common ancestral sequence of 28 <it>Vibrio </it>5S ribosomal RNA sequences predicted by parsimony was resurrected and found to be functional in the context of the <it>E. coli </it>cellular environment. The functionality of various point variants and intermediates that were constructed as part of the resurrection were examined in detail. When separately introduced the changes at single stranded positions and individual double variants at base-paired positions were also viable. An additional double variant was examined at a different base-paired position and it was also valid.</p> <p>Conclusions</p> <p>The results show that at least in the case of the 5S rRNAs considered here, ancestors predicted by parsimony are likely to be realistic when the prediction is not overly influenced by single outliers. It is especially noteworthy that the phenotype of the predicted ancestors could be anticipated as a cumulative consequence of the phenotypes of the individual variants that comprised them. Thus, point mutation data is potentially useful in evaluating the reasonableness of ancestral sequences predicted by parsimony or other methods. The results also suggest that in the absence of significant tertiary structure constraints double variants that preserve pairing in stem regions will typically be accepted. Overall, the results suggest that it will be feasible to resurrect additional meaningful 5S rRNA ancestors as well as ancestral sequences of many different types of RNA.</p

    On the use of cartographic projections in visualizing phylo-genetic tree space

    Get PDF
    Phylogenetic analysis is becoming an increasingly important tool for biological research. Applications include epidemiological studies, drug development, and evolutionary analysis. Phylogenetic search is a known NP-Hard problem. The size of the data sets which can be analyzed is limited by the exponential growth in the number of trees that must be considered as the problem size increases. A better understanding of the problem space could lead to better methods, which in turn could lead to the feasible analysis of more data sets. We present a definition of phylogenetic tree space and a visualization of this space that shows significant exploitable structure. This structure can be used to develop search methods capable of handling much larger data sets

    Assessing the Value of DNA Barcodes for Molecular Phylogenetics: Effect of Increased Taxon Sampling in Lepidoptera

    Get PDF
    BACKGROUND: A common perception is that DNA barcode datamatrices have limited phylogenetic signal due to the small number of characters available per taxon. However, another school of thought suggests that the massively increased taxon sampling afforded through the use of DNA barcodes may considerably increase the phylogenetic signal present in a datamatrix. Here I test this hypothesis using a large dataset of macrolepidopteran DNA barcodes. METHODOLOGY/PRINCIPAL FINDINGS: Taxon sampling was systematically increased in datamatrices containing macrolepidopteran DNA barcodes. Sixteen family groups were designated as concordance groups and two quantitative measures; the taxon consistency index and the taxon retention index, were used to assess any changes in phylogenetic signal as a result of the increase in taxon sampling. DNA barcodes alone, even with maximal taxon sampling (500 species per family), were not sufficient to reconstruct monophyly of families and increased taxon sampling generally increased the number of clades formed per family. However, the scores indicated a similar level of taxon retention (species from a family clustering together) in the cladograms as the number of species included in the datamatrix was increased, suggesting substantial phylogenetic signal below the 'family' branch. CONCLUSIONS/SIGNIFICANCE: The development of supermatrix, supertree or constrained tree approaches could enable the exploitation of the massive taxon sampling afforded through DNA barcodes for phylogenetics, connecting the twigs resolved by barcodes to the deep branches resolved through phylogenomics

    Additions to the Mycosphaerella complex

    Get PDF
    Species in the present study were compared based on their morphology, growth characteristics in culture, and DNA sequences of the nuclear ribosomal RNA gene operon (including ITS1, ITS2, 5.8S nrDNA and the first 900 bp of the 28S nrDNA) for all species and partial actin and translation elongation factor 1-alpha gene sequences for Cladosporium species. New species of Mycosphaerella (Mycosphaerellaceae) introduced in this study include M. cerastiicola (on Cerastium semidecandrum, The Netherlands), and M. etlingerae (on Etlingera elatior, Hawaii). Mycosphaerella holualoana is newly reported on Hedychium coronarium (Hawaii). Epitypes are also designated for Hendersonia persooniae, the basionym of Camarosporula persooniae, and for Sphaerella agapanthi, the basionym of Teratosphaeria agapanthi comb. nov. (Teratosphaeriaceae) on Agapathus umbellatus from South Africa. The latter pathogen is also newly recorded from A. umbellatus in Europe (Portugal). Furthermore, two sexual species of Cladosporium (Davidiellaceae) are described, namely C. grevilleae (on Grevillea sp., Australia), and C. silenes (on Silene maritima, UK). Finally, the phylogenetic position of two genera are newly confirmed, namely Camarosporula (based on C. persooniae, teleomorph Anthracostroma persooniae), which is a leaf pathogen of Persoonia spp. in Australia, belongs to the Teratosphaeriaceae, and Sphaerulina (based on S. myriadea), which occurs on leaves of Fagaceae (Carpinus, Castanopsis, Fagus, Quercus), and belongs to the Mycosphaerellaceae

    Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases

    Get PDF
    Microorganisms that cause infectious diseases present critical issues of national security, public health, and economic welfare.  For example, in recent years, highly pathogenic strains of avian influenza have emerged in Asia, spread through Eastern Europe and threaten to become pandemic. As demonstrated by the coordinated response to Severe Acute Respiratory Syndrome (SARS) and influenza, agents of infectious disease are being addressed via large-scale genomic sequencing.  The goal of genomic sequencing projects are to rapidly put large amounts of data in the public domain to accelerate research on disease surveillance, treatment, and prevention. However, our ability to derive information from large comparative genomic datasets lags far behind acquisition.  Here we review the computational challenges of comparative genomic analyses, specifically sequence alignment and reconstruction of phylogenetic trees.  We present novel analytical results on from two important infectious diseases, Severe Acute Respiratory Syndrome (SARS) and influenza.SARS and influenza have similarities and important differences both as biological and comparative genomic analysis problems.  Influenza viruses (Orthymxyoviridae) are RNA based.  Current evidence indicates that influenza viruses originate in aquatic birds from wild populations. Influenza has been studied for decades via well-coordinated international efforts.  These efforts center on surveillance via antibody characterization of the hemagglutinin (HA) and neuraminidase (N) proteins of the circulating strains to inform vaccine design. However we still do not have a clear understanding of: 1) various transmission pathways such as the role of intermediate hosts such as swine and domestic birds and 2) the key mutation and genomic recombination events that underlie periodic pandemics of influenza.  In the past 30 years, sequence data from HA and N loci has become an important data type. In the past year, full genomic data has become prominent.  These data present exciting opportunities to address unanswered questions in influenza pandemics.SARS is caused by a previously unrecognized lineage of coronavirus, SARS-CoV, which like influenza has an RNA based genome.  Although SARS-CoV is widely believed to have originated in animals there remains disagreement over the candidate animal source that lead to the original outbreak of SARS.  In contrast to the long history of the study of influenza, SARS was only recognized in late 2002 and the virus that causes SARS has been documented primarily by genomic sequencing.In the past, most studies of influenza were performed on a limited number of isolates and genes suited to a particular problem.  Major goals in science today are to understand emerging diseases in broad geographic, environmental, societal, biological, and genomic contexts. Synthesizing diverse information brought together by various researchers is important to find out what can be done to prevent future outbreaks {JON03}.  Thus comprehensive means to organize and analyze large amounts of diverse information are critical.  For example, the relationships of isolates and patterns of genomic change observed in large datasets might not be consistent with hypotheses formed on partial data.  Moreover when researchers rely on partial datasets, they restrict the range of possible discoveries.Phylogenetics is well suited to the complex task of understanding emerging infectious disease. Phylogenetic analyses can test many hypotheses by comparing diverse isolates collected from various hosts, environments, and points in time and organizing these data into various evolutionary scenarios.  The products of a phylogenetic analysis are a graphical tree of ancestor-descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses.  However, this synthesis comes at a price.  The cost of computation of phylogenetic analysis expands combinatorially as the number of isolates considered increases. Thus, large datasets like those currently produced are commonly considered intractable.  We address this problem with synergistic development of heuristics tree search strategies and parallel computing.Fil: Janies, D.. Ohio State University; Estados UnidosFil: Pol, Diego. Ohio State University; Estados Unidos. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentin
    corecore