83 research outputs found

    Twisted trees and inconsistency of tree estimation when gaps are treated as missing data -- the impact of model mis-specification in distance corrections

    Get PDF
    Statistically consistent estimation of phylogenetic trees or gene trees is possible if pairwise sequence dissimilarities can be converted to a set of distances that are proportional to the true evolutionary distances. Susko et al. (2004) reported some strikingly broad results about the forms of inconsistency in tree estimation that can arise if corrected distances are not proportional to the true distances. They showed that if the corrected distance is a concave function of the true distance, then inconsistency due to long branch attraction will occur. If these functions are convex, then two "long branch repulsion" trees will be preferred over the true tree -- though these two incorrect trees are expected to be tied as the preferred true. Here we extend their results, and demonstrate the existence of a tree shape (which we refer to as a "twisted Farris-zone" tree) for which a single incorrect tree topology will be guaranteed to be preferred if the corrected distance function is convex. We also report that the standard practice of treating gaps in sequence alignments as missing data is sufficient to produce non-linear corrected distance functions if the substitution process is not independent of the insertion/deletion process. Taken together, these results imply inconsistent tree inference under mild conditions. For example, if some positions in a sequence are constrained to be free of substitutions and insertion/deletion events while the remaining sites evolve with independent substitutions and insertion/deletion events, then the distances obtained by treating gaps as missing data can support an incorrect tree topology even given an unlimited amount of data.Comment: 29 pages, 3 figure

    A Genomic Approach for Distinguishing between Recent and Ancient Admixture as Applied to Cattle

    Get PDF
    Genomic data facilitate opportunities to track complex population histories of divergence and gene flow. We developed a metric, scaled block size (SBS), which uses the nonrecombined block size of introgressed regions of chromosomes to differentiate between recent and ancient types of admixture, and applied it to the reconstruction of admixture in cattle. Cattle are descendants of 2 independently domesticated lineages, taurine and indicine, which diverged more than 200 000 years ago. Several breeds have hybrid ancestry between these divergent lineages. Using 47 506 single-nucleotide polymorphisms, we analyzed the genomic architecture of the ancestry of 1369 individuals. We focused on 4 groups with admixed ancestry, including 2 anciently admixed African breeds (n = 58; n = 43), New World cattle of Spanish origin (n = 51), and known recent hybrids (n = 46). We estimated the ancestry of chromosomal regions for each individual and used the SBS metric to differentiate the timing of admixture among groups and among individuals within groups. By comparing SBS values of test individuals with standards with known recent hybrid ancestry, we were able to differentiate individuals of recent hybrid origin from other admixed cattle. We also estimated ancestry at the chromosomal scale. The X chromosome exhibits reduced indicine ancestry in recent hybrid, New World, and western African cattle, with virtually no evidence of indicine ancestry in New World cattle. Key words: cattle, chromosome painting, hybridization, introgressionGraduate Program in Ecology, Evolution, and Behavior at the University of Texas at Austin; Texas EcoLabs; Texas Longhorn Cattleman’s Foundation; National Science Foundation BEACON (Cooperative Agreement DBI–0939454); Extreme Science and Engineering Discovery Environment (XSEDE), National Science Foundation (OCI–1053575)

    How do SNP ascertainment schemes and population demographics affect inferences about population history?

    Get PDF
    Background: The selection of variable sites for inclusion in genomic analyses can influence results, especially when exemplar populations are used to determine polymorphic sites. We tested the impact of ascertainment bias on the inference of population genetic parameters using empirical and simulated data representing the three major continental groups of cattle: European, African, and Indian. We simulated data under three demographic models. Each simulated data set was subjected to three ascertainment schemes: (I) random selection; (II) geographically biased selection; and (III) selection biased toward loci polymorphic in multiple groups. Empirical data comprised samples of 25 individuals representing each continental group. These cattle were genotyped for 47,506 loci from the bovine 50 K SNP panel. We compared the inference of population histories for the empirical and simulated data sets across different ascertainment conditions using FST_{ST} and principal components analysis (PCA). Results: Bias toward shared polymorphism across continental groups is apparent in the empirical SNP data. Bias toward uneven levels of within-group polymorphism decreases estimates of FST_{ST} between groups. Subpopulation-biased selection of SNPs changes the weighting of principal component axes and can affect inferences about proportions of admixture and population histories using PCA. PCA-based inferences of population relationships are largely congruent across types of ascertainment bias, even when ascertainment bias is strong. Conclusions: Analyses of ascertainment bias in genomic data have largely been conducted on human data. As genomic analyses are being applied to non-model organisms, and across taxa with deeper divergences, care must be taken to consider the potential for bias in ascertainment of variation to affect inferences. Estimates of FST_{ST}, time of separation, and population divergence as estimated by principal components analysis can be misleading if this bias is not taken into account

    TreeToReads - a pipeline for simulating raw reads from phylogenies.

    Get PDF
    BackgroundUsing phylogenomic analysis tools for tracking pathogens has become standard practice in academia, public health agencies, and large industries. Using the same raw read genomic data as input, there are several different approaches being used to infer phylogenetic tree. These include many different SNP pipelines, wgMLST approaches, k-mer algorithms, whole genome alignment and others; each of these has advantages and disadvantages, some have been extensively validated, some are faster, some have higher resolution. A few of these analysis approaches are well-integrated into the regulatory process of US Federal agencies (e.g. the FDA's SNP pipeline for tracking foodborne pathogens). However, despite extensive validation on benchmark datasets and comparison with other pipelines, we lack methods for fully exploring the effects of multiple parameter values in each pipeline that can potentially have an effect on whether the correct phylogenetic tree is recovered.ResultsTo resolve this problem, we offer a program, TreeToReads, which can generate raw read data from mutated genomes simulated under a known phylogeny. This simulation pipeline allows direct comparisons of simulated and observed data in a controlled environment. At each step of these simulations, researchers can vary parameters of interest (e.g., input tree topology, amount of sequence divergence, rate of indels, read coverage, distance of reference genome, etc) to assess the effects of various parameter values on correctly calling SNPs and reconstructing an accurate tree.ConclusionsSuch critical assessments of the accuracy and robustness of analytical pipelines are essential to progress in both research and applied settings

    Variation in Flight Morphology in a Damselfly with Female-Limited Polymorphism

    Get PDF
    Background: Female-limited colour polymorphisms occur in many species of dragonflies and damselflies. Often one female morph appears male-like in coloration (androchromes) whereas one or more others are distinct from males (gynochromes). These androchromes are hypothesized to be male-mimics, thereby avoiding the harassment of excessive male mating attempts. Organism: The damselfly Ischnura ramburii, Rambur’s forktail, is a widespread New World species with androchrome and gynochrome females. It was introduced to the Hawaiian Islands in the mid-1970s and females were thought to be exclusively gynochromatic there. Questions: How do males and females differ in their flight apparatus? Do females with different colour morphologies also differ in flight morphology? Hypothesis: Because male-like coloration is sometimes associated with male-like flight behaviours, androchrome females should have more male-like wings than gynochrome females. Methods: We caught individuals of I. ramburii in the field from seven populations on three of the Hawaiian Islands and three populations in Texas (part of its native range). Using digitized wing and body images, we compared body size, wing size, and wing shape between sexes, between female morphs, and among geographic regions. Results: Male I. ramburii are smaller than females and have smaller, more slender wings. Although androchromes are absent from the Big Island of Hawaii, both androchrome and gynochrome females are common on Oahu and Kauai. Androchrome females are indistinguishable from gynochrome females in all aspects of their flight apparatus except for forewing size, which is smaller than that of gynochromes and thus more-male like. Wing shape and size vary geographically. Body- and wing-size differences between males and females are consistent across regions, although the degree and direction of sexual dimorphism in wing shape are not

    Phylesystem: a git-based data store for community-curated phylogenetic estimates

    Get PDF
    Motivation: Phylogenetic estimates from published studies can be archived using general platforms like Dryad (Vision, 2010) or TreeBASE (Sanderson et al., 1994). Such services fulfill a crucial role in ensuring transparency and reproducibility in phylogenetic research. However, digital tree data files often require some editing (e.g. rerooting) to improve the accuracy and reusability of the phylogenetic statements. Furthermore, establishing the mapping between tip labels used in a tree and taxa in a single common taxonomy dramatically improves the ability of other researchers to reuse phylogenetic estimates. As the process of curating a published phylogenetic estimate is not error-free, retaining a full record of the provenance of edits to a tree is crucial for openness, allowing editors to receive credit for their work and making errors introduced during curation easier to correct. Results: Here, we report the development of software infrastructure to support the open curation of phylogenetic data by the community of biologists. The backend of the system provides an interface for the standard database operations of creating, reading, updating and deleting records by making commits to a git repository. The record of the history of edits to a tree is preserved by git’s version control features. Hosting this data store on GitHub (http://github.com/) provides open access to the data store using tools familiar to many developers. We have deployed a server running the ‘phylesystem-api’, which wraps the interactions with git and GitHub. The Open Tree of Life project has also developed and deployed a JavaScript application that uses the phylesystem-api and other web services to enable input and curation of published phylogenetic statements

    Phylotastic! Making Tree-of-Life Knowledge Accessible, Reusable and Convenient

    Get PDF
    Scientists rarely reuse expert knowledge of phylogeny, in spite of years of effort to assemble a great "Tree of Life" (ToL). A notable exception involves the use of Phylomatic, which provides tools to generate custom phylogenies from a large, pre-computed, expert phylogeny of plant taxa. This suggests great potential for a more generalized system that, starting with a query consisting of a list of any known species, would rectify non-standard names, identify expert phylogenies containing the implicated taxa, prune away unneeded parts, and supply branch lengths and annotations, resulting in a custom phylogeny suited to the user's needs. Such a system could become a sustainable community resource if implemented as a distributed system of loosely coupled parts that interact through clearly defined interfaces. Results: With the aim of building such a "phylotastic" system, the NESCent Hackathons, Interoperability, Phylogenies (HIP) working group recruited 2 dozen scientist-programmers to a weeklong programming hackathon in June 2012. During the hackathon (and a three-month follow-up period), 5 teams produced designs, implementations, documentation, presentations, and tests including: (1) a generalized scheme for integrating components; (2) proof-of-concept pruners and controllers; (3) a meta-API for taxonomic name resolution services; (4) a system for storing, finding, and retrieving phylogenies using semantic web technologies for data exchange, storage, and querying; (5) an innovative new service, DateLife.org, which synthesizes pre-computed, time-calibrated phylogenies to assign ages to nodes; and (6) demonstration projects. These outcomes are accessible via a public code repository (GitHub.com), a website (www.phylotastic.org), and a server image. Conclusions: Approximately 9 person-months of effort (centered on a software development hackathon) resulted in the design and implementation of proof-of-concept software for 4 core phylotastic components, 3 controllers, and 3 end-user demonstration tools. While these products have substantial limitations, they suggest considerable potential for a distributed system that makes phylogenetic knowledge readily accessible in computable form. Widespread use of phylotastic systems will create an electronic marketplace for sharing phylogenetic knowledge that will spur innovation in other areas of the ToL enterprise, such as annotation of sources and methods and third-party methods of quality assessment.NESCent (the National Evolutionary Synthesis Center)NSF EF-0905606iPlant Collaborative (NSF) DBI-0735191Biodiversity Synthesis Center (BioSync) of the Encyclopedia of LifeComputer Science
    corecore