83 research outputs found
Twisted trees and inconsistency of tree estimation when gaps are treated as missing data -- the impact of model mis-specification in distance corrections
Statistically consistent estimation of phylogenetic trees or gene trees is
possible if pairwise sequence dissimilarities can be converted to a set of
distances that are proportional to the true evolutionary distances. Susko et
al. (2004) reported some strikingly broad results about the forms of
inconsistency in tree estimation that can arise if corrected distances are not
proportional to the true distances. They showed that if the corrected distance
is a concave function of the true distance, then inconsistency due to long
branch attraction will occur. If these functions are convex, then two "long
branch repulsion" trees will be preferred over the true tree -- though these
two incorrect trees are expected to be tied as the preferred true. Here we
extend their results, and demonstrate the existence of a tree shape (which we
refer to as a "twisted Farris-zone" tree) for which a single incorrect tree
topology will be guaranteed to be preferred if the corrected distance function
is convex. We also report that the standard practice of treating gaps in
sequence alignments as missing data is sufficient to produce non-linear
corrected distance functions if the substitution process is not independent of
the insertion/deletion process. Taken together, these results imply
inconsistent tree inference under mild conditions. For example, if some
positions in a sequence are constrained to be free of substitutions and
insertion/deletion events while the remaining sites evolve with independent
substitutions and insertion/deletion events, then the distances obtained by
treating gaps as missing data can support an incorrect tree topology even given
an unlimited amount of data.Comment: 29 pages, 3 figure
A Genomic Approach for Distinguishing between Recent and Ancient Admixture as Applied to Cattle
Genomic data facilitate opportunities to track complex population histories of divergence and gene flow. We developed a metric, scaled block size (SBS), which uses the nonrecombined block size of introgressed regions of chromosomes to differentiate between recent and ancient types of admixture, and applied it to the reconstruction of admixture in cattle. Cattle are descendants of 2 independently domesticated lineages, taurine and indicine, which diverged more than 200 000 years ago. Several breeds have hybrid ancestry between these divergent lineages. Using 47 506 single-nucleotide polymorphisms, we analyzed the genomic architecture of the ancestry of 1369 individuals. We focused on 4 groups with admixed ancestry, including 2 anciently admixed African breeds (n = 58; n = 43), New World cattle of Spanish origin (n = 51), and known recent hybrids (n = 46). We estimated the ancestry of chromosomal regions for each individual and used the SBS metric to differentiate the timing of admixture among groups and among individuals within groups. By comparing SBS values of test individuals with standards with known recent hybrid ancestry, we were able to differentiate individuals of recent hybrid origin from other admixed cattle. We also estimated ancestry at the chromosomal scale. The X chromosome exhibits reduced indicine ancestry in recent hybrid, New World, and western African cattle, with virtually no evidence of indicine ancestry in New World cattle.
Key words: cattle, chromosome painting, hybridization, introgressionGraduate Program in Ecology, Evolution, and Behavior at the University of Texas at Austin; Texas EcoLabs; Texas Longhorn Cattleman’s Foundation; National Science Foundation BEACON (Cooperative Agreement DBI–0939454); Extreme Science and Engineering Discovery Environment (XSEDE), National Science Foundation (OCI–1053575)
How do SNP ascertainment schemes and population demographics affect inferences about population history?
Background: The selection of variable sites for inclusion in genomic analyses can influence results, especially when exemplar populations are used to determine polymorphic sites. We tested the impact of ascertainment bias on the inference of population genetic parameters using empirical and simulated data representing the three major continental groups of cattle: European, African, and Indian. We simulated data under three demographic models. Each simulated data set was subjected to three ascertainment schemes: (I) random selection; (II) geographically biased selection; and (III) selection biased toward loci polymorphic in multiple groups. Empirical data comprised samples of 25 individuals representing each continental group. These cattle were genotyped for 47,506 loci from the bovine 50 K SNP panel. We compared the inference of population histories for the empirical and simulated data sets across different ascertainment conditions using F and principal components analysis (PCA).
Results: Bias toward shared polymorphism across continental groups is apparent in the empirical SNP data. Bias toward uneven levels of within-group polymorphism decreases estimates of F between groups. Subpopulation-biased selection of SNPs changes the weighting of principal component axes and can affect inferences about proportions of admixture and population histories using PCA. PCA-based inferences of population relationships are largely congruent across types of ascertainment bias, even when ascertainment bias is strong.
Conclusions: Analyses of ascertainment bias in genomic data have largely been conducted on human data. As genomic analyses are being applied to non-model organisms, and across taxa with deeper divergences, care must be taken to consider the potential for bias in ascertainment of variation to affect inferences. Estimates of F, time of separation, and population divergence as estimated by principal components analysis can be misleading if this bias is not taken into account
TreeToReads - a pipeline for simulating raw reads from phylogenies.
BackgroundUsing phylogenomic analysis tools for tracking pathogens has become standard practice in academia, public health agencies, and large industries. Using the same raw read genomic data as input, there are several different approaches being used to infer phylogenetic tree. These include many different SNP pipelines, wgMLST approaches, k-mer algorithms, whole genome alignment and others; each of these has advantages and disadvantages, some have been extensively validated, some are faster, some have higher resolution. A few of these analysis approaches are well-integrated into the regulatory process of US Federal agencies (e.g. the FDA's SNP pipeline for tracking foodborne pathogens). However, despite extensive validation on benchmark datasets and comparison with other pipelines, we lack methods for fully exploring the effects of multiple parameter values in each pipeline that can potentially have an effect on whether the correct phylogenetic tree is recovered.ResultsTo resolve this problem, we offer a program, TreeToReads, which can generate raw read data from mutated genomes simulated under a known phylogeny. This simulation pipeline allows direct comparisons of simulated and observed data in a controlled environment. At each step of these simulations, researchers can vary parameters of interest (e.g., input tree topology, amount of sequence divergence, rate of indels, read coverage, distance of reference genome, etc) to assess the effects of various parameter values on correctly calling SNPs and reconstructing an accurate tree.ConclusionsSuch critical assessments of the accuracy and robustness of analytical pipelines are essential to progress in both research and applied settings
Variation in Flight Morphology in a Damselfly with Female-Limited Polymorphism
Background: Female-limited colour polymorphisms occur in many species of dragonflies and damselflies. Often one female morph appears male-like in coloration (androchromes) whereas one or more others are distinct from males (gynochromes). These androchromes are hypothesized to be male-mimics, thereby avoiding the harassment of excessive male mating attempts.
Organism: The damselfly Ischnura ramburii, Rambur’s forktail, is a widespread New World species with androchrome and gynochrome females. It was introduced to the Hawaiian Islands in the mid-1970s and females were thought to be exclusively gynochromatic there.
Questions: How do males and females differ in their flight apparatus? Do females with different colour morphologies also differ in flight morphology? Hypothesis: Because male-like coloration is sometimes associated with male-like flight behaviours, androchrome females should have more male-like wings than gynochrome females.
Methods: We caught individuals of I. ramburii in the field from seven populations on three of the Hawaiian Islands and three populations in Texas (part of its native range). Using digitized wing and body images, we compared body size, wing size, and wing shape between sexes, between female morphs, and among geographic regions.
Results: Male I. ramburii are smaller than females and have smaller, more slender wings. Although androchromes are absent from the Big Island of Hawaii, both androchrome and gynochrome females are common on Oahu and Kauai. Androchrome females are indistinguishable from gynochrome females in all aspects of their flight apparatus except for forewing size, which is smaller than that of gynochromes and thus more-male like. Wing shape and size vary geographically. Body- and wing-size differences between males and females are consistent across regions, although the degree and direction of sexual dimorphism in wing shape are not
Phylesystem: a git-based data store for community-curated phylogenetic estimates
Motivation: Phylogenetic estimates from published studies can be archived using general platforms like Dryad (Vision, 2010) or TreeBASE (Sanderson et al., 1994). Such services fulfill a crucial role in ensuring transparency and reproducibility in phylogenetic research. However, digital tree data files often require some editing (e.g. rerooting) to improve the accuracy and reusability of the phylogenetic statements. Furthermore, establishing the mapping between tip labels used in a tree and taxa in a single common taxonomy dramatically improves the ability of other researchers to reuse phylogenetic estimates. As the process of curating a published phylogenetic estimate is not error-free, retaining a full record of the provenance of edits to a tree is crucial for openness, allowing editors to receive credit for their work and making errors introduced during curation easier to correct.
Results: Here, we report the development of software infrastructure to support the open curation of phylogenetic data by the community of biologists. The backend of the system provides an interface for the standard database operations of creating, reading, updating and deleting records by making commits to a git repository. The record of the history of edits to a tree is preserved by git’s version control features. Hosting this data store on GitHub (http://github.com/) provides open access to the data store using tools familiar to many developers. We have deployed a server running the ‘phylesystem-api’, which wraps the interactions with git and GitHub. The Open Tree of Life project has also developed and deployed a JavaScript application that uses the phylesystem-api and other web services to enable input and curation of published phylogenetic statements
Phylotastic! Making Tree-of-Life Knowledge Accessible, Reusable and Convenient
Scientists rarely reuse expert knowledge of phylogeny, in spite of years of effort to assemble a great "Tree of Life" (ToL). A notable exception involves the use of Phylomatic, which provides tools to generate custom phylogenies from a large, pre-computed, expert phylogeny of plant taxa. This suggests great potential for a more generalized system that, starting with a query consisting of a list of any known species, would rectify non-standard names, identify expert phylogenies containing the implicated taxa, prune away unneeded parts, and supply branch lengths and annotations, resulting in a custom phylogeny suited to the user's needs. Such a system could become a sustainable community resource if implemented as a distributed system of loosely coupled parts that interact through clearly defined interfaces. Results: With the aim of building such a "phylotastic" system, the NESCent Hackathons, Interoperability, Phylogenies (HIP) working group recruited 2 dozen scientist-programmers to a weeklong programming hackathon in June 2012. During the hackathon (and a three-month follow-up period), 5 teams produced designs, implementations, documentation, presentations, and tests including: (1) a generalized scheme for integrating components; (2) proof-of-concept pruners and controllers; (3) a meta-API for taxonomic name resolution services; (4) a system for storing, finding, and retrieving phylogenies using semantic web technologies for data exchange, storage, and querying; (5) an innovative new service, DateLife.org, which synthesizes pre-computed, time-calibrated phylogenies to assign ages to nodes; and (6) demonstration projects. These outcomes are accessible via a public code repository (GitHub.com), a website (www.phylotastic.org), and a server image. Conclusions: Approximately 9 person-months of effort (centered on a software development hackathon) resulted in the design and implementation of proof-of-concept software for 4 core phylotastic components, 3 controllers, and 3 end-user demonstration tools. While these products have substantial limitations, they suggest considerable potential for a distributed system that makes phylogenetic knowledge readily accessible in computable form. Widespread use of phylotastic systems will create an electronic marketplace for sharing phylogenetic knowledge that will spur innovation in other areas of the ToL enterprise, such as annotation of sources and methods and third-party methods of quality assessment.NESCent (the National Evolutionary Synthesis Center)NSF EF-0905606iPlant Collaborative (NSF) DBI-0735191Biodiversity Synthesis Center (BioSync) of the Encyclopedia of LifeComputer Science
Recommended from our members
Phylotastic! Making tree-of-life knowledge accessible, reusable and convenient
Background: Scientists rarely reuse expert knowledge of phylogeny, in spite of years of effort to assemble a great
“Tree of Life” (ToL). A notable exception involves the use of Phylomatic, which provides tools to generate custom
phylogenies from a large, pre-computed, expert phylogeny of plant taxa. This suggests great potential for a more
generalized system that, starting with a query consisting of a list of any known species, would rectify non-standard
names, identify expert phylogenies containing the implicated taxa, prune away unneeded parts, and supply branch
lengths and annotations, resulting in a custom phylogeny suited to the user’s needs. Such a system could become
a sustainable community resource if implemented as a distributed system of loosely coupled parts that interact
through clearly defined interfaces.
Results: With the aim of building such a “phylotastic” system, the NESCent Hackathons, Interoperability, Phylogenies
(HIP) working group recruited 2 dozen scientist-programmers to a weeklong programming hackathon in June 2012.
During the hackathon (and a three-month follow-up period), 5 teams produced designs, implementations,
documentation, presentations, and tests including: (1) a generalized scheme for integrating components; (2) proofof-
concept pruners and controllers; (3) a meta-API for taxonomic name resolution services; (4) a system for storing,
finding, and retrieving phylogenies using semantic web technologies for data exchange, storage, and querying; (5) an
innovative new service, DateLife.org, which synthesizes pre-computed, time-calibrated phylogenies to assign ages to
nodes; and (6) demonstration projects. These outcomes are accessible via a public code repository (GitHub.com), a
website (www.phylotastic.org), and a server image.
Conclusions: Approximately 9 person-months of effort (centered on a software development hackathon) resulted in
the design and implementation of proof-of-concept software for 4 core phylotastic components, 3 controllers, and 3
end-user demonstration tools. While these products have substantial limitations, they suggest considerable potential
for a distributed system that makes phylogenetic knowledge readily accessible in computable form. Widespread use of
phylotastic systems will create an electronic marketplace for sharing phylogenetic knowledge that will spur innovation
in other areas of the ToL enterprise, such as annotation of sources and methods and third-party methods of quality
assessment.Keywords: Web services, Taxonomy, Data reuse, Phylogeny, Tree of life, HackathonKeywords: Web services, Taxonomy, Data reuse, Phylogeny, Tree of life, Hackatho
- …