31 research outputs found
DNA barcoding and taxonomy: dark taxa and dark texts
Both classical taxonomy and DNA barcoding are engaged in the task of digitizing the living world. Much of the taxonomic literature remains undigitized. The rise of open access publishing this century and the freeing of older literature from the shackles of copyright have greatly increased the online availability of taxonomic descriptions, but much of the literature of the mid- to late-twentieth century remains offline (‘dark texts’). DNA barcoding is generating a wealth of computable data that in many ways are much easier to work with than classical taxonomic descriptions, but many of the sequences are not identified to species level. These ‘dark taxa’ hamper the classical method of integrating biodiversity data, using shared taxonomic names. Voucher specimens are a potential common currency of both the taxonomic literature and sequence databases, and could be used to help link names, literature and sequences. An obstacle to this approach is the lack of stable, resolvable specimen identifiers. The paper concludes with an appeal for a global ‘digital dashboard’ to assess the extent to which biodiversity data are available online.
This article is part of the themed issue ‘From DNA barcodes to biomes’
Ozymandias: a biodiversity knowledge graph
Enormous quantities of biodiversity data are being made available online, but much of this data remains isolated in silos. One approach to breaking these silos is to map local, often database-specific identifiers to shared global identifiers. This mapping can then be used to construct a knowledge graph, where entities such as taxa, publications, people, places, specimens, sequences, and institutions are all part of a single, shared knowledge space. Motivated by the 2018 GBIF Ebbe Nielsen Challenge I explore the feasibility of constructing a “biodiversity knowledge graph” for the Australian fauna. The data cleaning and reconciliation steps involved in constructing the knowledge graph are described in detail. Examples are given of its application to understanding changes in patterns of taxonomic publication over time. A web interface to the knowledge graph (called “Ozymandias”) is available at https://ozymandias-demo.herokuapp.com
An interactive DNA barcode browser
This paper describes an interactive web application to display DNA barcode data. It supports both query by sequence and query by geographic area. By using n-gram indexing of DNA sequences, and alignment-free phylogeny construction, the user can interactively explore DNA barcode data in real time
Training and hackathon on building biodiversity knowledge graphs
Knowledge graphs have the potential to unite disconnected digitized biodiversity data, and there are a number of efforts underway to build biodiversity knowledge graphs. More generally, the recent popularity of knowledge graphs, driven in part by the advent and success of the Google Knowledge Graph, has breathed life into the ongoing development of semantic web infrastructure and prototypes in the biodiversity informatics community. We describe a one week training event and hackathon that focused on applying three specific knowledge graph technologies – the Neptune graph database; Metaphactory; and Wikidata - to a diverse set of biodiversity use cases.
We give an overview of the training, the projects that were advanced throughout the week, and the critical discussions that emerged. We believe that the main barriers towards adoption of biodiversity knowledge graphs are the lack of understanding of knowledge graphs and the lack of adoption of shared unique identifiers. Furthermore, we believe an important advancement in the outlook of knowledge graph development is the emergence of Wikidata as an identifier broker and as a scoping tool. To remedy the current barriers towards biodiversity knowledge graph development, we recommend continued discussions at workshops and at conferences, which we expect to increase awareness and adoption of knowledge graph technologies
People are essential to linking biodiversity data
People are one of the best known and most stable entities in the biodiversity knowledge graph. The wealth of public information associated with people and the ability to identify them uniquely open up the possibility to make more use of these data in biodiversity science. Person data are almost always associated with entities such as specimens, molecular sequences, taxonomic names, observations, images, traits and publications. For example, the digitization and the aggregation of specimen data from museums and herbaria allow us to view a scientist’s specimen collecting in conjunction with the whole corpus of their works. However, the metadata of these entities are also useful in validating data, integrating data across collections and institutional databases and can be the basis of future research into biodiversity and science. In addition, the ability to reliably credit collectors for their work has the potential to change the incentive structure to promote improved curation and maintenance of natural history collections
Phyloinformatics: toward a phylogenetic database
Much of the interest in the “tree of life” is motivated by the notion that we can make much more meaningful use of biological information if we query the information in a phylogenetic framework. Assembling the tree of life raises numerous computational and data management issues. Biologists are generating large numbers of evolutionary trees (phylogenies). In contrast to sequence data, very few phylogenies (and the data from which they were derived) are stored in publicly accessible databases. Part of the reason is the need to develop new methods for storing, querying, and visualizing trees. This chapter explores some of these issues; it discusses some prototypes with a view to determining how far phylogenetics is toward its goal of a phylogenetic database
Cospeciation
Cospeciation is joint speciation of both host and parasite, resulting in host and parasite phylogenies being mirror images of each other
Ozymandias: a biodiversity knowledge graph
Enormous quantities of biodiversity data are being made available online, but much of this data remains isolated in silos. One approach to breaking these silos is to map local, often database-specific identifiers to shared global identifiers. This mapping can then be used to construct a knowledge graph, where entities such as taxa, publications, people, places, specimens, sequences, and institutions are all part of a single, shared knowledge space. Motivated by the 2018 GBIF Ebbe Nielsen Challenge I explore the feasibility of constructing a “biodiversity knowledge graph” for the Australian fauna. The data cleaning and reconciliation steps involved in constructing the knowledge graph are described in detail. Examples are given of its application to understanding changes in patterns of taxonomic publication over time. A web interface to the knowledge graph (called “Ozymandias”) is available at https://ozymandias-demo.herokuapp.com.</jats:p
Taxonomy, supertrees, and the tree of life
Some of the main practical impediments to the application of supertrees in large-scale phylogenetic analysis are inconsistent use of taxonomic names, trees incorporating taxa of different ranks, and poor taxonomic overlap between different phylogenetic studies. This chapter considers these problems and suggests some solutions. The notion of a “classification graph” is introduced to test for consistency between higher-level classifications. One strategy for coping with poor taxonomic overlap is to use a constraint tree that specifies some taxonomic groups that must appear in the supertree
Panbiogeography: a cladistic approach
This thesis develops a quantitative cladistic approach to panbiogeography. Algorithms for constructing and comparing area cladograms are developed and implemented in a computer program. Examples of the use of this software are described. The principle results of this thesis are: (1) The description of algorithms for implementing Nelson and Platnick's (1981) methods for constructing area cladograms. These algorithms have been incorporated into a computer program. (2) Zandee and Roos' (1987) methods based on "component-compatibility" are shown to be flawed. (3) Recent criticisms of Nelson and Platnick's methods by E. O. Wiley are rebutted. (4) A quantitative reanalysis of Hafner and Nadler's (1988) allozyme data for gophers and their parasitic lice illustrates the utility of information on timing of speciation events in interpreting apparent incongruence between host and parasite cladograms. In addition the thesis contains a survey of some current themes in biogeography, a reply to criticisms of my earlier work on track analysis, and an application of bootstrap and consensus methods to place confidence limits on estimates of cladograms
