78 research outputs found
Applications of Natural Language Processing in Biodiversity Science
Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science.
A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science
The taxonomic significance of species that have only been observed once : the genus Gymnodinium (Dinoflagellata) as an example
© The Author(s), 2012. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in PLoS ONE 7 (2012): e44015, doi:10.1371/journal.pone.0044015.Taxonomists have been tasked with cataloguing and quantifying the Earth’s biodiversity. Their progress is measured in code-compliant species descriptions that include text, images, type material and molecular sequences. It is from this material that other researchers are to identify individuals of the same species in future observations. It has been estimated that 13% to 22% (depending on taxonomic group) of described species have only ever been observed once. Species that have only been observed at the time and place of their original description are referred to as oncers. Oncers are important to our current understanding of biodiversity. They may be validly described species that are members of a rare biosphere, or they may indicate endemism, or that these species are limited to very constrained niches. Alternatively, they may reflect that taxonomic practices are too poor to allow the organism to be re-identified or that the descriptions are unknown to other researchers. If the latter are true, our current tally of species will not be an accurate indication of what we know. In order to investigate this phenomenon and its potential causes, we examined the microbial eukaryote genus Gymnodinium. This genus contains 268 extant species, 103 (38%) of which have not been observed since their original description. We report traits of the original descriptions and interpret them in respect to the status of the species. We conclude that the majority of oncers were poorly described and their identity is ambiguous. As a result, we argue that the genus Gymnodinium contains only 234 identifiable species. Species that have been observed multiple times tend to have longer descriptions, written in English. The styles of individual authors have a major effect, with a few authors describing a disproportionate number of oncers. The information about the taxonomy of Gymnodinium that is available via the internet is incomplete, and reliance on it will not give access to all necessary knowledge. Six new names are presented – Gymnodinium campbelli for the homonymous name Gymnodinium translucens Campbell 1973, Gymnodinium antarcticum for the homonymous name Gymnodinium frigidum Balech 1965, Gymnodinium manchuriensis for the homonymous name Gymnodinium autumnale Skvortzov 1968, Gymnodinium christenum for the homonymous name Gymnodinium irregulare Christen 1959, Gymnodinium conkufferi for the homonymous name Gymnodinium irregulare Conrad & Kufferath 1954 and Gymnodinium chinensis for the homonymous name Gymnodinium frigidum Skvortzov 1968.This work was funded by grants from the John D and Catherine T MacArthur Foundation and the Alfred P Sloan Foundation to the Encyclopedia of Life
and the National Science Foundation Data Net Program 0830976 and Global Names Project DBI-1062387
A new paradigm for the scientific enterprise: nurturing the ecosystem [version 1; referees: 2 approved]
The institutions of science are in a state of flux. Declining public funding for basic science, the increasingly corporatized administration of universities, increasing “adjunctification” of the professoriate and poor academic career prospects for postdoctoral scientists indicate a significant mismatch between the reality of the market economy and expectations in higher education for science. Solutions to these issues typically revolve around the idea of fixing the career "pipeline", which is envisioned as being a pathway from higher-education training to a coveted permanent position, and then up a career ladder until retirement. In this paper, we propose and describe the term “ecosystem” as a more appropriate way to conceptualize today’s scientific training and the professional landscape of the scientific enterprise. First, we highlight the issues around the concept of “fixing the pipeline”. Then, we articulate our ecosystem metaphor by describing a series of concrete design patterns that draw on peer-to-peer, decentralized, cooperative, and commons-based approaches for creating a new dynamic scientific enterprise
20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration
Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combines skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10–11 min. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree of Life Reference Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree of Life Reference Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills
Recommended from our members
20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration
Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combines skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10 - 11 min. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree of Life Reference Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree of Life Reference Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills
The influence of droplet size and biodegradation on the transport of subsurface oil droplets during the Deepwater Horizon: a model sensitivity study
A better understanding of oil droplet formation, degradation, and dispersal in deep waters is needed to enhance prediction of the fate and transport of subsurface oil spills. This research evaluates the influence of initial droplet size and rates of biodegradation on the subsurface transport of oil droplets, specifically those from the Deepwater Horizon oil spill. A three-dimensional coupled model was employed with components that included analytical multiphase plume, hydrodynamic and Lagrangian models. Oil droplet biodegradation was simulated based on first order decay rates of alkanes. The initial diameter of droplets (10–300 μm) spanned a range of sizes expected from dispersant-treated oil. Results indicate that model predictions are sensitive to biodegradation processes, with depth distributions deepening by hundreds of meters, horizontal distributions decreasing by hundreds to thousands of kilometers, and mass decreasing by 92–99% when biodegradation is applied compared to simulations without biodegradation. In addition, there are two- to four-fold changes in the area of the seafloor contacted by oil droplets among scenarios with different biodegradation rates. The spatial distributions of hydrocarbons predicted by the model with biodegradation are similar to those observed in the sediment and water column, although the model predicts hydrocarbons to the northeast and east of the well where no observations were made. This study indicates that improvement in knowledge of droplet sizes and biodegradation processes is important for accurate prediction of subsurface oil spills.National Science Foundation (U.S.) (RAPID: Deepwater Horizon Grant OCE-1048630)National Science Foundation (U.S.) (RAPID: Deepwater Horizon Grant OCE-1044573)National Science Foundation (U.S.) (RAPID: Deepwater Horizon Grant CBET-1045831)Gulf of Mexico Research Initiativ
Using knowledge graphs to infer gene expression in plants
IntroductionClimate change is already affecting ecosystems around the world and forcing us to adapt to meet societal needs. The speed with which climate change is progressing necessitates a massive scaling up of the number of species with understood genotype-environment-phenotype (G×E×P) dynamics in order to increase ecosystem and agriculture resilience. An important part of predicting phenotype is understanding the complex gene regulatory networks present in organisms. Previous work has demonstrated that knowledge about one species can be applied to another using ontologically-supported knowledge bases that exploit homologous structures and homologous genes. These types of structures that can apply knowledge about one species to another have the potential to enable the massive scaling up that is needed through in silico experimentation.MethodsWe developed one such structure, a knowledge graph (KG) using information from Planteome and the EMBL-EBI Expression Atlas that connects gene expression, molecular interactions, functions, and pathways to homology-based gene annotations. Our preliminary analysis uses data from gene expression studies in Arabidopsis thaliana and Populus trichocarpa plants exposed to drought conditions.ResultsA graph query identified 16 pairs of homologous genes in these two taxa, some of which show opposite patterns of gene expression in response to drought. As expected, analysis of the upstream cis-regulatory region of these genes revealed that homologs with similar expression behavior had conserved cis-regulatory regions and potential interaction with similar trans-elements, unlike homologs that changed their expression in opposite ways.DiscussionThis suggests that even though the homologous pairs share common ancestry and functional roles, predicting expression and phenotype through homology inference needs careful consideration of integrating cis and trans-regulatory components in the curated and inferred knowledge graph
The Environmental Conditions, Treatments, and Exposures Ontology (ECTO): connecting toxicology and exposure to human health and beyond.
BACKGROUND: Evaluating the impact of environmental exposures on organism health is a key goal of modern biomedicine and is critically important in an age of greater pollution and chemicals in our environment. Environmental health utilizes many different research methods and generates a variety of data types. However, to date, no comprehensive database represents the full spectrum of environmental health data. Due to a lack of interoperability between databases, tools for integrating these resources are needed. In this manuscript we present the Environmental Conditions, Treatments, and Exposures Ontology (ECTO), a species-agnostic ontology focused on exposure events that occur as a result of natural and experimental processes, such as diet, work, or research activities. ECTO is intended for use in harmonizing environmental health data resources to support cross-study integration and inference for mechanism discovery.
METHODS AND FINDINGS: ECTO is an ontology designed for describing organismal exposures such as toxicological research, environmental variables, dietary features, and patient-reported data from surveys. ECTO utilizes the base model established within the Exposure Ontology (ExO). ECTO is developed using a combination of manual curation and Dead Simple OWL Design Patterns (DOSDP), and contains over 2700 environmental exposure terms, and incorporates chemical and environmental ontologies. ECTO is an Open Biological and Biomedical Ontology (OBO) Foundry ontology that is designed for interoperability, reuse, and axiomatization with other ontologies. ECTO terms have been utilized in axioms within the Mondo Disease Ontology to represent diseases caused or influenced by environmental factors, as well as for survey encoding for the Personalized Environment and Genes Study (PEGS).
CONCLUSIONS: We constructed ECTO to meet Open Biological and Biomedical Ontology (OBO) Foundry principles to increase translation opportunities between environmental health and other areas of biology. ECTO has a growing community of contributors consisting of toxicologists, public health epidemiologists, and health care providers to provide the necessary expertise for areas that have been identified previously as gaps
Recommended from our members
Transforming the study of organisms: Phenomic data models and knowledge bases
The rapidly decreasing cost of gene sequencing has resulted in a deluge of genomic data from across the tree of life; however, outside a few model organism databases, genomic data are limited in their scientific impact because they are not accompanied by computable phenomic data. The majority of phenomic data are contained in countless small, heterogeneous phenotypic data sets that are very difficult or impossible to integrate at scale because of variable formats, lack of digitization, and linguistic problems. One powerful solution is to represent phenotypic data using data models with precise, computable semantics, but adoption of semantic standards for representing phenotypic data has been slow, especially in biodiversity and ecology. Some phenotypic and trait data are available in a semantic language from knowledge bases, but these are often not interoperable. In this review, we will compare and contrast existing ontology and data models, focusing on nonhuman phenotypes and traits. We discuss barriers to integration of phenotypic data and make recommendations for developing an operationally useful, semantically interoperable phenotypic data ecosystem
People are essential to linking biodiversity data
People are one of the best known and most stable entities in the biodiversity knowledge graph. The wealth of public information associated with people and the ability to identify them uniquely open up the possibility to make more use of these data in biodiversity science. Person data are almost always associated with entities such as specimens, molecular sequences, taxonomic names, observations, images, traits and publications. For example, the digitization and the aggregation of specimen data from museums and herbaria allow us to view a scientist’s specimen collecting in conjunction with the whole corpus of their works. However, the metadata of these entities are also useful in validating data, integrating data across collections and institutional databases and can be the basis of future research into biodiversity and science. In addition, the ability to reliably credit collectors for their work has the potential to change the incentive structure to promote improved curation and maintenance of natural history collections
- …