5 research outputs found

    Applications of Natural Language Processing in Biodiversity Science

    Get PDF
    Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science

    FAIR data representation in times of eScience: a comparison of instance-based and class-based semantic representations of empirical data using phenotype descriptions as example

    Get PDF
    Background: The size, velocity, and heterogeneity of Big Data outclasses conventional data management tools and requires data and metadata to be fully machine-actionable (i.e., eScience-compliant) and thus findable, accessible, interoperable, and reusable (FAIR). This can be achieved by using ontologies and through representing them as semantic graphs. Here, we discuss two different semantic graph approaches of representing empirical data and metadata in a knowledge graph, with phenotype descriptions as an example. Almost all phenotype descriptions are still being published as unstructured natural language texts, with far-reaching consequences for their FAIRness, substantially impeding their overall usability within the life sciences. However, with an increasing amount of anatomy ontologies becoming available and semantic applications emerging, a solution to this problem becomes available. Researchers are starting to document and communicate phenotype descriptions through the Web in the form of highly formalized and structured semantic graphs that use ontology terms and Uniform Resource Identifiers (URIs) to circumvent the problems connected with unstructured texts. Results: Using phenotype descriptions as an example, we compare and evaluate two basic representations of empirical data and their accompanying metadata in the form of semantic graphs: the class-based TBox semantic graph approach called Semantic Phenotype and the instance-based ABox semantic graph approach called Phenotype Knowledge Graph. Their main difference is that only the ABox approach allows for identifying every individual part and property mentioned in the description in a knowledge graph. This technical difference results in substantial practical consequences that significantly affect the overall usability of empirical data. The consequences affect findability, accessibility, and explorability of empirical data as well as their comparability, expandability, universal usability and reusability, and overall machine-actionability. Moreover, TBox semantic graphs often require querying under entailment regimes, which is computationally more complex. Conclusions: We conclude that, from a conceptual point of view, the advantages of the instance-based ABox semantic graph approach outweigh its shortcomings and outweigh the advantages of the class-based TBox semantic graph approach. Therefore, we recommend the instance-based ABox approach as a FAIR approach for documenting and communicating empirical data and metadata in a knowledge graph

    Genomic Changes Underlying Adaptive Traits and Reproductive Isolation Between Young Species of Cyprinodon Pupfishes

    Get PDF
    Adaptive radiations showcase dramatic instances of biological diversification resulting from ecological speciation, which occurs when reproductive isolation evolves as a by-product of adaptive divergence between populations. While this process seems widespread and may account for much of life’s diversity, there is little known about genomic differences between species that influence differences in phenotypes and contribute to reproductive barriers. In my dissertation work, I used a variety of evolutionary genomic methods to study the genetic basis of rapid ecological speciation within an adaptive radiation of Cyprinodon pupfish endemic to San Salvador Island, Bahamas, which consists of a dietary generalist species and two trophic specialists – a molluscivore and a scale-eater. In my first chapter, I combined genome-wide divergence scans, selections scans, and association mapping to discover loci that were highly diverged between species, showed signs of recent selection, and were associated with variation in jaw size – the primary axis of phenotypic divergence in this system. In my second chapter, I found that the scale-eater and molluscivore species showed similar gene expression patterns compared to the generalist species, providing the first evidence of parallel changes in gene expression underling adaptation to divergent niches. These findings indicated convergent adaptation to higher trophic levels through shared genetic pathways. In my third and fourth chapters, I measured gene expression levels in F1 hybrids generated from crosses between San Salvador species. Intriguingly, many genes that were differentially expressed between sympatric species were also misregulated in their F1 hybrids. These results indicate that divergent ecological selection in sympatry can drive hybrid gene misregulation which may act as a primary reproductive barrier between nascent species. In my fifth chapter, I combined whole-genome resequencing data with total mRNA sequencing to identify candidate cis-acting genetic variation influencing rapidly evolving craniofacial phenotypes. I found very few alleles fixed between species – only 157 SNPs and 87 deletions. By measuring allele-specific expression in F1 hybrids, I found strong evidence for cis-regulatory alleles affecting expression divergence of genes with putative effects on skeletal development. These results highlight the utility of the San Salvador pupfish system as an evolutionary model for craniofacial development.Doctor of Philosoph
    corecore