43 research outputs found

    Applications of Natural Language Processing in Biodiversity Science

    Get PDF
    Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science

    Site-specific mutagenesis of Drosophila proliferating cell nuclear antigen enhances its effects on calf thymus DNA polymerase δ

    Get PDF
    BACKGROUND: We and others have shown four distinct and presumably related effects of mammalian proliferating cell nuclear antigen (PCNA) on DNA synthesis catalyzed by mammalian DNA polymerase δ(pol δ). In the presence of homologous PCNA, pol δ exhibits 1) increased absolute activity; 2) increased processivity of DNA synthesis; 3) stable binding of synthetic oligonucleotide template-primers (t(1/2 )of the pol δ•PCNA•template-primer complex ≥2.5 h); and 4) enhanced synthesis of DNA opposite and beyond template base lesions. This last effect is potentially mutagenic in vivo. Biochemical studies performed in parallel with in vivo genetic analyses, would represent an extremely powerful approach to investigate further, both DNA replication and repair in eukaryotes. RESULTS: Drosophila PCNA, although highly similar in structure to mammalian PCNA (e.g., it is >70% identical to human PCNA in amino acid sequence), can only substitute poorly for either calf thymus or human PCNA (~10% as well) in affecting calf thymus pol δ. However, by mutating one or only a few amino acids in the region of Drosophila PCNA thought to interact with pol δ, all four effects can be enhanced dramatically. CONCLUSIONS: Our results therefore suggest that all four above effects depend at least in part on the PCNA-pol δ interaction. Moreover unlike mammals, Drosophila offers the potential for immediate in vivo genetic analyses. Although it has proven difficult to obtain sufficient amounts of homologous pol δ for parallel in vitro biochemical studies, by altering Drosophila PCNA using site-directed mutagenesis as suggested by our results, in vitro biochemical studies may now be performed using human and/or calf thymus pol δ preparations

    SeaBase : a multispecies transcriptomic resource and platform for gene network inference

    Get PDF
    Author Posting. © The Author(s), 2014. This is the author's version of the work. It is posted here by permission of Oxford University Press for personal use, not for redistribution. The definitive version was published in Integrative and Comparative Biology 54 (2014): 250-263, doi: 10.1093/icb/icu065.Marine and aquatic animals are extraordinarily useful as models for identifying mechanisms of development and evolution, regeneration, resistance to cancer, longevity and symbiosis, among many other areas of research. This is due to the great diversity of these organisms and their wide-ranging capabilities. Genomics tools are essential for taking advantage of these “free lessons” of nature. However, genomics and transcriptomics are challenging in emerging model systems. Here, we present SeaBase, a tool for helping to meet these needs. Specifically, SeaBase provides a platform for sharing and searching transcriptome data. More importantly, SeaBase will support a growing number of tools for inferring gene network mechanisms. The first dataset available on SeaBase is a developmental transcriptome profile of the sea anemone Nematostella vectensis (Anthozoa, Cnidaria). Additional datasets are currently being prepared and we are aiming to expand SeaBase to include user-supplied data for any number of marine and aquatic organisms, thereby supporting many potentially new models for gene network studies.2015-06-0

    The taxonomic name resolution service : an online tool for automated standardization of plant names

    Get PDF
    © The Author(s), 2013. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in BMC Bioinformatics 14 (2013): 16, doi:10.1186/1471-2105-14-16.The digitization of biodiversity data is leading to the widespread application of taxon names that are superfluous, ambiguous or incorrect, resulting in mismatched records and inflated species numbers. The ultimate consequences of misspelled names and bad taxonomy are erroneous scientific conclusions and faulty policy decisions. The lack of tools for correcting this ‘names problem’ has become a fundamental obstacle to integrating disparate data sources and advancing the progress of biodiversity science. The TNRS, or Taxonomic Name Resolution Service, is an online application for automated and user-supervised standardization of plant scientific names. The TNRS builds upon and extends existing open-source applications for name parsing and fuzzy matching. Names are standardized against multiple reference taxonomies, including the Missouri Botanical Garden's Tropicos database. Capable of processing thousands of names in a single operation, the TNRS parses and corrects misspelled names and authorities, standardizes variant spellings, and converts nomenclatural synonyms to accepted names. Family names can be included to increase match accuracy and resolve many types of homonyms. Partial matching of higher taxa combined with extraction of annotations, accession numbers and morphospecies allows the TNRS to standardize taxonomy across a broad range of active and legacy datasets. We show how the TNRS can resolve many forms of taxonomic semantic heterogeneity, correct spelling errors and eliminate spurious names. As a result, the TNRS can aid the integration of disparate biological datasets. Although the TNRS was developed to aid in standardizing plant names, its underlying algorithms and design can be extended to all organisms and nomenclatural codes. The TNRS is accessible via a web interface at http://tnrs.iplantcollaborative.org/ webcite and as a RESTful web service and application programming interface. Source code is available at https://github.com/iPlantCollaborativeOpenSource/TNRS/ webcite.BJE was supported by NSF grant DBI 0850373 and TR by CSIRO Marine and Atmospheric Research, Australia,. BB and BJE acknowledge early financial support from Conservation International and TEAM who funded the development of early prototypes of taxonomic name resolution. The iPlant Collaborative (http://www.iplantcollaborative.org) is funded by a grant from the National Science Foundation (#DBI-0735191)

    Connecting Taxonomic Backbones using Global Names Tools

    No full text
    Biodiversity taxonomy provides a means to organize information about living organisms into maintainable tree- or graph-like structures (taxonomic backbones). Taxonomy is tightly bound to biodiversity nomenclature—a collection of recommendations, rules and conventions for naming living organisms. Species are often considered to be the most important unit of taxonomy structures. Keeping scientific names of species and other taxa accurate and up to date are major challenges during creation and maintenance of large taxonomic backbones.Global Names Architecture (Global Names) is an initiative that developed tools and databases for detecting, parsing, and verifying scientific names. Verification tools also provide information about which taxonomic and nomenclatural resources contain information for a given scientific name. Taxonomic intelligence provided by resources aggregated by Global Names allows resolving of taxon names from different backbones, even if their "current" scientific names vary.Parsing of scientific names with GNparser allows for normalization of names, making them comparable. Fast name matching (reconciliation) and discovery of a taxonomic meaning (resolution) by GNverifier connects information from various resources. The most recently developed tools by Global Names provide name verification and taxon matching on an unprecedented scale.During this presentation we are going to describe Global Names tools and show how they can be used for reconciliation of lexical variants of scientific names, for extracting the authorship metadata, how names can be verified and resolved, and how data can be connected to a variety of biodiversity resources

    Biodiversity Heritage Library and Global Names: Successes, opportunities and the challenges for the future collaboration

    No full text
    The Biodiversity Heritage Library (BHL) is a major aggregator of biodiversity literature with more than 200,000 volumes. The Global Names Architecture (GNA) strives to develop and provide tools for finding, parsing and verifying scientific names. GNA and BHL have enjoyed 10 years of collaboration in the creation of a scientific names index for BHL. Such an index provides researchers with a means for finding data about more than a million species.Recently, BHL and GNA developed a workflow for the creation of an index that covers more than 50 million pages of BHL, and finds and verifies scientific names in less than a day. The unprecedented speed of the index creation opens an opportunity to dramatically increase its quality and reach. The following challenges can now be addressed.1. Abbreviated names reconciliation.From 20% to 25% of all scientific names in BHL are abbreviated. It is much harder to reconcile and verify abbreviated names, because their specific epithets are not unique. We plan to reconcile the vast majority of such names via a statistical approach.2. Linking of biodiversity publication titles with actual pages in BHL.Scientific names are closely connected to publications of original description, taxonomic treatments, and other usages. We plan to build algorithms for finding out how different lexical variants of the same publication reference can be disambiguated and connected to corresponding BHL pages.3. Using taxonomic intelligence for finding information about species.According to our estimation, on average, there are three scientific names (historical and current) per taxon. Names of species often change over time as a result of misspellings, and homotypic or heterotypic synonymy. We plan to link outdated names with currently accepted names of taxa. This functionality provides all information about a taxon in BHL, no matter what names were used to reference the taxon at the time of publication.4. Finding information about original descriptions of genera and species.For every species there is a publication with the original description. We want to create an index of species that are described in the publications aggregated by BHL.5. Detection of species names in spite of "incorrect" capitalization.Previously, or in horticultural sources, specific epithets were often capitalized (e.g., Bulbophyllum Nocturnum), or for patronyms in which the species was named in honor of someone (e.g., Notiospathius Johnlennoni). We plan to detect names with non-standard capitalization of this sort.6. Removal of false positives.Texts in Latin language, names of people, and geographical entities often create false positives that look like scientific names. Using machine learning techniques will allow us to detect and remove most of these errors from the names index.7. Detection of the names of biodiversity scientists and geographical entities in texts.Finding names of biologists and geographical places in addition to scientific names would allow us to draw connections between these data and to create metadata demonstrating these links. We plan to add tools and algorithms for indexing person names and geographical names.In this talk I will present plans for a dramatic quality increase in the scientific name-finding algorithms, as well as an introduction of other elements that would enhance usability of BHL for its patrons

    Algorithms for connecting scientific names with literature in the Biodiversity Heritage Library via the Global Names Project and Catalogue of Life

    No full text
    Being able to quickly find and access original species descriptions is essential for efficiently conducting taxonomic research. Linking scientific name queries to the original species description is challenging and requires taxonomic intelligence because on average there are an estimated three scientific names associated with each currently accepted species, and many historical scientific names have fallen into disuse from being synonymized or forgotten. Additionally, non-standard usage of journal abbreviations can make it difficult to automatically disambiguate bibliographic citations and ascribe them to the correct publication. The largest open access resource for biodiversity literature is the Biodiversity Heritage Library (BHL), which was built by a consortium of natural history institutions and contains over 200,000 digitized volumes of natural history publications spanning hundreds of years of biological research. Catalogue of Life (CoL) is the largest aggregator of scientific names globally, publishing an annual checklist of currently accepted scientific names and their historical synonyms. TaxonWorks is an integrative web-based workbench that facilitates collaboration on biodiversity informatics research between scientists and developers. The Global Names project has been collaborating with BHL, TaxonWorks, and CoL to develop a Global Names Index that links all of these services together by finding scientific names in BHL and using the taxonomic intelligence provided by CoL to conveniently link directly to the page referenced in BHL. The Global Names Index is continuously updated as metadata is improved and digitization technologies advance to provide more accurate optical character recognition (OCR) of scanned texts. We developed an open source tool, “BHLnames,” and launched a restful application programming interface (API) service with a freely available Javascript widget that can be embedded on any website to link scientific names to literature citations in BHL. If no bibliographic citation is provided, the widget will link to the oldest name usage in BHL, which often is the original species description. The BHLnames widget can also be used to browse all mentions of a scientific name and its synonyms in BHL, which could make the tool more broadly useful for studying the natural history of any species

    Preservation Strategies for Biodiversity Data

    No full text
    We are witnessing a fast proliferation of biodiversity informatics projects. The data accumulated by these initiatives often grows rapidly, even exponentially. Most of these projects start small and do not foresee the data achitecture challenges of their future. Organizations may lack the necessary expertise and/or money to strategically address the care and feeding of this expanding data pile. In other cases, individuals with the expertise to address these needs may be present, but lack the power or status or possibly the bandwidth to take effective actions. Over time, the data may increase in size to such an extent that handling and preserving it becomes an almost insurmountable problem. The most common technical challenges include migrating data from one physical data storage to another, organizing backups, providing fast disaster recovery, and preparing data to be accessible for posterity. Some sociotechnical and strategic hurdles noted when trying to address data stewardship include funding, data leadership (Stack and Stadolnik 2018) and vision (or lack thereof), and organizational structure and culture. The biodiversity data collected today will be indispensable for future research, and it is our collective responsibility to preserve it for current and future generations.Some of the most common information loss risk factors are the end of funding, retirement of a researcher, or the departure of a critical researcher or programmer. More risk factors, such as hardware malfunction, hurricanes, tornadoes, and severe magnetic storms, can destroy the data carefully collected by large groups of people.The co-location of original data and backups create a "Library of Alexandria" where a single disaster at this location can lead to permanent data loss and poses an existential threat to the project.Biodiversity data becomes more valuable over time and should survive for several centuries. However, SSD (solid-state drive) and HDD (hard disk drive) storage solutions have an expiration date of only a few years. We propose the following solutions (Fig. 1) to provide long-term data security:Technical tacticsUse an immutable file storage for everything that is not entered very recently.Most of the biodiversity "big data" are files that are written once and never changed again. We suggest separating storage into a read-only part and small read/write sections. Data from the read/write section can be moved to the read-only part often, for example, daily.Use a Copy-On-Write file system, such as ZFS (Zettabyte File System).The ZFS file system is widely used in industry and is known for its robustness and error resistance. It allows efficient incremental backups and much faster data transfer than other systems. Regular incremental backups can work even with slow internet connections. ZFS provides real-time data integrity checks and uses powerful tools for data healing.Split data and its backups into smaller chunks.Dividing backups into cost-effective 2–8 terabyte chunks allows running backups using cheap hardware. Assuming that the data is read-only, such data organization always splits the backup into chunks, with hardware costs changing from tens of thousands of dollars (US) to less than two thousand dollars. We recognize that with time data storage costs drop, and larger chunks will be used.Split the data even further to the size of the largest available long-term storage unit (currently an optical M-disc).The write-once optical M-DISC is analogous to a Sumerian clay tablet. Data written on such discs does not deteriorate for hundreds of years. This option addresses the need for last resort backups because the storage does not depend on magnetic properties and is impervious to electromagnetic disasters. Optical discs can be easily and cheaply copied and distributed to libraries worldwide. In the future, discs' data can be transferred to a different long-term storage medium. We also trust these discs can be deciphered by those in the future, just like clay tablets.Sociotechnical insightsThe above example of a comprehensive strategy to preserve data epitomizes "LOCKSS" (lots of copies keep stuff safe) and makes it clear that these copies need to be in multiple media types. Our suggestions here focus on projects that experience data growth pains. Such projects often look to see how others address these data needs. Recently, The Species File Group (SFG) did this exercise to evaluate and address our data growth needs (Mozzherin et al. 2023). We recognize and emphasize here the need forpersonnel with the knowledge and skills to build, maintain, and evolve robust strategies and infrastructure to make data accessible and preserve it,funding to back the most suitable architectural strategies to do so, andpeople with expertise in long-term data security to have a seat at the leadership table in our organizations.We encourage our colleagues to evaluate the status of data leadership at your organizations (Stack and Stadolnik 2018, Kalms 2012). Implementing these suggestions will help ensure the survival of the data and accompanying software for hundreds of years to come

    A path to continuous reindexing of scientific names appearing in Biodiversity Heritage Library data

    No full text
    Biodiversity Heritage Library (BHL) is a massive, constantly expanding repository of Open Access biodiversity literature. It currently serves over 50 million pages of biological texts to scientific community. Metadata attached to this textual data dramatically enhances its usefulness. One of the most important categories of metadata is an index of scientific names that binds names to pages in BHL. This index helps researches to find information about species or higher taxons they study. Finding scientific names in 50 million pages is a challenging task not only because of the scale, but also because of inevitable mistakes made during the process. Optical character recognition (OCR) mistakes, historical changes in common practices of nomenclature, name abbreviations, variations in formats of scientific names etc. - all create potential problems for name indexing. Ability to cope with such problems mainly depends on the quality of applied computer algorithms. With time improvements in OCR, error reports from BHL users, improvements in name-finding algorithms make it beneficial to re-index the whole data set. If we are able to regularly re-index BHL data, we should be able to systematically improve the quality of the index. However with tools existed so far such task was impractical. In 2012 our team at Global Names Architecture (GNA) project accomplished a complete index of BHL. The process took more than 40 days, during which the whole toolset of name-finding, name-parsing, name-verification programs was unavailable for public use. Repeating such extensive exercise was not feasible since then. We think it is important to constantly improve indexing quality, and for that we need much faster approach that would allow to re-index BHL data on a regular basis. Global Names Architecture team is developing a system that should be able to scan 50 million pages and extract scientific names from there in a matter of one-two days. Our approach includes dramatic increase in speed and scalability of the tools at every stage of the process. On the hardware level we created a computer cluster of more than 20 servers. The cluster uses Kubernetes -- an Open Source-based cloud operating system that allow us flexibly increase or decrease power of every component by scaling services up or down at will. This allows us to fine tune amount of resources for every component and use computer power to its maximum. On the software level we have to solve several problems: scientific name recognition, name verification, name parsing. We developed new faster tools for each of the steps -- gnparser allows to break names into its elements, gnindex analyses the quality of found names, Global Names Recognition and Detection service (GNRD) detects names in BHL texts. For this talk we are presenting the architecture of the system for massive indexing of biodiversity information. The tools and system administration approaches are easily transferable to cloud services that support Kubernetes and it allows with to scale the system even further with relative ease. After successful scanning of the BHL pages the system should be fast enough to to massive name finding in all available scientific literature

    A path to continuous reindexing of scientific names appearing in Biodiversity Heritage Library data

    No full text
    Biodiversity Heritage Library (BHL) is a massive, constantly expanding repository of Open Access biodiversity literature. It currently serves over 50 million pages of biological texts to scientific community. Metadata attached to this textual data dramatically enhances its usefulness. One of the most important categories of metadata is an index of scientific names that binds names to pages in BHL. This index helps researches to find information about species or higher taxons they study. Finding scientific names in 50 million pages is a challenging task not only because of the scale, but also because of inevitable mistakes made during the process. Optical character recognition (OCR) mistakes, historical changes in common practices of nomenclature, name abbreviations, variations in formats of scientific names etc. - all create potential problems for name indexing. Ability to cope with such problems mainly depends on the quality of applied computer algorithms. With time improvements in OCR, error reports from BHL users, improvements in name-finding algorithms make it beneficial to re-index the whole data set. If we are able to regularly re-index BHL data, we should be able to systematically improve the quality of the index. However with tools existed so far such task was impractical. In 2012 our team at Global Names Architecture (GNA) project accomplished a complete index of BHL. The process took more than 40 days, during which the whole toolset of name-finding, name-parsing, name-verification programs was unavailable for public use. Repeating such extensive exercise was not feasible since then. We think it is important to constantly improve indexing quality, and for that we need much faster approach that would allow to re-index BHL data on a regular basis. Global Names Architecture team is developing a system that should be able to scan 50 million pages and extract scientific names from there in a matter of one-two days. Our approach includes dramatic increase in speed and scalability of the tools at every stage of the process. On the hardware level we created a computer cluster of more than 20 servers. The cluster uses Kubernetes -- an Open Source-based cloud operating system that allow us flexibly increase or decrease power of every component by scaling services up or down at will. This allows us to fine tune amount of resources for every component and use computer power to its maximum. On the software level we have to solve several problems: scientific name recognition, name verification, name parsing. We developed new faster tools for each of the steps -- gnparser allows to break names into its elements, gnindex analyses the quality of found names, Global Names Recognition and Detection service (GNRD) detects names in BHL texts. For this talk we are presenting the architecture of the system for massive indexing of biodiversity information. The tools and system administration approaches are easily transferable to cloud services that support Kubernetes and it allows with to scale the system even further with relative ease. After successful scanning of the BHL pages the system should be fast enough to to massive name finding in all available scientific literature
    corecore