20 research outputs found

    The CHEMDNER corpus of chemicals and drugs and its annotation principles

    Get PDF
    The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus

    Hull fouling marine invasive species pose a very low, but plausible, risk of introduction to East Antarctica in climate change scenarios

    No full text
    Aims: To identify potential hull fouling marine invasive species that could survive in East Antarctica presently and in the future. Location: Australia's Antarctic continental stations: Davis, Mawson and Casey, East Antarctica; and subantarctic islands: Macquarie Island and Heard and McDonald Islands. Methods: Our study uses a novel machine-learning algorithm to predict which currently known hull fouling MIS could survive in shallow benthic ecosystems adjacent to Australian Antarctic research stations and subantarctic islands, where ship traffic is present. We used gradient boosted machine learning (XGBoost) with four important environmental variables (sea surface temperature, salinity, nitrate and pH) to develop models of suitable environments for each potentially invasive species. We then used these models to determine if any of Australia's three Antarctic research stations and two subantarctic islands could be environmentally suitable for MIS now and under two future climate scenarios. Results: Most of the species were predicted to be unable to survive at any location between now and the end of this century; however, four species were identified as potential current threats and five as threats under future climate change. Asterias amurensis was identified as a potential threat to all locations. Main conclusions: This study suggests that the risks are very low, but plausible, that known hull fouling species could survive in the shallow benthic habitats near Australia's East Antarctica locations and suggest a precautionary approach is needed by way of surveillance and monitoring in this region, particularly if propagule pressure increases. While some species could survive as adults in the region, their ability to reach these locations and undergo successful reproduction is considered unlikely based on current knowledge.</p
    corecore