6 research outputs found

    The CHEMDNER corpus of chemicals and drugs and its annotation principles

    Get PDF
    The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus

    Automatic analysis of online conversations as processes

    No full text
    International audienceThe tremendous use of social media has changed the way society communicates and interacts nowadays, leading to a plethora of online conversations (Perrin et al., 2017). The increasing availability of these conversations as behavioral traces has enabled automatic approaches for behavior discovery and analysis. These approaches, grounded in machine learning, data mining and language processing have become effective predictive components and intelligent descriptive tools for many domains

    Process Models of Interrelated Speech Intentions from Online Health-related Conversations

    No full text
    International audienceBeing related to the adoption of new beliefs, attitudes and, ultimately, behaviors, analyzing online communication is of utmost importance for medicine. Multiple health care, academic communities, such as information seeking and dissemination and persuasive technologies, acknowledge this need. However, in order to obtain understanding, a relevant way to model online communication for the study of behavior is required. In this paper, we propose an automatic method to reveal process models of interrelated speech intentions from conversations. Specifically, a domain-independent taxonomy of speech intentions is adopted, an annotated corpus of Reddit conversations is released, supervised classifiers for speech intention prediction from utterances are trained and assessed using 10-fold cross validation (multi-class, one-versus-all and multi-label setups) and an approach to transform conversations into well-defined, representative logs of verbal behavior, needed by process mining techniques, is designed. The experimental results show that: 1) the automatic classification of intentions is feasible (with Kappa scores varying between 0.52 and 1); 2) predicting pairs of intentions, also known as adjacency pairs, or including more utterances from even other heterogeneous corpora can improve the predictions of some classes; and 3) the classifiers in the current state are robust to be used on other corpora, although the results are poorer and suggest that the input corpus may not suciently capture varied ways of expressing certain intentions. The extracted process models of interrelated speech intentions open new views on grasping the formation of beliefs and behavioral intentions in and from speech, but in-depth evaluation of these models is further required

    The CHEMDNER corpus of chemicals and drugs and its annotation principles

    Get PDF
    The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus
    corecore