Search CORE

2 research outputs found

Annotated chemical patent corpus: A gold standard for text mining

Author: Akhondi S.A. (Saber)
Boppana K. (Kiran)
Jagarlapudi S.A.R.P. (Sarma A. R. P.)
Klenner A.G. (Alexander G.)
Kors J.A. (Jan)
Lowe D. (Daniel)
Manchala A.K. (Anil K.)
Muresan C. (Cornelia)
Sayle R. (Roger)
Tyrchan C. (Christian)
Zimmermann M. (Marc)
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, t

Crossref

Directory of Open Access Journals

Fraunhofer-ePrints

PubMed Central

EUR Research Repository

Erasmus University Digital Repository

The CHEMDNER corpus of chemicals and drugs and its annotation principles

Author: Akhondi S.A. (Saber A.)
Alves R. (Rui)
An X. (Xin)
Ata C. (Caglar)
Bajec M. (Marko)
Batista-Navarro R.T. (Riza Theresa)
Campos D. (David)
Can T. (Tolga)
Choi M. (Miji)
Couto F.M. (Francisco M.)
Dai H.J (Hong-Jie)
Dieb T.M. (Thaer M.)
Ekbal A. (Asif)
Giles C.L. (C. Lee)
Huber T. (Torsten)
Irmer M. (Matthias)
Ji D. (Donghong)
Khabsa M. (Madian)
Kors J.A. (Jan A.)
Krallinger M. (Martin)
Lamurias A. (Andre)
Leaman R. (Robert)
Leitner F. (Florian)
Liu H. (Hongfang)
Lowe D.M. (Daniel M.)
Lu Y. (Yanan)
Lu Z. (Zhiyong)
Martínez P. (Paloma)
Matos S. (Sérgio)
Munkhdalai T. (Tsendsuren)
Nathan S. (Senthil)
Oyarzabal J. (Julen)
Rabal O. (Obdulia)
Rak R. (Rafal)
Ramanan S.V. (S.V.)
Ravikumar K.E. (Komandur Elayavilli)
Rocktäschel T. (Tim)
Ryu K.H. (Keun Ho)
Salgado D. (David)
Sayle R.A. (Roger A.)
Segura-Bedmar I. (Isabel)
Sikdar U.K. (Utpal Kumar)
Tang B. (Buzhou)
Tzong-Han-Tsai R. (Richard)
Usié A. (Anabel)
Valencia A. (Alfonso)
Vazquez M. (Miguel)
Verspoor K. (Karin)
Weber L. (Lutz)
Xu H. (Hua)
Xu S. (Shuo)
Yoshioka M. (Masaharu)
Zitnik S. (Slavko)
Publication venue: Chemistry Central
Publication date: 01/01/2015
Field of study

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus

Universidad de Navarra

Erasmus University Digital Repository

Dadun, University of Navarra