Search CORE

10 research outputs found

Annotated chemical patent corpus: A gold standard for text mining

Author: Akhondi S.A. (Saber)
Boppana K. (Kiran)
Jagarlapudi S.A.R.P. (Sarma A. R. P.)
Klenner A.G. (Alexander G.)
Kors J.A. (Jan)
Lowe D. (Daniel)
Manchala A.K. (Anil K.)
Muresan C. (Cornelia)
Sayle R. (Roger)
Tyrchan C. (Christian)
Zimmermann M. (Marc)
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, t

Crossref

Directory of Open Access Journals

Fraunhofer-ePrints

PubMed Central

EUR Research Repository

Erasmus University Digital Repository

Analysis of in vitro bioactivity data extracted from drug discovery literature and patents: Ranking 1654 human protein targets by assayed compounds and molecular scaffolds

Author: A Monge
A Schuffenhauer
AL Hopkins
AL Hopkins
B Fabio
C Southan
C Southan
C Tyrchan
Christopher Southan
CP Cannon
D Maglott
DS Wishart
E Ryberg
F Lovering
FNB Edfeldt
GV Paolini
GW Bemis
H Chen
H Ye
J Scheiber
JL Jenkins
JP Overington
K Mackie
Kiran Boppana
L Harland
MR Bowlby
PD Leeson
Q Li
R Christensen
S Devidas
S Muresan
S Wetzel
Sarma ARP Jagarlapudi
SARP Jagarlapudi
SJ Campbell
Sorel Muresan
T Joy
T Liu
TH Keller
X Chen
Y Wang
Y Yang
Y Yasuda
YJ Xu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Since the classic Hopkins and Groom druggable genome review in 2002, there have been a number of publications updating both the hypothetical and successful human drug target statistics. However, listings of research targets that define the area between these two extremes are sparse because of the challenges of collating published information at the necessary scale. We have addressed this by interrogating databases, populated by expert curation, of bioactivity data extracted from patents and journal papers over the last 30 years. Results From a subset of just over 27,000 documents we have extracted a set of compound-to-target relationships for biochemical <it>in vitro </it>binding-type assay data for 1,736 human proteins and 1,654 gene identifiers. These are linked to 1,671,951 compound records derived from 823,179 unique chemical structures. The distribution showed a compounds-per-target average of 964 with a maximum of 42,869 (Factor Xa). The list includes non-targets, failed targets and cross-screening targets. The top-278 most actively pursued targets cover 90% of the compounds. We further investigated target ranking by determining the number of molecular frameworks and scaffolds. These were compared to the compound counts as alternative measures of chemical diversity on a per-target basis. Conclusions The compounds-per-protein listing generated in this work (provided as a supplementary file) represents the major proportion of the human drug target landscape defined by published data. We supplemented the simple ranking by the number of compounds assayed with additional rankings by molecular topology. These showed significant differences and provide complementary assessments of chemical tractability.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining - Figure 1

Author: Alexander G. Klenner (636395)
Anil K. Manchala (636397)
Christian Tyrchan (636396)
Daniel Lowe (401476)
Jan A. Kors (173657)
Kiran Boppana (477814)
Marc Zimmermann (636398)
Roger Sayle (636400)
Saber A. Akhondi (636394)
Sarma A. R. P. Jagarlapudi (636399)
Sorel Muresan (435355)
Publication venue
Publication date
Field of study

Example patent text with pre-annotations as shown by the Brat annotation tool.</p

FigShare

Number of annotated terms and unique terms in the harmonized set and in the full patent set of the gold standard corpus after disambiguation.

Author: Alexander G. Klenner (636395)
Anil K. Manchala (636397)
Christian Tyrchan (636396)
Daniel Lowe (401476)
Jan A. Kors (173657)
Kiran Boppana (477814)
Marc Zimmermann (636398)
Roger Sayle (636400)
Saber A. Akhondi (636394)
Sarma A. R. P. Jagarlapudi (636399)
Sorel Muresan (435355)
Publication venue
Publication date
Field of study

Number of annotated terms and unique terms in the harmonized set and in the full patent set of the gold standard corpus after disambiguation.</p

FigShare

Inter-annotator agreement (F-score) without ambiguity resolution.

Author: Alexander G. Klenner (636395)
Anil K. Manchala (636397)
Christian Tyrchan (636396)
Daniel Lowe (401476)
Jan A. Kors (173657)
Kiran Boppana (477814)
Marc Zimmermann (636398)
Roger Sayle (636400)
Saber A. Akhondi (636394)
Sarma A. R. P. Jagarlapudi (636399)
Sorel Muresan (435355)
Publication venue
Publication date
Field of study

Inter-annotator agreement (F-score) without ambiguity resolution.</p

FigShare

Target class distribution of the 8,066 patents from which the final set was drawn.

Author: Alexander G. Klenner (636395)
Anil K. Manchala (636397)
Christian Tyrchan (636396)
Daniel Lowe (401476)
Jan A. Kors (173657)
Kiran Boppana (477814)
Marc Zimmermann (636398)
Roger Sayle (636400)
Saber A. Akhondi (636394)
Sarma A. R. P. Jagarlapudi (636399)
Sorel Muresan (435355)
Publication venue
Publication date
Field of study

Target class distribution of the 8,066 patents from which the final set was drawn.</p

FigShare

Inter-annotator agreement after ambiguity resolution.

Author: Alexander G. Klenner (636395)
Anil K. Manchala (636397)
Christian Tyrchan (636396)
Daniel Lowe (401476)
Jan A. Kors (173657)
Kiran Boppana (477814)
Marc Zimmermann (636398)
Roger Sayle (636400)
Saber A. Akhondi (636394)
Sarma A. R. P. Jagarlapudi (636399)
Sorel Muresan (435355)
Publication venue
Publication date
Field of study

The lower left triangle presents the inter-annotator agreement scores (F-score). The upper right triangle shows the improvement gained through disambiguation.Inter-annotator agreement after ambiguity resolution.</p

FigShare

The effect of the disambiguation process on the annotations.

Author: Alexander G. Klenner (636395)
Anil K. Manchala (636397)
Christian Tyrchan (636396)
Daniel Lowe (401476)
Jan A. Kors (173657)
Kiran Boppana (477814)
Marc Zimmermann (636398)
Roger Sayle (636400)
Saber A. Akhondi (636394)
Sarma A. R. P. Jagarlapudi (636399)
Sorel Muresan (435355)
Publication venue
Publication date
Field of study

The effect of the disambiguation process on the annotations.</p

FigShare

Number of annotated terms and unique terms within the harmonized set prior to disambiguation.

Author: Alexander G. Klenner (636395)
Anil K. Manchala (636397)
Christian Tyrchan (636396)
Daniel Lowe (401476)
Jan A. Kors (173657)
Kiran Boppana (477814)
Marc Zimmermann (636398)
Roger Sayle (636400)
Saber A. Akhondi (636394)
Sarma A. R. P. Jagarlapudi (636399)
Sorel Muresan (435355)
Publication venue
Publication date
Field of study

Number of annotated terms and unique terms within the harmonized set prior to disambiguation.</p

FigShare

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

Author: Alexander G. Klenner
Anil K. Manchala
C Kolarik
C Kolárik
C Southan
C Tyrchan
Christian Tyrchan
D Weininger
Daniel Lowe
DM Jessop
DM Lowe
I Lewin
Jan A. Kors
JD Kim
K Degtyarenko
Kiran Boppana
M Kiss
M Krallinger
M Vazquez
M Zimmermann
Marc Zimmermann
P De Matos
P Stenetorp
R Klinger
R Sayle
Roger Sayle
S Heller
S Kulick
S Muresan
SA Akhondi
Saber A. Akhondi
Sarma A. R. P. Jagarlapudi
Shoba Ranganathan
Sorel Muresan
T Grego
Y-H Tseng
Publication venue: 'Public Library of Science (PLoS)'
Publication date
Field of study

Crossref