Search CORE

A realistic assessment of methods for extracting gene/protein interactions from free text

Author: A Moschitti
AB Clegg
Adrian J Shepherd
AM Cohen
Andrew B Clegg
AS Yeh
B Settles
C Nédellec
D Rebholz-Schuhmann
H Jose
HL Johnson
J Ding
J Fluck
JD Kim
JD Kim
K Franzén
K Fundel
K Sagae
L Hunter
M Krallinger
N Domedel-Puig
R Bunescu
R Hoffmann
R Kabiljo
R Kabiljo
R Leaman
R Sætre
Renata Kabiljo
S Pyysalo
S Pyysalo
S Pyysalo
T Hara
WA Baumgartner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Background: The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger. Results: Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions. Conclusion: In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community

UCL Discovery

Birkbeck Institutional Research Online

Automatic reconstruction of a bacterial regulatory network using Natural Language Processing

Author: AM Cohen
C Friedman
Carlos Rodríguez-Penagos
D Corney
G Demetriou
H Salgado
H Schmid
Heladia Salgado
IM Keseler
Irma Martínez-Flores
J Saric
J Saric
JM Cherry
Julio Collado-Vides
L Grivell
L Hirschman
M Hucka
M Krallinger
M Krallinger
M Scherf
MD Yandell
PD Karp
R Grishman
R Hoffmann
R Rodriguez-Esteban
S Abney
Publication venue: BioMed Central
Publication date: 01/08/2007
Field of study

Abstract Background Manual curation of biological databases, an expensive and labor-intensive process, is essential for high quality integrated data. In this paper we report the implementation of a state-of-the-art Natural Language Processing system that creates computer-readable networks of regulatory interactions directly from different collections of abstracts and full-text papers. Our major aim is to understand how automatic annotation using Text-Mining techniques can complement manual curation of biological databases. We implemented a rule-based system to generate networks from different sets of documents dealing with regulation in <it>Escherichia coli </it>K-12. Results Performance evaluation is based on the most comprehensive transcriptional regulation database for any organism, the manually-curated RegulonDB, 45% of which we were able to recreate automatically. From our automated analysis we were also able to find some new interactions from papers not already curated, or that were missed in the manual filtering and review of the literature. We also put forward a novel Regulatory Interaction Markup Language better suited than SBML for simultaneously representing data of interest for biologists and text miners. Conclusion Manual curation of the output of automatic processing of text is a good way to complement a more detailed review of the literature, either for validating the results of what has been already annotated, or for discovering facts and information that might have been overlooked at the triage or curation stages.</p

PIE: an online prediction system for protein–protein interactions from text

Author: B.-T. Zhang
Chen
Cohen
Hoffmann
I.-H. Lee
Jang
Jensen
Kim
Krallinger
R. Sriram
S. Kim
S.-J. Kim
S.-Y. Shin
Sanchez-Graillet
Publication venue: Oxford University Press
Publication date: 01/01/2008
Field of study

Protein–protein interaction (PPI) extraction has been an important research topic in bio-text mining area, since the PPI information is critical for understanding biological processes. However, there are very few open systems available on the Web and most of the systems focus on keyword searching based on predefined PPIs. PIE (Protein Interaction information Extraction system) is a configurable Web service to extract PPIs from literature, including user-provided papers as well as PubMed articles. After providing abstracts or papers, the prediction results are displayed in an easily readable form with essential, yet compact features. The PIE interface supports more features such as PDF file extraction, PubMed search tool and network communication, which are useful for biologists and bio-system developers. The PIE system utilizes natural language processing techniques and machine learning methodologies to predict PPI sentences, which results in high precision performance for Web users. PIE is freely available at http://bi.snu.ac.kr/pie/

CiteSeerX

The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

Author: A Abi-Haidar
A Ceol
A Chatr-aryamontri
A Cohen
A Kolchinsky
A Lourenco
A McCallum
A Ng
A Yeh
Alfonso Valencia
AM Cohen
Andrew Chatr-aryamontri
Andrew Winter
Ashish V Tendulkar
B Aranda
B Settles
BP Suomela
C Blaschke
C Elkan
C Stark
Charles Elkan
D Bauer
D Salgado
David Salgado
E Marcotte
F Ehrler
F Leitner
F Leitner
F Leitner
F Rinaldi
F Rinaldi
F Rinaldi
Fabio Rinaldi
Feifan Liu
Florian Leitner
G Andrew
Gerold Schneider
Gianni Cesareni
GL Poulter
Graciela Gonzalez
H Daumé III
H Hermjakob
H Shatkay
H Wang
Hagit Shatkay
HK Rekapalli
I Donaldson
J Lin
Jean-Fred Fontaine
JR Curran
Keith Noto
KG Dowell
L Tanabe
Leonardo Briganti
Livia Perfetto
Luana Licata
Luis Rocha
Luisa Castagnoli
M Hall
M Harris
M Hollander
M Krallinger
M Krallinger
M Krallinger
M Krallinger
M Krallinger
M Oberoi
Marta Iannuccelli
Martin Krallinger
Miguel A Andrade-Navarro
Miguel Vazquez
Mike Tyers
P Wang
R Chowdhary
R Hoffmann
Rafal Rak
Rezarta Islamaj Dogan
Robert Leaman
S Kim
S Matos
S Orchard
Sergio Matos
Shashank Agarwal
Sun Kim
T Kappeler
T Ono
T Zhang
W Baumgartner
W Hersh
W Hersh
W John Wilbur
W Wilbur
Xinglong Wang
Y Niu
Y Sasaki
Z Cao
Zhiyong Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them.RESULTS:A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89 and the best AUC iP/R was 68. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35) the macro-averaged precision ranged between 50 and 80, with a maximum F-Score of 55. CONCLUSIONS: The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows

HAL AMU

eScholarship - University of California

ZORA

ART

MDC Repository

Monash University Research Portal

A linguistic rule-based approach to extract drug-drug interactions from pharmacological documents

Author: A Airola
A Aronson
A Rodríguez-Terol
A Siddharthan
C de Pablo-Sánchez
C Fries
César de Pablo-Sánchez
D Klein
D Wishart
D Zhou
E Williams
G Curme
I Segura-Bedmar
I Segura-Bedmar
I Segura-Bedmar
Isabel Segura-Bedmar
J Wingersky
K Verspoora
M Huang
M Krallinger
M Krallinger
M Marcus
N Burton-Roberts
N Calzolari
O Jespersen
P Hansten
Paloma Martínez
R Bunescu
R Sætre
S Ahmed
S Pyysalo
S Pyysalo
W Francis
Y Chen
Y Tateisi
Z Yang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background A drug-drug interaction (DDI) occurs when one drug influences the level or activity of another drug. The increasing volume of the scientific literature overwhelms health care professionals trying to be kept up-to-date with all published studies on DDI. Methods This paper describes a hybrid linguistic approach to DDI extraction that combines shallow parsing and syntactic simplification with pattern matching. Appositions and coordinate structures are interpreted based on shallow syntactic parsing provided by the UMLS MetaMap tool (MMTx). Subsequently, complex and compound sentences are broken down into clauses from which simple sentences are generated by a set of simplification rules. A pharmacist defined a set of domain-specific lexical patterns to capture the most common expressions of DDI in texts. These lexical patterns are matched with the generated sentences in order to extract DDIs. Results We have performed different experiments to analyze the performance of the different processes. The lexical patterns achieve a reasonable precision (67.30%), but very low recall (14.07%). The inclusion of appositions and coordinate structures helps to improve the recall (25.70%), however, precision is lower (48.69%). The detection of clauses does not improve the performance. Conclusions Information Extraction (IE) techniques can provide an interesting way of reducing the time spent by health care professionals on reviewing the literature. Nevertheless, no approach has been carried out to extract DDI from texts. To the best of our knowledge, this work proposes the first integral solution for the automatic extraction of DDI from biomedical texts.</p

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Learning an enriched representation from unlabeled data for protein-protein interaction extraction

Author: A Airola
AM Cohen
C Giuliano
DP Corney
Hongfei Lin
J Taylor
J Wilbur
K Fundel
M Krallinger
M Miwa
M Miwa
R Bunescu
R Bunescu
R Bunescu
R Sætre
S Pyysalo
S Pyysalo
S Van Landeghem
T Mitsumori
W Hersh
X Hu
Xiaohua Hu
Y Li
Y Miyao
Yanpeng Li
Zhihao Yang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

wKinMut: An integrated tool for the analysis and interpretation of mutations in human protein kinases

Author: A Baudot
A Gonzalez-Perez
A Torkamani
A Valencia
Alfonso Valencia
Angela del Pozo
B Reva
C Ferrer-Costa
C Greenman
C Greenman
C Ortutay
D Miranda-Saavedra
G Lopez
G Manning
G Wainreb
I Friedberg
IA Adzhubei
J Hurst
J Izarzugaza
JM Izarzugaza
JMG Izarzugaza
JMG Izarzugaza
Jose MG Izarzugaza
JS Kaminker
LD Wood
M Cline
M Krallinger
M Krallinger
Miguel Vazquez
MR Stratton
P Beltrao
P Lahiry
P Minguez
P Yue
PC Ng
R Calabrese
R Hoffmann
R Karchin
R Karchin
RJ Clifford
S Bamford
T Sjöblom
V Quesada
V Ramensky
VG Krishnan
XS Puente
Y Bromberg
YL Yip
Z Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

BACKGROUND: Protein kinases are involved in relevant physiological functions and a broad number of mutations in this superfamily have been reported in the literature to affect protein function and stability. Unfortunately, the exploration of the consequences on the phenotypes of each individual mutation remains a considerable challenge. RESULTS: The wKinMut web-server offers direct prediction of the potential pathogenicity of the mutations from a number of methods, including our recently developed prediction method based on the combination of information from a range of diverse sources, including physicochemical properties and functional annotations from FireDB and Swissprot and kinase-specific characteristics such as the membership to specific kinase groups, the annotation with disease-associated GO terms or the occurrence of the mutation in PFAM domains, and the relevance of the residues in determining kinase subfamily specificity from S3Det. This predictor yields interesting results that compare favourably with other methods in the field when applied to protein kinases. Together with the predictions, wKinMut offers a number of integrated services for the analysis of mutations. These include: the classification of the kinase, information about associations of the kinase with other proteins extracted from iHop, the mapping of the mutations onto PDB structures, pathogenicity records from a number of databases and the classification of mutations in large-scale cancer studies. Importantly, wKinMut is connected with the SNP2L system that extracts mentions of mutations directly from the literature, and therefore increases the possibilities of finding interesting functional information associated to the studied mutations. CONCLUSIONS: wKinMut facilitates the exploration of the information available about individual mutations by integrating prediction approaches with the automatic extraction of information from the literature (text mining) and several state-of-the-art databases. wKinMut has been used during the last year for the analysis of the consequences of mutations in the context of a number of cancer genome projects, including the recent analysis of Chronic Lymphocytic Leukemia cases and is publicly available at http://wkinmut.bioinfo.cnio.es

Online Research Database In Technology

pubmed2ensembl: A Resource for Mining the Biological Literature on Genes

Author: A Doms
AA Morgan
AM Jenkinson
B Giardine
BA Eckman
C Plake
Casey M. Bergman
D Hull
D Maglott
D Smedley
E Ryder
EM Zdobnov
G Zhou
Goran Nenadic
H Miller
H Parkinson
J Hakenberg
J Hirschman
J Tamames
JM Fernandez
Joachim Baran
L Chen
L Hirschman
M Ashburner
M Gerner
M Haeussler
M Huang
M Krallinger
M Krallinger
Martin Gerner
Maximilian Haeussler
P Flicek
P Kersey
PA Fujita
R Drysdale
R Hoffmann
R Leinonen
R Lyne
RC Gentleman
S Matos
SM Gallo
SP Shah
SS Dwight
Stein Aerts
T Imanishi
TJ Lee
U Mudunuri
W Xuan
Y Makita
Y Yoshida
Z Lu
Publication venue: Public Library of Science
Publication date: 29/09/2011
Field of study

The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions.To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data.By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature

The University of Manchester - Institutional Repository

Overlap in drug-disease associations between clinical practice guidelines and drug structured product label indications

Author: AM Cohen
J-D Kim
K Jung
M Herrero-Zazo
M Krallinger
M Krallinger
M Kuhn
M Kuhn
NH Shah
O Uzuner
R Hoehndorf
RD Boyce
S Liu
TI Leung
TI Leung
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study