We participated in three of the protein-protein interaction subtasks of the
Second BioCreative Challenge: classification of abstracts relevant for
protein-protein interaction (IAS), discovery of protein pairs (IPS) and text
passages characterizing protein interaction (ISS) in full text documents. We
approached the abstract classification task with a novel, lightweight linear
model inspired by spam-detection techniques, as well as an uncertainty-based
integration scheme. We also used a Support Vector Machine and the Singular
Value Decomposition on the same features for comparison purposes. Our approach
to the full text subtasks (protein pair and passage identification) includes a
feature expansion method based on word-proximity networks. Our approach to the
abstract classification task (IAS) was among the top submissions for this task
in terms of the measures of performance used in the challenge evaluation
(accuracy, F-score and AUC). We also report on a web-tool we produced using our
approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our
approach to the full text tasks resulted in one of the highest recall rates as
well as mean reciprocal rank of correct passages. Our approach to abstract
classification shows that a simple linear model, using relatively few features,
is capable of generalizing and uncovering the conceptual nature of
protein-protein interaction from the bibliome. Since the novel approach is
based on a very lightweight linear model, it can be easily ported and applied
to similar problems. In full text problems, the expansion of word features with
word-proximity networks is shown to be useful, though the need for some
improvements is discussed

Abi-Haidar, Alaa

Kaur, Jasleen

Maguitman, Ana G.

Radivojac, Predrag

Retchsteiner, Andreas

Rocha, Luis M.

Verspoor, Karin

Wang, Zhiping

English

arXiv

Springer - Publisher Connector

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

Background: We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (interaction article subtask [IAS]), discovery of protein pairs (interaction pair subtask [IPS]), and identification of text passages characterizing protein interaction (interaction sentences subtask [ISS]) in full-text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam detection techniques, as well as an uncertainty-based integration scheme. We also used a support vector machine and singular value decomposition on the same features for comparison purposes. Our approach to the full-text subtasks (protein pair and passage identification) includes a feature expansion method based on word proximity networks. Results: Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of measures of performance used in the challenge evaluation (accuracy, F-score, and area under the receiver operating characteristic curve). We also report on a web tool that we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full-text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages. Conclusion: Our approach to abstract classification shows that a simple linear model, using relatively few features, can generalize and uncover the conceptual nature of protein-protein interactions from the bibliome. Because the novel approach is based on a rather lightweight linear model, it can easily be ported and applied to similar problems. In full-text problems, the expansion of word features with word proximity networks is shown to be useful, although the need for some improvements is discussed.Fil: Abi-Haidar, Alaa. Indiana University; Estados Unidos. Fundação Luso-Americana para o Desenvolvimento; PortugalFil: Kaur, Jasleen. Indiana University; Estados UnidosFil: Maguitman, Ana Gabriela. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca; Argentina. Universidad Nacional del Sur. Departamento de Ciencias  e Ingeniería de la Computación; ArgentinaFil: Radivojac, Pedrag. Indiana University; Estados UnidosFil: Rechtsteiner, Andreas. Indiana University; Estados UnidosFil: Verspoor, Karin. Los Alamos National High Magnetic Field Laboratory; Estados UnidosFil: Wang, Zhiping. Indiana University; Estados UnidosFil: Rocha, Luis. Fundação Luso-Americana para o Desenvolvimento; Portugal. Indiana University; Estados Unido

Maguitman, Ana Gabriela

Radivojac, Pedrag

Rechtsteiner, Andreas

Rocha, Luis

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Alaa Abi-Haidar

Jasleen Kaur

Ana Maguitman

Predrag Radivojac

Andreas Rechtsteiner

Karin Verspoor

Zhiping Wang

Luis M Rocha

Crossref

CONICET Digital

BACKGROUND: We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (interaction article subtask [IAS]), discovery of protein pairs (interaction pair subtask [IPS]), and identification of text passages characterizing protein interaction (interaction sentences subtask [ISS]) in full-text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam detection techniques, as well as an uncertainty-based integration scheme. We also used a support vector machine and singular value decomposition on the same features for comparison purposes. Our approach to the full-text subtasks (protein pair and passage identification) includes a feature expansion method based on word proximity networks. RESULTS: Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of measures of performance used in the challenge evaluation (accuracy, F-score, and area under the receiver operating characteristic curve). We also report on a web tool that we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full-text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages. CONCLUSION: Our approach to abstract classification shows that a simple linear model, using relatively few features, can generalize and uncover the conceptual nature of protein-protein interactions from the bibliome. Because the novel approach is based on a rather lightweight linear model, it can easily be ported and applied to similar problems. In full-text problems, the expansion of word features with word proximity networks is shown to be useful, although the need for some improvements is discussed

Abi-Haidar, A

Kaur, J

Maguitman, A

Radivojac, P

Rechtsteiner, A

Verspoor, K

Wang, Z

Rocha, LM

University of Melbourne Institutional Repository

A: Assessment of the second BioCreative PPI task: automatic extraction o protein-protein interactions.

A: Evaluating the detection and ranking of protein interaction relevant articles: the BioCreative Challenge Interaction Article Sub-Task (IAS).

A: Overview of BioCreAtIvE: critical assessment of information extraction for biology.

ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics

Cesareni G: MINT: the Molecular INTeraction database. Nucleic Acids Res

Enhancing performance in latent semantic indexing.

Hermjakob H: IntAct: open source resource for molecular interaction data. Nucleic Acids Res

Interaction Abstract Relevance Evaluator

Learning to classify text using support vector machines: methods, theory, and algorithms

Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet

LM: Large-scale testing of Bibliome informatics using Pfam protein families. Pac Symp Biocomp

LM: Singular value decomposition and principal component analysis. A Practical Approach to Microarray Data Analysis 2003:91-109 [http://public.lanl.gov/mewall/ kluwer2002.html].

Mining the biomedical literature in the genomic era: an overview.

Ruepp A: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res

Simas T: Protein annotation as term categorization in the gene ontology using word proximity networks.

SpamHunting: an instance-based reasoning system for spam labelling and filtering. Decision Support Systems

Stable association of 70-kDa heat shock protein induces latent multisite specificity of a unisite-specific endonuclease in yeast mitochondria.

Statistical learning theory

Uncovering Protein-Protein Interactions in the Bibliome.

UniProtConsortium: The Universal Protein Resource (UniProt).

Use of text mining for protein structure prediction and functional annotation in lack of sequence homology.

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2559982

Uncovering protein interaction in abstracts and text using a novel
  linear model and word proximity networks

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

Abstract

Similar works

Full text

Available Versions

Springer - Publisher Connector

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

CONICET Digital

University of Melbourne Institutional Repository