Search CORE

1,980 research outputs found

Adapting a relation extraction pipeline for the BioCreAtIvE II task

Author: Grover Claire
Haddow Barry
Klein Ewan
Matthews Michael
Nielsen Leif Arda
Tobin Richard
Wang Xinglong
Publication venue
Publication date: 01/01/2007
Field of study

Edinburgh Research Explorer

Biomedical Text Mining and Its Applications

Crossref

Directory of Open Access Journals

PubMed Central

OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression

Abstract Background Information extraction (IE) efforts are widely acknowledged to be important in harnessing the rapid advance of biomedical knowledge, particularly in areas where important factual information is published in a diverse literature. Here we report on the design, implementation and several evaluations of OpenDMAP, an ontology-driven, integrated concept analysis system. It significantly advances the state of the art in information extraction by leveraging knowledge in ontological resources, integrating diverse text processing applications, and using an expanded pattern language that allows the mixing of syntactic and semantic elements and variable ordering. Results OpenDMAP information extraction systems were produced for extracting protein transport assertions (transport), protein-protein interaction assertions (interaction) and assertions that a gene is expressed in a cell type (expression). Evaluations were performed on each system, resulting in F-scores ranging from .26 – .72 (precision .39 – .85, recall .16 – .85). Additionally, each of these systems was run over all abstracts in MEDLINE, producing a total of 72,460 transport instances, 265,795 interaction instances and 176,153 expression instances. Conclusion OpenDMAP advances the performance standards for extracting protein-protein interaction predications from the full texts of biomedical research articles. Furthermore, this level of performance appears to generalize to other information extraction tasks, including extracting information about predicates of more than two arguments. The output of the information extraction system is always constructed from elements of an ontology, ensuring that the knowledge representation is grounded with respect to a carefully constructed model of reality. The results of these efforts can be used to increase the efficiency of manual curation efforts and to provide additional features in systems that integrate multiple sources for information extraction. The open source OpenDMAP code library is freely available at <url>http://bionlp.sourceforge.net/</url></p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Automated recognition of malignancy mentions in biomedical literature

Author: Carroll Steven
Jin Yang
Lerman Kevin
Liberman Mark Y
Mandel Mark A
McDonald Ryan T
Pereira Fernando C
White Peter S
Winters Raymond S
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Automated information extraction procedures can assist in the acquisition and management of this knowledge. Previous efforts in biomedical text mining have focused primarily upon named entity recognition of well-defined molecular objects such as genes, but less work has been performed to identify disease-related objects and concepts. Furthermore, promise has been tempered by an inability to efficiently scale approaches in ways that minimize manual efforts and still perform with high accuracy. Here, we have applied a machine-learning approach previously successful for identifying molecular entities to a disease concept to determine if the underlying probabilistic model effectively generalizes to unrelated concepts with minimal manual intervention for model retraining. RESULTS: We developed a named entity recognizer (MTag), an entity tagger for recognizing clinical descriptions of malignancy presented in text. The application uses the machine-learning technique Conditional Random Fields with additional domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Overall, our experiments resulted in 0.85 precision, 0.83 recall, and 0.84 F-measure on the evaluation set. Compared with a baseline system using string matching of text with a neoplasm term list, MTag performed with a much higher recall rate (92.1% vs. 42.1% recall) and demonstrated the ability to learn new patterns. Application of MTag to all MEDLINE abstracts yielded the identification of 580,002 unique and 9,153,340 overall mentions of malignancy. Significantly, addition of an extensive lexicon of malignancy mentions as a feature set for extraction had minimal impact in performance. CONCLUSION: Together, these results suggest that the identification of disparate biomedical entity classes in free text may be achievable with high accuracy and only moderate additional effort for each new application domain

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

ScholarlyCommons@Penn

Building a high-quality sense inventory for improved abbreviation disambiguation

Author: Ananiadou
Erhardt
Federiuk
J. Tsujii
Liu
N. Okazaki
S. Ananiadou
Sehgal
Wren
Yu
Publication venue: Oxford University Press
Publication date: 01/01/2010
Field of study

Motivation: The ultimate goal of abbreviation management is to disambiguate every occurrence of an abbreviation into its expanded form (concept or sense). To collect expanded forms for abbreviations, previous studies have recognized abbreviations and their expanded forms in parenthetical expressions of bio-medical texts. However, expanded forms extracted by abbreviation recognition are mixtures of concepts/senses and their term variations. Consequently, a list of expanded forms should be structured into a sense inventory, which provides possible concepts or senses for abbreviation disambiguation

CiteSeerX

Crossref

PubMed Central

The University of Manchester - Institutional Repository

Can Bibliographic Pointers for Known Biological Data Be Found Automatically? Protein Interactions as a Case Study

Author: Alfonso Valencia
Andrade
Bader
Bairoch
Barker
Benson
Blaschke
Blaschke
Chien
Christian Blaschke
Eilbeck
Eisenberg
Enright
Fromont-Racine
Fukuda
Hishiki
Humphreys
Ito
Jenssen
Ohta
Proux
Proux
PubMed
Rain
Rindflesch
Rindflesch
Schwikowski
Sekimizu
Stapley
Sussman
Tanabe
Thomas
Uetz
Xenarios
Yakushiji
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2001
Field of study

The Dictionary of Interacting Proteins (DIP) (Xenarios et al., 2000) is a large repository of protein interactions: its March 2000 release included 2379 protein pairs whose interactions have been detected by experimental methods. Even if many of these correspond to poorly characterized proteins, the result of massive yeast two-hybrid screenings, as many as 851 correspond to interactions detected using direct biochemical methods

Crossref

Directory of Open Access Journals

PubMed Central

Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome

Author: Bunescu Razvan C
Marcotte Edward M
Mooney Raymond J
Ramani Arun K
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Extensive protein interaction maps are being constructed for yeast, worm, and fly to ask how the proteins organize into pathways and systems, but no such genome-wide interaction map yet exists for the set of human proteins. To prepare for studies in humans, we wished to establish tests for the accuracy of future interaction assays and to consolidate the known interactions among human proteins. RESULTS: We established two tests of the accuracy of human protein interaction datasets and measured the relative accuracy of the available data. We then developed and applied natural language processing and literature-mining algorithms to recover from Medline abstracts 6,580 interactions among 3,737 human proteins. A three-part algorithm was used: first, human protein names were identified in Medline abstracts using a discriminator based on conditional random fields, then interactions were identified by the co-occurrence of protein names across the set of Medline abstracts, filtering the interactions with a Bayesian classifier to enrich for legitimate physical interactions. These mined interactions were combined with existing interaction data to obtain a network of 31,609 interactions among 7,748 human proteins, accurate to the same degree as the existing datasets. CONCLUSION: These interactions and the accuracy benchmarks will aid interpretation of current functional genomics data and provide a basis for determining the quality of future large-scale human protein interaction assays. Projecting from the approximately 15 interactions per protein in the best-sampled interaction set to the estimated 25,000 human genes implies more than 375,000 interactions in the complete human protein interaction network. This set therefore represents no more than 10% of the complete network

Springer - Publisher Connector

PubMed Central

RetroMine, or how to provide in-depth retrospective studies from Medline in a glance: the hepcidin use-case

Author: Ameline de Cadeville Bertrand
Loreal Olivier
Moussouni-Marzolf Fouzia
Publication venue: IMBio e.V.
Publication date: 01/01/2015
Field of study

International audienceThe rapid expansion of biomedical literature has provoked an increased development of advanced text mining tools to rapidly extract relevant events from the continuously increasing amount of knowledge published periodically in PubMed. However, bioinvestigators are still reluctant to use these tools for two reasons: i) a large volume of events is often extracted upon a query, and this volume is hard to manage, and ii) background events dominate search results and overshadow more pertinent published information, especially for domain experts. In this paper, we propose an approach that incorporates the temporal dimension of published events to the process of information extraction to improve data selection and prioritize more pertinent periodically published knowledge for scientists. Indeed, instead of providing the total knowledge associated with a PubMed query, which is usually a mix of trivial background information and non-background information, we propose a method that incorporates time and selects non background and highly relevant biological entities and events published over time for bioinvestigators. Before excluding background events from the total knowledge extracted, a quantification of their amount is also provided. This work is illustrated by a case study regarding Hepcidin gene publications over a decade, a duration that is sufficiently long enough to generate alternative views on the overall data extracted

HAL-Rennes 1