Search CORE

70 research outputs found

The GNAT library for local and remote gene mention normalization

Author: C. M. Bergman
C. Plake
G. Gonzalez
G. Nenadic
Gerner
I. Solt
J. Hakenberg
M. Gerner
M. Haeussler
M. Schroeder
Tamames
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

Summary: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987

CiteSeerX

Crossref

PubMed Central

The University of Manchester - Institutional Repository

MDC Repository

SR4GN: A Species Recognition Software Tool for Gene Normalization

Author: AA Morgan
B D
C-H Wei
C-H Wei
C-H Wei
C-N Hsu
Chih-Hsuan Wei
H Cunningham
HD Carroll
Hung-Yu Kao
J Hakenberg
J Hakenberg
J William A Baumgartner
Jan Aerts
K Bontcheva
L Hirschman
M Gerner
M Krallinger
M Krallinger
N Naderi
T Kappeler
T Mu
X Wang
Y Kano Jr
Z Lu
Zhiyong Lu
Publication venue: Public Library of Science
Publication date: 05/06/2012
Field of study

As suggested in recent studies, species recognition and disambiguation is one of the most critical and challenging steps in many downstream text-mining applications such as the gene normalization task and protein-protein interaction extraction. We report SR4GN: an open source tool for species recognition and disambiguation in biomedical text. In addition to the species detection function in existing tools, SR4GN is optimized for the Gene Normalization task. As such it is developed to link detected species with corresponding gene mentions in a document. SR4GN achieves 85.42% in accuracy and compares favorably to the other state-of-the-art techniques in benchmark experiments. Finally, SR4GN is implemented as a standalone software tool, thus making it convenient and robust for use in many text-mining applications. SR4GN can be downloaded at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/downloads/SR4G

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Annotating genes and genomes with DNA sequences extracted from biomedical articles

Author: Aerts
Anderson
Benson
Casey M. Bergman
Cock
Colosimo
Dowell
Fulp
Garcia-Remesal
Garcia-Remesal
Gerner
Gibson
Gray
Hakenberg
Holley
Hubbard
Karolchik
Kent
Kersey
Krallinger
Maglott
Martin Gerner
Maximilian Haeussler
Morgan
Rhead
Roberts
Semon
Shtatland
The FlyBase Consortium
Vandesompele
Visel
Weiss
Wren
Yoshida
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study

CiteSeerX

Crossref

PubMed Central

The University of Manchester - Institutional Repository

pubmed2ensembl: A Resource for Mining the Biological Literature on Genes

Author: A Doms
AA Morgan
AM Jenkinson
B Giardine
BA Eckman
C Plake
Casey M. Bergman
D Hull
D Maglott
D Smedley
E Ryder
EM Zdobnov
G Zhou
Goran Nenadic
H Miller
H Parkinson
J Hakenberg
J Hirschman
J Tamames
JM Fernandez
Joachim Baran
L Chen
L Hirschman
M Ashburner
M Gerner
M Haeussler
M Huang
M Krallinger
M Krallinger
Martin Gerner
Maximilian Haeussler
P Flicek
P Kersey
PA Fujita
R Drysdale
R Hoffmann
R Leinonen
R Lyne
RC Gentleman
S Matos
SM Gallo
SP Shah
SS Dwight
Stein Aerts
T Imanishi
TJ Lee
U Mudunuri
W Xuan
Y Makita
Y Yoshida
Z Lu
Publication venue: Public Library of Science
Publication date: 29/09/2011
Field of study

The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions.To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data.By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature

Crossref

Directory of Open Access Journals

PubMed Central

The University of Manchester - Institutional Repository

ProNormz – An integrated approach for human proteins and protein kinases normalization

Author: Natarajan Jeyakumar
Raja Kalpana
Subramani Suresh
Publication venue: Elsevier Inc.
Publication date: 28/02/2014
Field of study

AbstractThe task of recognizing and normalizing protein name mentions in biomedical literature is a challenging task and important for text mining applications such as protein–protein interactions, pathway reconstruction and many more. In this paper, we present ProNormz, an integrated approach for human proteins (HPs) tagging and normalization. In Homo sapiens, a greater number of biological processes are regulated by a large human gene family called protein kinases by post translational phosphorylation. Recognition and normalization of human protein kinases (HPKs) is considered to be important for the extraction of the underlying information on its regulatory mechanism from biomedical literature. ProNormz distinguishes HPKs from other HPs besides tagging and normalization. To our knowledge, ProNormz is the first normalization system available to distinguish HPKs from other HPs in addition to gene normalization task. ProNormz incorporates a specialized synonyms dictionary for human proteins and protein kinases, a set of 15 string matching rules and a disambiguation module to achieve the normalization. Experimental results on benchmark BioCreative II training and test datasets show that our integrated approach achieve a fairly good performance and outperforms more sophisticated semantic similarity and disambiguation systems presented in BioCreative II GN task. As a freely available web tool, ProNormz is useful to developers as extensible gene normalization implementation, to researchers as a standard for comparing their innovative techniques, and to biologists for normalization and categorization of HPs and HPKs mentions in biomedical literature. URL: http://www.biominingbu.org/pronormz

Elsevier - Publisher Connector

Species identification for gene name normalization

Author: C Plake
Domonkos Tikk
H Salgado
Illés Solt
J Hakenberg
M Gerner
Ulf Leser
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

GeneTUKit: a software for document-level gene normalization

Author: J. Liu
M. Huang
Neves
X. Zhu
Publication venue: Oxford University Press
Publication date
Field of study

Motivation: Linking gene mentions in an article to entries of biological databases can facilitate indexing and querying biological literature greatly. Due to the high ambiguity of gene names, this task is particularly challenging. Manual annotation for this task is cost expensive, time consuming and labor intensive. Therefore, providing assistive tools to facilitate the task is of high value

Crossref

PubMed Central

BioCreative III interactive task: an overview

The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested

Crossref

The Jackson Laboratory: The Mouseion at the JAXlibrary

Springer

Springer - Publisher Connector

PubMed Central

ZORA

ART

NORA - Norwegian Open Research Archives

Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts

Author: Damaschun A.
Fontaine J.F.
Kurtz A.
Lekschas F.
Leser U.
Mah N.
Neves M.
Seltmann S.
Stachelscheid H.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2013
Field of study

Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ~50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction. Database URL: http://www.cellfinder.org

CiteSeerX

SNU Open Repository and Archive

PubMed Central

MDC Repository