Search CORE

35 research outputs found

An environment for relation mining over richly annotated corpora: the case of GENIA

Author: A Koike
A Mikheev
A Ratnaparkhi
A Yakushiji
C Friedman
D Hindle
D Lin
DPA Corney
F Rinaldi
F Rinaldi
Fabio Rinaldi
G Leroy
G Minnen
G Schneider
G Schneider
G Schneider
Gerold Schneider
J Carroll
J Hakenberg
J Kim
J Preiss
J Saric
JC Reynar
K Kaljurand
Kaarel Kaljurand
LJ Jensen
M Collins
M Huang
M Marcus
M Romacker
Martin Romacker
Michael Hess
N Daraselia
S Novichkova
S Riedel
T Rindflesch
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The biomedical domain is witnessing a rapid growth of the amount of published scientific results, which makes it increasingly difficult to filter the core information. There is a real need for support tools that 'digest' the published results and extract the most important information. RESULTS: We describe and evaluate an environment supporting the extraction of domain-specific relations, such as protein-protein interactions, from a richly-annotated corpus. We use full, deep-linguistic parsing and manually created, versatile patterns, expressing a large set of syntactic alternations, plus semantic ontology information. CONCLUSION: The experiments show that our approach described is capable of delivering high-precision results, while maintaining sufficient levels of recall. The high level of abstraction of the rules used by the system, which are considerably more powerful and versatile than finite-state approaches, allows speedy interactive development and validation

Crossref

Springer - Publisher Connector

PubMed Central

ZORA

GenCLiP: a software program for clustering gene lists by literature profiling and constructing gene co-occurrence networks related to custom keywords

Author: AA Schaffer
BT Alako
C Plake
C Rodriguez-Penagos
D Chaussabel
D Lee
EG Cerami
G Karakiulakis
H Kim
Hui-Yong Tian
Jin Zhao
K Fundel
Kai-Tai Yao
KJ Bussey
LJ Jensen
M Bundschus
M Suderman
MB Eisen
N Daraselia
P Shannon
R Hammamieh
R Hoffmann
R Rubinstein
RT Tsai
S Li
T Ide
TK Jenssen
VK Gajendran
Yi-Bo Zhou
Z Huang
ZF Hu
Zhen-Fu Hu
Zhong-Xi Huang
Publication venue: BioMed Central
Publication date: 01/07/2008
Field of study

Abstract Background Biomedical researchers often want to explore pathogenesis and pathways regulated by abnormally expressed genes, such as those identified by microarray analyses. Literature mining is an important way to assist in this task. Many literature mining tools are now available. However, few of them allows the user to make manual adjustments to zero in on what he/she wants to know in particular. Results We present our software program, GenCLiP (Gene Cluster with Literature Profiles), which is based on the methods presented by Chaussabel and Sher (<it>Genome Biol </it>2002, 3(10):RESEARCH0055) that search gene lists to identify functional clusters of genes based on up-to-date literature profiling. Four features were added to this previously described method: the ability to 1) manually curate keywords extracted from the literature, 2) search genes and gene co-occurrence networks related to custom keywords, 3) compare analyzed gene results with negative and positive controls generated by GenCLiP, and 4) calculate probabilities that the resulting genes and gene networks are randomly related. In this paper, we show with a set of differentially expressed genes between keloids and normal control, how implementation of functions in GenCLiP successfully identified keywords related to the pathogenesis of keloids and unknown gene pathways involved in the pathogenesis of keloids. Conclusion With regard to the identification of disease-susceptibility genes, GenCLiP allows one to quickly acquire a primary pathogenesis profile and identify pathways involving abnormally expressed genes not previously associated with the disease.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

BioInfer: a corpus for information extraction in the biomedical domain

Author: A Yakushiji
CF Baker
D Lin
DD Sleator
E Alphonse
E Tsivtsivadze
E Tsivtsivadze
F Ginter
Filip Ginter
G Hripcsak
H Shatkay
J Cohen
J Ding
J Kim
Jari Björne
JM Temkin
Jorma Boberg
Jouni Järvinen
Juho Heimonen
K Franzén
K Kipper
KB Cohen
KB Cohen
L Hirschman
L Salwinski
M Ashburner
N Daraselia
P Kingsbury
P Kingsbury
P Szolovits
S Aubin
S Pyysalo
S Pyysalo
S Pyysalo
S Siegel
Sampo Pyysalo
T Ohta
T Pahikkala
T Wattarujeekrit
Tapio Salakoski
TH King
Y Tateisi
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora. RESULTS: We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation. CONCLUSION: We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation

BACKGROUND: High-throughput experiments, such as with DNA microarrays, typically result in hundreds of genes potentially relevant to the process under study, rendering the interpretation of these experiments problematic. Here, we propose and evaluate an approach to find functional associations between large numbers of genes and other biomedical concepts from free-text literature. For each gene, a profile of related concepts is constructed that summarizes the context in which the gene is mentioned in literature. We assign a weight to each concept in the profile based on a likelihood ratio measure. Gene concept profiles can then be clustered to find related genes and other concepts. RESULTS: The experimental validation was done in two steps. We first applied our method on a controlled test set. After this proved to be successful the datasets from two DNA microarray experiments were analyzed in the same way and the results were evaluated by domain experts. The first dataset was a gene-expression profile that characterizes the cancer cells of a group of acute myeloid leukemia patients. For this group of patients the biological background of the cancer cells is largely unknown. Using our methodology we found an association of these cells to monocytes, which agreed with other experimental evidence. The second data set consisted of differentially expressed genes following androgen receptor stimulation in a prostate cancer cell line. Based on the analysis we put forward a hypothesis about the biological processes induced in these studied cells: secretory lysosomes are involved in the production of prostatic fluid and their development and/or secretion are androgen-regulated processes. CONCLUSION: Our method can be used to analyze DNA microarray datasets based on information explicitly and implicitly available in the literature. We provide a publicly available tool, dubbed Anni, for this purpose

Maastricht University Research Portal

Crossref

Springer - Publisher Connector

PubMed Central

EUR Research Repository

Erasmus University Digital Repository

Extracting causal relations on HIV drug resistance from literature

Author: A Koike
AM Cohen
Breanndán Ó Nualláin
C Giles
Charles A Boucher
D Klein
D Zhou
DR Douglas
F Horn
F Leitner
F Rinaldi
G Erkan
H Jang
H Saigo
IH Witten
J Saric
J Vercauteren
JG Liao
JH Kim
K Fundel
LJ Jensen
M Abulaish
M Huang
MY Kim
N Daraselia
O Sanchez-Graillet
P Zweigenbaum
Peter MA Sloot
Quoc-Chinh Bui
R Chowdhary
R Malik
RA Erhardt
S Ananiadou
S Katrenko
S Kim
T Lengauer
VI Torvik
Y Miyao
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background In HIV treatment it is critical to have up-to-date resistance data of applicable drugs since HIV has a very high rate of mutation. These data are made available through scientific publications and must be extracted manually by experts in order to be used by virologists and medical doctors. Therefore there is an urgent need for a tool that partially automates this process and is able to retrieve relations between drugs and virus mutations from literature. Results In this work we present a novel method to extract and combine relationships between HIV drugs and mutations in viral genomes. Our extraction method is based on natural language processing (NLP) which produces grammatical relations and applies a set of rules to these relations. We applied our method to a relevant set of PubMed abstracts and obtained 2,434 extracted relations with an estimated performance of 84% for F-score. We then combined the extracted relations using logistic regression to generate resistance values for each <drug, mutation> pair. The results of this relation combination show more than 85% agreement with the Stanford HIVDB for the ten most frequently occurring mutations. The system is used in 5 hospitals from the Virolab project <url>http://www.virolab.org</url> to preselect the most relevant novel resistance data from literature and present those to virologists and medical doctors for further evaluation. Conclusions The proposed relation extraction and combination method has a good performance on extracting HIV drug resistance data. It can be used in large-scale relation extraction experiments. The developed methods can also be applied to extract other type of relations such as gene-protein, gene-disease, and disease-mutation.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

EUR Research Repository

Erasmus University Digital Repository

DR-NTU (Digital Repository of NTU)

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Mining phenotypes for gene function prediction

Author: A Kahraman
A Keller
AA Dobritsa
AJ Butte
B Hur
B Schwikowski
Bertram Weiss
BP Kelley
CR Scriver
D Kuttenkeuler
D Lin
D Sieburth
E SanJuana
EC Green
F Piano
G Pandey
G Roman
GJ Hannon
Hans-Dieter Pohlenz
JZ Wang
KA Kellerman
KC Gunsalus
KC Gunsalus
KJ Gaulton
LB Vosshall
M Bate
M Steinbach
MA Huynen
MA van Driel
N Daraselia
N Freimer
P Bhandari
P Groth
P Groth
Philip Groth
PW Lord
RM Cripps
S Jaeger
S Raychaudhuri
SC Rison
SD Brown
T Schupbach
U Nongthomba
Ulf Leser
US Eggert
V Mermall
V Spirin
X Guo
Y Lussier
Y Shi
Y Tao
Y Zhao
Y Zhao
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Health and disease of organisms are reflected in their phenotypes. Often, a genetic component to a disease is discovered only after clearly defining its phenotype. In the past years, many technologies to systematically generate phenotypes in a high-throughput manner, such as RNA interference or gene knock-out, have been developed and used to decipher functions for genes. However, there have been relatively few efforts to make use of phenotype data beyond the single genotype-phenotype relationships. Results We present results on a study where we use a large set of phenotype data – in textual form – to predict gene annotation. To this end, we use text clustering to group genes based on their phenotype descriptions. We show that these clusters correlate well with several indicators for biological coherence in gene groups, such as functional annotations from the Gene Ontology (GO) and protein-protein interactions. We exploit these clusters for predicting gene function by carrying over annotations from well-annotated genes to other, less-characterized genes in the same cluster. For a subset of groups selected by applying objective criteria, we can predict GO-term annotations from the biological process sub-ontology with up to 72.6% precision and 16.7% recall, as evaluated by cross-validation. We manually verified some of these clusters and found them to exhibit high biological coherence, e.g. a group containing all available antennal Drosophila odorant receptors despite inconsistent GO-annotations. Conclusion The intrinsic nature of phenotypes to visibly reflect genetic activity underlines their usefulness in inferring new gene functions. Thus, systematically analyzing these data on a large scale offers many possibilities for inferring functional annotation of genes. We show that text clustering can play an important role in this process.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Corpus annotation for mining biomedical events from literature

Abstract Background Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. Results We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. Conclusion The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Evidence-Based Annotation of Gene Function in Shewanella oneidensis MR-1 Using Genome-Wide Fitness Profiling across 121 Conditions

Author: A Mitchell
A Typas
Adam Deutschbauer
Adam P. Arkin
AM Deutschbauer
AM Smith
B Christen
B Efron
B Rost
BJ Akerley
C Yang
CP Ewing
CR Myers
DE Cameron
DJ Burdige
E Alm
E Fischer
G Butland
G Butland
G Giaever
GC Langridge
GE Pinchuk
GW Birrell
H Gao
H Ochman
HH Hau
HS Girgis
I Tagkopoulos
IM Keseler
J Oh
J Oh
J Quan
JA Gralnick
Jason K. Baumohl
JD Gawronski
JD Peterson
JF Heidelberg
JJ Faith
JK Fredrickson
JL Groh
JR Warner
K Kobayashi
K Suzuki
Kelly M. Wetmore
KR Brocklehurst
KT Konstantinidis
L Binnenkade
LA Gallagher
LA Gallagher
M Hashimoto
M Huynen
ME Driscoll
ME Hillenmeyer
ME Hillenmeyer
ME Kovach
Michelle Nguyen
MJ Lercher
MN Price
Morgan N. Price
MY Galperin
N Daraselia
N Ishii
NT Liberati
P Burghout
Paul M. Richardson
PS Dehal
PS Novichkov
Q Ren
R Bouhenni
R Zhang
RA Larsen
Raquel Tamse
RJ Nichols
RJ Roberts
RL Tatusov
RM Martinez
Ronald W. Davis
S Kuhner
S Kumari
S Weinitschke
SE Pierce
SJ Cooper
SK Sharan
SY Gerdes
T Baba
T van Opijnen
TR Hughes
V de Berardinis
Wenjun Shao
Y Liu
Zhuchen Xu
Publication venue: Public Library of Science
Publication date: 01/11/2011
Field of study

Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes) have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome) are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Re-Annotation Is an Essential Step in Systems Biology Modeling of Functional Genomics Data

Author: A Harel
A Hutloff
AM Schnoes
Bart H. J. van den Berg
BH van den Berg
C Smith
CA Ouzounis
CE Jones
CE Rudd
CH Wu
D Barrell
D Devos
D Kemmer
DA Benson
DP Wall
E Eyras
E Quevillon
F Meurens
Fiona M. McCarthy
FM McCarthy
G Moreno-Hagelsieb
H Zhou
ICGS Consortium
Iddo Friedberg
J Burnside
JC Camus
JR Wortman
K Sellheyer
KM Kim
L Tian
LL Chen
M Andersson
M Andersson
M Ashburner
M Pruess
M Schena
M Vidric
ME van Berkel
MK Richardson
N Daraselia
N Gupta
N Rocques
O Gundogdu
PB Neerincx
PE Neiman
R Apweiler
R Edgar
RA Shilling
S Washietl
SE Brenner
Shane C. Burgess
SL Salzberg
Susan J. Lamont
T Barrett
TJ Buza
TJ Buza
UM Braga-Neto
V Wood
X Wang
YP de Jong
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

One motivation of systems biology research is to understand gene functions and interactions from functional genomics data such as that derived from microarrays. Up-to-date structural and functional annotations of genes are an essential foundation of systems biology modeling. We propose that the first essential step in any systems biology modeling of functional genomics data, especially for species with recently sequenced genomes, is gene structural and functional re-annotation. To demonstrate the impact of such re-annotation, we structurally and functionally re-annotated a microarray developed, and previously used, as a tool for disease research. We quantified the impact of this re-annotation on the array based on the total numbers of structural- and functional-annotations, the Gene Annotation Quality (GAQ) score, and canonical pathway coverage. We next quantified the impact of re-annotation on systems biology modeling using a previously published experiment that used this microarray. We show that re-annotation improves the quantity and quality of structural- and functional-annotations, allows a more comprehensive Gene Ontology based modeling, and improves pathway coverage for both the whole array and a differentially expressed mRNA subset. Our results also demonstrate that re-annotation can result in a different knowledge outcome derived from previous published research findings. We propose that, because of this, re-annotation should be considered to be an essential first step for deriving value from functional genomics data

Digital Repository @ Iowa State University (ISU)

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Scholars Junction - Mississippi State University Institutional Repository

An Empirical Strategy for Characterizing Bacterial Proteomes across Species in the Absence of Genomic Sequences

Global protein identification through current proteomics methods typically depends on the availability of sequenced genomes. In spite of increasingly high throughput sequencing technologies, this information is not available for every microorganism and rarely available for entire microbial communities. Nevertheless, the protein-level homology that exists between related bacteria makes it possible to extract biological information from the proteome of an organism or microbial community by using the genomic sequences of a near neighbor organism. Here, we demonstrate a trans-organism search strategy for determining the extent to which near-neighbor genome sequences can be applied to identify proteins in unsequenced environmental isolates. In proof of concept testing, we found that within a CLUSTAL W distance of 0.089, near-neighbor genomes successfully identified a high percentage of proteins within an organism. Application of this strategy to characterize environmental bacterial isolates lacking sequenced genomes, but having 16S rDNA sequence similarity to Shewanella resulted in the identification of 300–500 proteins in each strain. The majority of identified pathways mapped to core processes, as well as to processes unique to the Shewanellae, in particular to the presence of c-type cytochromes. Examples of core functional categories include energy metabolism, protein and nucleotide synthesis and cofactor biosynthesis, allowing classification of bacteria by observation of conserved processes. Additionally, within these core functionalities, we observed proteins involved in the alternative lactate utilization pathway, recently described in Shewanella

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central