Search CORE

Mining phenotypes for gene function prediction

Author: A Kahraman
A Keller
AA Dobritsa
AJ Butte
B Hur
B Schwikowski
Bertram Weiss
BP Kelley
CR Scriver
D Kuttenkeuler
D Lin
D Sieburth
E SanJuana
EC Green
F Piano
G Pandey
G Roman
GJ Hannon
Hans-Dieter Pohlenz
JZ Wang
KA Kellerman
KC Gunsalus
KC Gunsalus
KJ Gaulton
LB Vosshall
M Bate
M Steinbach
MA Huynen
MA van Driel
N Daraselia
N Freimer
P Bhandari
P Groth
P Groth
Philip Groth
PW Lord
RM Cripps
S Jaeger
S Raychaudhuri
SC Rison
SD Brown
T Schupbach
U Nongthomba
Ulf Leser
US Eggert
V Mermall
V Spirin
X Guo
Y Lussier
Y Shi
Y Tao
Y Zhao
Y Zhao
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Health and disease of organisms are reflected in their phenotypes. Often, a genetic component to a disease is discovered only after clearly defining its phenotype. In the past years, many technologies to systematically generate phenotypes in a high-throughput manner, such as RNA interference or gene knock-out, have been developed and used to decipher functions for genes. However, there have been relatively few efforts to make use of phenotype data beyond the single genotype-phenotype relationships. Results We present results on a study where we use a large set of phenotype data – in textual form – to predict gene annotation. To this end, we use text clustering to group genes based on their phenotype descriptions. We show that these clusters correlate well with several indicators for biological coherence in gene groups, such as functional annotations from the Gene Ontology (GO) and protein-protein interactions. We exploit these clusters for predicting gene function by carrying over annotations from well-annotated genes to other, less-characterized genes in the same cluster. For a subset of groups selected by applying objective criteria, we can predict GO-term annotations from the biological process sub-ontology with up to 72.6% precision and 16.7% recall, as evaluated by cross-validation. We manually verified some of these clusters and found them to exhibit high biological coherence, e.g. a group containing all available antennal Drosophila odorant receptors despite inconsistent GO-annotations. Conclusion The intrinsic nature of phenotypes to visibly reflect genetic activity underlines their usefulness in inferring new gene functions. Thus, systematically analyzing these data on a large scale offers many possibilities for inferring functional annotation of genes. We show that text clustering can play an important role in this process.</p

Corpus annotation for mining biomedical events from literature

Abstract Background Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. Results We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. Conclusion The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.</p

Extracting causal relations on HIV drug resistance from literature

Author: A Koike
AM Cohen
Breanndán Ó Nualláin
C Giles
Charles A Boucher
D Klein
D Zhou
DR Douglas
F Horn
F Leitner
F Rinaldi
G Erkan
H Jang
H Saigo
IH Witten
J Saric
J Vercauteren
JG Liao
JH Kim
K Fundel
LJ Jensen
M Abulaish
M Huang
MY Kim
N Daraselia
O Sanchez-Graillet
P Zweigenbaum
Peter MA Sloot
Quoc-Chinh Bui
R Chowdhary
R Malik
RA Erhardt
S Ananiadou
S Katrenko
S Kim
T Lengauer
VI Torvik
Y Miyao
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background In HIV treatment it is critical to have up-to-date resistance data of applicable drugs since HIV has a very high rate of mutation. These data are made available through scientific publications and must be extracted manually by experts in order to be used by virologists and medical doctors. Therefore there is an urgent need for a tool that partially automates this process and is able to retrieve relations between drugs and virus mutations from literature. Results In this work we present a novel method to extract and combine relationships between HIV drugs and mutations in viral genomes. Our extraction method is based on natural language processing (NLP) which produces grammatical relations and applies a set of rules to these relations. We applied our method to a relevant set of PubMed abstracts and obtained 2,434 extracted relations with an estimated performance of 84% for F-score. We then combined the extracted relations using logistic regression to generate resistance values for each <drug, mutation> pair. The results of this relation combination show more than 85% agreement with the Stanford HIVDB for the ten most frequently occurring mutations. The system is used in 5 hospitals from the Virolab project <url>http://www.virolab.org</url> to preselect the most relevant novel resistance data from literature and present those to virologists and medical doctors for further evaluation. Conclusions The proposed relation extraction and combination method has a good performance on extracting HIV drug resistance data. It can be used in large-scale relation extraction experiments. The developed methods can also be applied to extract other type of relations such as gene-protein, gene-disease, and disease-mutation.</p

Erasmus University Digital Repository

EUR Research Repository

DR-NTU (Digital Repository of NTU)

UvA-DARE

International Migration, Integration and Social Cohesion online publications

BioInfer: a corpus for information extraction in the biomedical domain

Author: A Yakushiji
CF Baker
D Lin
DD Sleator
E Alphonse
E Tsivtsivadze
E Tsivtsivadze
F Ginter
Filip Ginter
G Hripcsak
H Shatkay
J Cohen
J Ding
J Kim
Jari Björne
JM Temkin
Jorma Boberg
Jouni Järvinen
Juho Heimonen
K Franzén
K Kipper
KB Cohen
KB Cohen
L Hirschman
L Salwinski
M Ashburner
N Daraselia
P Kingsbury
P Kingsbury
P Szolovits
S Aubin
S Pyysalo
S Pyysalo
S Pyysalo
S Siegel
Sampo Pyysalo
T Ohta
T Pahikkala
T Wattarujeekrit
Tapio Salakoski
TH King
Y Tateisi
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora. RESULTS: We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation. CONCLUSION: We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at

Maastricht University Research Portal

Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation

BACKGROUND: High-throughput experiments, such as with DNA microarrays, typically result in hundreds of genes potentially relevant to the process under study, rendering the interpretation of these experiments problematic. Here, we propose and evaluate an approach to find functional associations between large numbers of genes and other biomedical concepts from free-text literature. For each gene, a profile of related concepts is constructed that summarizes the context in which the gene is mentioned in literature. We assign a weight to each concept in the profile based on a likelihood ratio measure. Gene concept profiles can then be clustered to find related genes and other concepts. RESULTS: The experimental validation was done in two steps. We first applied our method on a controlled test set. After this proved to be successful the datasets from two DNA microarray experiments were analyzed in the same way and the results were evaluated by domain experts. The first dataset was a gene-expression profile that characterizes the cancer cells of a group of acute myeloid leukemia patients. For this group of patients the biological background of the cancer cells is largely unknown. Using our methodology we found an association of these cells to monocytes, which agreed with other experimental evidence. The second data set consisted of differentially expressed genes following androgen receptor stimulation in a prostate cancer cell line. Based on the analysis we put forward a hypothesis about the biological processes induced in these studied cells: secretory lysosomes are involved in the production of prostatic fluid and their development and/or secretion are androgen-regulated processes. CONCLUSION: Our method can be used to analyze DNA microarray datasets based on information explicitly and implicitly available in the literature. We provide a publicly available tool, dubbed Anni, for this purpose

Erasmus University Digital Repository

EUR Research Repository

Generation and Validation of a Shewanella oneidensis MR-1 Clone Set for Protein Expression and Phage Display

Author: A Dricot
A-C Gavin
AA Korenevsky
AB Leaphart
Adam B. Leaphart
AJ Darwin
AS Beliaev
B Arezi
C Schwalb
D Xu
DA Saffarini
Dawn M. Klingeman
DM Gelperin
Donna Pattison
E Kolker
EM Phizicky
G Marsischky
George M. Weinstock
H Gao
H Gao
H Gao
H Zhu
Haichun Gao
J LaBaer
J Park
J Reboul
JA Gralnick
JA Hoch
JC Aguiar
JF Heidelberg
Jizhong Zhou
JL Hartley
Joseph Petrosino
JW Kehoe
K Chourey
KL Tyson
KS Reece
Lisa Hemphill
M Forstner
M McKevitt
M McKevitt
MF Romine
MF Romine
MM Bradford
MZ Li
N Daraselia
N Salamat-Miller
Niyaz Ahmed
P Uetz
Q Liu
R Bencheikh-Latmani
S Li
SD Brown
SD Brown
T Palzkill
Timothy Palzkill
Tingfen Yan
TM Maier
W Huang
X Qiu
X Qiu
X Tang
X Zhu
Xiaohu Wang
Xiufeng Wan
Y Hu
Y Liu
YJ Tang
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

A comprehensive gene collection for S. oneidensis was constructed using the lambda recombinase (Gateway) cloning system. A total of 3584 individual ORFs (85%) have been successfully cloned into the entry plasmids. To validate the use of the clone set, three sets of ORFs were examined within three different destination vectors constructed in this study. Success rates for heterologous protein expression of S. oneidensis His- or His/GST- tagged proteins in E. coli were approximately 70%. The ArcA and NarP transcription factor proteins were tested in an in vitro binding assay to demonstrate that functional proteins can be successfully produced using the clone set. Further functional validation of the clone set was obtained from phage display experiments in which a phage encoding thioredoxin was successfully isolated from a pool of 80 different clones after three rounds of biopanning using immobilized anti-thioredoxin antibody as a target. This clone set complements existing genomic (e.g., whole-genome microarray) and other proteomic tools (e.g., mass spectrometry-based proteomic analysis), and facilitates a wide variety of integrated studies, including protein expression, purification, and functional analyses of proteins both in vivo and in vitro

CiteSeerX

SHAREOK repository

An Empirical Strategy for Characterizing Bacterial Proteomes across Species in the Absence of Genomic Sequences

Global protein identification through current proteomics methods typically depends on the availability of sequenced genomes. In spite of increasingly high throughput sequencing technologies, this information is not available for every microorganism and rarely available for entire microbial communities. Nevertheless, the protein-level homology that exists between related bacteria makes it possible to extract biological information from the proteome of an organism or microbial community by using the genomic sequences of a near neighbor organism. Here, we demonstrate a trans-organism search strategy for determining the extent to which near-neighbor genome sequences can be applied to identify proteins in unsequenced environmental isolates. In proof of concept testing, we found that within a CLUSTAL W distance of 0.089, near-neighbor genomes successfully identified a high percentage of proteins within an organism. Application of this strategy to characterize environmental bacterial isolates lacking sequenced genomes, but having 16S rDNA sequence similarity to Shewanella resulted in the identification of 300–500 proteins in each strain. The majority of identified pathways mapped to core processes, as well as to processes unique to the Shewanellae, in particular to the presence of c-type cytochromes. Examples of core functional categories include energy metabolism, protein and nucleotide synthesis and cofactor biosynthesis, allowing classification of bacteria by observation of conserved processes. Additionally, within these core functionalities, we observed proteins involved in the alternative lactate utilization pathway, recently described in Shewanella

CiteSeerX

Integrated Bio-Entity Network: A System for Biological Knowledge Discovery

Author: A Ceol
A Chatr-aryamontri
A Coulet
A Grote
A Koike
A Mottaz
A Rzhetsky
A Yuryev
B Aranda
C Alfarano
C Blaschke
C Friedman
C Stark
CB Giles
CF Schaefer
D Barrell
D Hristovski
D Maglott
D Maglott
D Tikk
DR Swanson
EW Dijkstra
F Leitner
G Gonzalez
GR Mishra
H Liu
I Iossifov
I Vastrik
J Bjorne
JD Wren
Jinfeng Zhang
JO Korbel
Jun S. Liu
K Du
K Han
KD Pruitt
L Gong
L Salwinski
Lindsey Bell
LJ Jensen
LS Wong
M Ashburner
M Castagna
M Devignes
M Huang
M Kanehisa
M Krallinger
M Krallinger
M Kuhn
M Kuhn
M Yetisgen-Yildiz
MG Kann
N Daraselia
N Sierro
OL Griffith
P Pagel
P Shahi
P Srinivasan
QC Bui
QC Bui
R Apweiler
R Chowdhary
R Crnich
R Frijters
R Hoffmann
R Hoffmann
R Saetre
Rajesh Chowdhary
S Gama-Castro
S Mathivanan
S Naidu
S Yilmaz
T Beuming
TH Cormen
TS Keshava Prasad
V Matys
Xufeng Niu
Y Li
Y Wang
Ying Xu
Z Gao
Z Huang
Publication venue: Public Library of Science
Publication date: 27/06/2011
Field of study

A significant part of our biological knowledge is centered on relationships between biological entities (bio-entities) such as proteins, genes, small molecules, pathways, gene ontology (GO) terms and diseases. Accumulated at an increasing speed, the information on bio-entity relationships is archived in different forms at scattered places. Most of such information is buried in scientific literature as unstructured text. Organizing heterogeneous information in a structured form not only facilitates study of biological systems using integrative approaches, but also allows discovery of new knowledge in an automatic and systematic way. In this study, we performed a large scale integration of bio-entity relationship information from both databases containing manually annotated, structured information and automatic information extraction of unstructured text in scientific literature. The relationship information we integrated in this study includes protein–protein interactions, protein/gene regulations, protein–small molecule interactions, protein–GO relationships, protein–pathway relationships, and pathway–disease relationships. The relationship information is organized in a graph data structure, named integrated bio-entity network (IBN), where the vertices are the bio-entities and edges represent their relationships. Under this framework, graph theoretic algorithms can be designed to perform various knowledge discovery tasks. We designed breadth-first search with pruning (BFSP) and most probable path (MPP) algorithms to automatically generate hypotheses—the indirect relationships with high probabilities in the network. We show that IBN can be used to generate plausible hypotheses, which not only help to better understand the complex interactions in biological systems, but also provide guidance for experimental designs

Evidence-Based Annotation of Gene Function in Shewanella oneidensis MR-1 Using Genome-Wide Fitness Profiling across 121 Conditions

Author: A Mitchell
A Typas
Adam Deutschbauer
Adam P. Arkin
AM Deutschbauer
AM Smith
B Christen
B Efron
B Rost
BJ Akerley
C Yang
CP Ewing
CR Myers
DE Cameron
DJ Burdige
E Alm
E Fischer
G Butland
G Butland
G Giaever
GC Langridge
GE Pinchuk
GW Birrell
H Gao
H Ochman
HH Hau
HS Girgis
I Tagkopoulos
IM Keseler
J Oh
J Oh
J Quan
JA Gralnick
Jason K. Baumohl
JD Gawronski
JD Peterson
JF Heidelberg
JJ Faith
JK Fredrickson
JL Groh
JR Warner
K Kobayashi
K Suzuki
Kelly M. Wetmore
KR Brocklehurst
KT Konstantinidis
L Binnenkade
LA Gallagher
LA Gallagher
M Hashimoto
M Huynen
ME Driscoll
ME Hillenmeyer
ME Hillenmeyer
ME Kovach
Michelle Nguyen
MJ Lercher
MN Price
Morgan N. Price
MY Galperin
N Daraselia
N Ishii
NT Liberati
P Burghout
Paul M. Richardson
PS Dehal
PS Novichkov
Q Ren
R Bouhenni
R Zhang
RA Larsen
Raquel Tamse
RJ Nichols
RJ Roberts
RL Tatusov
RM Martinez
Ronald W. Davis
S Kuhner
S Kumari
S Weinitschke
SE Pierce
SJ Cooper
SK Sharan
SY Gerdes
T Baba
T van Opijnen
TR Hughes
V de Berardinis
Wenjun Shao
Y Liu
Zhuchen Xu
Publication venue: Public Library of Science
Publication date: 01/11/2011
Field of study

Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes) have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome) are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes