Search CORE

Subcellular location prediction of proteins using support vector machines with alignment of block sequences utilizing amino acid composition

Author: A Ben–Hur
A Reinhardt
BW Matthews
C Guda
C Leslie
CS Yu
CS Yu
CS Yu
H Nielsen
J Cedano
J Guo
K Nakai
KC Chou
KC Chou
KC Chou
KC Chou
KJ Park
M Bhasin
M Bhasin
M Bhasin
M Kumar
M Reczko
O Emanuelsson
O Emanuelsson
P Horton
P Horton
P Horton
P Pavlidis
R Nair
S Hua
S Matsuda
Takeyuki Tamura
Tatsuya Akutsu
YD Cai
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Background: Subcellular location prediction of proteins is an important and well-studied problem in bioinformatics. This is a problem of predicting which part in a cell a given protein is transported to, where an amino acid sequence of the protein is given as an input. This problem is becoming more important since information on subcellular location is helpful for annotation of proteins and genes and the number of complete genomes is rapidly increasing. Since existing predictors are based on various heuristics, it is important to develop a simple method with high prediction accuracies. Results: In this paper, we propose a novel and general predicting method by combining techniques for sequence alignment and feature vectors based on amino acid composition. We implemented this method with support vector machines on plant data sets extracted from the TargetP database. Through fivefold cross validation tests, the obtained overall accuracies and average MCC were 0.9096 and 0.8655 respectively. We also applied our method to other datasets including that of WoLF PSORT. Conclusion: Although there is a predictor which uses the information of gene ontology and yields higher accuracy than ours, our accuracies are higher than existing predictors which use only sequence information. Since such information as gene ontology can be obtained only for known proteins, our predictor is considered to be useful for subcellular location prediction of newly-discovered proteins. Furthermore, the idea of combination of alignment and amino acid frequency is novel and general so that it may be applied to other problems in bioinformatics. Our method for plant is also implemented as a web-system and available on http://sunflower.kuicr.kyoto-u.ac.jp/~tamura/slpfa.html webcite

Kyoto University Research Information Repository

LipocalinPred: a SVM-based method for prediction of lipocalins

Author: A Ben-Hur
A Garg
A Garg
A Sali
AS Martin Vogt
B Adam
C Leslie
D Holloway
D Plewczynski
Dinesh Gupta
DR Flower
DR Flower
G Wang
H Saiga
H Saigo
J Ahnstrom
J Duan
J Hull-Thompson
J Thorsten
JA Swets
Jayashree Ramana
LJ McGuffin
M Sieber
M Zervakis
NV Vapnik
P Pavlidis
R Rajakariar
S Ahmad
S Arne
SF Altschul
SR Eddy
W Deng
X Yu
YR Chan
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Functional annotation of rapidly amassing nucleotide and protein sequences presents a challenging task for modern bioinformatics. This is particularly true for protein families sharing extremely low sequence identity, as for lipocalins, a family of proteins with varied functions and great diversity at the sequence level, yet conserved structures. Results In the present study we propose a SVM based method for identification of lipocalin protein sequences. The SVM models were trained with the input features generated using amino acid, dipeptide and secondary structure compositions as well as PSSM profiles. The model derived using both PSSM and secondary structure emerged as the best model in the study. Apart from achieving a high prediction accuracy (>90% in leave-one-out), lipocalinpred correctly differentiates closely related fatty acid-binding proteins and triabins as non-lipocalins. Conclusion The method offers a promising approach as a lipocalin prediction tool, complementing PROSITE, Pfam and homology modelling methods.</p

Public Library of Science (PLOS)

Inferring Pathway Activity toward Precise Disease Classification

Author: A Agresti
A Bhattacharjee
A Subramanian
AA Alizadeh
AH Bild
B Tian
CL Banka
DG Beer
Doheon Lee
E Segal
EJ Yeoh
Eunjung Lee
Greg Tucker-Kellogg
GV Glinsky
Han-Yu Chuang
HY Chuang
J Chen
J Lapointe
JA Swets
Jong-Won Kim
JP Svensson
JP Vert
KM Mani
L Ein-Dor
L Tian
LJ van 't Veer
MJ van de Vijver
P Pavlidis
P Pavlidis
R Sharan
RA Fisher
RA Gatenby
RA Gatenby
S Draghici
S Efroni
S Ramaswamy
SA Tomlins
SS Gambhir
SW Doniger
T Breslin
T Ideker
TR Golub
Trey Ideker
VK Mootha
WF Symmans
Y Saeys
Y Wang
Z Guo
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

The advent of microarray technology has made it possible to classify disease states based on gene expression profiles of patients. Typically, marker genes are selected by measuring the power of their expression profiles to discriminate among patients of different disease states. However, expression-based classification can be challenging in complex diseases due to factors such as cellular heterogeneity within a tissue sample and genetic heterogeneity across patients. A promising technique for coping with these challenges is to incorporate pathway information into the disease classification procedure in order to classify disease based on the activity of entire signaling pathways or protein complexes rather than on the expression levels of individual genes or proteins. We propose a new classification method based on pathway activities inferred for each patient. For each pathway, an activity level is summarized from the gene expression levels of its condition-responsive genes (CORGs), defined as the subset of genes in the pathway whose combined expression delivers optimal discriminative power for the disease phenotype. We show that classifiers using pathway activity achieve better performance than classifiers based on individual gene expression, for both simple and complex case-control studies including differentiation of perturbed from non-perturbed cells and subtyping of several different kinds of cancer. Moreover, the new method outperforms several previous approaches that use a static (i.e., non-conditional) definition of pathways. Within a pathway, the identified CORGs may facilitate the development of better diagnostic markers and the discovery of core alterations in human disease

CiteSeerX

Public Library of Science (PLOS)

Time to Recurrence and Survival in Serous Ovarian Tumors Predicted from Integrated Genomic Profiles

Author: A Daemen
A Schramm
AC Tan
AP Crijns
Chris Sander
CT Lopes
DM Witten
Douglas A. Levine
E Cerami
E Noetzel
G Heller
H Zou
HK Dressman
HM Bovelstad
J Helleman
J Subramanian
JJ Peluso
JV Rajan
K Yoshihara
KL Borden
L Ein-Dor
LC Hartmann
M GÖnen
M Zangenberg
MW Causey
MY Park
Nikolaus Schultz
O Smaletz
P Pavlidis
Parminder K. Mankoo
PS Freemont
R Shen
R Tibshirani
Ronglai Shen
S Awasthi
S Dell'Orso
S L'Esperance
S Maere
S Mizuarai
S Wada
SF Slovin
Sumitra Deb
SY Yu
T Bonome
T Ota
V Poroyo
Y Jiang
YT Tai
ZZ Wu
Publication venue: Public Library of Science
Publication date: 03/11/2011
Field of study

Serous ovarian cancer (SeOvCa) is an aggressive disease with differential and often inadequate therapeutic outcome after standard treatment. The Cancer Genome Atlas (TCGA) has provided rich molecular and genetic profiles from hundreds of primary surgical samples. These profiles confirm mutations of TP53 in ∼100% of patients and an extraordinarily complex profile of DNA copy number changes with considerable patient-to-patient diversity. This raises the joint challenge of exploiting all new available datasets and reducing their confounding complexity for the purpose of predicting clinical outcomes and identifying disease relevant pathway alterations. We therefore set out to use multi-data type genomic profiles (mRNA, DNA methylation, DNA copy-number alteration and microRNA) available from TCGA to identify prognostic signatures for the prediction of progression-free survival (PFS) and overall survival (OS). prediction algorithm and applied it to two datasets integrated from the four genomic data types. We (1) selected features through cross-validation; (2) generated a prognostic index for patient risk stratification; and (3) directly predicted continuous clinical outcome measures, that is, the time to recurrence and survival time. We used Kaplan-Meier p-values, hazard ratios (HR), and concordance probability estimates (CPE) to assess prediction performance, comparing separate and integrated datasets. Data integration resulted in the best PFS signature (withheld data: p-value = 0.008; HR = 2.83; CPE = 0.72).We provide a prediction tool that inputs genomic profiles of primary surgical samples and generates patient-specific predictions for the time to recurrence and survival, along with outcome risk predictions. Using integrated genomic profiles resulted in information gain for prediction of outcomes. Pathway analysis provided potential insights into functional changes affecting disease progression. The prognostic signatures, if prospectively validated, may be useful for interpreting therapeutic outcomes for clinical trials that aim to improve the therapy for SeOvCa patients

arXiv.org e-Print Archive

ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples

Author: A Su
B Brancotte
B Calvo
B Linghu
B Liu
B Schölkopf
B Schölkopf
B Schölkopf
C Giallourakis
C Perez-Iratxeta
C Son
CC Chang
EA Adie
F Denis
F Mordelet
Fantine Mordelet
FS Turner
G Lanckriet
GRG Lanckriet
J Freudenberg
Jean-Philippe Vert
K Bleakley
K Lage
L Jacob
L Jacob
LC Tranchevent
M van Driel
N López-Bigas
N Tiffin
O Vanunu
P Pavlidis
RI Kondor
S Aerts
S Köhler
S Yu
T De Bie
T Evgeniou
T Hwang
U Ala
V McKusick
X Wu
Y Yamanishi
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases. Results We propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases. Conclusions ProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at <url>http://cbio.ensmp.fr/prodige</url>.</p

Universidade do Minho: RepositoriUM

Dormancy within Staphylococcus epidermidis biofilms : a transcriptomic analysis by RNA-seq

Author: A Bink
A Franceschini
A Mortazavi
A Resch
AH Vliet van
AJ Westermann
AN Bhatt
Angela França
AS Kaprelyants
B Song
CF Schuster
CI Kint
D Mack
D Merico
D Shah
DL Piddington
F Cerca
F Cerca
F Cerca
F Sun
Filipe Cerca
G Mittenhuber
G Zandri
GD Bader
Gerald B. Pier
GW Li
HJ Blumenthal
I Keren
I Keren
I Keren
J Dworkin
J Kim
JA Shapiro
JD Oliver
JH Bullard
JT Trevors
K Lewis
K Lewis
K Lewis
KA Baggerly
KJ Livak
KR Allison
KR Allison
KS Williamson
M Ashburner
M Butala
M Fauvart
M Kanehisa
M Otto
M Punta
M Shu
MA Orman
Manuel Vilanova
MD Young
MS Cline
NJ Croucher
NQ Balaban
Nuno Cerca
NY Yu
P Pavlidis
P Shannon
PD Cotter
PS Stewart
R Jayaraman
R Sorek
RR Colwell
Rui Vitorino
S Rozen
S Tarazona
SR Gill
T Raz
T Wijtzes
Virginia Carvalhais
WM Dunne Jr
X Didelot
Y Pawitan
Y Yao
Z Fang
Z Fu
Z Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/02/2014
Field of study

The proportion of dormant bacteria within Staphylococcus epidermidis biofilms may determine its inflammatory profile. Previously, we have shown that S. epidermidis biofilms with higher proportions of dormant bacteria have reduced activation of murine macrophages. RNA-sequencing was used to identify the major transcriptomic differences between S. epidermidis biofilms with different proportions of dormant bacteria. To accomplish this goal, we used an in vitro model where magnesium allowed modulation of the proportion of dormant bacteria within S. epidermidis biofilms. Significant differences were found in the expression of 147 genes. A detailed analysis of the results was performed based on direct and functional gene interactions. Biological processes among the differentially expressed genes were mainly related to oxidation-reduction processes and acetyl-CoA metabolic processes. Gene set enrichment revealed that the translation process is related to the proportion of dormant bacteria. Transcription of mRNAs involved in oxidation-reduction processes was associated with higher proportions of dormant bacteria within S. epidermidis biofilm. Moreover, the pH of the culture medium did not change after the addition of magnesium, and genes related to magnesium transport did not seem to impact entrance of bacterial cells into dormancy.The authors thank Stephen Lorry at Harvard Medical School for providing CLC Genomics software. This work was funded by Fundacao para a Ciencia e a Tecnologia (FCT) and COMPETE grants PTDC/BIA-MIC/113450/2009, FCOMP-01-0124-FEDER-014309, FCOMP-01-0124-FEDER-022718 (FCT PEst-C/SAU/LA0002/2011), QOPNA research unit (project PEst-C/QUI/UI0062/2011), and CENTRO-07-ST24-FEDER-002034. The following authors had an individual FCT fellowship: VC (SFRH/BD/78235/2011) and AF (2SFRH/BD/62359/2009)

The CRE1 carbon catabolite repressor of the fungus Trichoderma reesei: a master regulator of carbon assimilation

Abstract Background The identification and characterization of the transcriptional regulatory networks governing the physiology and adaptation of microbial cells is a key step in understanding their behaviour. One such wide-domain regulatory circuit, essential to all cells, is carbon catabolite repression (CCR): it allows the cell to prefer some carbon sources, whose assimilation is of high nutritional value, over less profitable ones. In lower multicellular fungi, the C2H2 zinc finger CreA/CRE1 protein has been shown to act as the transcriptional repressor in this process. However, the complete list of its gene targets is not known. Results Here, we deciphered the CRE1 regulatory range in the model cellulose and hemicellulose-degrading fungus <it>Trichoderma reesei </it>(anamorph of <it>Hypocrea jecorina</it>) by profiling transcription in a wild-type and a delta-<it>cre1 </it>mutant strain on glucose at constant growth rates known to repress and de-repress CCR-affected genes. Analysis of genome-wide microarrays reveals 2.8% of transcripts whose expression was regulated in at least one of the four experimental conditions: 47.3% of which were repressed by CRE1, whereas 29.0% were actually induced by CRE1, and 17.2% only affected by the growth rate but CRE1 independent. Among CRE1 repressed transcripts, genes encoding unknown proteins and transport proteins were overrepresented. In addition, we found CRE1-repression of nitrogenous substances uptake, components of chromatin remodeling and the transcriptional mediator complex, as well as developmental processes. Conclusions Our study provides the first global insight into the molecular physiological response of a multicellular fungus to carbon catabolite regulation and identifies several not yet known targets in a growth-controlled environment.</p

DEA University of Debrecen Electronic Archive

HAL-Inserm

Large-scale integration of cancer microarray data identifies a robust common cancer signature

Author: A Bhattacharjee
A Cromer
AC Tan
AI Su
AI Su
BJ Quade
CA Iacobuzio-Donahue
CD Logsdon
CF Basil
D Geman
D Talantov
DG Beer
DH Gutmann
Donald Geman
DR Rhodes
DR Rhodes
DS Rickman
E Dehan
E Segal
F Zhan
GJ Gordon
HF Frierson Jr.
I Yanai
J Luo
JB Welsh
JM Lancaster
JPT Higgins
L Dyrskjot
L Liotta
L Xu
Lei Xu
LL Hsiao
M Bittner
M Lenburg
MA Watson
ND Price
P Pavlidis
PJ Hoffman
R Shai
Raimond L Winslow
RC Bast Jr.
RS Stearman
S Michiels
S Ramaswamy
S Wachi
S Welle
SL Pomeroy
SM Dhanasekaran
SS Yoon
T Barrett
T Yagi
TJ Giordano
TR Golub
X Chen
X Yang
Y Hippo
Y Huang
YP Yu
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background There is a continuing need to develop molecular diagnostic tools which complement histopathologic examination to increase the accuracy of cancer diagnosis. DNA microarrays provide a means for measuring gene expression signatures which can then be used as components of genomic-based diagnostic tests to determine the presence of cancer. Results In this study, we collect and integrate ~ 1500 microarray gene expression profiles from 26 published cancer data sets across 21 major human cancer types. We then apply a statistical method, referred to as the <it>T</it>op-<it>S</it>coring <it>P</it>air of <it>G</it>roups (TSPG) classifier, and a repeated random sampling strategy to the integrated training data sets and identify a common cancer signature consisting of 46 genes. These 46 genes are naturally divided into two distinct groups; those in one group are typically expressed less than those in the other group for cancer tissues. Given a new expression profile, the classifier discriminates cancer from normal tissues by ranking the expression values of the 46 genes in the cancer signature and comparing the average ranks of the two groups. This signature is then validated by applying this decision rule to independent test data. Conclusion By combining the TSPG method and repeated random sampling, a robust common cancer signature has been identified from large-scale microarray data integration. Upon further validation, this signature may be useful as a robust and objective diagnostic test for cancer.</p

A taxonomy of epithelial human cancer and their metastases

Abstract Background Microarray technology has allowed to molecularly characterize many different cancer sites. This technology has the potential to individualize therapy and to discover new drug targets. However, due to technological differences and issues in standardized sample collection no study has evaluated the molecular profile of epithelial human cancer in a large number of samples and tissues. Additionally, it has not yet been extensively investigated whether metastases resemble their tissue of origin or tissue of destination. Methods We studied the expression profiles of a series of 1566 primary and 178 metastases by unsupervised hierarchical clustering. The clustering profile was subsequently investigated and correlated with clinico-pathological data. Statistical enrichment of clinico-pathological annotations of groups of samples was investigated using Fisher exact test. Gene set enrichment analysis (GSEA) and DAVID functional enrichment analysis were used to investigate the molecular pathways. Kaplan-Meier survival analysis and log-rank tests were used to investigate prognostic significance of gene signatures. Results Large clusters corresponding to breast, gastrointestinal, ovarian and kidney primary tissues emerged from the data. Chromophobe renal cell carcinoma clustered together with follicular differentiated thyroid carcinoma, which supports recent morphological descriptions of thyroid follicular carcinoma-like tumors in the kidney and suggests that they represent a subtype of chromophobe carcinoma. We also found an expression signature identifying primary tumors of squamous cell histology in multiple tissues. Next, a subset of ovarian tumors enriched with endometrioid histology clustered together with endometrium tumors, confirming that they share their etiopathogenesis, which strongly differs from serous ovarian tumors. In addition, the clustering of colon and breast tumors correlated with clinico-pathological characteristics. Moreover, a signature was developed based on our unsupervised clustering of breast tumors and this was predictive for disease-specific survival in three independent studies. Next, the metastases from ovarian, breast, lung and vulva cluster with their tissue of origin while metastases from colon showed a bimodal distribution. A significant part clusters with tissue of origin while the remaining tumors cluster with the tissue of destination. Conclusion Our molecular taxonomy of epithelial human cancer indicates surprising correlations over tissues. This may have a significant impact on the classification of many cancer sites and may guide pathologists, both in research and daily practice. Moreover, these results based on unsupervised analysis yielded a signature predictive of clinical outcome in breast cancer. Additionally, we hypothesize that metastases from gastrointestinal origin either remember their tissue of origin or adapt to the tissue of destination. More specifically, colon metastases in the liver show strong evidence for such a bimodal tissue specific profile.</p

Ghent University Academic Bibliography