Search CORE

7,264 research outputs found

Gene ontology based transfer learning for protein subcellular localization

Author: A Bateman
A Dijk
A Hoglund
A Hoglund
A Pierleoni
C Chen
C Leslie
C Leslie
DH Haft
E Marcotte
EM Zdobnov
F Corpet
FM Li
G Lanckriet
G Schneider
H Ding
H Lin
H Lin
H Liu
H Rangwala
H Shen
HB Shen
HB Shen
HB Shen
HB Shen
HB Shen
J Cedano
J Schultz
J Shen
JD Qiu
JD Qiu
K Chou
K Chou
K Chou
K Hofmann
K Lee
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
L Nanni
M Ashburner
M Esmaeili
M Mak
M Wang
Q Gu
Q Yang
R Apweiler
R Kuang
R Kuang
S Mei
S Pan
Shuigeng Zhou
Suyu Mei
T Blum
T Tung
TK Attwood
W Dai
W Dai
W Huang
W Huang
Wang Fei
X Jiang
X Xiao
XB Zhou
YH Zeng
YS Ding
YS Ding
Z Lei
Z Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as <it>GO</it>, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the <it>GO </it>terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology. Results In this paper, we propose a Gene Ontology Based Transfer Learning Model (<it>GO-TLM</it>) for large-scale protein subcellular localization. The model transfers the signature-based homologous <it>GO </it>terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false <it>GO </it>terms that are resulted from evolutionary divergence. We derive three <it>GO </it>kernels from the three aspects of gene ontology to measure the <it>GO </it>similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate <it>GO-TLM </it>performance against three baseline models: <it>MultiLoc, MultiLoc-GO </it>and <it>Euk-mPLoc </it>on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that <it>GO-TLM </it>achieves substantial accuracy improvement against the baseline models: 80.38% against model <it>Euk-mPLoc </it>67.40% with <it>12.98% </it>substantial increase; 96.65% and 96.27% against model <it>MultiLoc-GO </it>89.60% and 89.60%, with <it>7.05% </it>and <it>6.67% </it>accuracy increase on dataset <it>MultiLoc plant </it>and dataset <it>MultiLoc animal</it>, respectively; 97.14%, 95.90% and 96.85% against model <it>MultiLoc-GO </it>83.70%, 90.10% and 85.70%, with accuracy increase <it>13.44%</it>, <it>5.8% </it>and <it>11.15% </it>on dataset <it>BaCelLoc plant</it>, dataset <it>BaCelLoc fungi </it>and dataset <it>BaCelLoc animal </it>respectively. For <it>BaCelLoc </it>independent sets, <it>GO-TLM </it>achieves 81.25%, 80.45% and 79.46% on dataset <it>BaCelLoc plant holdout</it>, dataset <it>BaCelLoc plant holdout </it>and dataset <it>BaCelLoc animal holdout</it>, respectively, as compared against baseline model <it>MultiLoc-GO </it>76%, 60.00% and 73.00%, with accuracy increase <it>5.25%</it>, <it>20.45% </it>and <it>6.46%</it>, respectively. Conclusions Since direct homology-based <it>GO </it>term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, <it>GO-TLM</it>) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based <it>GO </it>term transfer and explicitly weighing the <it>GO </it>kernels substantially improve the prediction performance.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

BUSCA: An integrative web server to predict subcellular localization of proteins

Author: Casadio Rita
Fariselli Piero
Martelli Pier Luigi
Profiti Giuseppe
Savojardo Castrense
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

Here, we present BUSCA (http://busca.biocomp.unibo.it), a novel web server that integrates different computational tools for predicting protein subcellular localization. BUSCA combines methods for identifying signal and transit peptides (DeepSig and TPpred3), GPI-anchors (PredGPI) and transmembrane domains (ENSEMBLE3.0 and BetAware) with tools for discriminating subcellular localization of both globular and membrane proteins (BaCelLo, MemLoci and SChloro). Outcomes from the different tools are processed and integrated for annotating subcellular localization of both eukaryotic and bacterial protein sequences. We benchmark BUSCA against protein targets derived from recent CAFA experiments and other specific data sets, reporting performance at the state-of-the-art. BUSCA scores better than all other evaluated methods on 2732 targets from CAFA2, with a F1 value equal to 0.49 and among the best methods when predicting targets from CAFA3. We propose BUSCA as an integrated and accurate resource for the annotation of protein subcellular localization

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Archivio istituzionale della ricerca - Università di Padova

Institutional Research Information System University of Turin

Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction

Author: AK Bjorklund
B Rost
BW Matthews
D Sarda
G Dellaire
H Wu
J Wang
JL Gardy
JL Gardy
K Itoh
K Nakai
K Tu
KC Chou
L Cocco
M Bhasin
MA Harris
P Zhang
PW Lord
PW Lord
R Gentleman
R Nair
R Nair
V Brendel
X Lu
X Wu
Yang Dai
YD Cai
Z Lei
Zhengdeng Lei
ZP Feng
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The accomplishment of the various genome sequencing projects resulted in accumulation of massive amount of gene sequence information. This calls for a large-scale computational method for predicting protein localization from sequence. The protein localization can provide valuable information about its molecular function, as well as the biological pathway in which it participates. The prediction of localization of a protein at subnuclear level is a challenging task. In our previous work we proposed an SVM-based system using protein sequence information for this prediction task. In this work, we assess protein similarity with Gene Ontology (GO) and then improve the performance of the system by adding a module of nearest neighbor classifier using a similarity measure derived from the GO annotation terms for protein sequences. RESULTS: The performance of the new system proposed here was compared with our previous system using a set of proteins resided within 6 localizations collected from the Nuclear Protein Database (NPD). The overall MCC (accuracy) is elevated from 0.284 (50.0%) to 0.519 (66.5%) for single-localization proteins in leave-one-out cross-validation; and from 0.420 (65.2%) to 0.541 (65.2%) for an independent set of multi-localization proteins. The new system is available at . CONCLUSION: The prediction of protein subnuclear localizations can be largely influenced by various definitions of similarity for a pair of proteins based on different similarity measures of GO terms. Using the sum of similarity scores over the matched GO term pairs for two proteins as the similarity definition produced the best predictive outcome. Substantial improvement in predicting protein subnuclear localizations has been achieved by combining Gene Ontology with sequence information

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Automated data integration for developmental biological research

Author: Sternberg Paul W.
Zhong Weiwei
Publication venue: 'The Company of Biologists'
Publication date: 15/09/2007
Field of study

In an era exploding with genome-scale data, a major challenge for developmental biologists is how to extract significant clues from these publicly available data to benefit our studies of individual genes, and how to use them to improve our understanding of development at a systems level. Several studies have successfully demonstrated new approaches to classic developmental questions by computationally integrating various genome-wide data sets. Such computational approaches have shown great potential for facilitating research: instead of testing 20,000 genes, researchers might test 200 to the same effect. We discuss the nature and state of this art as it applies to developmental research

Caltech Authors

Large-scale prediction and analysis of protein sub-mitochondrial localization with DeepMito

Author: Casadio R.
Martelli P.L.
Savojardo C.
Tartari G.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Background: The prediction of protein subcellular localization is a key step of the big effort towards protein functional annotation. Many computational methods exist to identify high-level protein subcellular compartments such as nucleus, cytoplasm or organelles. However, many organelles, like mitochondria, have their own internal compartmentalization. Knowing the precise location of a protein inside mitochondria is crucial for its accurate functional characterization. We recently developed DeepMito, a new method based on a 1-Dimensional Convolutional Neural Network (1D-CNN) architecture outperforming other similar approaches available in literature. Results: Here, we explore the adoption of DeepMito for the large-scale annotation of four sub-mitochondrial localizations on mitochondrial proteomes of five different species, including human, mouse, fly, yeast and Arabidopsis thaliana. A significant fraction of the proteins from these organisms lacked experimental information about sub-mitochondrial localization. We adopted DeepMito to fill the gap, providing complete characterization of protein localization at sub-mitochondrial level for each protein of the five proteomes. Moreover, we identified novel mitochondrial proteins fishing on the set of proteins lacking any subcellular localization annotation using available state-of-the-art subcellular localization predictors. We finally performed additional functional characterization of proteins predicted by DeepMito as localized into the four different sub-mitochondrial compartments using both available experimental and predicted GO terms. All data generated in this study were collected into a database called DeepMitoDB (available at http://busca.biocomp.unibo.it/deepmitodb), providing complete functional characterization of 4307 mitochondrial proteins from the five species. Conclusions: DeepMitoDB offers a comprehensive view of mitochondrial proteins, including experimental and predicted fine-grain sub-cellular localization and annotated and predicted functional annotations. The database complements other similar resources providing characterization of new proteins. Furthermore, it is also unique in including localization information at the sub-mitochondrial level. For this reason, we believe that DeepMitoDB can be a valuable resource for mitochondrial research

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Integrative Identification of Arabidopsis Mitochondrial Proteome and Its Function Exploitation through Protein Interaction Network

Author: A Drawid
A Drawid
A Höglund
A Kumar
A Reinhardt
A Vianello
AA Vashisht
AH Millar
AH Millar
BJ Haas
BW Rhee SY
C Bai
C Guda
C Jonak
CG Bartoli
CG Kurland
CM Lee
D Skowyra
DA Bota
David Moore
E Delannoy
E Jambrina
E Mazzucotelli
EE Patton
EE Patton
EH Kruft V
EM Marcotte
EO Karlberg
F Rébeillé
GW Tian
H Bannai
H Fölsch
H Prokisch
HN Chua
I Small
ID Small
IM Moller
J Balk
J Bardel
J Cui
J Huang
J Kilian
JA Kreps
Jian Cui
Jinghua Liu
JK Zhu
JL Heazlewood
JL Heazlewood
JL Heazlewood
K Ishizaki
K Meierhoff
KP O'Brien
L Li
LJ Lu
M Teige
M Unseld
MG Claros
MG Claros
O Emanuelsson
O Van Aken
OA Koroleva
P Horton
P Pavlidis
R Nair
R Nair
RA Irizarry
S Hua
S Killcoyne
S Li
S Ma
S Maere
S Mahajan
S Mili
SG Andersson
T Sing
Tieliu Shi
V Gueguen
VK Mootha
Vladimir Uversky
W Werhahn
WK Huh
X Gong
Y Gavel
YD Cai
YD Cai
Yuhua Li
Z Liu
Z Yuan
Publication venue: Public Library of Science
Publication date: 31/01/2011
Field of study

Mitochondria are major players on the production of energy, and host several key reactions involved in basic metabolism and biosynthesis of essential molecules. Currently, the majority of nucleus-encoded mitochondrial proteins are unknown even for model plant Arabidopsis. We reported a computational framework for predicting Arabidopsis mitochondrial proteins based on a probabilistic model, called Naive Bayesian Network, which integrates disparate genomic data generated from eight bioinformatics tools, multiple orthologous mappings, protein domain properties and co-expression patterns using 1,027 microarray profiles. Through this approach, we predicted 2,311 candidate mitochondrial proteins with 84.67% accuracy and 2.53% FPR performances. Together with those experimental confirmed proteins, 2,585 mitochondria proteins (named CoreMitoP) were identified, we explored those proteins with unknown functions based on protein-protein interaction network (PIN) and annotated novel functions for 26.65% CoreMitoP proteins. Moreover, we found newly predicted mitochondrial proteins embedded in particular subnetworks of the PIN, mainly functioning in response to diverse environmental stresses, like salt, draught, cold, and wound etc. Candidate mitochondrial proteins involved in those physiological acitivites provide useful targets for further investigation. Assigned functions also provide comprehensive information for Arabidopsis mitochondrial proteome

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Proper integration of feature subsets boosts GO subcellular localization predictions

Author: Angelone Laura Monica
Bulacio Pilar Estela
Krsticevic Flavia Jorgelina
Murillo Javier
Ponce Sergio
Spetale Flavio Ezequiel
Tapia Paredes Elizabeth
Publication venue: Sociedad Argentina de Bioingeniería
Publication date: 01/03/2018
Field of study

La predicción de múltiples localizaciones subcelulares en proteínas brinda información relavante para el descubrimiento de funciones biológicas. El uso de métodos computacionales basados en el conocimiento puede ser un buen punto de partida para conducir a las costosas validaciones experimentales. En este trabajo, presentamos un framework de clasificación multi-etiqueta para para realizar la predicción en Gene Ontology - Componente Celular enfocada en la mejora de dos aspectos del diseño: i) la caracterización de la secuencia proteica, relacionando el conocimiento biológico con la evidencia experimental; y ii) la evaluación de errores al considerar un modelo de ruido inherente a los frameworks de predicción reales. Nuestra propuesta es validada contra un conjunto de secuencias de proteínas de cuatro organismos modelos D. rerio, A. thaliana, S. cerevisiae and D. melanogaster.Prediction of multiple subcellular localizations in proteins brings relevant information for biologicalfunction discovery. The use of computational methods based on knowledge can be a helpful starting point forguiding the costly experimental validation. In this work, we present a multilabel classifier framework to performGene Ontology - Cellular Component prediction focused on the improvement of two design aspects: i) the proteinsequence characterization, regarding biological knowledge with experimental evidence, and ii) the error evaluation byconsidering a noise model inherent in real prediction frameworks. Our proposal is validated against sets of well-knownprotein sequences of four model organisms D. rerio, A. thaliana, S. cerevisiae and D. melanogasterFil: Spetale, Flavio Ezequiel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Tapia Elizabeth. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Murillo, Javier. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Krsticevic Flavia. Universidad Tecnológica Nacional. Facultad Regional San Nicolás; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Ponce Sergio. Universidad Tecnológica Nacional. Facultad Regional San Nicolás; ArgentinaFil: Angelone, Laura Monica. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Bulacio, Pilar Estela. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; Argentina. Universidad Tecnológica Nacional. Facultad Regional San Nicolás; Argentin

CONICET Digital