Search CORE

192 research outputs found

Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction

Author: AK Bjorklund
B Rost
BW Matthews
D Sarda
G Dellaire
H Wu
J Wang
JL Gardy
JL Gardy
K Itoh
K Nakai
K Tu
KC Chou
L Cocco
M Bhasin
MA Harris
P Zhang
PW Lord
PW Lord
R Gentleman
R Nair
R Nair
V Brendel
X Lu
X Wu
Yang Dai
YD Cai
Z Lei
Zhengdeng Lei
ZP Feng
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The accomplishment of the various genome sequencing projects resulted in accumulation of massive amount of gene sequence information. This calls for a large-scale computational method for predicting protein localization from sequence. The protein localization can provide valuable information about its molecular function, as well as the biological pathway in which it participates. The prediction of localization of a protein at subnuclear level is a challenging task. In our previous work we proposed an SVM-based system using protein sequence information for this prediction task. In this work, we assess protein similarity with Gene Ontology (GO) and then improve the performance of the system by adding a module of nearest neighbor classifier using a similarity measure derived from the GO annotation terms for protein sequences. RESULTS: The performance of the new system proposed here was compared with our previous system using a set of proteins resided within 6 localizations collected from the Nuclear Protein Database (NPD). The overall MCC (accuracy) is elevated from 0.284 (50.0%) to 0.519 (66.5%) for single-localization proteins in leave-one-out cross-validation; and from 0.420 (65.2%) to 0.541 (65.2%) for an independent set of multi-localization proteins. The new system is available at . CONCLUSION: The prediction of protein subnuclear localizations can be largely influenced by various definitions of similarity for a pair of proteins based on different similarity measures of GO terms. Using the sum of similarity scores over the matched GO term pairs for two proteins as the similarity definition produced the best predictive outcome. Substantial improvement in predicting protein subnuclear localizations has been achieved by combining Gene Ontology with sequence information

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Gene ontology based transfer learning for protein subcellular localization

Author: A Bateman
A Dijk
A Hoglund
A Hoglund
A Pierleoni
C Chen
C Leslie
C Leslie
DH Haft
E Marcotte
EM Zdobnov
F Corpet
FM Li
G Lanckriet
G Schneider
H Ding
H Lin
H Lin
H Liu
H Rangwala
H Shen
HB Shen
HB Shen
HB Shen
HB Shen
HB Shen
J Cedano
J Schultz
J Shen
JD Qiu
JD Qiu
K Chou
K Chou
K Chou
K Hofmann
K Lee
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
L Nanni
M Ashburner
M Esmaeili
M Mak
M Wang
Q Gu
Q Yang
R Apweiler
R Kuang
R Kuang
S Mei
S Pan
Shuigeng Zhou
Suyu Mei
T Blum
T Tung
TK Attwood
W Dai
W Dai
W Huang
W Huang
Wang Fei
X Jiang
X Xiao
XB Zhou
YH Zeng
YS Ding
YS Ding
Z Lei
Z Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as <it>GO</it>, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the <it>GO </it>terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology. Results In this paper, we propose a Gene Ontology Based Transfer Learning Model (<it>GO-TLM</it>) for large-scale protein subcellular localization. The model transfers the signature-based homologous <it>GO </it>terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false <it>GO </it>terms that are resulted from evolutionary divergence. We derive three <it>GO </it>kernels from the three aspects of gene ontology to measure the <it>GO </it>similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate <it>GO-TLM </it>performance against three baseline models: <it>MultiLoc, MultiLoc-GO </it>and <it>Euk-mPLoc </it>on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that <it>GO-TLM </it>achieves substantial accuracy improvement against the baseline models: 80.38% against model <it>Euk-mPLoc </it>67.40% with <it>12.98% </it>substantial increase; 96.65% and 96.27% against model <it>MultiLoc-GO </it>89.60% and 89.60%, with <it>7.05% </it>and <it>6.67% </it>accuracy increase on dataset <it>MultiLoc plant </it>and dataset <it>MultiLoc animal</it>, respectively; 97.14%, 95.90% and 96.85% against model <it>MultiLoc-GO </it>83.70%, 90.10% and 85.70%, with accuracy increase <it>13.44%</it>, <it>5.8% </it>and <it>11.15% </it>on dataset <it>BaCelLoc plant</it>, dataset <it>BaCelLoc fungi </it>and dataset <it>BaCelLoc animal </it>respectively. For <it>BaCelLoc </it>independent sets, <it>GO-TLM </it>achieves 81.25%, 80.45% and 79.46% on dataset <it>BaCelLoc plant holdout</it>, dataset <it>BaCelLoc plant holdout </it>and dataset <it>BaCelLoc animal holdout</it>, respectively, as compared against baseline model <it>MultiLoc-GO </it>76%, 60.00% and 73.00%, with accuracy increase <it>5.25%</it>, <it>20.45% </it>and <it>6.46%</it>, respectively. Conclusions Since direct homology-based <it>GO </it>term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, <it>GO-TLM</it>) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based <it>GO </it>term transfer and explicitly weighing the <it>GO </it>kernels substantially improve the prediction performance.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization

Author: Ho Shih-Wen
Ho Shinn-Ying
Huang Wen-Lin
Hwang Shiow-Fen
Tung Chun-Wei
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data

Author: Du LinFang
Xu Tao
Zhou Yan
Publication venue: BioMed Central
Publication date: 01/11/2008
Field of study

Abstract Background Researchers interested in analysing the expression patterns of functionally related genes usually hope to improve the accuracy of their results beyond the boundaries of currently available experimental data. Gene ontology (GO) data provides a novel way to measure the functional relationship between gene products. Many approaches have been reported for calculating the similarities between two GO terms, known as semantic similarities. However, biologists are more interested in the relationship between gene products than in the scores linking the GO terms. To highlight the relationships among genes, recent studies have focused on functional similarities. Results In this study, we evaluated five functional similarity methods using both protein-protein interaction (PPI) and expression data of <it>S. cerevisiae</it>. The receiver operating characteristics (ROC) and correlation coefficient analysis of these methods showed that the maximum method outperformed the other methods. Statistical comparison of multiple- and single-term annotated proteins in biological process ontology indicated that genes with multiple GO terms may be more reliable for separating true positives from noise. Conclusion This study demonstrated the reliability of current approaches that elevate the similarity of GO terms to the similarity of proteins. Suggestions for further improvements in functional similarity analysis are also provided.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Amino acid classification based spectrum kernel fusion for protein subnuclear localization

Author: A Dijk
A Hoglund
B Boeckmann
C Leslie
C Leslie
G Dellaire
G Lanckriet
G Schneider
H Rangwala
H Shen
J Cedano
J Guo
J Shen
J Taylor
K Chou
K Chou
K Lee
M Edward
M Mak
M Richard
P Jia
R Kuang
R Kuang
S Alejandro
S Mei
Suyu Mei
T Tung
V Vapnik
Wang Fei
Z Alexander
Z Lei
Z Lei
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Prediction of protein localization in subnuclear organelles is more challenging than general protein subcelluar localization. There are only three computational models for protein subnuclear localization thus far, to the best of our knowledge. Two models were based on protein primary sequence only. The first model assumed homogeneous amino acid substitution pattern across all protein sequence residue sites and used BLOSUM62 to encode <it>k</it>-mer of protein sequence. Ensemble of SVM based on different <it>k</it>-mers drew the final conclusion, achieving 50% overall accuracy. The simplified assumption did not exploit protein sequence profile and ignored the fact of heterogeneous amino acid substitution patterns across sites. The second model derived the <it>PsePSSM </it>feature representation from protein sequence by simply averaging the profile PSSM and combined the <it>PseAA </it>feature representation to construct a kNN ensemble classifier <it>Nuc-PLoc</it>, achieving 67.4% overall accuracy. The two models based on protein primary sequence only both achieved relatively poor predictive performance. The third model required that GO annotations be available, thus restricting the model's applicability. Methods In this paper, we only use the amino acid information of protein sequence without any other information to design a widely-applicable model for protein subnuclear localization. We use <it>K</it>-spectrum kernel to exploit the contextual information around an amino acid and the conserved motif information. Besides expanding window size, we adopt various amino acid classification approaches to capture diverse aspects of amino acid physiochemical properties. Each amino acid classification generates a series of spectrum kernels based on different window size. Thus, (I) window expansion can capture more contextual information and cover size-varying motifs; (II) various amino acid classifications can exploit multi-aspect biological information from the protein sequence. Finally, we combine all the spectrum kernels by simple addition into one single kernel called <it>SpectrumKernel+ </it>for protein subnuclear localization. Results We conduct the performance evaluation experiments on two benchmark datasets: <it>Lei </it>and <it>Nuc-PLoc</it>. Experimental results show that <it>SpectrumKernel+ </it>achieves substantial performance improvement against the previous model <it>Nuc-PLoc</it>, with overall accuracy <it>83.47% </it>against <it>67.4%</it>; and <it>71.23% </it>against <it>50% </it>of <it>Lei SVM Ensemble</it>, against 66.50% of <it>Lei GO SVM Ensemble</it>. Conclusion The method <it>SpectrumKernel</it>+ can exploit rich amino acid information of protein sequence by embedding into implicit size-varying motifs the multi-aspect amino acid physiochemical properties captured by amino acid classification approaches. The kernels derived from diverse amino acid classification approaches and different sizes of <it>k</it>-mer are summed together for data integration. Experiments show that the method <it>SpectrumKernel</it>+ significantly outperforms the existing models for protein subnuclear localization.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Prediction of eukaryotic protein subcellular multi- localisation with a combined KNN-SVM ensemble classifier

Author: Hong Kuang
Kaifa Wang
Liqi Li
Ying Wan
Yuan Zhang
Yue Zhou
Publication venue
Publication date: 24/04/2020
Field of study

Proteins may exist in or shift among two or more different subcellular locations, and this phenomenon is closely related to biological function. It is challenging to deal with multiple locations during eukaryotic protein subcellular localisation prediction with routine methods; therefore, a reliable and automatic ensemble classifier for protein subcellular localisation is needed. We propose a new ensemble classifier combined with the KNN (K-nearest neighbour) and SVM (support vector machine) algorithms to predict the subcellular localisation of eukaryotic proteins from the GO (gene ontology) annotations. This method was developed by fusing basic individual classifiers through a voting system. The overall prediction accuracies thus obtained via the jackknife test and resubstitution test were 70.5 and 77.6% for eukaryotic proteins respectively, which are significantly higher than other methods presented in the previous studies and reveal that our strategy better predicts eukaryotic protein subcellular localisation

CiteSeerX

Protein Interactome and Its Application to Protein Function Prediction

Author: Hyun-Hwan Jeong
KiYoung Lee
Woojin Jung
Publication venue: 'IntechOpen'
Publication date: 30/03/2012
Field of study

IntechOpen

Going from where to why—interpretable prediction of protein subcellular localization

Author: Bannai
Blum
Boden
Brady
Briesemeister
Carlson
Casadio
Cedano
Chou
Chou
Chou
Chou
Chou
Cokol
Cui
Emanuelsson
Fayyad
Fujiwara
Fyshe
Garg
Garg
Guo
Hall
Horton
Hua
Huang
Höglund
Jörg Rahnenführer
Kaiser
King
Lee
Lei
Lin
Lu
Lu
Nair
Nair
Nair
Nakai
Oliver Kohlbacher
Outten
Park
Petsalaki
Pierleoni
Reinhardt
Rish
Scott
Scott
Sebastian Briesemeister
Shin
Small
Takada
Tsoumakas
Whitten
Xie
Zhang
Publication venue: Oxford University Press
Publication date
Field of study

Motivation: Protein subcellular localization is pivotal in understanding a protein's function. Computational prediction of subcellular localization has become a viable alternative to experimental approaches. While current machine learning-based methods yield good prediction accuracy, most of them suffer from two key problems: lack of interpretability and dealing with multiple locations

Crossref

PubMed Central

An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology

Author: A del Pozo
A Hofer
A Patil
A Schlicker
AC Gavin
AJ Faller
C Pesquita
C Pesquita
C Pesquita
C Pesquita
CM Deane
D Li
D Lin
D Warde-Farley
DP Eisinger
DR Rhodes
F Azuaje
FM Couto
G Alterovitz
G van Rossum
G Yu
Gary D Bader
H Wu
H Yu
H Yu
I Xenarios
J Cheng
JJ Jiang
JL Sevilla
JZ Wang
K Strasser
K Xia
LJ Jensen
M Ashburner
M West
N Nariai
NJ Krogan
P Resnik
P Uetz
P Zhang
R Gentleman
R Shen
S Chavez
S Razick
Shobhit Jain
T Xu
U Stelzl
X Guo
Y Chen
Y Tao
Z Lei
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Semantic similarity measures are useful to assess the physiological relevance of protein-protein interactions (PPIs). They quantify similarity between proteins based on their function using annotation systems like the Gene Ontology (GO). Proteins that interact in the cell are likely to be in similar locations or involved in similar biological processes compared to proteins that do not interact. Thus the more semantically similar the gene function annotations are among the interacting proteins, more likely the interaction is physiologically relevant. However, most semantic similarity measures used for PPI confidence assessment do not consider the unequal depth of term hierarchies in different classes of cellular location, molecular function, and biological process ontologies of GO and thus may over-or under-estimate similarity. Results We describe an improved algorithm, Topological Clustering Semantic Similarity (TCSS), to compute semantic similarity between GO terms annotated to proteins in interaction datasets. Our algorithm, considers unequal depth of biological knowledge representation in different branches of the GO graph. The central idea is to divide the GO graph into sub-graphs and score PPIs higher if participating proteins belong to the same sub-graph as compared to if they belong to different sub-graphs. Conclusions The TCSS algorithm performs better than other semantic similarity measurement techniques that we evaluated in terms of their performance on distinguishing true from false protein interactions, and correlation with gene expression and protein families. We show an average improvement of 4.6 times the <it>F</it>1 score over Resnik, the next best method, on our <it>Saccharomyces cerevisiae </it>PPI dataset and 2 times on our <it>Homo sapiens </it>PPI dataset using cellular component, biological process and molecular function GO annotations.</p

University of Toronto Research Repository

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Metrics for GO based protein semantic similarity: a systematic evaluation

Author: A Schlicker
A Valencia
André O Falcão
António EN Ferreira
C Pesquita
C Wu
Catia Pesquita
D Devos
D Devos
D Faria
D Lin
Daniel Faria
E Camon
EB Camon
F Azuaje
F Azuaje
F Couto
F Couto
FM Couto
Francisco M Couto
Gentleman
Hugo Bastos
J Chabalier
J Jiang
J Tuikkala
JL Sevilla
L Stein
P Lord
P Lord
P Resnik
PH Lee
RM Othman
RM Riensche
S Cao
T Joshi
X Guo
X Wu
Y Tao
Z Lei
ZH Duan
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Several semantic similarity measures have been applied to gene products annotated with Gene Ontology terms, providing a basis for their functional comparison. However, it is still unclear which is the best approach to semantic similarity in this context, since there is no conclusive evaluation of the various measures. Another issue, is whether electronic annotations should or not be used in semantic similarity calculations. Results We conducted a systematic evaluation of GO-based semantic similarity measures using the relationship with sequence similarity as a means to quantify their performance, and assessed the influence of electronic annotations by testing the measures in the presence and absence of these annotations. We verified that the relationship between semantic and sequence similarity is not linear, but can be well approximated by a rescaled Normal cumulative distribution function. Given that the majority of the semantic similarity measures capture an identical behaviour, but differ in resolution, we used the latter as the main criterion of evaluation. Conclusions This work has provided a basis for the comparison of several semantic similarity measures, and can aid researchers in choosing the most adequate measure for their work. We have found that the hybrid <it>simGIC</it> was the measure with the best overall performance, followed by Resnik's measure using a best-match average combination approach. We have also found that the average and maximum combination approaches are problematic since both are inherently influenced by the number of terms being combined. We suspect that there may be a direct influence of data circularity in the behaviour of the results including electronic annotations, as a result of functional inference from sequence similarity.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Universidade de Lisboa: Repositório.UL