15 research outputs found

    A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine

    Get PDF
    AbstractApoptosis proteins have a central role in the development and homeostasis of an organism. These proteins are very important for understanding the mechanism of programmed cell death. Based on the idea of coarse-grained description and grouping in physics, a new feature extraction method with grouped weight for protein sequence is presented, and applied to apoptosis protein subcellular localization prediction associated with support vector machine. For the same training dataset and the same predictive algorithm, the overall prediction accuracy of our method in Jackknife test is 13.2% and 15.3% higher than the accuracy based on the amino acid composition and instability index. Especially for the else class apoptosis proteins, the increment of prediction accuracy is 41.7 and 33.3 percentile, respectively. The experiment results show that the new feature extraction method is efficient to extract the structure information implicated in protein sequence and the method has reached a satisfied performance despite its simplicity. The overall prediction accuracy of EBGW_SVM model on dataset ZD98 reach 92.9% in Jackknife test, which is 8.2–20.4 percentile higher than other existing models. For a new dataset ZW225, the overall prediction accuracy of EBGW_SVM achieves 83.1%. Those implied that EBGW_SVM model is a simple but efficient prediction model for apoptosis protein subcellular location prediction

    An SVM-based system for predicting protein subnuclear localizations

    Get PDF
    BACKGROUND: The large gap between the number of protein sequences in databases and the number of functionally characterized proteins calls for the development of a fast computational tool for the prediction of subnuclear and subcellular localizations generally applicable to protein sequences. The information on localization may reveal the molecular function of novel proteins, in addition to providing insight on the biological pathways in which they function. The bulk of past work has been focused on protein subcellular localizations. Furthermore, no specific tool has been dedicated to prediction at the subnuclear level, despite its high importance. In order to design a suitable predictive system, the extraction of subtle sequence signals that can discriminate among proteins with different subnuclear localizations is the key. RESULTS: New kernel functions used in a support vector machine (SVM) learning model are introduced for the measurement of sequence similarity. The k-peptide vectors are first mapped by a matrix of high-scored pairs of k-peptides which are measured by BLOSUM62 scores. The kernels, measuring the similarity for sequences, are then defined on the mapped vectors. By combining these new encoding methods, a multi-class classification system for the prediction of protein subnuclear localizations is established for the first time. The performance of the system is evaluated with a set of proteins collected in the Nuclear Protein Database (NPD). The overall accuracy of prediction for 6 localizations is about 50% (vs. random prediction 16.7%) for single localization proteins in the leave-one-out cross-validation; and 65% for an independent set of multi-localization proteins. This integrated system can be accessed at . CONCLUSION: The integrated system benefits from the combination of predictions from several SVMs based on selected encoding methods. Finally, the predictive power of the system is expected to improve as more proteins with known subnuclear localizations become available

    Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence

    Get PDF
    BACKGROUND: Knowing the submitochondria localization of a mitochondria protein is an important step to understand its function. We develop a method which is based on an extended version of pseudo-amino acid composition to predict the protein localization within mitochondria. This work goes one step further than predicting protein subcellular location. We also try to predict the membrane protein type for mitochondrial inner membrane proteins. RESULTS: By using leave-one-out cross validation, the prediction accuracy is 85.5% for inner membrane, 94.5% for matrix and 51.2% for outer membrane. The overall prediction accuracy for submitochondria location prediction is 85.2%. For proteins predicted to localize at inner membrane, the accuracy is 94.6% for membrane protein type prediction. CONCLUSION: Our method is an effective method for predicting protein submitochondria location. But even with our method or the methods at subcellular level, the prediction of protein submitochondria location is still a challenging problem. The online service SubMito is now available at

    Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction

    Get PDF
    BACKGROUND: The accomplishment of the various genome sequencing projects resulted in accumulation of massive amount of gene sequence information. This calls for a large-scale computational method for predicting protein localization from sequence. The protein localization can provide valuable information about its molecular function, as well as the biological pathway in which it participates. The prediction of localization of a protein at subnuclear level is a challenging task. In our previous work we proposed an SVM-based system using protein sequence information for this prediction task. In this work, we assess protein similarity with Gene Ontology (GO) and then improve the performance of the system by adding a module of nearest neighbor classifier using a similarity measure derived from the GO annotation terms for protein sequences. RESULTS: The performance of the new system proposed here was compared with our previous system using a set of proteins resided within 6 localizations collected from the Nuclear Protein Database (NPD). The overall MCC (accuracy) is elevated from 0.284 (50.0%) to 0.519 (66.5%) for single-localization proteins in leave-one-out cross-validation; and from 0.420 (65.2%) to 0.541 (65.2%) for an independent set of multi-localization proteins. The new system is available at . CONCLUSION: The prediction of protein subnuclear localizations can be largely influenced by various definitions of similarity for a pair of proteins based on different similarity measures of GO terms. Using the sum of similarity scores over the matched GO term pairs for two proteins as the similarity definition produced the best predictive outcome. Substantial improvement in predicting protein subnuclear localizations has been achieved by combining Gene Ontology with sequence information

    'Unite and conquer': enhanced prediction of protein subcellular localization by integrating multiple specialized tools

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Knowing the subcellular location of proteins provides clues to their function as well as the interconnectivity of biological processes. Dozens of tools are available for predicting protein location in the eukaryotic cell. Each tool performs well on certain data sets, but their predictions often disagree for a given protein. Since the individual tools each have particular strengths, we set out to integrate them in a way that optimally exploits their potential. The method we present here is applicable to various subcellular locations, but tailored for predicting whether or not a protein is localized in mitochondria. Knowledge of the mitochondrial proteome is relevant to understanding the role of this organelle in global cellular processes.</p> <p>Results</p> <p>In order to develop a method for enhanced prediction of subcellular localization, we integrated the outputs of available localization prediction tools by several strategies, and tested the performance of each strategy with known mitochondrial proteins. The accuracy obtained (up to 92%) surpasses by far the individual tools. The method of integration proved crucial to the performance. For the prediction of mitochondrion-located proteins, integration via a two-layer decision tree clearly outperforms simpler methods, as it allows emphasis of biologically relevant features such as the mitochondrial targeting peptide and transmembrane domains.</p> <p>Conclusion</p> <p>We developed an approach that enhances the prediction accuracy of mitochondrial proteins by uniting the strength of specialized tools. The combination of machine-learning based integration with biological expert knowledge leads to improved performance. This approach also alleviates the conundrum of how to choose between conflicting predictions. Our approach is easy to implement, and applicable to predicting subcellular locations other than mitochondria, as well as other biological features. For a trial of our approach, we provide a webservice for mitochondrial protein prediction (named YimLOC), which can be accessed through the AnaBench suite at http://anabench.bcm.umontreal.ca/anabench/. The source code is provided in the Additional File <supplr sid="S2">2</supplr>.</p> <suppl id="S2"> <title> <p>Additional file 2</p> </title> <text> <p>This file contains scripts for the online server YimLOC. Please note that there scripts only codes for the ready-to-use STACK-mem-DT described in the main text. The scripts do not provide the training process.</p> </text> <file name="1471-2105-8-420-S2.pdf"> <p>Click here for file</p> </file> </suppl

    A Comparative Study on Feature Extraction from Protein Sequences for Subcellular Localization Prediction

    Full text link

    Predicting multiplex subcellular localization of proteins using protein-protein interaction network: a comparative study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Proteins that interact in vivo tend to reside within the same or "adjacent" subcellular compartments. This observation provides opportunities to reveal protein subcellular localization in the context of the protein-protein interaction (PPI) network. However, so far, only a few efforts based on heuristic rules have been made in this regard.</p> <p>Results</p> <p>We systematically and quantitatively validate the hypothesis that proteins physically interacting with each other probably share at least one common subcellular localization. With the result, for the first time, four graph-based semi-supervised learning algorithms, Majority, <it>χ</it><sup>2</sup>-score, GenMultiCut and FunFlow originally proposed for protein function prediction, are introduced to assign "multiplex localization" to proteins. We analyze these approaches by performing a large-scale cross validation on a <it>Saccharomyces cerevisiae </it>proteome compiled from BioGRID and comparing their predictions for 22 protein subcellular localizations. Furthermore, we build an ensemble classifier to associate 529 unlabeled and 137 ambiguously-annotated proteins with subcellular localizations, most of which have been verified in the previous experimental studies.</p> <p>Conclusions</p> <p>Physical interaction of proteins has actually provided an essential clue for their co-localization. Compared to the local approaches, the global algorithms consistently achieve a superior performance.</p

    Imbalanced Multi-Modal Multi-Label Learning for Subcellular Localization Prediction of Human Proteins with Both Single and Multiple Sites

    Get PDF
    It is well known that an important step toward understanding the functions of a protein is to determine its subcellular location. Although numerous prediction algorithms have been developed, most of them typically focused on the proteins with only one location. In recent years, researchers have begun to pay attention to the subcellular localization prediction of the proteins with multiple sites. However, almost all the existing approaches have failed to take into account the correlations among the locations caused by the proteins with multiple sites, which may be the important information for improving the prediction accuracy of the proteins with multiple sites. In this paper, a new algorithm which can effectively exploit the correlations among the locations is proposed by using Gaussian process model. Besides, the algorithm also can realize optimal linear combination of various feature extraction technologies and could be robust to the imbalanced data set. Experimental results on a human protein data set show that the proposed algorithm is valid and can achieve better performance than the existing approaches

    Protein subcellular localization prediction based on compartment-specific features and structure conservation

    Get PDF
    BACKGROUND: Protein subcellular localization is crucial for genome annotation, protein function prediction, and drug discovery. Determination of subcellular localization using experimental approaches is time-consuming; thus, computational approaches become highly desirable. Extensive studies of localization prediction have led to the development of several methods including composition-based and homology-based methods. However, their performance might be significantly degraded if homologous sequences are not detected. Moreover, methods that integrate various features could suffer from the problem of low coverage in high-throughput proteomic analyses due to the lack of information to characterize unknown proteins. RESULTS: We propose a hybrid prediction method for Gram-negative bacteria that combines a one-versus-one support vector machines (SVM) model and a structural homology approach. The SVM model comprises a number of binary classifiers, in which biological features derived from Gram-negative bacteria translocation pathways are incorporated. In the structural homology approach, we employ secondary structure alignment for structural similarity comparison and assign the known localization of the top-ranked protein as the predicted localization of a query protein. The hybrid method achieves overall accuracy of 93.7% and 93.2% using ten-fold cross-validation on the benchmark data sets. In the assessment of the evaluation data sets, our method also attains accurate prediction accuracy of 84.0%, especially when testing on sequences with a low level of homology to the training data. A three-way data split procedure is also incorporated to prevent overestimation of the predictive performance. In addition, we show that the prediction accuracy should be approximately 85% for non-redundant data sets of sequence identity less than 30%. CONCLUSION: Our results demonstrate that biological features derived from Gram-negative bacteria translocation pathways yield a significant improvement. The biological features are interpretable and can be applied in advanced analyses and experimental designs. Moreover, the overall accuracy of combining the structural homology approach is further improved, which suggests that structural conservation could be a useful indicator for inferring localization in addition to sequence homology. The proposed method can be used in large-scale analyses of proteomes

    Protein subcellular localization prediction of eukaryotes using a knowledge-based approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The study of protein subcellular localization (PSL) is important for elucidating protein functions involved in various cellular processes. However, determining the localization sites of a protein through wet-lab experiments can be time-consuming and labor-intensive. Thus, computational approaches become highly desirable. Most of the PSL prediction systems are established for single-localized proteins. However, a significant number of eukaryotic proteins are known to be localized into multiple subcellular organelles. Many studies have shown that proteins may simultaneously locate or move between different cellular compartments and be involved in different biological processes with different roles.</p> <p>Results</p> <p>In this study, we propose a knowledge based method, called KnowPred<sub>site</sub>, to predict the localization site(s) of both single-localized and multi-localized proteins. Based on the local similarity, we can identify the "related sequences" for prediction. We construct a knowledge base to record the possible sequence variations for protein sequences. When predicting the localization annotation of a query protein, we search against the knowledge base and used a scoring mechanism to determine the predicted sites. We downloaded the dataset from ngLOC, which consisted of ten distinct subcellular organelles from 1923 species, and performed ten-fold cross validation experiments to evaluate KnowPred<sub>site</sub>'s performance. The experiment results show that KnowPred<sub>site </sub>achieves higher prediction accuracy than ngLOC and Blast-hit method. For single-localized proteins, the overall accuracy of KnowPred<sub>site </sub>is 91.7%. For multi-localized proteins, the overall accuracy of KnowPred<sub>site </sub>is 72.1%, which is significantly higher than that of ngLOC by 12.4%. Notably, half of the proteins in the dataset that cannot find any Blast hit sequence above a specified threshold can still be correctly predicted by KnowPred<sub>site</sub>.</p> <p>Conclusion</p> <p>KnowPred<sub>site </sub>demonstrates the power of identifying related sequences in the knowledge base. The experiment results show that even though the sequence similarity is low, the local similarity is effective for prediction. Experiment results show that KnowPred<sub>site </sub>is a highly accurate prediction method for both single- and multi-localized proteins. It is worth-mentioning the prediction process of KnowPred<sub>site </sub>is transparent and biologically interpretable and it shows a set of template sequences to generate the prediction result. The KnowPred<sub>site </sub>prediction server is available at <url>http://bio-cluster.iis.sinica.edu.tw/kbloc/</url>.</p
    corecore