7,785 research outputs found

    Protein (Multi-)Location Prediction: Using Location Inter-Dependencies in a Probabilistic Framework

    Full text link
    Knowing the location of a protein within the cell is important for understanding its function, role in biological processes, and potential use as a drug target. Much progress has been made in developing computational methods that predict single locations for proteins, assuming that proteins localize to a single location. However, it has been shown that proteins localize to multiple locations. While a few recent systems have attempted to predict multiple locations of proteins, they typically treat locations as independent or capture inter-dependencies by treating each locations-combination present in the training set as an individual location-class. We present a new method and a preliminary system we have developed that directly incorporates inter-dependencies among locations into the multiple-location-prediction process, using a collection of Bayesian network classifiers. We evaluate our system on a dataset of single- and multi-localized proteins. Our results, obtained by incorporating inter-dependencies are significantly higher than those obtained by classifiers that do not use inter-dependencies. The performance of our system on multi-localized proteins is comparable to a top performing system (YLoc+), without restricting predictions to be based only on location-combinations present in the training set.Comment: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013

    Imbalanced Multi-Modal Multi-Label Learning for Subcellular Localization Prediction of Human Proteins with Both Single and Multiple Sites

    Get PDF
    It is well known that an important step toward understanding the functions of a protein is to determine its subcellular location. Although numerous prediction algorithms have been developed, most of them typically focused on the proteins with only one location. In recent years, researchers have begun to pay attention to the subcellular localization prediction of the proteins with multiple sites. However, almost all the existing approaches have failed to take into account the correlations among the locations caused by the proteins with multiple sites, which may be the important information for improving the prediction accuracy of the proteins with multiple sites. In this paper, a new algorithm which can effectively exploit the correlations among the locations is proposed by using Gaussian process model. Besides, the algorithm also can realize optimal linear combination of various feature extraction technologies and could be robust to the imbalanced data set. Experimental results on a human protein data set show that the proposed algorithm is valid and can achieve better performance than the existing approaches

    A method to improve protein subcellular localization prediction by integrating various biological data sources

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein subcellular localization is crucial information to elucidate protein functions. Owing to the need for large-scale genome analysis, computational method for efficiently predicting protein subcellular localization is highly required. Although many previous works have been done for this task, the problem is still challenging due to several reasons: the number of subcellular locations in practice is large; distribution of protein in locations is imbalanced, that is the number of protein in each location remarkably different; and there are many proteins located in multiple locations. Thus it is necessary to explore new features and appropriate classification methods to improve the prediction performance.</p> <p>Results</p> <p>In this paper we propose a new predicting method which combines two key ideas: 1) Information of neighbour proteins in a probabilistic gene network is integrated to enrich the prediction features. 2) Fuzzy k-NN, a classification method based on fuzzy set theory is applied to predict protein locating in multiple sites. Experiment was conducted on a dataset consisting of 22 locations from Budding yeast proteins and significant improvement was observed.</p> <p>Conclusion</p> <p>Our results suggest that the neighbourhood information from functional gene networks is predictive to subcellular localization. The proposed method thus can be integrated and complementary to other available prediction methods.</p

    PLPD: reliable protein localization prediction from imbalanced and overlapped datasets

    Get PDF
    Subcellular localization is one of the key functional characteristics of proteins. An automatic and efficient prediction method for the protein subcellular localization is highly required owing to the need for large-scale genome analysis. From a machine learning point of view, a dataset of protein localization has several characteristics: the dataset has too many classes (there are more than 10 localizations in a cell), it is a multi-label dataset (a protein may occur in several different subcellular locations), and it is too imbalanced (the number of proteins in each localization is remarkably different). Even though many previous works have been done for the prediction of protein subcellular localization, none of them tackles effectively these characteristics at the same time. Thus, a new computational method for protein localization is eventually needed for more reliable outcomes. To address the issue, we present a protein localization predictor based on D-SVDD (PLPD) for the prediction of protein localization, which can find the likelihood of a specific localization of a protein more easily and more correctly. Moreover, we introduce three measurements for the more precise evaluation of a protein localization predictor. As the results of various datasets which are made from the experiments of Huh et al. (2003), the proposed PLPD method represents a different approach that might play a complimentary role to the existing methods, such as Nearest Neighbor method and discriminate covariant method. Finally, after finding a good boundary for each localization using the 5184 classified proteins as training data, we predicted 138 proteins whose subcellular localizations could not be clearly observed by the experiments of Huh et al. (2003)

    MULocDeep: A deep-learning framework for protein subcellular and suborganellar localization prediction with residue-level interpretation

    Get PDF
    Prediction of protein localization plays an important role in understanding protein function and mechanisms. In this paper, we propose a general deep learning-based localization prediction framework, MULocDeep, which can predict multiple localizations of a protein at both subcellular and suborganellar levels. We collected a dataset with 44 suborganellar localization annotations in 10 major subcellular compartments—the most comprehensive suborganelle localization dataset to date. We also experimentally generated an independent dataset of mitochondrial proteins in Arabidopsis thaliana cell cultures, Solanum tuberosum tubers, and Vicia faba roots and made this dataset publicly available. Evaluations using the above datasets show that overall, MULocDeep outperforms other major methods at both subcellular and suborganellar levels. Furthermore, MULocDeep assesses each amino acid's contribution to localization, which provides insights into the mechanism of protein sorting and localization motifs. A web server can be accessed at http://mu-loc.org

    Predict gram - positive and gram - negative subcellular localization via incorporating evolutionary information and physicochemical features into Chou’s general PseAAC

    Get PDF
    In this study, we used structural and evolutionary based features to represent the sequences of gram-positive and gram-negative subcellular localizations. To do this, we proposed a normalization method to construct a normalize Position Specific Scoring Matrix (PSSM) using the information from original PSSM. To investigate the effectiveness of the proposed method we compute feature vectors from normalize PSSM and by applying Support Vector Machine (SVM) and Naïve Bayes classifier, respectively, we compared achieved results with the previously reported results. We also computed features from original PSSM and normalized PSSM and compared their results. The archived results show enhancement in gram-positive and gram-negative subcellular localizations. Evaluating localization for each feature, our results indicate that employing SVM and concatenating features (amino acid composition feature, Dubchak feature (physicochemical-based features), normalized PSSM based auto-covariance feature and normalized PSSM based bigram feature) have higher accuracy while employing Naïve Bayes classifier with normalized PSSM based auto-covariance feature proves to have high sensitivity for both benchmarks. Our reported results in terms of overall locative accuracy is 84.8% and overall absolute accuracy is 85.16% for gram-positive dataset; and, for gram- negative dataset, overall locative accuracy is 85.4% and overall absolute accuracy is 86.3%

    Machine learning methods for omics data integration

    Get PDF
    High-throughput technologies produce genome-scale transcriptomic and metabolomic (omics) datasets that allow for the system-level studies of complex biological processes. The limitation lies in the small number of samples versus the larger number of features represented in these datasets. Machine learning methods can help integrate these large-scale omics datasets and identify key features from each dataset. A novel class dependent feature selection method integrates the F statistic, maximum relevance binary particle swarm optimization (MRBPSO), and class dependent multi-category classification (CDMC) system. A set of highly differentially expressed genes are pre-selected using the F statistic as a filter for each dataset. MRBPSO and CDMC function as a wrapper to select desirable feature subsets for each class and classify the samples using those chosen class-dependent feature subsets. The results indicate that the class-dependent approaches can effectively identify unique biomarkers for each cancer type and improve classification accuracy compared to class independent feature selection methods. The integration of transcriptomics and metabolomics data is based on a classification framework. Compared to principal component analysis and non-negative matrix factorization based integration approaches, our proposed method achieves 20-30% higher prediction accuracies on Arabidopsis tissue development data. Metabolite-predictive genes and gene-predictive metabolites are selected from transcriptomic and metabolomic data respectively. The constructed gene-metabolite correlation network can infer the functions of unknown genes and metabolites. Tissue-specific genes and metabolites are identified by the class-dependent feature selection method. Evidence from subcellular locations, gene ontology, and biochemical pathways support the involvement of these entities in different developmental stages and tissues in Arabidopsis
    • …
    corecore