51 research outputs found
A study of network-based kernel methods on protein-protein interaction for protein functions prediction
Predicting protein functions is an important issue in the post-genomic era. In this paper, we studied several network-based kernels including Local Linear Embedding (LLE) kernel method, Diffusion kernel and Laplacian Kernel to uncover the relationship between proteins functions and Protein-Protein Interactions (PPI). We first construct kernels based on PPI networks, we then apply Support Vector Machine (SVM) techniques to classify proteins into different functional groups. 5-fold cross validation is then applied to the selected 359 GO terms to compare the performance of different kernels and guilt-by-association methods including neighbor counting methods and Chisquare methods. Finally we made predictions of functions of some unknown genes and verified the preciseness of our prediction in part by the information of other data source.postprintThe 3rd International Symposium on Optimization and Systems Biology (OSB 2009), Zhangjiajie, China, 20-22 September 2009. In Lecture Notes in Operations Research, 2009, v. 11, p. 25-3
A markov classification model for metabolic pathways
<p>Abstract</p> <p>Background</p> <p>This paper considers the problem of identifying pathways through metabolic networks that relate to a specific biological response. Our proposed model, HME3M, first identifies frequently traversed network paths using a Markov mixture model. Then by employing a hierarchical mixture of experts, separate classifiers are built using information specific to each path and combined into an ensemble prediction for the response.</p> <p>Results</p> <p>We compared the performance of HME3M with logistic regression and support vector machines (SVM) for both simulated pathways and on two metabolic networks, glycolysis and the pentose phosphate pathway for <it>Arabidopsis thaliana</it>. We use AltGenExpress microarray data and focus on the pathway differences in the developmental stages and stress responses of <it>Arabidopsis</it>. The results clearly show that HME3M outperformed the comparison methods in the presence of increasing network complexity and pathway noise. Furthermore an analysis of the paths identified by HME3M for each metabolic network confirmed known biological responses of <it>Arabidopsis</it>.</p> <p>Conclusions</p> <p>This paper clearly shows HME3M to be an accurate and robust method for classifying metabolic pathways. HME3M is shown to outperform all comparison methods and further is capable of identifying known biologically active pathways within microarray data.</p
On Evaluating MHC-II Binding Peptide Prediction Methods
Choice of one method over another for MHC-II binding peptide prediction is typically based on published reports of their estimated performance on standard benchmark datasets. We show that several standard benchmark datasets of unique peptides used in such studies contain a substantial number of peptides that share a high degree of sequence identity with one or more other peptide sequences in the same dataset. Thus, in a standard cross-validation setup, the test set and the training set are likely to contain sequences that share a high degree of sequence identity with each other, leading to overly optimistic estimates of performance. Hence, to more rigorously assess the relative performance of different prediction methods, we explore the use of similarity-reduced datasets. We introduce three similarity-reduced MHC-II benchmark datasets derived from MHCPEP, MHCBN, and IEDB databases. The results of our comparison of the performance of three MHC-II binding peptide prediction methods estimated using datasets of unique peptides with that obtained using their similarity-reduced counterparts shows that the former can be rather optimistic relative to the performance of the same methods on similarity-reduced counterparts of the same datasets. Furthermore, our results demonstrate that conclusions regarding the superiority of one method over another drawn on the basis of performance estimates obtained using commonly used datasets of unique peptides are often contradicted by the observed performance of the methods on the similarity-reduced versions of the same datasets. These results underscore the importance of using similarity-reduced datasets in rigorously comparing the performance of alternative MHC-II peptide prediction methods
PepDist: A New Framework for Protein-Peptide Binding Prediction based on Learning Peptide Distance Functions
BACKGROUND: Many different aspects of cellular signalling, trafficking and targeting mechanisms are mediated by interactions between proteins and peptides. Representative examples are MHC-peptide complexes in the immune system. Developing computational methods for protein-peptide binding prediction is therefore an important task with applications to vaccine and drug design. METHODS: Previous learning approaches address the binding prediction problem using traditional margin based binary classifiers. In this paper we propose PepDist: a novel approach for predicting binding affinity. Our approach is based on learning peptide-peptide distance functions. Moreover, we suggest to learn a single peptide-peptide distance function over an entire family of proteins (e.g. MHC class I). This distance function can be used to compute the affinity of a novel peptide to any of the proteins in the given family. In order to learn these peptide-peptide distance functions, we formalize the problem as a semi-supervised learning problem with partial information in the form of equivalence constraints. Specifically, we propose to use DistBoost [1,2], which is a semi-supervised distance learning algorithm. RESULTS: We compare our method to various state-of-the-art binding prediction algorithms on MHC class I and MHC class II datasets. In almost all cases, our method outperforms all of its competitors. One of the major advantages of our novel approach is that it can also learn an affinity function over proteins for which only small amounts of labeled peptides exist. In these cases, our method's performance gain, when compared to other computational methods, is even more pronounced. We have recently uploaded the PepDist webserver which provides binding prediction of peptides to 35 different MHC class I alleles. The webserver which can be found at is powered by a prediction engine which was trained using the framework presented in this paper. CONCLUSION: The results obtained suggest that learning a single distance function over an entire family of proteins achieves higher prediction accuracy than learning a set of binary classifiers for each of the proteins separately. We also show the importance of obtaining information on experimentally determined non-binders. Learning with real non-binders generalizes better than learning with randomly generated peptides that are assumed to be non-binders. This suggests that information about non-binding peptides should also be published and made publicly available
Genome-Wide Integration on Transcription Factors, Histone Acetylation and Gene Expression Reveals Genes Co-Regulated by Histone Modification Patterns
N-terminal tails of H2A, H2B, H3 and H4 histone families are subjected to posttranslational modifications that take part in transcriptional regulation mechanisms, such as transcription factor binding and gene expression. Regulation mechanisms under control of histone modification are important but remain largely unclear, despite of emerging datasets for comprehensive analysis of histone modification. In this paper, we focus on what we call genetic harmonious units (GHUs), which are co-occurring patterns among transcription factor binding, gene expression and histone modification. We present the first genome-wide approach that captures GHUs by combining ChIP-chip with microarray datasets from Saccharomyces cerevisiae. Our approach employs noise-robust soft clustering to select patterns which share the same preferences in transcription factor-binding, histone modification and gene expression, which are all currently implied to be closely correlated. The detected patterns are a well-studied acetylation of lysine 16 of H4 in glucose depletion as well as co-acetylation of five lysine residues of H3 with H4 Lys12 and H2A Lys7 responsible for ribosome biogenesis. Furthermore, our method further suggested the recognition of acetylated H4 Lys16 being crucial to histone acetyltransferase ESA1, whose essential role is still under controversy, from a microarray dataset on ESA1 and its bypass suppressor mutants. These results demonstrate that our approach allows us to provide clearer principles behind gene regulation mechanisms under histone modifications and detect GHUs further by applying to other microarray and ChIP-chip datasets. The source code of our method, which was implemented in MATLAB (http://www.mathworks.com/), is available from the supporting page for this paper: http://www.bic.kyoto-u.ac.jp/pathway/natsume/hm_detector.htm
Calpain Cleavage Prediction Using Multiple Kernel Learning
Calpain, an intracellular -dependent cysteine protease, is known to play a role in a wide range of metabolic pathways through limited proteolysis of its substrates. However, only a limited number of these substrates are currently known, with the exact mechanism of substrate recognition and cleavage by calpain still largely unknown. While previous research has successfully applied standard machine-learning algorithms to accurately predict substrate cleavage by other similar types of proteases, their approach does not extend well to calpain, possibly due to its particular mode of proteolytic action and limited amount of experimental data. Through the use of Multiple Kernel Learning, a recent extension to the classic Support Vector Machine framework, we were able to train complex models based on rich, heterogeneous feature sets, leading to significantly improved prediction quality (6% over highest AUC score produced by state-of-the-art methods). In addition to producing a stronger machine-learning model for the prediction of calpain cleavage, we were able to highlight the importance and role of each feature of substrate sequences in defining specificity: primary sequence, secondary structure and solvent accessibility. Most notably, we showed there existed significant specificity differences across calpain sub-types, despite previous assumption to the contrary. Prediction accuracy was further successfully validated using, as an unbiased test set, mutated sequences of calpastatin (endogenous inhibitor of calpain) modified to no longer block calpain's proteolytic action. An online implementation of our prediction tool is available at http://calpain.org
NetMHCpan, a Method for Quantitative Predictions of Peptide Binding to Any HLA-A and -B Locus Protein of Known Sequence
Binding of peptides to Major Histocompatibility Complex (MHC) molecules is the single most selective step in the recognition of pathogens by the cellular immune system. The human MHC class I system (HLA-I) is extremely polymorphic. The number of registered HLA-I molecules has now surpassed 1500. Characterizing the specificity of each separately would be a major undertaking.Here, we have drawn on a large database of known peptide-HLA-I interactions to develop a bioinformatics method, which takes both peptide and HLA sequence information into account, and generates quantitative predictions of the affinity of any peptide-HLA-I interaction. Prospective experimental validation of peptides predicted to bind to previously untested HLA-I molecules, cross-validation, and retrospective prediction of known HIV immune epitopes and endogenous presented peptides, all successfully validate this method. We further demonstrate that the method can be applied to perform a clustering analysis of MHC specificities and suggest using this clustering to select particularly informative novel MHC molecules for future biochemical and functional analysis.Encompassing all HLA molecules, this high-throughput computational method lends itself to epitope searches that are not only genome- and pathogen-wide, but also HLA-wide. Thus, it offers a truly global analysis of immune responses supporting rational development of vaccines and immunotherapy. It also promises to provide new basic insights into HLA structure-function relationships. The method is available at http://www.cbs.dtu.dk/services/NetMHCpan
A voting approach to identify a small number of highly predictive genes using multiple classifiers
<p>Abstract</p> <p>Background</p> <p>Microarray gene expression profiling has provided extensive datasets that can describe characteristics of cancer patients. An important challenge for this type of data is the discovery of gene sets which can be used as the basis of developing a clinical predictor for cancer. It is desirable that such gene sets be compact, give accurate predictions across many classifiers, be biologically relevant and have good biological process coverage.</p> <p>Results</p> <p>By using a new type of multiple classifier voting approach, we have identified gene sets that can predict breast cancer prognosis accurately, for a range of classification algorithms. Unlike a wrapper approach, our method is not specialised towards a single classification technique. Experimental analysis demonstrates higher prediction accuracies for our sets of genes compared to previous work in the area. Moreover, our sets of genes are generally more compact than those previously proposed. Taking a biological viewpoint, from the literature, most of the genes in our sets are known to be strongly related to cancer.</p> <p>Conclusion</p> <p>We show that it is possible to obtain superior classification accuracy with our approach and obtain a compact gene set that is also biologically relevant and has good coverage of different biological processes.</p
Improving model construction of profile HMMs for remote homology detection through structural alignment
<p>Abstract</p> <p>Background</p> <p>Remote homology detection is a challenging problem in Bioinformatics. Arguably, profile Hidden Markov Models (pHMMs) are one of the most successful approaches in addressing this important problem. pHMM packages present a relatively small computational cost, and perform particularly well at recognizing remote homologies. This raises the question of whether structural alignments could impact the performance of pHMMs trained from proteins in the <it>Twilight Zone</it>, as structural alignments are often more accurate than sequence alignments at identifying motifs and functional residues. Next, we assess the impact of using structural alignments in pHMM performance.</p> <p>Results</p> <p>We used the SCOP database to perform our experiments. Structural alignments were obtained using the 3DCOFFEE and MAMMOTH-mult tools; sequence alignments were obtained using CLUSTALW, TCOFFEE, MAFFT and PROBCONS. We performed leave-one-family-out cross-validation over super-families. Performance was evaluated through ROC curves and paired two tailed t-test.</p> <p>Conclusion</p> <p>We observed that pHMMs derived from structural alignments performed significantly better than pHMMs derived from sequence alignment in low-identity regions, mainly below 20%. We believe this is because structural alignment tools are better at focusing on the important patterns that are more often conserved through evolution, resulting in higher quality pHMMs. On the other hand, sensitivity of these tools is still quite low for these low-identity regions. Our results suggest a number of possible directions for improvements in this area.</p
- …