4 research outputs found

    Inferring a protein interaction map of Mycobacterium tuberculosis based on sequences and interologs

    Get PDF
    Background: Mycobacterium tuberculosis is an infectious bacterium posing serious threats to human health. Due to the difficulty in performing molecular biology experiments to detect protein interactions, reconstruction of a protein interaction map of M. tuberculosis by computational methods will provide crucial information to understand the biological processes in the pathogenic microorganism, as well as provide the framework upon which new therapeutic approaches can be developed.Results: In this paper, we constructed an integrated M. tuberculosis protein interaction network by machine learning and ortholog-based methods. Firstly, we built a support vector machine (SVM) method to infer the protein interactions of M. tuberculosis H37Rv by gene sequence information. We tested our predictors in Escherichia coli and mapped the genetic codon features underlying its protein interactions to M. tuberculosis. Moreover, the documented interactions of 14 other species were mapped to the interactome of M. tuberculosis by the interolog method. The ensemble protein interactions were validated by various functional relationships, i.e., gene coexpression, evolutionary relationship and functional similarity, extracted from heterogeneous data sources. The accuracy and validation demonstrate the effectiveness and efficiency of our framework.Conclusions: A protein interaction map of M. tuberculosis is inferred from genetic codons and interologs. The prediction accuracy and numerically experimental validation demonstrate the effectiveness and efficiency of our method. Furthermore, our methods can be straightforwardly extended to infer the protein interactions of other bacterial species. © 2012 Liu et al.; licensee BioMed Central Ltd.Link_to_subscribed_fulltex

    Predicting RNA-Protein Interactions Using Only Sequence Information

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>RNA-protein interactions (RPIs) play important roles in a wide variety of cellular processes, ranging from transcriptional and post-transcriptional regulation of gene expression to host defense against pathogens. High throughput experiments to identify RNA-protein interactions are beginning to provide valuable information about the complexity of RNA-protein interaction networks, but are expensive and time consuming. Hence, there is a need for reliable computational methods for predicting RNA-protein interactions.</p> <p>Results</p> <p>We propose <b><it>RPISeq</it></b>, a family of classifiers for predicting <b><it>R</it></b>NA-<b><it>p</it></b>rotein <b><it>i</it></b>nteractions using only <b><it>seq</it></b>uence information. Given the sequences of an RNA and a protein as input, <it>RPIseq </it>predicts whether or not the RNA-protein pair interact. The RNA sequence is encoded as a normalized vector of its ribonucleotide 4-mer composition, and the protein sequence is encoded as a normalized vector of its 3-mer composition, based on a 7-letter reduced alphabet representation. Two variants of <it>RPISeq </it>are presented: <it>RPISeq-SVM</it>, which uses a Support Vector Machine (SVM) classifier and <it>RPISeq-RF</it>, which uses a Random Forest classifier. On two non-redundant benchmark datasets extracted from the Protein-RNA Interface Database (PRIDB), <it>RPISeq </it>achieved an AUC (Area Under the Receiver Operating Characteristic (ROC) curve) of 0.96 and 0.92. On a third dataset containing only mRNA-protein interactions, the performance of <it>RPISeq </it>was competitive with that of a published method that requires information regarding many different features (e.g., mRNA half-life, GO annotations) of the putative RNA and protein partners. In addition, <it>RPISeq </it>classifiers trained using the PRIDB data correctly predicted the majority (57-99%) of non-coding RNA-protein interactions in NPInter-derived networks from <it>E. coli, S. cerevisiae, D. melanogaster, M. musculus</it>, and <it>H. sapiens</it>.</p> <p>Conclusions</p> <p>Our experiments with <it>RPISeq </it>demonstrate that RNA-protein interactions can be reliably predicted using only sequence-derived information. <it>RPISeq </it>offers an inexpensive method for computational construction of RNA-protein interaction networks, and should provide useful insights into the function of non-coding RNAs. <it>RPISeq </it>is freely available as a web-based server at <url>http://pridb.gdcb.iastate.edu/RPISeq/.</url></p

    A normalized differential sequence feature encoding method based on amino acid sequences

    Get PDF
    Protein interactions are the foundation of all metabolic activities of cells, such as apoptosis, the immune response, and metabolic pathways. In order to optimize the performance of protein interaction prediction, a coding method based on normalized difference sequence characteristics (NDSF) of amino acid sequences is proposed. By using the positional relationships between amino acids in the sequences and the correlation characteristics between sequence pairs, NDSF is jointly encoded. Using principal component analysis (PCA) and local linear embedding (LLE) dimensionality reduction methods, the coded 174-dimensional human protein sequence vector is extracted using sequence features. This study compares the classification performance of four ensemble learning methods (AdaBoost, Extra trees, LightGBM, XGBoost) applied to PCA and LLE features. Cross-validation and grid search methods are used to find the best combination of parameters. The results show that the accuracy of NDSF is generally higher than that of the sequence matrix-based coding method (MOS) coding method, and the loss and coding time can be greatly reduced. The bar chart of feature extraction shows that the classification accuracy is significantly higher when using the linear dimensionality reduction method, PCA, compared to the nonlinear dimensionality reduction method, LLE. After classification with XGBoost, the model accuracy reaches 99.2%, which provides the best performance among all models. This study suggests that NDSF combined with PCA and XGBoost may be an effective strategy for classifying different human protein interactions

    Computational prediction of RNA-protein interaction partners and interfaces

    Get PDF
    RNA-protein interactions play important roles in fundamental cellular processes involved in human diseases, viral replication and defense against pathogens in plants, animals and microbes. However, the detailed recognition mechanisms underlying these interactions are poorly understood. To gain a better understanding of the molecular recognition code for RNA-protein interactions, this dissertation has three related goals: i) to develop methods for predicting RNA-protein interaction partners; ii) to develop an approach for predicting interfacial residues in both the RNA and protein components of RNA-protein complexes; and iii) to develop computational tools and resources for investigating RNA-protein interactions. First, we present machine learning classifiers for predicting RNA-protein interaction partners. The classifiers use the amino acid composition of proteins and the ribonucleotide composition of RNAs as input to predict whether a given RNA-protein pair interacts. We show that protein and RNA sequences alone (i.e., in the absence of any structural information) contain enough signal to allow reliable prediction of interaction partners. Second, we present RPISeq, a webserver that predicts the interaction probabilities of input RNA-protein pairs, using the above-mentioned machine learning classifiers. A comprehensive database of RNA-protein interactions, RPIntDB, is integrated with the webserver to allow users to search for homologous proteins and their known interacting RNA partners. Finally, we perform an analysis of contiguous interfacial amino acids and ribonucleotides in RNA-protein complexes for which structures are known. We generate a dataset of bipartite RNA-protein motifs that can be used to predict interfacial residues in both the RNA and protein sequences of a given RNA-protein pair simultaneously. We show that taking binding partner information into account leads to higher precision in the prediction of RNA-binding residues in proteins. Taken together, these studies have increased our understanding of how RNA and proteins interact
    corecore