7 research outputs found
SPRINT: Ultrafast protein-protein interaction prediction of the entire human interactome
Proteins perform their functions usually by interacting with other proteins.
Predicting which proteins interact is a fundamental problem. Experimental
methods are slow, expensive, and have a high rate of error. Many computational
methods have been proposed among which sequence-based ones are very promising.
However, so far no such method is able to predict effectively the entire human
interactome: they require too much time or memory. We present SPRINT (Scoring
PRotein INTeractions), a new sequence-based algorithm and tool for predicting
protein-protein interactions. We comprehensively compare SPRINT with
state-of-the-art programs on seven most reliable human PPI datasets and show
that it is more accurate while running orders of magnitude faster and using
very little memory. SPRINT is the only program that can predict the entire
human interactome. Our goal is to transform the very challenging problem of
predicting the entire human interactome into a routine task. The source code of
SPRINT is freely available from github.com/lucian-ilie/SPRINT/ and the datasets
and predicted PPIs from www.csd.uwo.ca/faculty/ilie/SPRINT/
Issues in performance evaluation for host–pathogen protein interaction prediction
The study of interactions between host and pathogen proteins is important for understanding the underlying mechanisms of infectious diseases and for developing novel therapeutic solutions. Wet-lab techniques for detecting protein–protein interactions (PPIs) can benefit from computational predictions. Machine learning is one of the computational approaches that can assist biologists by predicting promising PPIs. A number of machine learning based methods for predicting host–pathogen interactions (HPI) have been proposed in the literature. The techniques used for assessing the accuracy of such predictors are of critical importance in this domain. In this paper, we question the effectiveness of K-fold cross-validation for estimating the generalization ability of HPI prediction for proteins with no known interactions. K-fold cross-validation does not model this scenario, and we demonstrate a sizable difference between its performance and the performance of an alternative evaluation scheme called leave one pathogen protein out (LOPO) cross-validation. LOPO is more effective in modeling the real world use of HPI predictors, specifically for cases in which no information about the interacting partners of a pathogen protein is available during training. We also point out that currently used metrics such as areas under the precision-recall or receiver operating characteristic curves are not intuitive to biologists and propose simpler and more directly interpretable metrics for this purpose
ProfPPIdb: Pairs of physical protein-protein interactions predicted for entire proteomes
Motivation Protein-protein interactions (PPIs) play a key role in many cellular processes. Most annotations of PPIs mix experimental and computational data. The mix optimizes coverage, but obfuscates the annotation origin. Some resources excel at focusing on reliable experimental data. Here, we focused on new pairs of interacting proteins for several model organisms based solely on sequence-based prediction methods. Results We extracted reliable experimental data about which proteins interact (binary) for eight diverse model organisms from public databases, namely from Escherichia coli, Schizosaccharomyces pombe, Plasmodium falciparum, Drosophila melanogaster, Caenorhabditis elegans, Mus musculus, Rattus norvegicus, Arabidopsis thaliana, and for the previously used Homo sapiens and Saccharomyces cerevisiae. Those data were the base to develop a PPI prediction method for each model organism. The method used evolutionary information through a profile-kernel Support Vector Machine (SVM). With the resulting eight models, we predicted all possible protein pairs in each organism and made the top predictions available through a web application. Almost all of the PPIs made available were predicted between proteins that have not been observed in any interaction, in particular for less well-studied organisms. Thus, our work complements existing resources and is particularly helpful for designing experiments because of its uniqueness. Experimental annotations and computational predictions are strongly influenced by the fact that some proteins have many partners and others few. To optimize machine learning, recent methods explicitly ignored such a network-structure and rely either on domain knowledge or sequence-only methods. Our approach is independent of domain-knowledge and leverages evolutionary information. The database interface representing our results is accessible from https://rostlab.org/services/ppipair/. The data can also be downloaded from https://figshare.com/collections/ProfPPI-DB/4141784
Improved K-mer Based Prediction of Protein-Protein Interactions With Chaos Game Representation, Deep Learning and Reduced Representation Bias
Protein-protein interactions drive many biological processes, including the
detection of phytopathogens by plants' R-Proteins and cell surface receptors.
Many machine learning studies have attempted to predict protein-protein
interactions but performance is highly dependent on training data; models have
been shown to accurately predict interactions when the proteins involved are
included in the training data, but achieve consistently poorer results when
applied to previously unseen proteins. In addition, models that are trained
using proteins that take part in multiple interactions can suffer from
representation bias, where predictions are driven not by learned biological
features but by learning of the structure of the interaction dataset.
We present a method for extracting unique pairs from an interaction dataset,
generating non-redundant paired data for unbiased machine learning. After
applying the method to datasets containing _Arabidopsis thaliana_ and pathogen
effector interations, we developed a convolutional neural network model capable
of learning and predicting interactions from Chaos Game Representations of
proteins' coding genes
Recommended from our members
FunFam protein families improve residue level molecular function prediction
BACKGROUND: The CATH database provides a hierarchical classification of protein domain structures including a sub-classification of superfamilies into functional families (FunFams). We analyzed the similarity of binding site annotations in these FunFams and incorporated FunFams into the prediction of protein binding residues. RESULTS: FunFam members agreed, on average, in 36.9 ± 0.6% of their binding residue annotations. This constituted a 6.7-fold increase over randomly grouped proteins and a 1.2-fold increase (1.1-fold on the same dataset) over proteins with the same enzymatic function (identical Enzyme Commission, EC, number). Mapping de novo binding residue prediction methods (BindPredict-CCS, BindPredict-CC) onto FunFam resulted in consensus predictions for those residues that were aligned and predicted alike (binding/non-binding) within a FunFam. This simple consensus increased the F1-score (for binding) 1.5-fold over the original prediction method. Variation of the threshold for how many proteins in the consensus prediction had to agree provided a convenient control of accuracy/precision and coverage/recall, e.g. reaching a precision as high as 60.8 ± 0.4% for a stringent threshold. CONCLUSIONS: The FunFams outperformed even the carefully curated EC numbers in terms of agreement of binding site residues. Additionally, we assume that our proof-of-principle through the prediction of protein binding residues will be relevant for many other solutions profiting from FunFams to infer functional information at the residue level
IDPpi:Protein-protein interaction analyses of human intrinsically disordered proteins
Intrinsically disordered proteins (IDPs) are characterized by the lack of a fixed tertiary structure and are involved in the regulation of key biological processes via binding to multiple protein partners. IDPs are malleable, adapting to structurally different partners, and this flexibility stems from features encoded in the primary structure. The assumption that universal sequence information will facilitate coverage of the sparse zones of the human interactome motivated us to explore the possibility of predicting protein-protein interactions (PPIs) that involve IDPs based on sequence characteristics. We developed a method that relies on features of the interacting and non-interacting protein pairs and utilizes machine learning to classify and predict IDP PPIs. Consideration of both sequence determinants specific for conformational organizations and the multiplicity of IDP interactions in the training phase ensured a reliable approach that is superior to current state-of-the-art methods. By applying a strict evaluation procedure, we confirm that our method predicts interactions of the IDP of interest even on the proteome-scale. This service is provided as a web tool to expedite the discovery of new interactions and IDP functions with enhanced efficiency. © 2018 The Author(s)
In silico Prediction and Validations of Domains Involved in Gossypium hirsutum SnRK1 Protein Interaction With Cotton Leaf Curl Multan Betasatellite Encoded βC1
Cotton leaf curl disease (CLCuD) caused by viruses of genus Begomovirus is a major constraint to cotton (Gossypium hirsutum) production in many cotton-growing regions of the world. Symptoms of the disease are caused by Cotton leaf curl Multan betasatellite (CLCuMB) that encodes a pathogenicity determinant protein, βC1. Here, we report the identification of interacting regions in βC1 protein by using computational approaches including sequence recognition, and binding site and interface prediction methods. We show the domain-level interactions based on the structural analysis of G. hirsutum SnRK1 protein and its domains with CLCuMB-βC1. To verify and validate the in silico predictions, three different experimental approaches, yeast two hybrid, bimolecular fluorescence complementation and pull down assay were used. Our results showed that ubiquitin-associated domain (UBA) and autoinhibitory sequence (AIS) domains of G. hirsutum-encoded SnRK1 are involved in CLCuMB-βC1 interaction. This is the first comprehensive investigation that combined in silico interaction prediction followed by experimental validation of interaction between CLCuMB-βC1 and a host protein. We demonstrated that data from computational biology could provide binding site information between CLCuD-associated viruses/satellites and new hosts that lack known binding site information for protein–protein interaction studies. Implications of these findings are discussed