266 research outputs found

    A novel structure-based encoding for machine-learning applied to the inference of SH3 domain specificity

    Get PDF
    MOTIVATION: Unravelling the rules underlying protein-protein and protein-ligand interactions is a crucial step in understanding cell machinery. Peptide recognition modules (PRMs) are globular protein domains which focus their binding targets on short protein sequences and play a key role in the frame of protein-protein interactions. High-throughput techniques permit the whole proteome scanning of each domain, but they are characterized by a high incidence of false positives. In this context, there is a pressing need for the development of in silico experiments to validate experimental results and of computational tools for the inference of domain-peptide interactions. RESULTS: We focused on the SH3 domain family and developed a machine-learning approach for inferring interaction specificity. SH3 domains are well-studied PRMs which typically bind proline-rich short sequences characterized by the PxxP consensus. The binding information is known to be held in the conformation of the domain surface and in the short sequence of the peptide. Our method relies on interaction data from high-throughput techniques and benefits from the integration of sequence and structure data of the interacting partners. Here, we propose a novel encoding technique aimed at representing binding information on the basis of the domain-peptide contact residues in complexes of known structure. Remarkably, the new encoding requires few variables to represent an interaction, thus avoiding the 'curse of dimension'. Our results display an accuracy >90% in detecting new binders of known SH3 domains, thus outperforming neural models on standard binary encodings, profile methods and recent statistical predictors. The method, moreover, shows a generalization capability, inferring specificity of unknown SH3 domains displaying some degree of similarity with the known data

    Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments

    Get PDF
    Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative approach for predicting PRM-mediated protein-protein interactions from sequence data. The model suffered from over-fitting, so Laplacian regularisation was found to be important in achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative model. We also propose another discriminative model which can be applied to all sequences present in the organism at a significantly lower computational cost. This is due to its additional assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small number of instances of each binding site motif. However, closely related species are expected to share similar binding sites, which would be expected to be highly conserved. We investigated rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic tree can represent the relationships and divergences between the taxa. However, taxa sequences exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites, and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments: one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo

    Satisfiability, sequence niches, and molecular codes in cellular signaling

    Full text link
    Biological information processing as implemented by regulatory and signaling networks in living cells requires sufficient specificity of molecular interaction to distinguish signals from one another, but much of regulation and signaling involves somewhat fuzzy and promiscuous recognition of molecular sequences and structures, which can leave systems vulnerable to crosstalk. This paper examines a simple computational model of protein-protein interactions which reveals both a sharp onset of crosstalk and a fragmentation of the neutral network of viable solutions as more proteins compete for regions of sequence space, revealing intrinsic limits to reliable signaling in the face of promiscuity. These results suggest connections to both phase transitions in constraint satisfaction problems and coding theory bounds on the size of communication codes

    Using genome-wide measurements for computational prediction of SH2–peptide interactions

    Get PDF
    Peptide-recognition modules (PRMs) are used throughout biology to mediate protein–protein interactions, and many PRMs are members of large protein domain families. Recent genome-wide measurements describe networks of peptide–PRM interactions. In these networks, very similar PRMs recognize distinct sets of peptides, raising the question of how peptide-recognition specificity is achieved using similar protein domains. The analysis of individual protein complex structures often gives answers that are not easily applicable to other members of the same PRM family. Bioinformatics-based approaches, one the other hand, may be difficult to interpret physically. Here we integrate structural information with a large, quantitative data set of SH2 domain–peptide interactions to study the physical origin of domain–peptide specificity. We develop an energy model, inspired by protein folding, based on interactions between the amino-acid positions in the domain and peptide. We use this model to successfully predict which SH2 domains and peptides interact and uncover the positions in each that are important for specificity. The energy model is general enough that it can be applied to other members of the SH2 family or to new peptides, and the cross-validation results suggest that these energy calculations will be useful for predicting binding interactions. It can also be adapted to study other PRM families, predict optimal peptides for a given SH2 domain, or study other biological interactions, e.g. protein–DNA interactions.National Institutes of Health. National Centers for Biomedical Computing (Informatics for Integrating Biology and the Bedside)National Institutes of Health (U.S.) (grant U54LM008748

    Phosphoproteomics Analyses to Identify the Candidate Substrates and Signaling Intermediates of the Non-Receptor Tyrosine Kinase, SRMS

    Get PDF
    SRMS (Src-related kinase lacking C-terminal regulatory tyrosine and N-terminal myristoylaton sites) is a non-receptor tyrosine kinase that belongs to the BRK family kinases (BFKs) and is evolutionarily related to the Src family kinases (SFKs). Like SFKs and BFKs, the SRMS protein comprises of two domains involved in protein-protein interactions, namely, the Src-homology 3 domain (SH3) and Src-homology 2 domain (SH2) and one catalytic kinase domain. Unlike members of the BFKs and SFKs, the biochemical and cellular role of SRMS is poorly understood primarily due to the lack of information on the substrates and signaling intermediates regulated by the kinase. Previous biochemical studies have shown that wild type SRMS is enzymatically active and leads to the tyrosine-phosphorylation of several proteins, when expressed exogenously in mammalian cells. These tyrosine-phosphorylated proteins represent the candidate cellular substrates of SRMS which are largely unknown. Further, previous studies have determined that the SRMS protein displays a characteristic punctate cytoplasmic localization pattern in mammalian cells. These SRMS cytoplasmic puncta are uncharacterized and may provide insights into the biochemical and cellular role of the kinase. Here, we utilized mass spectrometry-based quantitative label-free phosphoproteomics to (a) identify the candidate SRMS cellular substrates and (b) candidate signaling intermediates regulated by SRMS, in HEK293 cells expressing ectopic SRMS. Specifically, using a phosphotyrosine enrichment strategy we identified 663 candidate SRMS substrates and consensus substrate-motifs of SRMS. We used customized peptide arrays and performed the high-throughput validation of a subset of the identified candidate SRMS substrates. Further, we independently validated Vimentin and Sam68 as bonafide SRMS substrates. Next, using Titanium dioxide (TiO2)-based phosphopeptide enrichment columns, we identified multiple signaling intermediates of SRMS. Functional gene enrichment analyses revealed several common and unique cellular processes regulated by the candidate SRMS substrates and signaling intermediates. Overall, these studies led to the identification of a significant number of novel and biologically relevant SRMS candidate substrates and signaling intermediates, which mapped to a number of cellular and biological processes primarily involved in cell cycle regulation, apoptosis, RNA processing, DNA repair and protein synthesis. These findings provide an important resource for future mechanistic studies to investigate the cellular and physiological functions of the SRMS. Studies towards characterizing the SRMS cytoplasmic puncta showed that the SRMS punctate structures do not colocalize with some of the major cellular organelles investigated, such as the mitochondria, endoplasmic reticulum, golgi bodies and lysosomes. However, studies investigating the involvement of the SRMS domains in puncta-localization revealed that the SRMS SH2 domain partly regulates this localization pattern. These results highlight the potential role of the SRMS SH2 domain in the localization of SRMS to these cytoplasmic sites and lay important groundwork for future characterization studies

    Caretta – A multiple protein structure alignment and feature extraction suite

    Get PDF
    The vast number of protein structures currently available opens exciting opportunities for machine learning on proteins, aimed at predicting and understanding functional properties. In particular, in combination with homology modelling, it is now possible to not only use sequence features as input for machine learning, but also structure features. However, in order to do so, robust multiple structure alignments are imperative. Here we present Caretta, a multiple structure alignment suite meant for homologous but sequentially divergent protein families which consistently returns accurate alignments with a higher coverage than current state-of-the-art tools. Caretta is available as a GUI and command-line application and additionally outputs an aligned structure feature matrix for a given set of input structures, which can readily be used in downstream steps for supervised or unsupervised machine learning. We show Caretta's performance on two benchmark datasets, and present an example application of Caretta in predicting the conformational state of cyclin-dependent kinases.</p

    Interpretability-oriented data-driven modelling of bladder cancer via computational intelligence

    Get PDF

    Computational identification of new structured cis-regulatory elements in the 3'-untranslated region of human protein coding genes

    Get PDF
    Messenger ribonucleic acids (RNAs) contain a large number of cis-regulatory RNA elements that function in many types of post-transcriptional regulation. These cis-regulatory elements are often characterized by conserved structures and/or sequences. Although some classes are well known, given the wide range of RNA-interacting proteins in eukaryotes, it is likely that many new classes of cis-regulatory elements are yet to be discovered. An approach to this is to use computational methods that have the advantage of analysing genomic data, particularly comparative data on a large scale. In this study, a set of structural discovery algorithms was applied followed by support vector machine (SVM) classification. We trained a new classification model (CisRNA-SVM) on a set of known structured cis-regulatory elements from 3′-untranslated regions (UTRs) and successfully distinguished these and groups of cis-regulatory elements not been strained on from control genomic and shuffled sequences. The new method outperformed previous methods in classification of cis-regulatory RNA elements. This model was then used to predict new elements from cross-species conserved regions of human 3′-UTRs. Clustering of these elements identified new classes of potential cis-regulatory elements. The model, training and testing sets and novel human predictions are available at: http://mRNA.otago.ac.nz/CisRNA-SVM
    corecore