266 research outputs found
A novel structure-based encoding for machine-learning applied to the inference of SH3 domain specificity
MOTIVATION: Unravelling the rules underlying protein-protein and protein-ligand interactions is a crucial step in understanding cell machinery. Peptide recognition modules (PRMs) are globular protein domains which focus their binding targets on short protein sequences and play a key role in the frame of protein-protein interactions. High-throughput techniques permit the whole proteome scanning of each domain, but they are characterized by a high incidence of false positives. In this context, there is a pressing need for the development of in silico experiments to validate experimental results and of computational tools for the inference of domain-peptide interactions. RESULTS: We focused on the SH3 domain family and developed a machine-learning approach for inferring interaction specificity. SH3 domains are well-studied PRMs which typically bind proline-rich short sequences characterized by the PxxP consensus. The binding information is known to be held in the conformation of the domain surface and in the short sequence of the peptide. Our method relies on interaction data from high-throughput techniques and benefits from the integration of sequence and structure data of the interacting partners. Here, we propose a novel encoding technique aimed at representing binding information on the basis of the domain-peptide contact residues in complexes of known structure. Remarkably, the new encoding requires few variables to represent an interaction, thus avoiding the 'curse of dimension'. Our results display an accuracy >90% in detecting new binders of known SH3 domains, thus outperforming neural models on standard binary encodings, profile methods and recent statistical predictors. The method, moreover, shows a generalization capability, inferring specificity of unknown SH3 domains displaying some degree of similarity with the known data
Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments
Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes
and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage
display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative
approach for predicting PRM-mediated protein-protein interactions from sequence data. The
model suffered from over-fitting, so Laplacian regularisation was found to be important in
achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative
model. We also propose another discriminative model which can be applied to all sequences
present in the organism at a significantly lower computational cost. This is due to its additional
assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small
number of instances of each binding site motif. However, closely related species are expected
to share similar binding sites, which would be expected to be highly conserved. We investigated
rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic
tree can represent the relationships and divergences between the taxa. However, taxa sequences
exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites,
and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined
the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments:
one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried
out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo
Satisfiability, sequence niches, and molecular codes in cellular signaling
Biological information processing as implemented by regulatory and signaling
networks in living cells requires sufficient specificity of molecular
interaction to distinguish signals from one another, but much of regulation and
signaling involves somewhat fuzzy and promiscuous recognition of molecular
sequences and structures, which can leave systems vulnerable to crosstalk. This
paper examines a simple computational model of protein-protein interactions
which reveals both a sharp onset of crosstalk and a fragmentation of the
neutral network of viable solutions as more proteins compete for regions of
sequence space, revealing intrinsic limits to reliable signaling in the face of
promiscuity. These results suggest connections to both phase transitions in
constraint satisfaction problems and coding theory bounds on the size of
communication codes
Using genome-wide measurements for computational prediction of SH2–peptide interactions
Peptide-recognition modules (PRMs) are used throughout biology to mediate protein–protein interactions, and many PRMs are members of large protein domain families. Recent genome-wide measurements describe networks of peptide–PRM interactions. In these networks, very similar PRMs recognize distinct sets of peptides, raising the question of how peptide-recognition specificity is achieved using similar protein domains. The analysis of individual protein complex structures often gives answers that are not easily applicable to other members of the same PRM family. Bioinformatics-based approaches, one the other hand, may be difficult to interpret physically. Here we integrate structural information with a large, quantitative data set of SH2 domain–peptide interactions to study the physical origin of domain–peptide specificity. We develop an energy model, inspired by protein folding, based on interactions between the amino-acid positions in the domain and peptide. We use this model to successfully predict which SH2 domains and peptides interact and uncover the positions in each that are important for specificity. The energy model is general enough that it can be applied to other members of the SH2 family or to new peptides, and the cross-validation results suggest that these energy calculations will be useful for predicting binding interactions. It can also be adapted to study other PRM families, predict optimal peptides for a given SH2 domain, or study other biological interactions, e.g. protein–DNA interactions.National Institutes of Health. National Centers for Biomedical Computing (Informatics for Integrating Biology and the Bedside)National Institutes of Health (U.S.) (grant U54LM008748
Phosphoproteomics Analyses to Identify the Candidate Substrates and Signaling Intermediates of the Non-Receptor Tyrosine Kinase, SRMS
SRMS (Src-related kinase lacking C-terminal regulatory tyrosine and N-terminal myristoylaton sites) is a non-receptor tyrosine kinase that belongs to the BRK family kinases (BFKs) and is evolutionarily related to the Src family kinases (SFKs). Like SFKs and BFKs, the SRMS protein comprises of two domains involved in protein-protein interactions, namely, the Src-homology 3 domain (SH3) and Src-homology 2 domain (SH2) and one catalytic kinase domain. Unlike members of the BFKs and SFKs, the biochemical and cellular role of SRMS is poorly understood primarily due to the lack of information on the substrates and signaling intermediates regulated by the kinase. Previous biochemical studies have shown that wild type SRMS is enzymatically active and leads to the tyrosine-phosphorylation of several proteins, when expressed exogenously in mammalian cells. These tyrosine-phosphorylated proteins represent the candidate cellular substrates of SRMS which are largely unknown. Further, previous studies have determined that the SRMS protein displays a characteristic punctate cytoplasmic localization pattern in mammalian cells. These SRMS cytoplasmic puncta are uncharacterized and may provide insights into the biochemical and cellular role of the kinase.
Here, we utilized mass spectrometry-based quantitative label-free phosphoproteomics to (a) identify the candidate SRMS cellular substrates and (b) candidate signaling intermediates regulated by SRMS, in HEK293 cells expressing ectopic SRMS. Specifically, using a phosphotyrosine enrichment strategy we identified 663 candidate SRMS substrates and consensus substrate-motifs of SRMS. We used customized peptide arrays and performed the high-throughput validation of a subset of the identified candidate SRMS substrates. Further, we independently validated Vimentin and Sam68 as bonafide SRMS substrates. Next, using Titanium dioxide (TiO2)-based phosphopeptide enrichment columns, we identified multiple signaling intermediates of SRMS. Functional gene enrichment analyses revealed several common and unique cellular processes regulated by the candidate SRMS substrates and signaling intermediates. Overall, these studies led to the identification of a significant number of novel and biologically relevant SRMS candidate substrates and signaling intermediates, which mapped to a number of cellular and biological processes primarily involved in cell cycle regulation, apoptosis, RNA processing, DNA repair and protein synthesis. These findings provide an important resource for future mechanistic studies to investigate the cellular and physiological functions of the SRMS.
Studies towards characterizing the SRMS cytoplasmic puncta showed that the SRMS punctate structures do not colocalize with some of the major cellular organelles investigated, such as the mitochondria, endoplasmic reticulum, golgi bodies and lysosomes. However, studies investigating the involvement of the SRMS domains in puncta-localization revealed that the SRMS SH2 domain partly regulates this localization pattern. These results highlight the potential role of the SRMS SH2 domain in the localization of SRMS to these cytoplasmic sites and lay important groundwork for future characterization studies
Caretta – A multiple protein structure alignment and feature extraction suite
The vast number of protein structures currently available opens exciting opportunities for machine learning on proteins, aimed at predicting and understanding functional properties. In particular, in combination with homology modelling, it is now possible to not only use sequence features as input for machine learning, but also structure features. However, in order to do so, robust multiple structure alignments are imperative. Here we present Caretta, a multiple structure alignment suite meant for homologous but sequentially divergent protein families which consistently returns accurate alignments with a higher coverage than current state-of-the-art tools. Caretta is available as a GUI and command-line application and additionally outputs an aligned structure feature matrix for a given set of input structures, which can readily be used in downstream steps for supervised or unsupervised machine learning. We show Caretta's performance on two benchmark datasets, and present an example application of Caretta in predicting the conformational state of cyclin-dependent kinases.</p
Recommended from our members
Quantitative Approaches to the Genomics of Clonal Evolution
Many problems in the biological sciences reduce to questions of genetic evolution. Entire classes of medical pathology, such as malignant neoplasia or infectious disease, can be viewed in the light of Darwinian competition of genomes. With the benefit of today's maturing sequencing technologies we can observe and quantify genetic evolution with nucleotide resolution. This provides a molecular view of genetic material that has adapted, or is in the process of adapting, to its local selection pressures. A series of problems will be discussed in this thesis, all involving the mathematical modeling of genomic data derived from clonally evolving populations. We use a variety of computational approaches to characterize over-represented features in the data, with the underlying hypothesis that we may be detecting fitness-conferring features of the biology.
In Part I we consider the cross-sectional sampling of human tumors via RNA-sequencing, and devise computational pipelines for detecting oncogenic gene fusions and oncovirus infections. Genomic translocation and oncovirus infection can each be a highly penetrant alteration in a tumor's evolutionary history, with famous examples of both populating the cancer biology literature. In order to exert a transforming influence over the host cell, gene fusions and viral genetic programs need to be expressed and thus can be detected via whole transcriptome sequencing of a malignant cell population. We describe our approaches to predicting oncogenic gene fusions (Chapter 2) and quantifying host-viral interactions (Chapter 3) in large panels of human tumor tissue. The alterations that we characterize prompt the larger question of how the genetics of tumors and viruses might vary in time, leading us to the study of serially sampled populations.
In Part II we consider longitudinal sampling of a clonally evolving population. Phylogenetic trees are the standard representation of a clonal process, an evolutionary picture as old as Darwin's voyages on the Beagle. Chapter 4 first reviews phylogenetic inference and then introduces a certain phylogenetic tree space that forms the starting point of our work on the topic. Specifically, Chapter 4 describes the construction of our projective tree space along with an explicit implementation for visualizing point clouds of rescaled trees. The Chapter finishes by defining a method for stable dimensionality reduction of large phylogenies, which is useful for analyzing long genomic time series. In Chapter 5 we consider medically relevant instances of clonal evolution and the longitudinal genetic data sets to which they give rise. We analyze data from (i) the sequencing of cancers along their therapeutic course, (ii) the passaging of a xenografted tumor through a mouse model, and (iii) the seasonal surveillance of H3N2 influenza's hemagglutinin segment. A novel approach to predicting influenza vaccine effectiveness is demonstrated using statistics of point clouds in tree spaces.
Our investigations into clonal processes may be extended beyond naturally occurring genomes. In Part III we focus on the directed clonal evolution of populations of synthetic RNAs in vitro. Analogous to the selection pressures exerted upon malignant cells or viral particles, these synthetic RNA genomes can be evolved against a desired fitness objective. We investigate fitness objectives related to reprogramming ribosomal translation. Chapter 6 identifies high fitness RNA pseudoknot geometries capable of inducing ribosomal frameshift, while Chapter 7 takes an unbiased approach to evolving sequence and structural elements that promote stop codon readthrough
Computational identification of new structured cis-regulatory elements in the 3'-untranslated region of human protein coding genes
Messenger ribonucleic acids (RNAs) contain a large number of cis-regulatory RNA elements that function in many types of post-transcriptional regulation. These cis-regulatory elements are often characterized by conserved structures and/or sequences. Although some classes are well known, given the wide range of RNA-interacting proteins in eukaryotes, it is likely that many new classes of cis-regulatory elements are yet to be discovered. An approach to this is to use computational methods that have the advantage of analysing genomic data, particularly comparative data on a large scale. In this study, a set of structural discovery algorithms was applied followed by support vector machine (SVM) classification. We trained a new classification model (CisRNA-SVM) on a set of known structured cis-regulatory elements from 3′-untranslated regions (UTRs) and successfully distinguished these and groups of cis-regulatory elements not been strained on from control genomic and shuffled sequences. The new method outperformed previous methods in classification of cis-regulatory RNA elements. This model was then used to predict new elements from cross-species conserved regions of human 3′-UTRs. Clustering of these elements identified new classes of potential cis-regulatory elements. The model, training and testing sets and novel human predictions are available at: http://mRNA.otago.ac.nz/CisRNA-SVM
- …