8,538 research outputs found

    A novel structure-based encoding for machine-learning applied to the inference of SH3 domain specificity

    Get PDF
    MOTIVATION: Unravelling the rules underlying protein-protein and protein-ligand interactions is a crucial step in understanding cell machinery. Peptide recognition modules (PRMs) are globular protein domains which focus their binding targets on short protein sequences and play a key role in the frame of protein-protein interactions. High-throughput techniques permit the whole proteome scanning of each domain, but they are characterized by a high incidence of false positives. In this context, there is a pressing need for the development of in silico experiments to validate experimental results and of computational tools for the inference of domain-peptide interactions. RESULTS: We focused on the SH3 domain family and developed a machine-learning approach for inferring interaction specificity. SH3 domains are well-studied PRMs which typically bind proline-rich short sequences characterized by the PxxP consensus. The binding information is known to be held in the conformation of the domain surface and in the short sequence of the peptide. Our method relies on interaction data from high-throughput techniques and benefits from the integration of sequence and structure data of the interacting partners. Here, we propose a novel encoding technique aimed at representing binding information on the basis of the domain-peptide contact residues in complexes of known structure. Remarkably, the new encoding requires few variables to represent an interaction, thus avoiding the 'curse of dimension'. Our results display an accuracy >90% in detecting new binders of known SH3 domains, thus outperforming neural models on standard binary encodings, profile methods and recent statistical predictors. The method, moreover, shows a generalization capability, inferring specificity of unknown SH3 domains displaying some degree of similarity with the known data

    An integrated bioinformatics and computational biology approach identifies new BH3-only protein candidates

    Get PDF
    FoxD4L1 is a forkhead transcription factor that expands the neural ectoderm by down-regulating genes that promote the onset of neural differentiation and up-regulating genes that maintain proliferative neural precursors in an immature state. We previously demonstrated that binding of Grg4 to an Eh-1 motif enhances the ability of FoxD4L1 to down-regulate target neural genes but does not account for all of its repressive activity. Herein we analyzed the protein sequence for additional interaction motifs and secondary structure. Eight conserved motifs were identified in the C-terminal region of fish and frog proteins. Extending the analysis to mammals identified a high scoring motif downstream of the Eh-1 domain that contains a tryptophan residue implicated in protein-protein interactions. In addition, secondary structure prediction programs predicted an α-helical structure overlapping with amphibian-specific Motif 6 inXenopus, and similarly located α-helical structures in other vertebrate FoxD proteins. We tested functionality of this site by inducing a glutamine-to-proline substitution expected to break the predicted α-helical structure; this significantly reduced FoxD4L1’s ability to repress zic3and irx1. Because this mutation does not interfere with Grg4 binding, these results demonstrate that at least two regions, the Eh-1 motif and a more C-terminal predicted α-helical/Motif 6 site, additively contribute to repression. In the N-terminal region we previously identified a 14 amino acid motif that is required for the up-regulation of target genes. Secondary structure prediction programs predicted a short ÎČ-strand separating two acidic domains. Mutant constructs show that the ÎČ-strand itself is not required for transcriptional activation. Instead, activation depends upon a glycine residue that is predicted to provide sufficient flexibility to bring the two acidic domains into close proximity. These results identify conserved predicted motifs with secondary structures that enable FoxD4L1 to carry out its essential functions as both a transcriptional repressor and activator of neural genes

    Predicting domain-domain interaction based on domain profiles with feature selection and support vector machines

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein-protein interaction (PPI) plays essential roles in cellular functions. The cost, time and other limitations associated with the current experimental methods have motivated the development of computational methods for predicting PPIs. As protein interactions generally occur via domains instead of the whole molecules, predicting domain-domain interaction (DDI) is an important step toward PPI prediction. Computational methods developed so far have utilized information from various sources at different levels, from primary sequences, to molecular structures, to evolutionary profiles.</p> <p>Results</p> <p>In this paper, we propose a computational method to predict DDI using support vector machines (SVMs), based on domains represented as interaction profile hidden Markov models (ipHMM) where interacting residues in domains are explicitly modeled according to the three dimensional structural information available at the Protein Data Bank (PDB). Features about the domains are extracted first as the Fisher scores derived from the ipHMM and then selected using singular value decomposition (SVD). Domain pairs are represented by concatenating their selected feature vectors, and classified by a support vector machine trained on these feature vectors. The method is tested by leave-one-out cross validation experiments with a set of interacting protein pairs adopted from the 3DID database. The prediction accuracy has shown significant improvement as compared to <it>InterPreTS </it>(Interaction Prediction through Tertiary Structure), an existing method for PPI prediction that also uses the sequences and complexes of known 3D structure.</p> <p>Conclusions</p> <p>We show that domain-domain interaction prediction can be significantly enhanced by exploiting information inherent in the domain profiles via feature selection based on Fisher scores, singular value decomposition and supervised learning based on support vector machines. Datasets and source code are freely available on the web at <url>http://liao.cis.udel.edu/pub/svdsvm</url>. Implemented in Matlab and supported on Linux and MS Windows.</p

    A NEW METHODOLOGY FOR IDENTIFYING INTERFACE RESIDUES INVOLVED IN BINDING PROTEIN COMPLEXES

    Get PDF
    Genome-sequencing projects with advanced technologies have rapidly increased the amount of protein sequences, and demands for identifying protein interaction sites are significantly increased due to its impact on understanding cellular process, biochemical events and drug design studies. However, the capacity of current wet laboratory techniques is not enough to handle the exponentially growing protein sequence data; therefore, sequence based predictive methods identifying protein interaction sites have drawn increasing interest. In this article, a new predictive model which can be valuable as a first approach for guiding experimental methods investigating protein-protein interactions and localizing the specific interface residues is proposed. The proposed method extracts a wide range of features from protein sequences. Random forests framework is newly redesigned to effectively utilize these features and the problems of imbalanced data classification commonly encountered in binding site predictions. The method is evaluated with 2,829 interface residues and 24,616 non-interface residues extracted from 99 polypeptide chains in the Protein Data Bank. The experimental results show that the proposed method performs significantly better than two other conventional predictive methods and can reliably predict residues involved in protein interaction sites. As blind tests, the proposed method predicts interaction sites and constructs three protein complexes: the DnaK molecular chaperone system, 1YUW and 1DKG, which provide new insight into the sequence-function relationship. Finally, the robustness of the proposed method is assessed by evaluating the performances obtained from four different ensemble methods

    Computational Algorithms for Predicting Membrane Protein Assembly From Angstrom to Micron Scale

    Get PDF
    Biological barriers in the human body are one of the most crucial interfaces perfected through evolution for diverse and unique functions. Of the wide range of barriers, the paracellular protein interfaces of epithelial and endothelial cells called tight junctions with high molecular specificities are vital for homeostasis and to maintain proper health. While the breakdown of these barriers is associated with serious pathological consequences, their intact presence also poses a challenge to effective delivery of therapeutic drugs. Complimenting a rigorous combination of in vitro and in vivo approaches to establishing the fundamental biological construct, in addition to elucidating pathological implications and pharmaceutical interests, a systematic in silico approach is undertaken in this work in order to complete the molecular puzzle of the tight junctions. This work presents a bottom-up approach involving a careful consideration of protein interactions with Angstrom-level details integrated systematically, based on the principles of statistical thermodynamics and probabilities and designed using well-structured computational algorithms, up to micron-level molecular architecture of tight junctions, forming a robust prediction with molecular details packed for up to four orders of magnitude in length scale. This work is intended to bridge the gap between the computational nano-scale studies and the experimental micron-scale observations and provide a molecular explanation for cellular behaviors in the maintenance, and the adverse consequences of breakdown of these barriers. Furthermore, a comprehensive understanding of tight junctions shall enable development of safe strategies for enhanced delivery of therapeutics

    ROLE OF SULFIREDOXIN INTERACTING PROTEINS IN LUNG CANCER DEVELOPMENT

    Get PDF
    Sulfiredoxin (Srx) is an antioxidant enzyme that can be induced by oxidative stress. It promotes oncogenic phenotypes of cell proliferation, colony formation, migration, and metastasis in lung, skin and colon cancers. Srx reduces the overoxidation of 2-cysteine peroxiredoxins in cells, in addition to its role of removing glutathione modification from several proteins. In this study, I explored additional physiological functions of Srx in lung cancer through studying its interacting proteins. Protein disulfide isomerase (PDI) family members, thioredoxin domain containing protein 5 (TXNDC5) and protein disulfide isomerase family A member 6 (PDIA6), were detected to interact with Srx. Therefore, I proposed that TXNDC5 and PDIA6 are important for the oncogenic phenotypes of Srx in lung cancer. In chapter one, I presented background information about the role of Srx as an antioxidant enzyme in cancer. I also explained the functional significance of PDIs as oxidoreductase and chaperones in cells. In chapter two, I verified the Srx-TXNDC5/PDIA6 interaction in HEK293T and A549 cells by co-immunoprecipitation and other assays. In TXNDC5 and PDIA6, the N-terminal thioredoxin-like domain (D1) is determined to be the main platform for interaction with Srx. The Srx-TXNDC5 interaction was enhanced by H2O2 treatment in A549 cells. Srx was determined to localize in the endoplasmic reticulum (ER) of A549 cells along with TXNDC5 and PDIA6. This localization was confirmed by both subcellular fractionation and immunofluorescence imaging experiments. In chapter three I focused on studying the physiological function of Srx interacting proteins in the ER. A549 subcellular fractionation results showed that TXNDC5 facilitates Srx retention in the ER. Moreover, TXNDC5 and Srx were found to participate in chaperone activities in lung cancer. Both proteins contributed in the refolding of heat-shock induced protein aggregates. In addition, TXNDC5 and PDIA6 were found to enhance the protein refolding in response to H2O2 treatment. Conversely, Srx appeared to have an inhibitory effect on protein folding under same treatment conditions. Downregulation of Srx, TXNDC5, or PDIA6 significantly reduced cell viability in response to tunicamycin treatment. TXNDC5 knockdown decreased the time required for the splicing of X-box binding protein-1 (XBP-1). In either knockdown Srx or TXNDC5 cells, there was an observable decrease in the expression of GRP78 and the splicing of spliced XBP-1. These results suggest a possible role of Srx in unfolded protein response signaling. TXNDC5 and PDIA6, similar to Srx, contribute to the proliferation, anchorage independent colony formation and migration of lung cancer cells. In this dissertation I concluded that Srx TXNDC5, and PDIA6 proteins participate in oxidative protein folding in lung cancer. Srx and TXNDC5 can modulate unfolded protein response (UPR) sensor activation and growth inhibition. Furthermore, TXNDC5 and PDIA6 can promote tumorigenesis of lung cancer cells. Therefore, the molecular interaction of Srx with TXNDC5/PDIA6 has the potential to be used as novel therapeutic targets for lung cancer treatment

    Phylogenetic correlations can suffice to infer protein partners from sequences

    Get PDF
    International audienceDetermining which proteins interact together is crucial to a systems-level understanding of the cell. Recently, algorithms based on Direct Coupling Analysis (DCA) pairwise maximum-entropy models have allowed to identify interaction partners among paralogous proteins from sequence data. This success of DCA at predicting protein-protein interactions could be mainly based on its known ability to identify pairs of residues that are in contact in the three-dimensional structure of protein complexes and that coevolve to remain physicochemically complementary. However, interacting proteins possess similar evolutionary histories. What is the role of purely phylogenetic correlations in the performance of DCA-based methods to infer interaction partners? To address this question, we employ controlled synthetic data that only involve phylogeny and no interactions or contacts. We find that DCA accurately identifies the pairs of synthetic sequences that share evolutionary history. While phylogenetic correlations confound the identification of contacting residues by DCA, they are thus useful to predict interacting partners among paralogs. We find that DCA performs as well as phylogenetic methods to this end, and slightly better than them with large and accurate training sets. Employing DCA or phylogenetic methods within an Iterative Pairing Algorithm (IPA) allows to predict pairs of evolutionary partners without a training set. We further demonstrate the ability of these various methods to correctly predict pairings among real paralogous proteins with genome proximity but no known direct physical interaction, illustrating the importance of phylogenetic correlations in natural data. However, for physically interacting and strongly coevolving proteins, DCA and mutual information outperform phylogenetic methods. We finally discuss how to distinguish physically interacting proteins from proteins that only share a common evolutionary history

    Machine Learning based Protein Sequence to (un)Structure Mapping and Interaction Prediction

    Get PDF
    Proteins are the fundamental macromolecules within a cell that carry out most of the biological functions. The computational study of protein structure and its functions, using machine learning and data analytics, is elemental in advancing the life-science research due to the fast-growing biological data and the extensive complexities involved in their analyses towards discovering meaningful insights. Mapping of protein’s primary sequence is not only limited to its structure, we extend that to its disordered component known as Intrinsically Disordered Proteins or Regions in proteins (IDPs/IDRs), and hence the involved dynamics, which help us explain complex interaction within a cell that is otherwise obscured. The objective of this dissertation is to develop machine learning based effective tools to predict disordered protein, its properties and dynamics, and interaction paradigm by systematically mining and analyzing large-scale biological data. In this dissertation, we propose a robust framework to predict disordered proteins given only sequence information, using an optimized SVM with RBF kernel. Through appropriate reasoning, we highlight the structure-like behavior of IDPs in disease-associated complexes. Further, we develop a fast and effective predictor of Accessible Surface Area (ASA) of protein residues, a useful structural property that defines protein’s exposure to partners, using regularized regression with 3rd-degree polynomial kernel function and genetic algorithm. As a key outcome of this research, we then introduce a novel method to extract position specific energy (PSEE) of protein residues by modeling the pairwise thermodynamic interactions and hydrophobic effect. PSEE is found to be an effective feature in identifying the enthalpy-gain of the folded state of a protein and otherwise the neutral state of the unstructured proteins. Moreover, we study the peptide-protein transient interactions that involve the induced folding of short peptides through disorder-to-order conformational changes to bind to an appropriate partner. A suite of predictors is developed to identify the residue-patterns of Peptide-Recognition Domains from protein sequence that can recognize and bind to the peptide-motifs and phospho-peptides with post-translational-modifications (PTMs) of amino acid, responsible for critical human diseases, using the stacked generalization ensemble technique. The involved biologically relevant case-studies demonstrate possibilities of discovering new knowledge using the developed tools

    DISCRETIZED GEOMETRIC APPROACHES TO THE ANALYSIS OF PROTEIN STRUCTURES

    Get PDF
    Proteins play crucial roles in a variety of biological processes. While we know that their amino acid sequence determines their structure, which in turn determines their function, we do not know why particular sequences fold into particular structures. My work focuses on discretized geometric descriptions of protein structure—conceptualizing native structure space as composed of mostly discrete, geometrically defined fragments—to better understand the patterns underlying why particular sequence elements correspond to particular structure elements. This discretized geometric approach is applied to multiple levels of protein structure, from conceptualizing contacts between residues as interactions between discrete structural elements to treating protein structures as an assembly of discrete fragments. My earlier work focused on better understanding inter-residue contacts and estimating their energies statistically. By scoring structures with energies derived from a stricter notion of contact, I show that native protein structures can be identified out of a set of decoy structures more often than when using energies derived from traditional definitions of contact and how this has implications for the evaluation of predictions that rely on structurally defined contacts for validation. Demonstrating how useful simple geometric descriptors of structure can be, I then show that these energies identify native structures on par with well-validated, detailed, atomistic energy functions. Moving to a higher level of structure, in my later work I demonstrate that discretized, geometrically defined structural fragments make good objects for the interactive assembly of protein backbones and present a software application which lets users do so. Finally, I use these fragments to generate structure-conditioned statistical energies, generalizing the classic idea of contact energies by incorporating specific structural context, enabling these energies to reflect the interaction geometries they come from. These structure-conditioned energies contain more information about native sequence preferences, correlate more highly with experimentally determined energies, and show that pairwise sequence preferences are tightly coupled to their structural context. Considered jointly, these projects highlight the degree to which protein structures and the interactions they comprise can be understood as geometric elements coming together in finely tuned ways
    • 

    corecore