33 research outputs found

    Navigating the Extremes of Biological Datasets for Reliable Structural Inference and Design

    Get PDF
    Structural biologists currently confront serious challenges in the effective interpretation of experimental data due to two contradictory situations: a severe lack of structural data for certain classes of proteins, and an incredible abundance of data for other classes. The challenge with small data sets is how to extract sufficient information to draw meaningful conclusions, while the challenge with large data sets is how to curate, categorize, and search the data to allow for its meaningful interpretation and application to scientific problems. Here, we develop computational strategies to address both sparse and abundant data sets. In the category of sparse data sets, we focus our attention on the problem of transmembrane (TM) protein structure determination. As X-ray crystallography and NMR data is notoriously difficult to obtain for TM proteins, we develop a novel algorithm which uses low-resolution data from protein cross-linking or scanning mutagenesis studies to produce models of TM helix oligomers and show that our method produces models with an accuracy on par with X-ray crystallography or NMR for a test set of known TM proteins. Turning to instances of data abundance, we examine how to mine the vast stores of protein structural data in the Protein Data Bank (PDB) to aid in the design of proteins with novel binding properties. We show how the identification of an anion binding motif in an antibody structure allowed us to develop a phosphate binding module that can be used to produce novel antibodies to phosphorylated peptides - creating antibodies to 7 novel phospho-peptides to illustrate the utility of our approach. We then describe a general strategy for designing binders to a target protein epitope based upon recapitulating protein interaction geometries which are over-represented in the PDB. We follow this by using data describing the transition probabilities of amino acids to develop a novel set of degenerate codons to create more efficient gene libraries. We conclude by describing a novel, real-time, all-atom structural search engine, giving researchers the ability to quickly search known protein structures for a motif of interest and providing a new interactive paradigm of protein design

    From genotypes to organisms: state-of-the-art and perspectives of a cornerstone in evolutionary dynamics

    Get PDF
    Understanding how genotypes map onto phenotypes, fitness, and eventually organisms is arguably the next major missing piece in a fully predictive theory of evolution. We refer to this generally as the problem of the genotype-phenotype map. Though we are still far from achieving a complete picture of these relationships, our current understanding of simpler questions, such as the structure induced in the space of genotypes by sequences mapped to molecular structures, has revealed important facts that deeply affect the dynamical description of evolutionary processes. Empirical evidence supporting the fundamental relevance of features such as phenotypic bias is mounting as well, while the synthesis of conceptual and experimental progress leads to questioning current assumptions on the nature of evolutionary dynamics-cancer progression models or synthetic biology approaches being notable examples. This work delves with a critical and constructive attitude into our current knowledge of how genotypes map onto molecular phenotypes and organismal functions, and discusses theoretical and empirical avenues to broaden and improve this comprehension. As a final goal, this community should aim at deriving an updated picture of evolutionary processes soundly relying on the structural properties of genotype spaces, as revealed by modern techniques of molecular and functional analysis

    Correlated Mutations: A Hallmark of Phenotypic Amino Acid Substitutions

    Get PDF
    Point mutations resulting in the substitution of a single amino acid can cause severe functional consequences, but can also be completely harmless. Understanding what determines the phenotypical impact is important both for planning targeted mutation experiments in the laboratory and for analyzing naturally occurring mutations found in patients. Common wisdom suggests using the extent of evolutionary conservation of a residue or a sequence motif as an indicator of its functional importance and thus vulnerability in case of mutation. In this work, we put forward the hypothesis that in addition to conservation, co-evolution of residues in a protein influences the likelihood of a residue to be functionally important and thus associated with disease. While the basic idea of a relation between co-evolution and functional sites has been explored before, we have conducted the first systematic and comprehensive analysis of point mutations causing disease in humans with respect to correlated mutations. We included 14,211 distinct positions with known disease-causing point mutations in 1,153 human proteins in our analysis. Our data show that (1) correlated positions are significantly more likely to be disease-associated than expected by chance, and that (2) this signal cannot be explained by conservation patterns of individual sequence positions. Although correlated residues have primarily been used to predict contact sites, our data are in agreement with previous observations that (3) many such correlations do not relate to physical contacts between amino acid residues. Access to our analysis results are provided at http://webclu.bio.wzw.tum.de/~pagel/supplements/correlated-positions/

    From genotypes to organisms: State-of-the-art and perspectives of a cornerstone in evolutionary dynamics

    Get PDF
    Understanding how genotypes map onto phenotypes, fitness, and eventually organisms is arguably the next major missing piece in a fully predictive theory of evolution. We refer to this generally as the problem of the genotype-phenotype map. Though we are still far from achieving a complete picture of these relationships, our current understanding of simpler questions, such as the structure induced in the space of genotypes by sequences mapped to molecular structures, has revealed important facts that deeply affect the dynamical description of evolutionary processes. Empirical evidence supporting the fundamental relevance of features such as phenotypic bias is mounting as well, while the synthesis of conceptual and experimental progress leads to questioning current assumptions on the nature of evolutionary dynamics-cancer progression models or synthetic biology approaches being notable examples. This work delves into a critical and constructive attitude in our current knowledge of how genotypes map onto molecular phenotypes and organismal functions, and discusses theoretical and empirical avenues to broaden and improve this comprehension. As a final goal, this community should aim at deriving an updated picture of evolutionary processes soundly relying on the structural properties of genotype spaces, as revealed by modern techniques of molecular and functional analysis.Comment: 111 pages, 11 figures uses elsarticle latex clas

    Knob-socket Investigation of Stability and Specificity in Alpha-helical Secondary and Quaternary Packing Structure

    Get PDF
    The novel knob-socket (KS) model provides a construct to interpret and analyze the direct contributions of amino acid residues to the stability in α-helical protein structures. Based on residue preferences derived from a set of protein structures, the KS construct characterizes intra- and inter-helical packing into regular patterns of simple motifs. The KS model was used in the de novo design of an α-helical homodimer, KSα1.1. Using site-directed mutagenesis, KSα1.1 point mutants were designed to selectively increase and decrease stability by relating KS propensities with changes to α-helical structure. This study suggests that the sockets from the KS Model can be used as a measure of α-helical structure and stability. The KS model was also used to investigate coiled-coil specificity in bZIP proteins. Identifying and characterizing the interactions that determine the dimerization specificity between bZIP proteins is a crucial factor in better understanding disease formation and proliferation, as well as developing drugs or therapeutics to combat these diseases. Knob-Socket mapping methods identified Asn residues at a positions within the helices, and were determined to be crucial factors in coiled-coil specificity. Site-directed mutagenesis was conducted to investigate the role of the Asn residues, as well as the role played by the neighboring residues at the g and b positions. The results indicate that the Asn at the a position defines coiled-coil specificity, and that the Knob-Socket model can be used to determine bZIP protein quaternary interactions

    Probabilistic grammatical model of protein language and its application to helix-helix contact site classification

    Get PDF
    BACKGROUND: Hidden Markov Models power many state‐of‐the‐art tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on medium‐ and long‐range residue‐residue interactions. This requires an expressive power of at least context‐free grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited. RESULTS: In this work, we present a probabilistic grammatical framework for problem‐specific protein languages and apply it to classification of transmembrane helix‐helix pairs configurations. The core of the model consists of a probabilistic context‐free grammar, automatically inferred by a genetic algorithm from only a generic set of expert‐based rules and positive training samples. The model was applied to produce sequence based descriptors of four classes of transmembrane helix‐helix contact site configurations. The highest performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability of representing structural features of helix‐helix contact sites. CONCLUSIONS: We demonstrated that our probabilistic context‐free framework for analysis of protein sequences outperforms the state of the art in the task of helix‐helix contact site classification. However, this is achieved without necessarily requiring modeling long range dependencies between interacting residues. A significant feature of our approach is that grammar rules and parse trees are human‐readable. Thus they could provide biologically meaningful information for molecular biologists

    MODULATING PROTEIN FUNCTION WITH SMALL MOLECULES THROUGH COMPUTATIONAL AND EXPERIMENTAL DESIGN TECHNIQUES

    Get PDF
    The ability to modulate protein function using exogenous small molecules is a longstanding goal in chemical biology. Selective activation or inhibition of a particular protein function can help elucidate crucial molecular mechanisms and enables important advances in cell biology. Small-molecule controlled molecular systems also possess tremendous value in bioengineering and biomedical applications: activation of protein function allows the construction of protein switches and biosensor proteins, whereas inhibition of protein function contributes to the development of novel therapeutic agents. The discovery of small-molecule modulators of function is greatly aided by computational modeling methodologies. By utilizing structural information obtained through X-ray crystallography or NMR spectroscopy, these tools allow efficient and affordable examination of large small-molecule databases and provide quantitative evaluation of the likelihood that a given protein-ligand interaction occurs. Advances in computer algorithms and hardware development continue to accelerate and scale up the computation and lower the cost of this discovery process. The primary focus of this thesis is the development of structure-based computer-aided methodologies for designing small-molecule modulators of protein function. To this end I explored two parallel paths, one to study activation and one to study inhibition of protein functions. Taken together, my work aims to not only apply rational design strategies to specific proteins, but also demonstrate their general applicability. The first project, focused on activation of protein function, is built on an approach developed by our laboratory that designs a de novo allosteric binding site directly into the catalytic domain of an enzyme. This approach achieves modulation of function by a novel "chemical rescue of structure approach": a tryptophan-to-glycine mutation disrupts local structure and induces conformational changes that distort the geometry at the active site; the subsequent binding of exogenous indole then reverts this conformational change and restores the native enzyme structure. The main challenge of generalizing this approach, however, is the difficulty of rationally designing analogous conformational changes in other proteins. It is therefore important to study the possible mechanisms that can be utilized by chemical rescue of structure. Through collaborative and multidisciplinary efforts, we find that the switchable proteins built via the chemical rescue of structure are frequently controlled indirectly by modulating protein stability, rather than discrete conformational changes. Since energetic evaluation of protein stability is far more tractable than designing and/or predicting allosteric conformational changes, this finding demonstrates how chemical rescue of structure can be applied to other systems for building a variety of new protein switches. To further generalize the applicability of chemical rescue of structure, I sought to extend it to include multiple amino acids, rather than just one. I chose ChxR, a homodimeric response regulator in Chlamydia, as the model protein to examine the feasibility of this strategy. I mutated a pair of tryptophans at the dimer interface to glycine in order to disrupt the dimerization of ChxR. To enable the subsequent functional rescue, I used the removed structural elements as a template for ligand-based virtual screening and discovered a set of candidate small molecules that mimic the three-dimensional geometry and chemical properties of the removed chemical moieties. Biophysical characterization of these compounds suggests that the majority of them selectively bind to the engineered ChxR variant. This observation shows promises in extending this generalized design strategy to allow alternate activating ligands. In parallel to these efforts I carried out studies aimed at inhibition of protein function, as exemplified by my project that uses small molecules to disrupt a protein-RNA interaction. Conventional methods of inhibitor design mostly target RNA-processing enzymes and cannot be generalized to the majority of RNA-binding proteins (RBPs). I contributed to the development of a general strategy of designing competitive inhibitors targeting RBPs. This method involves identifying "hotspot pharmacophores" from the protein-RNA interaction and using it as a template in ligand-based virtual screening. To evaluate the performance of this approach, my collaborators and I applied it to Musashi-1 (Msi1), a protein that upregulates Notch and Wnt signaling pathway and promotes cell cycle progression. Our "hotspot mimicry" approach led us to discover compounds that match the hotspot pharmacophore, and thus enabled the development of novel inhibitors to the Msi1/RNA interaction that we validated in both biochemical and cell-based assays. This approach extends the "hotspot" paradigm from protein-protein complexes to protein-RNA complexes, and helps establish the "druggability" of RNA-binding interfaces. It is the first example of a rationally-designed competitive inhibitor for a non-enzymatic RNA-binding protein. Owing to the simplicity and generality, I anticipate that the hotspot mimicry approach may lead to the identification of inhibitors of other protein-RNA complexes, which in future may serve as starting points for the development of a novel class of therapeutic agents

    Soybean aphid biotype 1 genome: Insights into the invasive biology and adaptive evolution of a major agricultural pest

    Get PDF
    The soybean aphid, Aphis glycines Matsumura (Hemiptera: Aphididae) is a serious pest of the soybean plant, Glycine max, a major world-wide agricultural crop. We assembled a de novo genome sequence of Ap. glycines Biotype 1, from a culture established shortly after this species invaded North America. 20.4% of the Ap. glycines proteome is duplicated. These in-paralogs are enriched with Gene Ontology (GO) categories mostly related to apoptosis, a possible adaptation to plant chemistry and other environmental stressors. Approximately one-third of these genes show parallel duplication in other aphids. But Ap. gossypii, its closest related species, has the lowest number of these duplicated genes. An Illumina GoldenGate assay of 2380 SNPs was used to determine the world-wide population structure of Ap. Glycines. China and South Korean aphids are the closest to those in North America. China is the likely origin of other Asian aphid populations. The most distantly related aphids to those in North America are from Australia. The diversity of Ap. glycines in North America has decreased over time since its arrival. The genetic diversity of Ap. glycines North American population sampled shortly after its first detection in 2001 up to 2012 does not appear to correlate with geography. However, aphids collected on soybean Rag experimental varieties in Minnesota (MN), Iowa (IA), and Wisconsin (WI), closer to high density Rhamnus cathartica stands, appear to have higher capacity to colonize resistant soybean plants than aphids sampled in Ohio (OH), North Dakota (ND), and South Dakota (SD). Samples from the former states have SNP alleles with high FST values and frequencies, that overlap with genes involved in iron metabolism, a crucial metabolic pathway that may be affected by the Rag-associated soybean plant response. The Ap. glycines Biotype 1 genome will provide needed information for future analyses of mechanisms of aphid virulence and pesticide resistance as well as facilitate comparative analyses between aphids with differing natural history and host plant range

    Frustration in Biomolecules

    Get PDF
    Biomolecules are the prime information processing elements of living matter. Most of these inanimate systems are polymers that compute their structures and dynamics using as input seemingly random character strings of their sequence, following which they coalesce and perform integrated cellular functions. In large computational systems with a finite interaction-codes, the appearance of conflicting goals is inevitable. Simple conflicting forces can lead to quite complex structures and behaviors, leading to the concept of "frustration" in condensed matter. We present here some basic ideas about frustration in biomolecules and how the frustration concept leads to a better appreciation of many aspects of the architecture of biomolecules, and how structure connects to function. These ideas are simultaneously both seductively simple and perilously subtle to grasp completely. The energy landscape theory of protein folding provides a framework for quantifying frustration in large systems and has been implemented at many levels of description. We first review the notion of frustration from the areas of abstract logic and its uses in simple condensed matter systems. We discuss then how the frustration concept applies specifically to heteropolymers, testing folding landscape theory in computer simulations of protein models and in experimentally accessible systems. Studying the aspects of frustration averaged over many proteins provides ways to infer energy functions useful for reliable structure prediction. We discuss how frustration affects folding, how a large part of the biological functions of proteins are related to subtle local frustration effects and how frustration influences the appearance of metastable states, the nature of binding processes, catalysis and allosteric transitions. We hope to illustrate how Frustration is a fundamental concept in relating function to structural biology.Comment: 97 pages, 30 figure
    corecore