53 research outputs found

    Exploring RNA and protein 3D structures by geometric algorithms

    Get PDF
    Many problems in RNA and protein structures are related with their specific geometric properties. Geometric algorithms can be used to explore the possible solutions of these problems. This dissertation investigates the geometric properties of RNA and protein structures and explores three different ways that geometric algorithms can help to the study of the structures. Determine accurate structures. Accurate details in RNA structures are important for understanding RNA function, but the backbone conformation is difficult to determine and most existing RNA structures show serious steric clashes (greater than or equal to 0.4 A overlap). I developed a program called RNABC (RNA Backbone Correction) that searches for alternative clash-free conformations with acceptable geometry. It rebuilds a suite (unit from sugar to sugar) by anchoring phosphorus and base positions, which are clearest in crystallographic electron density, and reconstructing other atoms using forward kinematics and conjugate gradient methods. Two tests show that RNABC improves backbone conformations for most problem suites in S-motifs and for many of the worst problem suites identified by members of the Richardson lab. Display structure commonalities. Structure alignment commonly uses root mean squared distance (RMSD) to measure the structural similarity. I first extend RMSD to weighted RMSD (wRMSD) for multiple structures and show that using wRMSD with multiplicative weights implies the average is a consensus structure. Although I show that finding the optimal translations and rotations for minimizing wRMSD cannot be decoupled for multiple structures, I develop a near-linear iterative algorithm to converge to a local minimum of wRMSD. Finally I propose a heuristic algorithm to iteratively reassign weights to reduce the effect of outliers and find well-aligned positions that determine structurally conserved regions. Distinguish local structural features. Identifying common motifs (fragments of structures common to a group of molecules) is one way to further our understanding of the structure and function of molecules. I apply a graph database mining technique to identify RNA tertiary motifs. I abstract RNA molecules as labeled graphs, use a frequent subgraph mining algorithm to derive tertiary motifs, and present an iterative structure alignment algorithm to classify tertiary motifs and generate consensus motifs. Tests on ribosomal and transfer RNA families show that this method can identify most known RNA tertiary motifs in these families and suggest candidates for novel tertiary motifs

    Probabilistic grammatical model of protein language and its application to helix-helix contact site classification

    Get PDF
    BACKGROUND: Hidden Markov Models power many state‐of‐the‐art tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on medium‐ and long‐range residue‐residue interactions. This requires an expressive power of at least context‐free grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited. RESULTS: In this work, we present a probabilistic grammatical framework for problem‐specific protein languages and apply it to classification of transmembrane helix‐helix pairs configurations. The core of the model consists of a probabilistic context‐free grammar, automatically inferred by a genetic algorithm from only a generic set of expert‐based rules and positive training samples. The model was applied to produce sequence based descriptors of four classes of transmembrane helix‐helix contact site configurations. The highest performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability of representing structural features of helix‐helix contact sites. CONCLUSIONS: We demonstrated that our probabilistic context‐free framework for analysis of protein sequences outperforms the state of the art in the task of helix‐helix contact site classification. However, this is achieved without necessarily requiring modeling long range dependencies between interacting residues. A significant feature of our approach is that grammar rules and parse trees are human‐readable. Thus they could provide biologically meaningful information for molecular biologists

    Protein structure prediction and modelling

    Get PDF
    The prediction of protein structures from their amino acid sequence alone is a very challenging problem. Using the variety of methods available, it is often possible to achieve good models or at least to gain some more information, to aid scientists in their research. This thesis uses many of the widely available methods for the prediction and modelling of protein structures and proposes some new ideas for aiding the process. A new method for measuring the buriedness (or exposure) of residues is discussed which may lead to a potential way of assessing proteins' individual amino acid placement and whether they have a standard profile. This may become useful in assessing predicted models. Threading analysis and modelling of structures for the Critical Assessment of Techniques for Protein Structure Prediction (CASP2) highlights inaccuracies in the current state of protein prediction, particularly with the alignment predictions of sequence on structure. An in depth analysis of the placement of gaps within a multiple sequence threading method is discussed, with ideas for the improvement of threading predictions by the construction of an improved gap penalty. A threading based homology model was constructed with an RMSD of 6.2A, showing how combinations of methods can give usable results. Using a distance geometry method, DRAGON, the ab initio prediction of a protein (NK Lysin) for the CASP2 assessment was achieved with an accuracy of 4.6Å. This highlighted several ideas in disulphide prediction and a novel method for predicting which cysteine residues might form disulphide bonds in proteins. Using a combination of all the methods, with some like threading and homology modelling proving inadequate, an ab initio model of the N-terminal domain of a GPCR was built based on secondary structure and predictions of disulphide bonds. Use of multiple sequences in comparing sequences to structures in threading should give enough information to enable the improvements required before threading can be-come a major way of building homology models. Furthermore, with the ability to predict disulphide bonds: restraints can be placed when building models, ab initio or otherwise

    Computational Molecular Coevolution

    Get PDF
    A major goal in computational biochemistry is to obtain three-dimensional structure information from protein sequence. Coevolution represents a biological mechanism through which structural information can be obtained from a family of protein sequences. Evolutionary relationships within a family of protein sequences are revealed through sequence alignment. Statistical analyses of these sequence alignments reveals positions in the protein family that covary, and thus appear to be dependent on one another throughout the evolution of the protein family. These covarying positions are inferred to be coevolving via one of two biological mechanisms, both of which imply that coevolution is facilitated by inter-residue contact. Thus, high-quality multiple sequence alignments and robust coevolution-inferring statistics can produce structural information from sequence alone. This work characterizes the relationship between coevolution statistics and sequence alignments and highlights the implicit assumptions and caveats associated with coevolutionary inference. An investigation of sequence alignment quality and coevolutionary-inference methods revealed that such methods are very sensitive to the systematic misalignments discovered in public databases. However, repairing the misalignments in such alignments restores the predictive power of coevolution statistics. To overcome the sensitivity to misalignments, two novel coevolution-inferring statistics were developed that show increased contact prediction accuracy, especially in alignments that contain misalignments. These new statistics were developed into a suite of coevolution tools, the MIpToolset. Because systematic misalignments produce a distinctive pattern when analyzed by coevolution-inferring statistics, a new method for detecting systematic misalignments was created to exploit this phenomenon. This new method called ``local covariation\u27\u27 was used to analyze publicly-available multiple sequence alignment databases. Local covariation detected putative misalignments in a database designed to benchmark sequence alignment software accuracy. Local covariation was incorporated into a new software tool, LoCo, which displays regions of potential misalignment during alignment editing assists in their correction. This work represents advances in multiple sequence alignment creation and coevolutionary inference

    Hydropathic Interactions and Protein Structure: Utilizing the HINT Force Field in Structure Prediction and Protein‐Protein Docking.

    Get PDF
    Protein structure predication is a field of computational molecular modeling with an enormous potential for improvement. Side-chain geometry prediction is a critical component of this process that is crucial for computational protein structure predication as well as crystallographers in refining experimentally determined protein crystal structures. The cornerstone of side-chain geometry prediction are side-chain rotamer libraries, usually obtained through exhaustive statistical analysis of existing protein structures. Little is known, however, about the driving forces leading to the preference or suitability of one rotamer over another. Construction of 3D hydropathic interaction maps for nearly 30,000 tyrosines extracted from the PDB reveals their environments, in terms of hydrophobic and polar (collectively “hydropathic”) interactions. Using a unique 3D similarity metric, these environments were clustered with k-means. In the ϕ, ψ region (–200° \u3c ϕ \u3c –155°; –205° \u3c ψ \u3c –160°) representing 631 tyrosines, clustering reduced the set to 14 unique hydropathic environments, with most diversity arising from favorable hydrophobic interactions. Polar interactions for tyrosine include ubiquitous hydrogen bonding with the phenolic OH and a handful of unique environments surrounding the backbone. The memberships of all but one of the 14 environments are dominated by a single χ1/χ2 rotamer. Each tyrosine residue attempts to fulfill its hydropathic valence. Structural water molecules are thus used in a variety of roles throughout protein structure. A second project involves elucidating the 3D structure of CRIP1a, a cannabinoid 1 receptor (CB1R) binding protein that could provide information for designing small molecules targeting the CRIP1a-CB1R interaction. The CRIP1a protein was produced in high purity. Crystallization experiments failed, both with and without the last 9 or 12 amino acid peptide of the CB1R C-terminus. Attempts were made to use NMR for structure determination; however, the protein precipitated out during data acquisition. A model was thus built computationally to which the CB1R C-terminus peptide was docked. HINT was used in selecting optimum models and analyzing interactions involved in the CRIP1a-CB1R complex. The final model demonstrated key putative interactions between CRIP1a and CB1R while also predicting highly flexible areas of the CRIP1a possibly contributing to the difficulties faced during crystallization

    Understanding Molecular Interactions: Application of HINT-based Tools in the Structural Modeling of Novel Anticancer and Antiviral Targets, and in Protein-Protein Docking

    Get PDF
    Computationally driven drug design/discovery efforts generally rely on accurate assessment of the forces that guide the molecular recognition process. HINT (Hydropathic INTeraction) is a natural force field, derived from experimentally determined partition coefficients that quantifies all non-bonded interactions in the biological environment, including hydrogen bonding, electrostatic and hydrophobic interactions, and the energy of desolvation. The overall goal of this work is to apply the HINT-based atomic level description of molecular systems to biologically important proteins, to better understand their biochemistry – a key step in exploiting them for therapeutic purposes. This dissertation discusses the results of three diverse projects: i) structural modeling of human sphingosine kinase 2 (SphK2, a novel anticancer target) and binding mode determination of an isoform selective thiazolidine-2,4-dione (TZD) analog; ii) structural modeling of human cytomegalorvirus (HCMV) alkaline nuclease (AN) UL98 (a novel antiviral target) and subsequent virtual screening of its active site; and iii) explicit treatment of interfacial waters during protein-protein docking process using HINT-based computational tools. SphK2 is a key regulator of the sphingosine-rheostat, and its upregulation /overexpression has been associated with cancer development. We report structural modeling studies of a novel TZD-analog that selectively inhibits SphK2, in a HINT analysis that identifies the key structural features of ligand and protein binding site responsible for isoform selectivity. The second aim was to build a three-dimensional structure of a novel HCMV target – AN UL98, to identify its catalytically important residues. HINT analysis of the interaction of 5’ DNA end at its active site is reported. A parallel aim to perform in silico screening with a site-based pharmacophore model, identified several novel hits with potentially desirable chemical features for interaction with UL98 AN. The majority of current protein-protein docking algorithms fail to account for water molecules involved in bridging interactions between partners, mediating and stabilizing their association. HINT is capable of reproducing the physical and chemical properties of such waters, while accounting for their energetic stabilizing contributions. We have designed a solvated protein-protein docking protocol that explicitly models the Relevant bridging waters, and demonstrate that more accurate results are obtained when water is not ignored

    Protein structure prediction: improving and automating knowledge-based approaches

    Full text link
    This work presents a computational approach to improve the automatic prediction of protein structures from sequence. Its main focus was twofold. An automated method for guiding the modeling process was first developed. This was tested and found to be state of the art in the CASP4 structure prediction contest in 2000. The second focus was the development of a novel divide and conquer algorithm for modeling flexible loops in proteins. Implementation of the search procedure and subsequent ranking is presented. The results are again compared with state of the art methods

    Expanding the ancient DNA bioinformatics toolbox, and its applications to archeological microbiomes

    Get PDF
    The 1980s were very prolific years not only for music, but also for molecular biology and genetics, with the first publications on the microbiome and ancient DNA. Several technical revolutions later, the field of ancient metagenomics is now progressing full steam ahead, at a never seen before pace. While generating sequencing data is becoming cheaper every year, the bioinformatics methods and the compute power needed to analyze them are struggling to catch up. In this thesis, I propose new methods to reduce the sequencing to analysis gap, by introducing scalable and parallelized softwares for ancient DNA metagenomics analysis. In manuscript A, I first introduce a method for estimating the mixtures of different sources in a sequencing sample, a problem known as source tracking. I then apply this method to predict the original sources of paleofeces in manuscript B. In manuscript C, I propose a new method to scale the lowest common ancestor calling from sequence alignment files, which brings a solution for the computational intractability of fitting ever growing metagenomic reference database indices in memory. In manuscript D, I present a method to statistically estimate in parallel the ancient DNA deamination damage, and test it in the context of de novo assembly. Finally, in manuscript E, I apply some of the methods developed in this thesis to the analyis of ancient wine fermentation samples, and present the first ancient genomes of ancient fermentation bacteria. Taken together, the tools developed in this thesis will help the researchers working in the field of ancient DNA metagenomics to scale their analysis to the massive amount of sequencing data routinely produced nowadays

    A New Method for Ligand-supported Homology Modelling of Protein Binding Sites: Development and Application to the neurokinin-1 receptor

    Get PDF
    In this thesis, a novel strategy (MOBILE (Modelling Binding Sites Including Ligand Information Explicitly)) was developed that models protein binding-sites simultaneously considering information about the binding mode of bioactive ligands during the homology modelling process. As a result, protein binding-site models of higher accuracy and relevance can be generated. Starting with the (crystal) structure of one or more template proteins, in the first step several preliminary homology models of the target protein are generated using the homology modelling program MODELLER. Ligands are then placed into these preliminary models using different strategies depending on the amount of experimental information about the binding mode of the ligands. (1.) If a ligand is known to bind to the target protein and the crystal structure of the protein-ligand complex with the related template protein is available, it can be assumed that the ligand binding modes are similar in the target and template protein. Accordingly, ligands are then transferred among these structures keeping their orientation as a restraint for the subsequent modelling process. (2.) If no complex crystal structure with the template is available, the ligand(s) can be placed into the template protein structure by docking, and the resulting orientation can then be used to restrain the following protein modelling process. Alternatively, (3.) in cases where knowledge about the binding mode cannot be inferred by the template protein, ligand docking is performed into an ensemble of homology models. The ligands are placed into a crude binding-site representation via docking into averaged property fields derived from knowledge-based potentials. Once the ligands are placed, a new set of homology models is generated. However, in this step, ligand information is considered as additional restraint in terms of the knowledge-based DrugScore protein-ligand atom pair potentials. Consulting a large ensemble of produced models exhibiting di erent side-chain rotamers for the binding-site residues, a composite picture is assembled considering the individually best scored rotamers with respect to the ligand. After a local force-field optimisation, the obtained binding-site models can be used for structure-based drug design

    Graph-Based Approaches to Protein StructureComparison - From Local to Global Similarity

    Get PDF
    The comparative analysis of protein structure data is a central aspect of structural bioinformatics. Drawing upon structural information allows the inference of function for unknown proteins even in cases where no apparent homology can be found on the sequence level. Regarding the function of an enzyme, the overall fold topology might less important than the specific structural conformation of the catalytic site or the surface region of a protein, where the interaction with other molecules, such as binding partners, substrates and ligands occurs. Thus, a comparison of these regions is especially interesting for functional inference, since structural constraints imposed by the demands of the catalyzed biochemical function make them more likely to exhibit structural similarity. Moreover, the comparative analysis of protein binding sites is of special interest in pharmaceutical chemistry, in order to predict cross-reactivities and gain a deeper understanding of the catalysis mechanism. From an algorithmic point of view, the comparison of structured data, or, more generally, complex objects, can be attempted based on different methodological principles. Global methods aim at comparing structures as a whole, while local methods transfer the problem to multiple comparisons of local substructures. In the context of protein structure analysis, it is not a priori clear, which strategy is more suitable. In this thesis, several conceptually different algorithmic approaches have been developed, based on local, global and semi-global strategies, for the task of comparing protein structure data, more specifically protein binding pockets. The use of graphs for the modeling of protein structure data has a long standing tradition in structural bioinformatics. Recently, graphs have been used to model the geometric constraints of protein binding sites. The algorithms developed in this thesis are based on this modeling concept, hence, from a computer scientist's point of view, they can also be regarded as global, local and semi-global approaches to graph comparison. The developed algorithms were mainly designed on the premise to allow for a more approximate comparison of protein binding sites, in order to account for the molecular flexibility of the protein structures. A main motivation was to allow for the detection of more remote similarities, which are not apparent by using more rigid methods. Subsequently, the developed approaches were applied to different problems typically encountered in the field of structural bioinformatics in order to assess and compare their performance and suitability for different problems. Each of the approaches developed during this work was capable of improving upon the performance of existing methods in the field. Another major aspect in the experiments was the question, which methodological concept, local, global or a combination of both, offers the most benefits for the specific task of protein binding site comparison, a question that is addressed throughout this thesis
    • 

    corecore