118 research outputs found

    Accelerated Profile HMM Searches

    Get PDF
    Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches

    Quantifying the Role of Water in Ligand-Protein Binding Processes

    Get PDF
    The aim of this thesis is to quantify the contributions of water thermodynamics to the binding free energy in protein-ligand complexes. Various computational tools were directly applied, implemented, benchmarked and discussed. An own implementation of the IFST formulation was developed to facilitate easy integration in workflows that are based on Schrödinger software. By applying the tool to a well-defined test set of congeneric ligand pairs, the potential of IFST for quantitative predictions in lead-optimization was assessed. Furthermore, FEP calculations were applied to an extended test set to validate if these simulations can accurately account for solvent displacement in ligand modifications. As a fast tool that has applications in virtual screening problems, we finally developed and validated a new scoring function that incorporates terms for protein and ligand desolvation. This resulted in total in three distinct studies, that all elucidated different aspects of water thermodynamics in CADD. These three studies are presented in the next section. In the conclusion, the results and implications of these studies are discussed jointly, as well with possible future developments. An additional study was focused on virtual screening and toxicity prediction at the androgen receptor, where distinguishing agonists and antagonists poses difficulties. We proposed and validated an approach based on MD simulations and ensemble docking to improve predictions of androgen agonists and antagonists

    Probing Local Atomic Environments to Model RNA Energetics and Structure

    Full text link
    Ribonucleic acids (RNA) are critical components of living systems. Understanding RNA structure and its interaction with other molecules is an essential step in understanding RNA-driven processes within the cell. Experimental techniques like X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and chemical probing methods have provided insights into RNA structures on the atomic scale. To effectively exploit experimental data and characterize features of an RNA structure, quantitative descriptors of local atomic environments are required. Here, I investigated different ways to describe RNA local atomic environments. First, I investigated the solvent-accessible surface area (SASA) as a probe of RNA local atomic environment. SASA contains information on the level of exposure of an RNA atom to solvents and, in some cases, is highly correlated to reactivity profiles derived from chemical probing experiments. Using Bayesian/maximum entropy (BME), I was able to reweight RNA structure models based on the agreement between SASA and chemical reactivities. Next, I developed a numerical descriptor (the atomic fingerprint), that is capable of discriminating different atomic environments. Using atomic fingerprints as features enable the prediction of RNA structure and structure-related properties. Two detailed examples are discussed. Firstly, a classification model was developed to predict Mg2+^{2+} ion binding sites. Results indicate that the model could predict Mg2+^{2+} binding sites with reasonable accuracy, and it appears to outperform existing methods. Secondly, a set of models were developed to identify cavities in RNA that are likely to accommodate small-molecule ligands. The models were also used to identify bound-like conformations from an ensemble of RNA structures. The frameworks presented here provide paths to connect the local atomic environment to RNA structure, and I envision they will provide opportunities to develop novel RNA modeling tools.PHDPhysicsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163135/1/jingrux_1.pd

    Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements

    Get PDF
    Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)

    Discovery and development of novel inhibitors for the kinase Pim-1 and G-Protein Coupled Receptor Smoothened

    Get PDF
    Investigation of the cause of disease is no easy business. This is particularly so when one reflects upon the lessons taught us in antiquity. Prior to the beginning of the last century, diagnosis and treatment of diseases such as cancers was so bereft of hope that there was little physicians could offer in the way of comfort, let alone treatment. One of the major insights from investigations into cancers this century has been that those involved in research leading to treatments are not dealing with a singular malady but multiple families of diseases with different mechanisms and modes of action. Therefore, despite the end game being similar in cancers, that of uncontrolled growth and replication leading to cellular dysfunction, different diseases require different approaches in targeting them. This leads us to a particular broad treatment approach, that of drug design. A drug is, in the classical sense, a small molecule that, upon introduction into the body, interacts with biochemical targets to induce a wider biological effect, ideally with both an intended target and intended effect. The conceptual basis underpinning this `lock-and-key' paradigm was elucidated over a century ago and the primary occupation of those involved in biochemical research has been to determine as much information as possible about both of these protein locks and drug keys. And, as inferred from the paradigm, molecular shape is all-important in determining and controlling action against the most important locks with the most potent and specific keys. The two most important target classes in drug discovery for some time have been protein kinases and G Protein-Coupled Receptors (GPCRs). Both classes of proteins are large families that perform very different tasks within the body. Kinases activate and inactive many cellular processes by catalysing the transfer of a phosphate group from Adenosine Tri-Phosphate (ATP) to other targets. GPCRs perform the job of interacting with chemical signals and communicating them into a biological response. Dysfunction in both types of proteins in certain cells can lead to a loss of biological control and, ultimately, a cancer. Both of kinases and GPCRs have entirely different chemical structures so structural knowledge therefore becomes crucial in any approach targeting cells where dysfunction has occurred. Thus, for this thesis, a member from each class was investigated using a combination of structural approaches. From the kinase class, the kinase Proviral Integration site for MuLV (Pim-1) and from the GPCR class, the cell membrane-bound Smoothened receptor (SMO). The kinase \pimone\ was the target of various approaches in \autoref{chap:three}. Although a heavily studied target from the mid-2000's, there is a paucity of inhibitors targeting residues more remote from structural characteristics that define kinases. Further limiting extension possibilities is that \pimone\ is constitutively active so no inhibitors targeting an inactive state are possible. An initial project (\pone) used the known binding properties of small molecules, or, `fragments' to elucidate structural and dynamic information useful for targeting \pimone. This was followed by three projects, all with the goal of inhibitor discovery, all with different foci. In \ptwo, fragment binding modes from \pone\ provided the basis for the extension and development of drug-like inhibitors with a focus on synthetic feasibility. In contrast, inhibitors were found in \pthree\ via a large-scale public dataset of purchasable molecules that possess drug-like properties. Finally, \pfour\ took the truncated form of a particularly attractive fragment from \pone\ that was crystallised with \pimone, verified its binding mode and then generated extensions with, again, a focus on synthetic feasibility. The GPCR \smo\ has fewer molecular studies and much about its structural behaviour remains unknown. As the most `druggable' protein in the Hedgehog pathway, structural studies have primarily focussed on stabilising its inactive state to prevent signal transduction. Allied to this is that there are generally few inhibitors for \smo\ and the drugs for cancers related to its dysfunction are vulnerable to mutations that significantly reduce their effectiveness or abrogate it entirely. The elucidation of structural information in therefore of high priority. An initial study attempting to identify an unknown molecule from prior experiments led to insights regarding binding characteristics of specific moieties. This was particularly important to understand not just where favourable moieties bind but also sections of the \smo\ binding pocket with unfavourable binding. In both subsequent virtual screens performed in Chapter 4, the primary aim was to find new drug-like inhibitors of \smo\ using large public datasets of commercially-available molecules. The initial screen retrieved relatively few inhibitors so the binding pocket was modified to find a structural state more amenable to small molecule binding. These modifications led to a significant number of new, chemically novel inhibitors for \smo, some structural information useful for future inhibitors and the elucidation of structure-activity relationships useful for inhibitor design. This underpins the idea that structural information is of critical importance in the discovery and design of molecular inhibitors

    Machine Learning and Solvation Theory for Drug Discovery

    Full text link
    Drug discovery is a notoriously expensive and time-consuming process; hence, developing computational methods to facilitate the discovery process and lower the associated costs is a long-sought goal of computational chemists. Protein-ligand binding, which provides the physical and chemical basis for the mechanism of action of most drugs, occurs in an aqueous environment, and binding affinity is determined not only by atomic interactions between the protein and ligand but also by changes in their interactions with surrounding water molecules that occur upon binding. Thus, a quantitative understanding of the roles water molecules play in the protein-ligand binding process is an essential foundation for developing computational methods and tools to aid the drug discovery process. Grid inhomogeneous solvation theory (GIST) is a tool that measures the thermodynamic and structural properties of water molecules on protein surfaces. Since its implementation, GIST has been used to study water behavior upon protein-ligand binding and to account for solvent effects in scoring functions used in virtual screening. This thesis is comprised of two research projects that extend the applications and functionality of GIST. In the first project, we investigated whether the water properties measured by GIST could improve the performance of machine learning models, specifically, convolutional neural networks (CNN) applied to virtual screening (GIST-CNN project). In the second project, we implemented the particle mesh Ewald (PME) algorithm for energy calculation in GIST, enabling GIST to become a more accurate and more efficient tool for end-state free energy calculation (PME-GIST project). The GIST-CNN project arose in response to reports indicating that convolutional neural network (CNN) models were able to outperform classical scoring functions in virtual screening. We noticed that all the reported machine learning models had been trained only by protein-ligand structures, while water molecules were completely neglected. Given that water molecules play essential roles in protein-ligand binding, we hypothesized that we could further improve the performance of CNN models in terms of enrichment efficiency by adding water features, measured by GIST, to the data used to train the model. Contrary to our hypothesis, we found that adding water features could not further improve the performance of a CNN model trained by protein-ligand structures, which was already very high. However, further investigation revealed that the high performance and reported enrichment efficiency of a CNN model trained by protein-ligand information was solely attributable to biases in the Database of Useful Decoys-Enhanced (DUD-E), which was used to train and test the model. In this project, we also established a suite of methods to investigate what a model learns from the input during training and argued that machine learning models should be thoroughly validated before being applied in real drug discovery projects. The motivations for the PME-GIST project were twofold. First, although GIST provides the statistical thermodynamic framework for thermodynamic end-state free energy calculation, inconsistencies in energy calculations between the previous GIST implementation (GIST-2016) and modern molecular dynamics engines prevent precise comparison of the GIST end-state method to other reference free energy calculation methods such as thermodynamic integration (TI). Second, the O(N2) nonbonded energy calculation is the most expensive step in the entire GIST calculation process. By implementation of the PME algorithm into GIST, we aimed to achieve GIST energy calculations consistent with those of modern molecular dynamic engines and to accelerate the energy calculation to O(NlogN), which is highly desirable when applying GIST to the measurement of water properties across an entire protein surface. In addition to implementing PME, we derived a simple empirical estimator for high order entropies, which are truncated in GIST. After incorporating PME-based energy calculation and the high order entropy estimator, we used PME-GIST to calculate end-state solvation free energy for a wide range of small molecules and achieved results highly consistent with TI (= 0.99, mean unsigned difference = 0.44 kcal/mol). The PME-GIST code we developed in this project was integrated into the open-source molecular dynamics analysis software CPPTRAJ for easy access by others in the drug discovery community. In summary, in this thesis, we explored the potential of adding solvation thermodynamics to machine learning-based virtual screening and found that the high performance reported for machine learning models in this application reflected biases in the dataset used construct and test them rather than successfully generalization of the physical principles that govern molecular interactions. We also addressed the inconsistent energy calculation between GIST and modern molecular simulation engines by developing PME-GIST. We hope the research work presented in this thesis will further expand and accelerate the application of GIST to drug discovery

    Acceleration and Verification of Virtual High-throughput Multiconformer Docking

    Get PDF
    The work in this dissertation explores the use of massive computational power available through modern supercomputers as a virtual laboratory to aid drug discovery. As of November 2013, Tianhe-2, the fastest supercomputer in the world, has a theoretical performance peak of 54,902 TFlop/s or nearly 55 thousand trillion calculations per second. The Titan supercomputer located at Oak Ridge National Laboratory has 560,640 computing cores that can work in parallel to solve scientific problems. In order to harness this computational power to assist in drug discovery, tools are developed to aid in the preparation and analysis of high-throughput virtual docking screens, a tool to predict how and how well small molecules bind to disease associated proteins and potentially serve as a novel drug candidate. Methods and software for performing large screens are developed that run on high-performance computer systems. The future potential and benefits of using these tools to study polypharmacology and revolutionizing the pharmaceutical industry are also discussed
    • …
    corecore