43 research outputs found

    Using evolutionary covariance to infer protein sequence-structure relationships

    Get PDF
    During the last half century, a deep knowledge of the actions of proteins has emerged from a broad range of experimental and computational methods. This means that there are now many opportunities for understanding how the varieties of proteins affect larger scale behaviors of organisms, in terms of phenotypes and diseases. It is broadly acknowledged that sequence, structure and dynamics are the three essential components for understanding proteins. Learning about the relationships among protein sequence, structure and dynamics becomes one of the most important steps for understanding the mechanisms of proteins. Together with the rapid growth in the efficiency of computers, there has been a commensurate growth in the sizes of the public databases for proteins. The field of computational biology has undergone a paradigm shift from investigating single proteins to looking collectively at sets of related proteins and broadly across all proteins. we develop a novel approach that combines the structure knowledge from the PDB, the CATH database with sequence information from the Pfam database by using co-evolution in sequences to achieve the following goals: (a) Collection of co-evolution information on the large scale by using protein domain family data; (b) Development of novel amino acid substitution matrices based on the structural information incorporated; (c) Higher order co-evolution correlation detection. The results presented here show that important gains can come from improvements to the sequence matching. What has been done here is simple and the pair correlations in sequence have been decomposed into singlet terms, which amounts to discarding much of the correlation information itself. The gains shown here are encouraging, and we would like to develop a sequence matching method that retains the pair (or higher order) correlation information, and even higher order correlations directly, and this should be possible by developing the sequence matching separately for different domain structures. The many body correlations in particular have the potential to transform the common perceptions in biology from pairs that are not actually so very informative to higher-order interactions. Fully understanding cellular processes will require a large body of higher-order correlation information such as has been initiated here for single proteins

    Development of novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in MSAs

    Get PDF
    Multiple Sequenzalignments (MSAs) von homologen Proteinen sind nützliche Werkzeuge, um kompensatorische Mutationen zwischen nicht-konservierten Residuen zu charakterisieren. Die Identifizierung dieser Residuen in MSAs ist eine wichtige Aufgabe um die strukturellen Grundlagen und molekularen Mechanismen von Proteinfunktionen besser zu verstehen. Trotz der vielen Anzahl an Literatur über kompensatorische Mutationen sowie über die Sequenzkonservierungsanalyse für die Erkennung von wichtigen Residuen, haben vorherige Methoden meistens die biochemischen Eigenschaften von Aminosäuren nicht mit in Betracht gezogen, welche allerdings entscheidend für die Erkennung von kompensatorischen Mutationssignalen sein können. Jedoch werden kompensatorische Mutationssignale in MSAs oft durch das Rauschen verfälscht. Aus diesem Grund besteht ein weiteres Problem der Bioinformatik in der Trennung signifikanter Signale vom phylogenetischen Rauschen und beziehungslosen Paarsignalen. Das Ziel dieser Arbeit besteht darin Methoden zu entwickeln, welche biochemische Eigenschaften wie Ähnlichkeiten und Unähnlichkeiten von Aminosäuren in der Identifizierung von kompensatorischen Mutationen integriert und sich mit dem Rauschen auseinandersetzt. Deshalb entwickeln wir unterschiedliche Methoden basierend auf klassischer- und quantum Informationstheorie sowie multiple Testverfahren. Unsere erste Methode basiert auf der klassischen Informationstheorie. Diese Methode betrachtet hauptsächlich BLOSUM62-unähnliche Paare von Aminosäuren als ein Modell von kompensatorischen Mutationen und integriert sie in die Identifizierung von wichtigen Residuen. Um diese Methode zu ergänzen, entwickeln wir unsere zweite Methode unter Verwendung der Grundlagen von quantum Informationstheorie. Diese neue Methode unterscheidet sich von der ersten Methode durch gleichzeitige Modellierung ähnlicher und unähnlicher Signale in der kompensatorischen Mutationsanalyse. Des Weiteren, um signifikante Signale vom Rauschen zu trennen, entwickeln wir ein MSA-spezifisch statistisches Modell in Bezug auf multiple Testverfahren. Wir wenden unsere Methode für zwei menschliche Proteine an, nämlich epidermal growth factor receptor (EGFR) und glucokinase (GCK). Die Ergebnisse zeigen, dass das MSA-spezifisch statistische Modell die signifikanten Signale vom phylogenetischen Rauschen und von beziehungslosen Paarsignalen trennen kann. Nur unter Berücksichtigung BLOSUM62-unähnlicher Paare von Aminosäuren identifiziert die erste Methode erfolgreich die krankheits-assoziierten wichtigen Residuen der beiden Proteine. Im Gegensatz dazu, durch die gleichzeitige Modellierung ähnlicher und unähnlicher Signale von Aminosäurepaare ist die zweite Methode sensibler für die Identifizierung von katalytischen und allosterischen Residuen

    Disentangling the 4D Nucleome

    Full text link
    The dynamical relationship between 3D genome structure, genome function, and cellular phenotype is referred to as the 4D Nucleome (4DN). 4DN analysis remains difficult, since multiple data modalities must be integrated and comprehensively studied in order to obtain new insights. In my dissertation work, I present a computational toolbox which offers both novel and established methods to integrate and analyze time series genome structure and function data. I also provide an extension of the 4DN that captures the contributions of the maternal and paternal genomes. I uncover differences between the two genomes’ structural and functional features across the cell cycle, and reveal an allele-specific relationship between local genome structures and gene expression. In addition, I present a computational framework for analyzing multi-way genomic interactions which allow us to identify transcription clusters in the human genome. Finally, I introduce a computational method to characterize the differences between memory and plasma B cells in the adaptive immune system, which guide us to develop an immune system inspired learning system.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169908/1/lindsly_1.pd

    The Role of Mutations in Protein Structural Dynamics and Function: A Multi-scale Computational Approach

    Get PDF
    abstract: Proteins are a fundamental unit in biology. Although proteins have been extensively studied, there is still much to investigate. The mechanism by which proteins fold into their native state, how evolution shapes structural dynamics, and the dynamic mechanisms of many diseases are not well understood. In this thesis, protein folding is explored using a multi-scale modeling method including (i) geometric constraint based simulations that efficiently search for native like topologies and (ii) reservoir replica exchange molecular dynamics, which identify the low free energy structures and refines these structures toward the native conformation. A test set of eight proteins and three ancestral steroid receptor proteins are folded to 2.7Ă… all-atom RMSD from their experimental crystal structures. Protein evolution and disease associated mutations (DAMs) are most commonly studied by in silico multiple sequence alignment methods. Here, however, the structural dynamics are incorporated to give insight into the evolution of three ancestral proteins and the mechanism of several diseases in human ferritin protein. The differences in conformational dynamics of these evolutionary related, functionally diverged ancestral steroid receptor proteins are investigated by obtaining the most collective motion through essential dynamics. Strikingly, this analysis shows that evolutionary diverged proteins of the same family do not share the same dynamic subspace. Rather, those sharing the same function are simultaneously clustered together and distant from those functionally diverged homologs. This dynamics analysis also identifies 77% of mutations (functional and permissive) necessary to evolve new function. In silico methods for prediction of DAMs rely on differences in evolution rate due to purifying selection and therefore the accuracy of DAM prediction decreases at fast and slow evolvable sites. Here, we investigate structural dynamics through computing the contribution of each residue to the biologically relevant fluctuations and from this define a metric: the dynamic stability index (DSI). Using DSI we study the mechanism for three diseases observed in the human ferritin protein. The T30I and R40G DAMs show a loss of dynamic stability at the C-terminus helix and nearby regulatory loop, agreeing with experimental results implicating the same regulatory loop as a cause in cataracts syndrome.Dissertation/ThesisPh.D. Physics 201

    Doctor of Philosophy

    Get PDF
    dissertationRapidly evolving technologies such as chip arrays and next-generation sequencing are uncovering human genetic variants at an unprecedented pace. Unfortunately, this ever growing collection of gene sequence variation has limited clinical utility without clear association to disease outcomes. As electronic medical records begin to incorporate genetic information, gene variant classification and accurate interpretation of gene test results plays a critical role in customizing patient therapy. To verify the functional impact of a given gene variant, laboratories rely on confirming evidence such as previous literature reports, patient history and disease segregation in a family. By definition variants of uncertain significance (VUS) lack this supporting evidence and in such cases, computational tools are often used to evaluate the predicted functional impact of a gene mutation. This study evaluates leveraging high quality genotype-phenotype disease variant data from 20 genes and 3986 variants, to develop gene-specific predictors utilizing a combination of changes in primary amino acid sequence, amino acid properties as descriptors of mutation severity and NaĂŻve Bayes classification. A Primary Sequence Amino Acid Properties (PSAAP) prediction algorithm was then combined with well established predictors in a weighted Consensus sum in context of gene-specific reference intervals for known phenotypes. PSAAP and Consensus were also used to evaluate known variants of uncertain significance in the RET proto-oncogene as a model gene. The PSAAP algorithm was successfully extended to many genes and diseases. Gene-specific algorithms typically outperform generalized prediction tools. Characteristic mutation properties of a given gene and disease may be lost when diluted into genomewide data sets. A reliable computational phenotype classification framework with quantitative metrics and disease specific reference ranges allows objective evaluation of novel or uncertain gene variants and augments decision making when confirming clinical information is limited

    The intrinsic dimension of biological data landscapes

    Get PDF
    Analyzing large volumes of high-dimensional data is an issue of fundamental importance in science and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. That manifold however is generally twisted and curved; in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard. Here we propose a new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample. This extreme minimality enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. The ID estimator is theoretically exact in uniformly distributed data sets, and provides consistent measures in general. When used in combination with block analysis, it allows discriminating the relevant dimensions as a function of the block size. This allows estimating the ID even when the data lie on a manifold perturbed by a high-dimensional noise, a situation often encountered in real world data sets. Upon defining a notion of distance between protein sequences, This tools is used to estimate the ID of protein families, and to assess the consistency of generative models. Moreover, If coupled with a density estimator, our ID allows to measure the density of points by taking into account the space in which they actually lie, thus allowing for a cleaner estimation. Here we move a step further towards an automatic classification of protein sequences by using three new tools: our ID estimator, a density estimator and a clustering algorithm. We present the analysis performed on a Pfam PUA clan, showing that these combined tools allow to successfully separate protein domains into architectures. Finally, we present a generalized model for the estimation of the ID that is able to work in data sets with multiple dimensionalities: taking advantage of Bayesian inference techniques, the method allows discriminating manifolds with different dimensions as well as assigning all the points to the respective manifolds. We test the method on a molecular dynamics trajectory, showing that the folded state has a higher dimension with respect to the unfolded one

    Computational Approaches To Anti-Toxin Therapies And Biomarker Identification

    Get PDF
    This work describes the fundamental study of two bacterial toxins with computational methods, the rational design of a potent inhibitor using molecular dynamics, as well as the development of two bioinformatic methods for mining genomic data. Clostridium difficile is an opportunistic bacillus which produces two large glucosylating toxins. These toxins, TcdA and TcdB cause severe intestinal damage. As Clostridium difficile harbors considerable antibiotic resistance, one treatment strategy is to prevent the tissue damage that the toxins cause. The catalytic glucosyltransferase domain of TcdA and TcdB was studied using molecular dynamics in the presence of both a protein-protein binding partner and several substrates. These experiments were combined with lead optimization techniques to create a potent irreversible inhibitor which protects 95% of cells in vitro. Dynamics studies on a TcdB cysteine protease domain were performed to an allosteric communication pathway. Comparative analysis of the static and dynamic properties of the TcdA and TcdB glucosyltransferase domains were carried out to determine the basis for the differential lethality of these toxins. Large scale biological data is readily available in the post-genomic era, but it can be difficult to effectively use that data. Two bioinformatics methods were developed to process whole-genome data. Software was developed to return all genes containing a motif in single genome. This provides a list of genes which may be within the same regulatory network or targeted by a specific DNA binding factor. A second bioinformatic method was created to link the data from genome-wide association studies (GWAS) to specific genes. GWAS studies are frequently subjected to statistical analysis, but mutations are rarely investigated structurally. HyDn-SNP-S allows a researcher to find mutations in a gene that correlate to a GWAS studied phenotype. Across human DNA polymerases, this resulted in strongly predictive haplotypes for breast and prostate cancer. Molecular dynamics applied to DNA Polymerase Lambda suggested a structural explanation for the decrease in polymerase fidelity with that mutant. When applied to Histone Deacetylases, mutations were found that alter substrate binding, and post-translational modification

    Biological Systems Workbook: Data modelling and simulations at molecular level

    Get PDF
    Nowadays, there are huge quantities of data surrounding the different fields of biology derived from experiments and theoretical simulations, where results are often stored in biological databases that are growing at a vertiginous rate every year. Therefore, there is an increasing research interest in the application of mathematical and physical models able to produce reliable predictions and explanations to understand and rationalize that information. All these investigations are helping to overcome biological questions pushing forward in the solution of problems faced by our society. In this Biological Systems Workbook, we aim to introduce the basic pieces allowing life to take place, from the 3D structural point of view. We will start learning how to look at the 3D structure of molecules from studying small organic molecules used as drugs. Meanwhile, we will learn some methods that help us to generate models of these structures. Then we will move to more complex natural organic molecules as lipid or carbohydrates, learning how to estimate and reproduce their dynamics. Later, we will revise the structure of more complex macromolecules as proteins or DNA. Along this process, we will refer to different computational tools and databases that will help us to search, analyze and model the different molecular systems studied in this course

    The characterization of GTP Cyclohydrolase I and 6-Pyruvoyl Tetrahydropterin Synthase enzymes as potential anti-malarial drug targets

    Get PDF
    Malaria remains a public health problem and a high burden of disease, especially in developing countries. The unicellular protozoan malaria parasite of the genus Plasmodium infects about a quarter of a billion people annually, with an estimated 409 000 death cases. The majority of malaria cases occurred in Africa; hence, the region is regarded as endemic for malaria. Global efforts to eradicate the disease led to a decrease in morbidity and mortality rates. However, an enormous burden of malaria infection remains, and it cannot go unnoticed. Countries with limited resources are more affected by the disease, mainly on its public health and socio-economic development, due to many factors besides malaria itself, such as lack of access to adequate, affordable treatments and preventative regimes. Furthermore, the current antimalarial drugs are losing their efficacy because of parasite drug resistance. The emerged drug resistance has reduced the drug efficacy in clearing the parasite from the host system, causing prolonged illness and a higher risk of death. Therefore, the emerged antimalarial drug resistance has hindered the global efforts for malaria control and elimination and established an urgent need for new treatment strategies. When the resistance against classical antimalarial drugs emerged, the class of antifolate antimalarial medicines became the most common alternative. The antifolate antimalarial drugs target the malaria parasite de novo folate biosynthesis pathway by limiting folate derivates, which are essential for the parasite cell growth and survival. Yet again, the malaria parasite developed resistance against the available antifolate drugs, rendering the drugs ineffective in many cases. Given the previous success in targeting the malaria parasite de novo folate biosynthesis pathway, alternative enzymes within this pathway stand as good targets and can be explored to develop new antifolate drugs with novel mechanisms of action. The primary focus of this thesis is to contribute to the existing and growing knowledge of antimalarial drug discovery. The study aims to characterise the malaria parasite de novo folate synthesis pathway enzymes guanosine-5'-triphosphate (GTP) cyclohydrolase I (GCH1) and 6-pyruvoyl tetrahydropterin synthase (PTPS) as alternative drug targets for malaria treatment by using computational approaches. Further, discover new allosteric drug targeting sites within the two enzymes' 3D structures for future drug design and discovery. Sequence and structural analysis were carried out to characterise and pinpoint the two enzymes' unique sequence and structure-based features. From the analyses, key sequence and structure differences were identified between the malaria parasite enzymes relative to their human homolog; the identified sites can aid significantly in designing and developing new antimalarial antifolate drugs with good selectivity toward the parasites’ enzymes. GCH1 and PTPS contain a catalytically essential metal ion in their active site; therefore, force field parameters were needed to study their active sites accurately during all-atom molecular dynamic simulations (MD). The force field parameters were derived through quantum mechanics potential energy surface scans of the metals bonded terms and evaluated via all-atom MD simulations. Proteins structural dynamics is imperative for many biological processes; thus, it is essential to consider the structural dynamics of proteins whilst understanding their function. In this regard, the normal mode analysis (NMA) approach based on the elastic network model (ENM) was employed to study the intrinsic dynamics and conformations changes of GCH1 and PTPS enzymes. The NMA disclosed essential structural information about the protein’s intrinsic dynamics and mechanism of allosteric modulation of their binding properties, further highlighting regions that govern their conformational changes. The analysis also disclosed hotspot residues that are crucial for the proteins' fold stability and function. The NMA was further combined with sequence motif results and showed that conserved residues of GCH1 and PTPS were located within the identified key structural sites modulating the proteins' conformational rearrangement. The characterized structural features and hotspot residues were regarded as potential allosteric sites of important value for the design and development of allosteric drugs. Both GCH1 and PTPS enzymes have never been targeted before and can provide an excellent opportunity to overcome the antimalarial antifolate drug resistance problem. The data presented in this thesis contribute to the understanding of the sequence, structure, and global dynamics of both GCH1 and PTPS, further disclose potential allosteric drug targeting sites and unique structural features of both enzymes that can establish a solid starting point for drug design and development of new antimalarial drugs of a novel mechanism of actions. Lastly, the reported force field parameters will be of value for MD simulations for future in-silico drug discovery studies involving the two enzymes and other enzymes with the same Zn2+ binding motifs and coordination environments. The impact of this research can facilitate the discovery of new effective antimalarial medicines with novel mechanisms of action.Thesis (PhD) -- Faculty of Science, Biochemistry and Microbiology, 202
    corecore