621 research outputs found

    Three mathematical issues in reconstructing ancestral genome

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Modern Computing Techniques for Solving Genomic Problems

    Get PDF
    With the advent of high-throughput genomics, biological big data brings challenges to scientists in handling, analyzing, processing and mining this massive data. In this new interdisciplinary field, diverse theories, methods, tools and knowledge are utilized to solve a wide variety of problems. As an exploration, this dissertation project is designed to combine concepts and principles in multiple areas, including signal processing, information-coding theory, artificial intelligence and cloud computing, in order to solve the following problems in computational biology: (1) comparative gene structure detection, (2) DNA sequence annotation, (3) investigation of CpG islands (CGIs) for epigenetic studies. Briefly, in problem #1, sequences are transformed into signal series or binary codes. Similar to the speech/voice recognition, similarity is calculated between two signal series and subsequently signals are stitched/matched into a temporal sequence. In the nature of binary operation, all calculations/steps can be performed in an efficient and accurate way. Improving performance in terms of accuracy and specificity is the key for a comparative method. In problem #2, DNA sequences are encoded and transformed into numeric representations for deep learning methods. Encoding schemes greatly influence the performance of deep learning algorithms. Finding the best encoding scheme for a particular application of deep learning is significant. Three applications (detection of protein-coding splicing sites, detection of lincRNA splicing sites and improvement of comparative gene structure identification) are used to show the computing power of deep neural networks. In problem #3, CpG sites are assigned certain energy and a Gaussian filter is applied to detection of CpG islands. By using the CpG box and Markov model, we investigate the properties of CGIs and redefine the CGIs using the emerging epigenetic data. In summary, these three problems and their solutions are not isolated; they are linked to modern techniques in such diverse areas as signal processing, information-coding theory, artificial intelligence and cloud computing. These novel methods are expected to improve the efficiency and accuracy of computational tools and bridge the gap between biology and scientific computing

    The Role of Mutations in Protein Structural Dynamics and Function: A Multi-scale Computational Approach

    Get PDF
    abstract: Proteins are a fundamental unit in biology. Although proteins have been extensively studied, there is still much to investigate. The mechanism by which proteins fold into their native state, how evolution shapes structural dynamics, and the dynamic mechanisms of many diseases are not well understood. In this thesis, protein folding is explored using a multi-scale modeling method including (i) geometric constraint based simulations that efficiently search for native like topologies and (ii) reservoir replica exchange molecular dynamics, which identify the low free energy structures and refines these structures toward the native conformation. A test set of eight proteins and three ancestral steroid receptor proteins are folded to 2.7Å all-atom RMSD from their experimental crystal structures. Protein evolution and disease associated mutations (DAMs) are most commonly studied by in silico multiple sequence alignment methods. Here, however, the structural dynamics are incorporated to give insight into the evolution of three ancestral proteins and the mechanism of several diseases in human ferritin protein. The differences in conformational dynamics of these evolutionary related, functionally diverged ancestral steroid receptor proteins are investigated by obtaining the most collective motion through essential dynamics. Strikingly, this analysis shows that evolutionary diverged proteins of the same family do not share the same dynamic subspace. Rather, those sharing the same function are simultaneously clustered together and distant from those functionally diverged homologs. This dynamics analysis also identifies 77% of mutations (functional and permissive) necessary to evolve new function. In silico methods for prediction of DAMs rely on differences in evolution rate due to purifying selection and therefore the accuracy of DAM prediction decreases at fast and slow evolvable sites. Here, we investigate structural dynamics through computing the contribution of each residue to the biologically relevant fluctuations and from this define a metric: the dynamic stability index (DSI). Using DSI we study the mechanism for three diseases observed in the human ferritin protein. The T30I and R40G DAMs show a loss of dynamic stability at the C-terminus helix and nearby regulatory loop, agreeing with experimental results implicating the same regulatory loop as a cause in cataracts syndrome.Dissertation/ThesisPh.D. Physics 201

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements

    Get PDF
    Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)

    Evolutionary Genomics

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Abstracts of Papers, 86th Annual Meeting of the Virginia Academy of Science

    Get PDF
    Abstracts for the 86th Annual Meeting of the Virginia Academy of Science, May 20-23, 2008, Hampton University, Hampton, VA

    Lead optimization for new antimalarials and Successful lead identification for metalloproteinases: A Fragment-based approach Using Virtual Screening

    Get PDF
    Lead optimization for new antimalarials and Successful lead identification for metalloproteinases: A Fragment-based approach Using Virtual Screening Computer-aided drug design is an essential part of the modern medicinal chemistry, and has led to the acceleration of many projects. The herein described thesis presents examples for its application in the field of lead optimization and lead identification for three metalloproteins. DOXP-reductoisomerase (DXR) is a key enzyme of the mevalonate independent isoprenoid biosynthesis. Structure-activity relationships for 43 DXR inhibitors are established, derived from protein-based docking, ligand-based 3D QSAR and a combination of both approaches as realized by AFMoC. As part of an effort to optimize the properties of the established inhibitor Fosmidomycin, analogues have been synthesized and tested to gain further insights into the primary determinants of structural affinity. Unfortunately, these structures still leave the active Fosmidomycin conformation and detailed reaction mechanism undetermined. This fact, together with the small inhibitor data set provides a major challenge for presently available docking programs and 3D QSAR tools. Using the recently developed protein tailored scoring protocol AFMoC precise prediction of binding affinities for related ligands as well as the capability to estimate the affinities of structurally distinct inhibitors has been achieved. Farnesyltransferase is a zinc-metallo enzyme that catalyzes the posttranslational modification of numerous proteins involved in intracellular signal transduction. The development of farnesyltransferase inhibitors is directed towards the so-called non-thiol inhibitors because of adverse drug effects connected to free thiols. A first step on the way to non-thiol farnesyltransferase inhibitors was the development of an CAAX-benzophenone peptidomimetic based on a pharmacophore model. On its basis bisubstrate analogues were developed as one class of non-thiol farnesyltransferase inhibitors. In further studies two aryl binding and two distinct specificity sites were postulated. Flexible docking of model compounds was applied to investigate the sub-pockets and design highly active non-thiol farnesyltransferase inhibitor. In addition to affinity, special attention was paid towards in vivo activity and species specificity. The second part of this thesis describes a possible strategy for computer-aided lead discovery. Assembling a complex ligand from simple fragments has recently been introduced as an alternative to traditional HTS. While frequently applied experimentally, only a few examples are known for computational fragment-based approaches. Mostly, computational tools are applied to compile the libraries and to finally assess the assembled ligands. Using the metalloproteinase thermolysin (TLN) as a model target, a computational fragment-based screening protocol has been established. Starting with a data set of commercially available chemical compounds, a fragment library has been compiled considering (1) fragment likeness and (2) similarity to known drugs. The library is screened for target specificity, resulting in 112 fragments to target the zinc binding area and 75 fragments targeting the hydrophobic specificity pocket of the enzyme. After analyzing the performance of multiple docking programs and scoring functions forand the most 14 candidates are selected for further analysis. Soaking experiments were performed for reference fragment to derive a general applicable crystallization protocol for TLN and subsequently for new protein-fragment complex structures. 3-Methylsaspirin could be determined to bind to TLN. Additional studies addressed a retrospective performance analysis of the applied scoring functions and modification on the screening hit. Curios about the differences of aspirin and 3-methylaspirin, 3-chloroaspirin has been synthesized and affinities could be determined to be 2.42 mM; 1.73 mM und 522 μM respectively. The results of the thesis show, that computer aided drug design approaches could successfully support projects in lead optimization and lead identification. fragments in general, the fragments derived from the screening are docke
    corecore