281 research outputs found

    Modeling the spatio-temporal organization and segregation of bacterial chromosomes

    Get PDF
    This work examined the spatio-temporal organization and segregation of bacterial DNA in order to investigate the fundamental processes regulating the inheritance of genetic material and the proliferation of life. For the investigation of the spatio-temporal organization of genetic material in the cell fundamental physical principles were used in this work. The aim was to use concepts of polymer physics to formulate physical models of the complex biological reality. These models were evaluated in computer simulations and compared with experimental data. In the first project of this thesis, the spatial organization of DNA in multipartite bacteria (= bacteria with multiple replicons) was investigated. The results of this work reveal high order of spatial organization even for multipartite bacteria. The organization could be reproduced using a physical model of compacted DNA and geometric constraints on individual genes. Furthermore, it was possible to make accurate predictions for different mutants and to predict interactions between replicons with the developed model. The second project focused on the study of simultaneous replication and segregation of bacterial DNA. Segregation patterns of the ori were analyzed in the model organism Bacillus subtilis. Using Molecular Dynamics simulations, it was shown that entropic segregation of chromosomes is a plausible mechanism for the segregation of genetic material that would also explain the observed variability in the experimental data. The model of entropic segregation of bacterial chromosomes was extended in the third project by the implementation of additional segregation mechanisms, so that a large data set of different trajectories of the ori through the cell could be generated. Thus, machine learning models could be used to classify the different segregation movements. The evaluation of the predictions showed very good results and encourages future classification of experimental data based on the developed models. This work is intended to provide new perspectives on the organization of DNA in the bacterial cell as well as a better understanding of the physical basis of cellular processes

    Laboratory Directed Research and Development Program Activities for FY 2007.

    Full text link

    BIOINFORMATIC TOOLS FOR NEXT GENERATION GENOMICS

    Get PDF
    New sequencing strategies have redefined the concept of \u201chigh-throughput sequencing\u201d and many companies, researchers, and recent reviews use the term \u201cNext-Generation Sequencing\u201d (NGS) instead of high-throughput sequencing. These advances have introduced a new era in genomics and bioinformatics\u2060\u2060. During my years as PhD student I have developed various software, algorithms and procedures for the analysis of Nest Generation sequencing data required for distinct biological research projects and collaborations in which our research group was involved. The tools and algorithms are thus presented in their appropriate biological contexts. Initially I dedicated myself to the development of scripts and pipelines which were used to assemble and annotate the mitochondrial genome of the model plant Vitis vinifera. The sequence was subsequently used as a reference to study the RNA editing of mitochondrial transcripts, using data produced by the Illumina and SOLiD platforms. I subsequently developed a new approach and a new software package for the detection of of relatively small indels between a donor and a reference genome, using NGS paired-end (PE) data and machine learning algorithms. I was able to show that, suitable Paired End data, contrary to previous assertions, can be used to detect, with high confidence, very small indels in low complexity genomic contexts. Finally I participated in a project aimed at the reconstruction of the genomic sequences of 2 distinct strains of the biotechnologically relevant fungus Fusarium. In this context I performed the sequence assembly to obtain the initial contigs and devised and implemented a new scaffolding algorithm which has proved to be particularly efficient

    Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments

    Get PDF
    Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative approach for predicting PRM-mediated protein-protein interactions from sequence data. The model suffered from over-fitting, so Laplacian regularisation was found to be important in achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative model. We also propose another discriminative model which can be applied to all sequences present in the organism at a significantly lower computational cost. This is due to its additional assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small number of instances of each binding site motif. However, closely related species are expected to share similar binding sites, which would be expected to be highly conserved. We investigated rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic tree can represent the relationships and divergences between the taxa. However, taxa sequences exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites, and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments: one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo

    Proceedings of Abstracts, School of Physics, Engineering and Computer Science Research Conference 2022

    Get PDF
    © 2022 The Author(s). This is an open-access work distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. For further details please see https://creativecommons.org/licenses/by/4.0/. Plenary by Prof. Timothy Foat, ‘Indoor dispersion at Dstl and its recent application to COVID-19 transmission’ is © Crown copyright (2022), Dstl. This material is licensed under the terms of the Open Government Licence except where otherwise stated. To view this licence, visit http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3 or write to the Information Policy Team, The National Archives, Kew, London TW9 4DU, or email: [email protected] present proceedings record the abstracts submitted and accepted for presentation at SPECS 2022, the second edition of the School of Physics, Engineering and Computer Science Research Conference that took place online, the 12th April 2022

    MACHINE LEARNING AND BIOINFORMATIC INSIGHTS INTO KEY ENZYMES FOR A BIO-BASED CIRCULAR ECONOMY

    Get PDF
    The world is presently faced with a sustainability crisis; it is becoming increasingly difficult to meet the energy and material needs of a growing global population without depleting and polluting our planet. Greenhouse gases released from the continuous combustion of fossil fuels engender accelerated climate change, and plastic waste accumulates in the environment. There is need for a circular economy, where energy and materials are renewably derived from waste items, rather than by consuming limited resources. Deconstruction of the recalcitrant linkages in natural and synthetic polymers is crucial for a circular economy, as deconstructed monomers can be used to manufacture new products. In Nature, organisms utilize enzymes for the efficient depolymerization and conversion of macromolecules. Consequently, by employing enzymes industrially, biotechnology holds great promise for energy- and cost-efficient conversion of materials for a circular economy. However, there is need for enhanced molecular-level understanding of enzymes to enable economically viable technologies that can be applied on a global scale. This work is a computational study of key enzymes that catalyze important reactions that can be utilized for a bio-based circular economy. Specifically, bioinformatics and data- mining approaches were employed to study family 7 glycoside hydrolases (GH7s), which are the principal enzymes in Nature for deconstructing cellulose to simple sugars; a cytochrome P450 enzyme (GcoA) that catalyzes the demethylation of lignin subunits; and MHETase, a tannase-family enzyme utilized by the bacterium, Ideonella sakaiensis, in the degradation and assimilation of polyethylene terephthalate (PET). Since enzyme function is fundamentally dependent on the primary amino-acid sequence, we hypothesize that machine-learning algorithms can be trained on an ensemble of functionally related enzymes to reveal functional patterns in the enzyme family, and to map the primary sequence to enzyme function such that functional properties can be predicted for a new enzyme sequence with significant accuracy. We find that supervised machine learning identifies important residues for processivity and accurately predicts functional subtypes and domain architectures in GH7s. Bioinformatic analyses revealed conserved active-site residues in GcoA and informed protein engineering that enabled expanded enzyme specificity and improved activity. Similarly, bioinformatic studies and phylogenetic analysis provided evolutionary context and identified crucial residues for MHET-hydrolase activity in a tannase-family enzyme (MHETase). Lastly, we developed machine-learning models to predict enzyme thermostability, allowing for high-throughput screening of enzymes that can catalyze reactions at elevated temperatures. Altogether, this work provides a solid basis for a computational data-driven approach to understanding, identifying, and engineering enzymes for biotechnological applications towards a more sustainable world

    CaTCHing the functional and structural properties of chromosome folding

    Get PDF
    Proper development requires that genes are expressed at the right time, in the right tissue, and at the right transcriptional level. In metazoans, this involves long-range cis-regulatory elements such as enhancers, which can be located up to hundreds of kilobases away from their target promoters. How enhancers find their target genes and avoid aberrant interactions with non-target genes is currently under intense investigations. The predominant model for enhancer function involves its direct physical looping between the enhancer and target promoter. The three-dimensional organization of chromatin, which accommodates promoter- enhancer interactions, therefore might play an important role in the specificity of these interactions. In the last decade, the development of a class of techniques called chromosome conformation capture (3C) and its derivatives have revolutionized the field of chromatin folding. In particular, the genome-wide version of 3C, Hi-C, revealed that mammalian chromosomes possess a rich hierarchy of folding layers, from multi-megabase compartments corresponding to mutually exclusive associations of active and inactive chromatin to topologically associating domains (TADs), which reflect regions with preferential internal interactions. Although the mechanisms that give rise to this hierarchy are still poorly understood, there is increasing evidence to suggest that TADs represent fundamental functional units for establishing the correct pattern of enhancer-promoter interactions. This is thought to occur through two complementary mechanisms: on the one hand, TADs are thought to increase the chances that regulatory elements meet each other by confining them within the same domain; on the other hand, by segregation of physical interactions across the boundary to avoid unwanted events to occur frequently. It is however unclear whether the properties that have been attributed to TADs are specific to TADs, or rather common features among the whole hierarchy. To address this question, I have implemented an algorithm named Caller of Topological Chromosomal Hierarchies (CaTCH). CaTCH is able to detect nested hierarchies of domains, allowing a comprehensive analysis of structural and functional properties across the folding hierarchy. By applying CaTCH to published Hi-C data in mouse embryonic stem cells (ESCs) and neural progenitor cells (NPCs), I showed that TADs emerge as a functionally privileged scale. In particular, TADs appear to be the scale where accumulation of CTCF at domain boundaries and transcriptional co-regulation during differentiation is maximal. Moreover, TADs appear to be the folding scale where the partitioning of interactions within transcriptionally active domains (and notably between active enhancers and promoters) is optimized. 3C-based methods have enabled fundamental discoveries such as the existence of TADs and CTCF-mediated chromatin loops. 3C methods detect chromatin interactions as ligation products after crosslinking the DNA. Crosslinking and ligation have been often criticized as potential sources of experimental biases, raising the question of whether TADs and CTCF- mediated chromatin loops actually exist in living cells. To address this, in collaboration with Josef Redolfi, we developed a new method termed ‘DamC’ which combines DNA methylation with physical modeling to detect chromosomal interactions in living cells, at the molecular scale, without relying on crosslinking and ligation. By applying DamC to mouse ESCs, we provide the first in vivo and crosslinking- and ligation-free validation of chromosomal structures detected by 3C-methods, namely TADs and CTCF-mediated chromatin loops. DamC, together with 3C-based methods, thus have shown that mammalian chromosomes possess a rich hierarchy of folding layers. An important challenge in the field is to understand the mechanisms that drive the establishment these folding layers. In this sense, polymer physics represent a powerful tool to gain mechanistic insights into the hierarchical folding of mammalian chromosomes. In polymer models, the scaling of contact probability, i.e. the contact probability as a function of genomic distance, has been often used to benchmark polymer simulations and test alternative models. However, the scaling of contact probability is only one of the many properties that characterize polymer models raising the question of whether it would be enough to discriminate alternative polymer models. To address this, I have built finite-size heteropolymer models characterized by random interactions. I showed that finite-size effects, together with the heterogeneity of the interactions, are sufficient to reproduce the observed range of scaling of contact probability. This suggests that one should be careful in discriminating polymer models of chromatin folding based solely on the scaling. In conclusion, my findings have contributed to achieve a better understanding of chromatin folding, which is essential to really understand how enhancers act on promoters. The comprehensive analyses using CaTCH have provided conceptually new insights into how the architectural functionality of TADs may be established. My work on heteropolymer models has highlighted the fact that one should be careful in using solely scaling to discriminate physical models for chromatin folding. Finally, the ability to detect TADs and chromatin loops using DamC represents a fundamental result since it provides the first orthogonal in vivo validation of chromosomal structures that had essentially relied on a single technology

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
    • …
    corecore