134 research outputs found

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Modeling nucleosome mediated mechanisms of gene regulation

    Get PDF
    The genomes of all eukaryotic organisms are packaged into nucleosomes, which are the fundamental units of chromatin, each composed of approximately 147 base pairs of DNA wrapped around a histone octamer. Because 70-90% of the eukaryotic genome is packaged into nucleosomes they modulate accessibility of DNA to transcription factors (TFs) and play an important role in regulation of transcription. This thesis is devoted to the mathematical modeling of effects which are caused by direct competition between nucleosomes and transcription factors. The contents of the thesis are organized as follows: in chapter 1 we introduce experimental methods and recent discoveries which have been made in chromatin biology. In chapter 2 we introduce a thermodynamic biophysical model for calculating nucleosome and transcription factor occupancies. We also introduce the statistical positioning effect and how it may affect the binding of transcription factors. In chapter 2 we mostly address a question of how competition with transcription factors can affect nucleosome positioning. We first examine nucleosome experimental data and address the question of reproducibility of the data across different experiments carried out in several labs. Then, we introduce a new method for the quality assessment of the prediction of the model and use it to optimize parameters of the model to fit experimental data. We focus on how transcription factors can explain observed in vivo nucleosome positioning and which transcription factors play crucial roles in establishing nucleosome patterns at the promoters of genes. In chapter 3 we address a question of how nucleosomes and promoter architecture affect binding of TFs. We model binding of TFs in the context of chromatin to a cluster of binding sites and investigate what features of the binding site cluster determine the main characteristics of TF binding. Finally, we study how TFBSs in real genomes are positioned relative to each other and show that there are certain biases in spacing between TFBSs, probably due to effects caused by competition with nucleosomes

    Single-Molecule Investigation of Chromatin-Associated Factors in Genome Organization and Epigenetic Maintenance

    Get PDF
    The central dogma of biology has laid the foundation for understanding gene expression through the mechanisms of transcription and translation. However, another layer of eukaryotic gene regulation lies in the complex structure of chromatin. This scaffold of structural proteins and enzymatic regulators determines what genes are expressed at what times, leading to cell differentiation, cell fate, and often disease. Currently, the field of chromatin biology has relied on basic biochemistry and cellular assays to identify key epigenetic regulators and their role in genomic maintenance. For this thesis work, I have developed a biophysical platform to study chromatin-associated factors at the single-molecule level (Chapter 2). This methodology allows us to extract key mechanistic details often obscured by standard bulk methodologies. Using this platform, we posed the question of how epigenetic factor, Polycomb repressive complex 2 (PRC2) engages with chromatin (Chapter 3). PRC2 is a major epigenetic machinery that maintains transcriptionally silent heterochromatin in the nucleus and plays critical roles in embryonic development and oncogenesis. It is generally thought that PRC2 propagates repressive histone marks by modifying neighboring nucleosomes in a strictly linear progression. However, the behavior of PRC2 on native-like chromatin substrates remains incompletely characterized, making the precise mechanism of PRC2-mediated heterochromatin maintenance elusive. Our understanding of this process was limited by the resolution of structural techniques that fail to identify PRC2-binding modes on long chromatin substrates. In short, we found direct evidence that PRC2 can simultaneously engage nonadjacent nucleosome pairs. The demonstration of PRC2\u27s ability to bridge noncontiguous chromosomal segments furthers our understanding of how Polycomb complexes spread epigenetic modifications and compact chromatin. In addition to this single-molecule chromatin binding technology, I also created a singlemolecule platform harnessing correlative force and fluorescence microscopy to assay the material properties of phase separated condensates (Chapter 2). This assay combined methodology to visualize condensate formation at the single-molecule level, in addition to optical trapping of individual droplets to investigate their material properties. Utilizing this technology, we interrogated the role of linker histone H1 (Chapter 4). The linker histones are the most abundant group of chromatin-binding proteins that bind and organize eukaryotic chromatin. However, roles for the diverse and largely unstructured H1 proteins beyond chromatin compaction remain unclear. We used correlative single-molecule force and fluorescence microscopy to directly visualize the behavior of H1 on DNA under different tensions. Unexpectedly, our results show that H1 preferentially coalesces around nascent, relaxed singlestranded DNA. In vitro bulk assays confirmed that H1 has a higher propensity to form phaseseparated condensates with single-stranded DNA than with double-stranded DNA. Furthermore, we dissected the material properties of different H1:DNA condensates by controlled droplet fusion with optical tweezers, and found that increased DNA length and GC content result in more viscous, gel-like H1 condensates. Overall, our findings suggest a potential role for linker histones to sense and coacervate single-stranded nucleic acids in the nucleus, forming reaction hubs for genome maintenance. This work also provides a new perspective to understand how various H1 subtypes and disease-associated mutations affect chromatin structure and function. In summary, we have gained a greater understanding of the biophysical basis for chromatin regulation by both PRC2 and histone H1. Both of the biophysical platforms created for these studies can be applied to various new targets in chromatin biology. They will enable the investigation of a multiplicity of binding interactions, regulatory mechanisms, and material properties of protein-nucleic acid complexes (Chapters 5 & 6). I believe single-molecule techniques will become a major toolset to study chromatin biology, identifying the intricacies and interactions between epigenetic factors and our genome

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Systems Analytics and Integration of Big Omics Data

    Get PDF
    A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

    Development of Computational Techniques for Regulatory DNA Motif Identification Based on Big Biological Data

    Get PDF
    Accurate regulatory DNA motif (or motif) identification plays a fundamental role in the elucidation of transcriptional regulatory mechanisms in a cell and can strongly support the regulatory network construction for both prokaryotic and eukaryotic organisms. Next-generation sequencing techniques generate a huge amount of biological data for motif identification. Specifically, Chromatin Immunoprecipitation followed by high throughput DNA sequencing (ChIP-seq) enables researchers to identify motifs on a genome scale. Recently, technological improvements have allowed for DNA structural information to be obtained in a high-throughput manner, which can provide four DNA shape features. The DNA shape has been found as a complementary factor to genomic sequences in terms of transcription factor (TF)-DNA binding specificity prediction based on traditional machine learning models. Recent studies have demonstrated that deep learning (DL), especially the convolutional neural network (CNN), enables identification of motifs from DNA sequence directly. Although numerous algorithms and tools have been proposed and developed in this field, (1) the lack of intuitive and integrative web servers impedes the progress of making effective use of emerging algorithms and tools; (2) DNA shape has not been integrated with DL; and (3) existing DL models still suffer high false positive and false negative issues in motif identification. This thesis focuses on developing an integrated web server for motif identification based on DNA sequences either from users or built-in databases. This web server allows further motif-related analysis and Cytoscape-like network interpretation and visualization. We then proposed a DL framework for both sequence and shape motif identification from ChIP-seq data using a binomial distribution strategy. This framework can accept as input the different combinations of DNA sequence and DNA shape. Finally, we developed a gated convolutional neural network (GCNN) for capturing motif dependencies among long DNA sequences. Results show that our developed web server enables providing comprehensive motif analysis functionalities compared with existing web servers. The DL framework can identify motifs using an optimized threshold and disclose the strong predictive power of DNA shape in TF-DNA binding specificity. The identified sequence and shape motifs can contribute to TF-DNA binding mechanism interpretation. Additionally, GCNN can improve TF-DNA binding specificity prediction than CNN on most of the datasets

    Investigating the 3D chromatin architecture with fluorescence microscopy

    Get PDF
    Chromatin is an assembly of DNA and nuclear proteins, which on the one hand has the function to properly store the 2 meters of DNA of a diploid human nucleus in a small volume and on the other hand regulates the accessibility of specific DNA segments for proteins. Many cellular processes like gene expression and DNA repair are affected by the three-dimensional architecture of chromatin. Cohesin is an important and well-studied protein that affects three-dimensional chromatin organization. One of the functions of this motor protein is the active generation of specific domain structures (topologically associating domains (TADs)) by the process of loop extrusion. Studies of cohesin depleted cells showed that TAD structures were lost on a population average. Due to this finding, the question arose, to what extent the functional nuclear architecture, that can be detected by confocal and structured illumination microscopy, is impaired when cells were cohesin depleted. The work presented in this thesis could show that the structuring of the nucleus in areas with different chromatin densities including the localization of important nuclear proteins as well as replication patterns was retained. Interestingly, cohesin depleted cells proceeded through an endomitosis leading to the formation of multilobulated nuclei. Obviously, important structural features of chromatin can form even in the absence of cohesin. In the here presented work, fluorescence microscopic methods were used throughout, and an innovative technique was developed, that allows flexible labeling of proteins with different fluorophores in fixed cells. With this technique DNA as well as peptide nucleic acid (PNA) oligonucleotides can be site-specifically coupled to antibodies via the Tub-tag technology and visualized by complementary fluorescently labeled oligonucleotides. The advantages and disadvantages of PNAs as docking strands are discussed in this thesis as well as the use of PNAs in fluorescence in situ hybridization (FISH). In the next study, which is part of this work, a combination of FISH and super-resolution microscopy was used. There it could be shown that DNA segments of 5 kb can form both compact and elongated configurations in regulatory active as well as inactive chromatin. Coarse-grained modeling of these microscopic data, in agreement with published data from other groups, has suggested that elongated configurations occur more frequently in DNA segments in which the occupancy of nucleosomes is reduced. The microscopically measured distance distributions could only be simulated with models that assume different densities of nucleosomes in the population. Another result of this study was that inactive chromatin - as expected - shows a high level of compaction, which can hardly be explained with common coarse-grained models. It is possible that environmental effects that are difficult to simulate play a role here. Chromatin is a highly dynamic structure, and its architecture is constantly changing, be it through active processes such as the effect of cohesin investigated here or through thermodynamic interactions of nucleosomes as they are simulated in coarse-grained models. It will take a long time until we adequately understand these dynamic processes and their interplay

    Twisted bilayer systems

    Get PDF
    • …
    corecore