1,052 research outputs found

    GAMIBHEAR: whole-genome haplotype reconstruction from genome architecture mapping data

    Get PDF
    Motivation: Understanding haplotype-specific regulatory mechanisms becomes increasingly important in genomics and medical research. Investigating differences in allele-specific gene expression, epigenetic changes and their causal variants greatly benefits from haplotype reconstruction or phasing of genetic variants, but direct evidence for the haplotype structure is difficult to obtain from standard short-read sequencing data. Chromatin conformation data obtained from 3C experiments allows inference of haplotypes because inter-chromosomal contacts are more frequent than homologous intra-chromosomal contacts, but these data suffer from technical biases owing to the digestion and ligation process of the 3C technique. Genome Architecture Mapping (GAM) is a novel digestion- and ligation-free method for the inference of chromatin conformation from nuclear cryosections. Due to its high resolution and independence of enzymatic digestion it is well-suited for haplotype reconstruction and for detecting haplotype-specific chromatin contacts. Results: Here, we present GAMIBHEAR, a tool for accurate haplotype reconstruction from GAM data. GAMIBHEAR aggregates allelic co-observation frequencies across multiple nuclear slices and employs a GAM-specific probabilistic model of haplotype capture to optimise phasing accuracy. Using a hybrid mouse embryonic stem cell line with known haplotype structure as a benchmark dataset, we assess correctness and completeness of the reconstructed haplotypes, and demonstrate the power of GAM data and the accuracy of GAMIBHEAR to infer genome-wide haplotypes. Availability: GAMIBHEAR is available as an R package under the open source GPL-2 license at https://bitbucket.org/schwarzlab/gamibhear Maintainer: julia.markowski{at}mdc-berlin.d

    Subtle changes in chromatin loop contact propensity are associated with differential gene regulation and expression.

    Get PDF
    While genetic variation at chromatin loops is relevant for human disease, the relationships between contact propensity (the probability that loci at loops physically interact), genetics, and gene regulation are unclear. We quantitatively interrogate these relationships by comparing Hi-C and molecular phenotype data across cell types and haplotypes. While chromatin loops consistently form across different cell types, they have subtle quantitative differences in contact frequency that are associated with larger changes in gene expression and H3K27ac. For the vast majority of loci with quantitative differences in contact frequency across haplotypes, the changes in magnitude are smaller than those across cell types; however, the proportional relationships between contact propensity, gene expression, and H3K27ac are consistent. These findings suggest that subtle changes in contact propensity have a biologically meaningful role in gene regulation and could be a mechanism by which regulatory genetic variants in loop anchors mediate effects on expression

    Sequence-based Multiscale Model (SeqMM) for High-throughput chromosome conformation capture (Hi-C) data analysis

    Full text link
    In this paper, I introduce a Sequence-based Multiscale Model (SeqMM) for the biomolecular data analysis. With the combination of spectral graph method, I reveal the essential difference between the global scale models and local scale ones in structure clustering, i.e., different optimization on Euclidean (or spatial) distances and sequential (or genomic) distances. More specifically, clusters from global scale models optimize Euclidean distance relations. Local scale models, on the other hand, result in clusters that optimize the genomic distance relations. For a biomolecular data, Euclidean distances and sequential distances are two independent variables, which can never be optimized simultaneously in data clustering. However, sequence scale in my SeqMM can work as a tuning parameter that balances these two variables and deliver different clusterings based on my purposes. Further, my SeqMM is used to explore the hierarchical structures of chromosomes. I find that in global scale, the Fiedler vector from my SeqMM bears a great similarity with the principal vector from principal component analysis, and can be used to study genomic compartments. In TAD analysis, I find that TADs evaluated from different scales are not consistent and vary a lot. Particularly when the sequence scale is small, the calculated TAD boundaries are dramatically different. Even for regions with high contact frequencies, TAD regions show no obvious consistence. However, when the scale value increases further, although TADs are still quite different, TAD boundaries in these high contact frequency regions become more and more consistent. Finally, I find that for a fixed local scale, my method can deliver very robust TAD boundaries in different cluster numbers.Comment: 22 PAGES, 13 FIGURE

    Data mining and machine learning methods for chromosome conformation data analysis

    Get PDF
    Sixteen years after the sequencing of the human genome, the Human Genome Project (HGP), and 17 years after the introduction of Chromosome Conformation Capture (3C) technologies, three-dimensional (3-D) inference and big data remains problematic in the field of genomics, and specifically, in the field of 3C data analysis. Three-dimensional inference involves the reconstruction of a genome's 3D structure or, in some cases, ensemble of structures from contact interaction frequencies extracted from a variant of the 3C technology called the Hi-C technology. Further questions remain about chromosome topology and structure; enhancer-promoter interactions; location of genes, gene clusters, and transcription factors; the relationship between gene expression and epigenetics; and chromosome visualization at a higher scale, among others. In this dissertation, four major contributions are described, first, 3DMax, a tool for chromosome and genome 3-D structure prediction from H-C data using optimization algorithm, second, GSDB, a comprehensive and common repository that contains 3D structures for Hi-C datasets from novel 3D structure reconstruction tools developed over the years, third, ClusterTAD, a method for topological associated domains (TAD) extraction from Hi-C data using unsupervised learning algorithm. Finally, we introduce a tool called, GenomeFlow, a comprehensive graphical tool to facilitate the entire process of modeling and analysis of 3D genome organization. It is worth noting that GenomeFlow and GSDB are the first of their kind in the 3D chromosome and genome research field. All the methods are available as software tools that are freely available to the scientific community.Includes bibliographical reference

    Development of New Computational Tools for Analyzing Hi-C Data and Predicting Three-Dimensional Genome Organization

    Get PDF
    Background: The development of Hi-C (and related methods) has allowed for unprecedented sequence-level investigations into the structure-function relationship of the genome. There has been extensive effort in developing new tools to analyze this data in order to better understand the relationship between 3D genomic structure and function. While useful, the existing tools are far from maturity and (in some cases) lack the generalizability that would be required for application in a diverse set of organisms. This is problematic since the research community has proposed many cross-species "hallmarks" of 3D genome organization without confirming their existence in a variety of organisms. Research Objective: Develop new, generalizable computational tools for Hi-C analysis and 3D genome prediction. Results: Three new computational tools were developed for Hi-C analysis or 3D genome prediction: GrapHi-C (visualization), GeneRHi-C (3D prediction) and StoHi-C (3D prediction). Each tool has the potential to be used for 3D genome analysis in both model and non-model organisms since the underlying algorithms do not rely on any organism-specific constraints. A brief description of each tool follows. GrapHi-C is a graph-based visualization of Hi-C data. Unlike existing visualization methods, GrapHi-C allows for a more intuitive structural visualization of the underlying data. GeneRHi-C and StoHi-C are tools that can be used to predict 3D genome organizations from Hi-C data (the 3D-genome reconstruction problem). GeneRHi-C uses a combination of mixed integer programming and network layout algorithms to generate 3D coordinates from a ploidy-dependent subset of the Hi-C data. Alternatively, StoHi-C uses t-stochastic neighbour embedding with the complete set of Hi-C data to generate 3D coordinates of the genome. Each tool was applied to multiple, independent existing Hi-C datasets from fission yeast to demonstrate their utility. This is the first time 3D genome prediction has been successfully applied to these datasets. Overall, the tools developed here more clearly recapitulated documented features of fission yeast genomic organization when compared to existing techniques. Future work will focus on extending and applying these tools to analyze Hi-C datasets from other organisms. Additional Information: This thesis contains a collection of papers pertaining to the development of new tools for analyzing Hi-C data and predicting 3D genome organization. Each paper's publication status (as of January 2020) has been provided at the beginning of the corresponding chapter. For published papers, reprint permission was obtained and is available in the appendix

    3D Organization of Eukaryotic and Prokaryotic Genomes

    Get PDF
    There is a complex mutual interplay between three-dimensional (3D) genome organization and cellular activities in bacteria and eukaryotes. The aim of this thesis is to investigate such structure-function relationships. A main part of this thesis deals with the study of the three-dimensional genome organization using novel techniques for detecting genome-wide contacts using next-generation sequencing. These so called chromatin conformation capture-based methods, such as 5C and Hi-C, give deep insights into the architecture of the genome inside the nucleus, even on a small scale. We shed light on the question how the vastly increasing Hi-C data can generate new insights about the way the genome is organized in 3D. To this end, we first present the typical Hi-C data processing workflow to obtain Hi-C contact maps and show potential pitfalls in the interpretation of such contact maps using our own data pipeline and publicly available Hi-C data sets. Subsequently, we focus on approaches to modeling 3D genome organization based on contact maps. In this context, a computational tool was developed which interactively visualizes contact maps alongside complementary genomic data tracks. Inspired by machine learning with the help of probabilistic graphical models, we developed a tool that detects the compartmentalization structure within contact maps on multiple scales. In a further project, we propose and test one possible mechanism for the observed compartmentalization within contact maps of genomes across multiple species: Dynamic formation of loops within domains. In the context of 3D organization of bacterial chromosomes, we present the first direct evidence for global restructuring by long-range interactions of a DNA binding protein. Using Hi-C and live cell imaging of DNA loci, we show that the DNA binding protein Rok forms insulator-like complexes looping the B. subtilis genome over large distances. This biological mechanism agrees with our model based on dynamic formation of loops affecting domain formation in eukaryotic genomes. We further investigate the spatial segregation of the E. coli chromosome during cell division. In particular, we are interested in the positioning of the chromosomal replication origin region based on its interaction with the protein complex MukBEF. We tackle the problem using a combined approach of stochastic and polymer simulations. Last but not least, we develop a completely new methodology to analyze single molecule localization microscopy images based on topological data analysis. By using this new approach in the analysis of irradiated cells, we are able to show that the topology of repair foci can be categorized depending the distance to heterochromatin
    corecore