126 research outputs found

    Methods for developing a machine learning framework for precise 3D domain boundary prediction at base-level resolution

    Get PDF
    High-throughput chromosome conformation capture technology (Hi-C) has revealed extensive DNA looping and folding into discrete 3D domains. These include Topologically Associating Domains (TADs) and chromatin loops, the 3D domains critical for cellular processes like gene regulation and cell differentiation. The relatively low resolution of Hi-C data (regions of several kilobases in size) prevents precise mapping of domain boundaries by conventional TAD/loop-callers. However, high resolution genomic annotations associated with boundaries, such as CTCF and members of cohesin complex, suggest a computational approach for precise location of domain boundaries. We developed preciseTAD, an optimized machine learning framework that leverages a random forest model to improve the location of domain boundaries. Our method introduces three concepts - shifted binning, distance-type predictors, and random under-sampling - which we use to build classification models for predicting boundary regions. The algorithm then uses density-based clustering (DBSCAN) and partitioning around medoids (PAM) to extract the most biologically meaningful domain boundary from models trained on high-resolution genome annotation data and boundaries from low-resolution Hi-C data. We benchmarked our method against a popular TAD-caller and a novel chromatin loop prediction algorithm. Boundaries predicted by preciseTAD were more enriched for known molecular drivers of 3D chromatin including CTCF, RAD21, SMC3, and ZNF143. preciseTAD-predicted boundaries were more conserved across cell lines, highlighting their higher biological significance. Additionally, models pre-trained in one cell line accurately predict boundaries in another cell line. Using cell line-specific genomic annotations, the pre-trained models enable detecting domain boundaries in cells without Hi-C data. The research presented provides a unified approach for precisely predicting domain boundaries. This improved precision will provide insight into the association between genomic regulators and the 3D genome organization. Furthermore, our methods will provide researchers with flexible and easy-to-use tools to continue to annotate the 3D structure of the human genome without relying on costly high resolution Hi-C data. The preciseTAD R package and supplementary ExperimentHub package, preciseTADhub, are available on Bioconductor (version 3.13; https://bioconductor.org/packages/preciseTAD/; https://bioconductor.org/packages/preciseTADhub/)

    A (3D-nuclear) space odyssey: making sense of Hi-C maps

    Get PDF
    Three-dimensional 3D)-chromatin organization is critical for proper enhancer-promoter communication and, therefore, for a precise execution of the transcriptional programs governing cellular processes. The emergence of Chromosome Conformation Capture (3C) methods, in particular Hi-C, has allowed the investigation of chromatin interactions on a genome-wide scale, revealing the existence of overlapping molecular mechanisms that we are just starting to decipher. Therefore, disentangling Hi-C signal into these individual components is essential to provide meaningful biological data interpretation. Here, we discuss emerging views on the molecular forces shaping the genome in 3D, with a focus on their respective contributions and interdependence. We discuss Hi-C data at both population and single-cell levels, thus providing criteria to interpret genomic function in the 3D-nuclear space

    Order and disorder: abnormal 3D chromatin organization in human disease

    Get PDF
    A precise three-dimensional (3D) organization of chromatin is central to achieve the intricate transcriptional patterns that are required to form complex organisms. Growing evidence supports an important role of 3D chromatin architecture in development and delineates its alterations as prominent causes of disease. In this review, we discuss emerging concepts on the fundamental forces shaping genomes in space and on how their disruption can lead to pathogenic phenotypes. We describe the molecular mechanisms underlying a wide range of diseases, from the systemic effects of coding mutations on 3D architectural factors, to the more tissue-specific phenotypes resulting from genetic and epigenetic modifications at specific loci. Understanding the connection between the 3D organization of the genome and its underlying biological function will allow a better interpretation of human pathogenesis

    Measuring chromosomal interactions in living cells

    Get PDF
    3C based high-throughput sequencing methods such as Hi-C, 5C and 4C have substantially contributed to our current understanding of genome folding. These techniques have been instrumental in demonstrating that mammalian chromosomes possess a rich hierarchy of structural layers at the heart of which topologically associating domains (TADs) stand out as preferential functional units in the genome. TADs have been suggested to establish the correct interaction patterns between regulatory sequences, supported by genetic studies where the deletion of boundary elements resulted in ectopic gene expression in the neighboring domain. Within TADs, looping interactions occur between regulatory sequences and convergent binding sites of the architectural protein CTCF, the latter as a consequence of loop extrusion by cohesin that is blocked by CTCF bound to DNA in a defined orientation. The dominant role of CTCF in loop formation is further highlighted by induced depletion experiments and targeted deletions and inversion of CTCF sites manifesting in loss of these interactions. Despite these fundamental discoveries and their implications for transcriptional control by cis-regulatory sequences, 3C and derivatives are based on formaldehyde crosslinking and ligation, which have been often criticized as a source of important experimental bias. This has actually raised the question if structures detected by 3C methods do really exist in living cells. Based on discrepancies between 5C and DNA-FISH data, it was suggested that 3C based methods might not always capture spatial proximity or molecular-scale interactions, but rather detect DNA fragments which are hundreds of nanometers apart through crosslinking of macromolecular protein complexes between them. At the same time, it was debated whether capturing of ligation products might be variable depending on sequence context, therefore over- or underrepresenting some interactions detected in 3C based methods. Even though several other methods including native 4C/Hi-C, GAM and SPRITE have also detected chromatin compartmentalization, TADs and looping interactions, they still involve substantial biochemical manipulation of cells, notably either crosslinking or ligation. Importantly, many mechanistic models of chromosome folding rely on 3C based data, making the assumption that crosslinking frequency is proportional to absolute contact frequency. However, a formal proof of this is still missing. In order to measure chromosomal contacts directly in living cells, without using chemical fixation nor ligation, I developed an alternative approach based on the DamID technique that exploits detection of ectopic adenine methylation by the bacterial methyltransferase Dam. In the original version of DamID, Dam is fused to a DNA binding protein of interest resulting in adenine methylation within GATC motifs in the neighborhood of the DNA binding sites. The methylation-sensitive restriction enzyme DpnI is then used to detect methylated GATCs followed by high throughput sequencing of the restriction sites. After mapping the reads and normalizing for non-specific methylation by freely diffusing Dam, the binding sites of the protein of interest can be detected genome wide. I established a new modified version of this technique called DamC, where Dam is recruited in an inducible way to ectopically inserted Tet operators through fusion to the reverse tetracycline receptor. The detection of methylated DNA by high-throughput sequencing then allows to identify chromosomal contacts at high genomic resolution across hundreds of kilobases around viewpoints. Importantly, modeling of this process provides a theoretical framework showing that the experimental output of DamC is indeed proportional to chromosomal contact probabilities. DamC provides the first crosslinking- and ligation-free validation of key structural features of mammalian chromosomes identified by 3C methods. It confirms the existence of TADs and CTCF loops as well as the scaling of contact probabilities measured in 4C and Hi-C, which supports the validity of physical models of chromosome folding based on 3C-based data. Finally, it demonstrates that ectopic insertion of CTCF sites can lead to the formation of new loops with endogenous CTCF-bound sequences. This shows that chromosome structure can be engineered by inserting short ectopic sequences that rewire interactions within TADs, opening interesting avenues for modifying gene expression by altering chromosomal interactions rather than regulatory DNA sequences directly

    Modeling chromosomes: Beyond pretty pictures

    Get PDF
    Recently, Chromosome Conformation Capture (3C) based experiments have highlighted the importance of computational models for the study of chromosome organization. In this review, we propose that current computational models can be grouped into roughly four classes, with two classes of data-driven models: consensus structures and data-driven ensembles, and two classes of de novo models: structural ensembles and mechanistic ensembles. Finally, we highlight specific questions mechanistic ensembles can address.National Institutes of Health (U.S.) (Grant R01HG003143)National Institutes of Health (U.S.) (Grant R01 GM114190
    corecore