305 research outputs found

    Methods for Joint Normalization and Comparison of Hi-C data

    Get PDF
    The development of chromatin conformation capture technology has opened new avenues of study into the 3D structure and function of the genome. Chromatin structure is known to influence gene regulation, and differences in structure are now emerging as a mechanism of regulation between, e.g., cell differentiation and disease vs. normal states. Hi-C sequencing technology now provides a way to study the 3D interactions of the chromatin over the whole genome. However, like all sequencing technologies, Hi-C suffers from several forms of bias stemming from both the technology and the DNA sequence itself. Several normalization methods have been developed for normalizing individual Hi-C datasets, but little work has been done on developing joint normalization methods for comparing two or more Hi-C datasets. To make full use of Hi-C data, joint normalization and statistical comparison techniques are needed to carry out experiments to identify regions where chromatin structure differs between conditions. We develop methods for the joint normalization and comparison of two Hi-C datasets, which we then extended to more complex experimental designs. Our normalization method is novel in that it makes use of the distance-dependent nature of chromatin interactions. Our modification of the Minus vs. Average (MA) plot to the Minus vs. Distance (MD) plot allows for a nonparametric data-driven normalization technique using loess smoothing. Additionally, we present a simple statistical method using Z-scores for detecting differentially interacting regions between two datasets. Our initial method was published as the Bioconductor R package HiCcompare [http://bioconductor.org/packages/HiCcompare/](http://bioconductor.org/packages/HiCcompare/). We then further extended our normalization and comparison method for use in complex Hi-C experiments with more than two datasets and optional covariates. We extended the normalization method to jointly normalize any number of Hi-C datasets by using a cyclic loess procedure on the MD plot. The cyclic loess normalization technique can remove between dataset biases efficiently and effectively even when several datasets are analyzed at one time. Our comparison method implements a generalized linear model-based approach for comparing complex Hi-C experiments, which may have more than two groups and additional covariates. The extended methods are also available as a Bioconductor R package [http://bioconductor.org/packages/multiHiCcompare/](http://bioconductor.org/packages/multiHiCcompare/). Finally, we demonstrate the use of HiCcompare and multiHiCcompare in several test cases on real data in addition to comparing them to other similar methods (https://doi.org/10.1002/cpbi.76)

    ANALYSIS OF CHROMOSOME SPATIAL ORGANIZATION DATA AND INTEGRATION WITH GENE MAPPING FOR COMPLEX TRAITS

    Get PDF
    Studying the 3D chromosomal organization is crucial to understanding processes of transcription, histone modifications, and DNA repair and replication. Chromatin conformation shapes molecular functions beyond genetic variation at the sequence level and epigenetic footprints along the one-dimensional genome. DNA spatial organization features can influence molecular and organism-level phenotypes, from regulation of the expression of target genes (which can be megabases [Mb] away), to the development of various diseases including autoimmune diseases, neurological diseases, and cancer.The genome-wide chromosome conformation capture technology Hi-C captures genomic interactions of all loci, genome wide. Hi-C data allows us to investigate chromatin organization at various levels and resolutions, including the Mb resolution chromosome compartments and topologically associated domains (TADs), 10-40Kb resolution frequently interacting regions (FIREs), and 1-40Kb resolution chromatin loops and long-range chromatin interactions. FIREs have been demonstrated to provide valuable information for tissue or cell type-specific transcriptional regulation, characteristics unique from other domain features observed in the 3D genome. Until now, there is no stand-alone software package for the detection of FIREs. To fill in this gap, I first present a user-friendly R-package to identify FIREs and the clustering of FIREs (super-FIREs), accessible to the general scientific community.Next, I further explore the 3D genome and analyze brain tissue Hi-C data from 3 fetal and 3 adult human cortex samples with a total of 10.4 billion raw reads, the most deeply sequenced human brain tissue Hi-C datasets we are aware of to date. My analysis of this Hi-C data (identifying compartments, TAD boundaries, FIREs, and long range chromatin interactions) generated mechanistic insights at GWAS loci for psychiatric disorders, brain-based traits, and neurological conditions, particularly schizophrenia.Lastly, as incorporating annotation can provide insights at GWAS loci, I annotate 148,019 variants identified in a recent trans-ethnic analysis for hematological traits in 746,667 participants. I present my findings in an R Shiny app, ABCx: Annotator for Blood Cell Traits, which highlights variants 1D epigenomic signatures, impact on gene expression, and chromatin conformation information to aid in further functional follow up.Doctor of Public Healt

    BISER: Fast Characterization of Segmental Duplication Structure in Multiple Genome Assemblies

    Get PDF
    The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural parts, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure. However, optimal computation of SDs through standard local alignment algorithms is impractical due to the size of most genomes. A cross-genome evolutionary analysis of SDs is even harder, as one needs to characterize SDs in multiple genomes and find relations between those SDs and unique segments in other genomes. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today. Here we introduce a new tool, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology (75%) to multiple genomes while introducing further 8-24x speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 90 million years

    Computational investigation of cancer genomes

    Get PDF
    Cancer is a leading cause of death worldwide, and its incidence is increasing due to modern lifestyle that prolonged human life. All cancers originate from a single cell that had acquired genetic aberrations enabling uncontrolled proliferation. Each cancer is unique in its aberrant genetic makeup, which defines, to large extent, its biology, aggressiveness, and vulnerabilities to different treatments. Furthermore, the genetic makeup of each cancer is heterogeneous among its constituent cancer cells, and dynamic with the ability to evolve in order to preserve the survival of cancer cells. Sequencing technologies are currently producing massive amounts of data that, with the help of specialized computational methods, can revolutionize our knowledge on cancer. A key question in cancer research is how to personalize the treatment of cancer patients, so that each cancer is treated according to its molecular characteristics. The first study in this thesis takes a step in that direction through a proposed novel molecular classification system of diffuse large B-cell lymphoma (DLBCL), which is the most common hematological malignancy in adults. The suggested classification, derived from the integrative analysis of gene expression and DNA mutations, stratifies DLBCL into four groups with distinct biology, genetic landscapes, and clinical outcome. These subtypes could help identify patients at high risk who may benefit from an altered treatment plan. Understanding the genomic evolution of cancer that transforms a typically curable primary tumor into an incurable drug-resistant metastasis is another aspect of cancer research under intensive investigation. The second study in this thesis investigates the spreading patterns of metastasis in breast cancer, which is the most common cancer in women. Using phylogenetic analysis of somatic mutations from longitudinal breast cancer samples, the metastasis routes were uncovered. The study revealed that breast cancer spreads either in parallel from primary tumor to multiple distant sites, or linearly from primary tumor to a distant site, and then from that to another. However, in all cases, axillary lymph nodes did not mediate the spreading to distant sites. This provided a genetic-based evidence on the redundancy of lymph node dissection in breast cancer management. Towards a genetic-based diagnostics in cancer, the computational methods used to detect genetic aberrations need to be evaluated for their accuracy. The third study in this thesis performs a comparison of methods for detecting somatic copy number alterations from cancer samples. The study evaluated several commonly used methods for two different sequencing platforms using simulated and real cancer data. The results provided an overview of the weaknesses of the different methods that could be methodologically improved. Altogether, this thesis gives an overview on the field of computational cancer genomics and presents three studies that exemplify the clinical relevance of computational research.Not availabl

    3D Organization of Eukaryotic and Prokaryotic Genomes

    Get PDF
    There is a complex mutual interplay between three-dimensional (3D) genome organization and cellular activities in bacteria and eukaryotes. The aim of this thesis is to investigate such structure-function relationships. A main part of this thesis deals with the study of the three-dimensional genome organization using novel techniques for detecting genome-wide contacts using next-generation sequencing. These so called chromatin conformation capture-based methods, such as 5C and Hi-C, give deep insights into the architecture of the genome inside the nucleus, even on a small scale. We shed light on the question how the vastly increasing Hi-C data can generate new insights about the way the genome is organized in 3D. To this end, we first present the typical Hi-C data processing workflow to obtain Hi-C contact maps and show potential pitfalls in the interpretation of such contact maps using our own data pipeline and publicly available Hi-C data sets. Subsequently, we focus on approaches to modeling 3D genome organization based on contact maps. In this context, a computational tool was developed which interactively visualizes contact maps alongside complementary genomic data tracks. Inspired by machine learning with the help of probabilistic graphical models, we developed a tool that detects the compartmentalization structure within contact maps on multiple scales. In a further project, we propose and test one possible mechanism for the observed compartmentalization within contact maps of genomes across multiple species: Dynamic formation of loops within domains. In the context of 3D organization of bacterial chromosomes, we present the first direct evidence for global restructuring by long-range interactions of a DNA binding protein. Using Hi-C and live cell imaging of DNA loci, we show that the DNA binding protein Rok forms insulator-like complexes looping the B. subtilis genome over large distances. This biological mechanism agrees with our model based on dynamic formation of loops affecting domain formation in eukaryotic genomes. We further investigate the spatial segregation of the E. coli chromosome during cell division. In particular, we are interested in the positioning of the chromosomal replication origin region based on its interaction with the protein complex MukBEF. We tackle the problem using a combined approach of stochastic and polymer simulations. Last but not least, we develop a completely new methodology to analyze single molecule localization microscopy images based on topological data analysis. By using this new approach in the analysis of irradiated cells, we are able to show that the topology of repair foci can be categorized depending the distance to heterochromatin

    An Investigation of Segmental Duplications across Topologically Associating Domains

    Get PDF
    High-throughput chromosome conformation capture (Hi-C) reveals organization within genomes. Topologically associating domains (TADs) make up one level of organization and are identified by applying algorithms to Hi-C data. TADs have boundaries disrupted by structural variants (SVs), hypothesized to form due to recombination that occurs between segmental duplications (SDs). Little research is available about the effects of SDs at TAD boundaries. This project aimed to understand the distribution of SDs near TADs and determine any overlap between the two features. We analyzed public data and found SDs to have low breakpoint frequency and coverage at TAD boundaries. We then processed a new set of Hi-C data and found most SDs had a minimal distance of 200 kb or closer to TADs with a modest bimodal distribution. Of the total SDs analyzed, fewer than half had at least one overlap within a TAD. Further statistical analysis must be done as we only conducted a preliminary investigation

    Limited heterogeneity of known driver gene mutations among the metastases of individual patients with pancreatic cancer

    Get PDF
    The extent of heterogeneity among driver gene mutations present in naturally occurring metastases - that is, treatment-naive metastatic disease - is largely unknown. To address this issue, we carried out 60Ă— whole-genome sequencing of 26 metastases from four patients with pancreatic cancer. We found that identical mutations in known driver genes were present in every metastatic lesion for each patient studied. Passenger gene mutations, which do not have known or predicted functional consequences, accounted for all intratumoral heterogeneity. Even with respect to these passenger mutations, our analysis suggests that the genetic similarity among the founding cells of metastases was higher than that expected for any two cells randomly taken from a normal tissue. The uniformity of known driver gene mutations among metastases in the same patient has critical and encouraging implications for the success of future targeted therapies in advanced-stage disease

    Transposable element insertions are associated with batesian mimicry in the pantropical butterfly Hypolimnas misippus

    Get PDF
    Hypolimnas misippus is a Batesian mimic of the toxic African Queen butterfly (Danaus chrysippus). Female H. misippus butterflies use two major wing patterning loci (M and A) to imitate three color morphs of D. chrysippus found in different regions of Africa. In this study, we examine the evolution of the M locus and identify it as an example of adaptive atavism. This phenomenon involves a morphological reversion to an ancestral character that results in an adaptive phenotype. We show that H. misippus has re-evolved an ancestral wing pattern present in other Hypolimnas species, repurposing it for Batesian mimicry of a D. chrysippus morph. Using haplotagging, a linked-read sequencing technology, and our new analytical tool, Wrath, we discover two large transposable element insertions located at the M locus and establish that these insertions are present in the dominant allele responsible for producing mimetic phenotype. By conducting a comparative analysis involving additional Hypolimnas species, we demonstrate that the dominant allele is derived. This suggests that, in the derived allele, the transposable elements disrupt a cis-regulatory element, leading to the reversion to an ancestral phenotype that is then utilized for Batesian mimicry of a distinct model, a different morph of D. chrysippus. Our findings present a compelling instance of convergent evolution and adaptive atavism, in which the same pattern element has independently evolved multiple times in Hypolimnas butterflies, repeatedly playing a role in Batesian mimicry of diverse model species

    Investigating the role of enhancer-mediated gene expression in the human brain and its potential contribution to psychiatric disorders

    Full text link
    Autism spectrum disorder (ASD) and schizophrenia (SCZ) are two neuropsychiatric conditions with variable times of onset and are influenced by both genetic and environmental factors. Genome-wide association studies (GWASs) have led to the identification of numerous genetic loci common to both these disorders, however our understanding remains far from complete, with many clinical cases without a genetic cause. While increasing the statistical power of genome-wide association studies (GWASs) to find additional risk variants could rule-in or rule out rare cases of ASD and SCZ, this presently remains a difficult task. Furthermore, the biological functions for genetic susceptibility loci remains poorly understood, particularly for more-recent discoveries of loci devoid of gene bodies. On the other hand, recent biotechnological developments have made it possible to conduct high-resolution experimental measurements of the three-dimensional architecture of the genome, including enhancer-promoter interactions (EPIs). Such data have been used to connect GWAS risk variants to their potential target genes which, in turn, provide insights into underlying molecular mechanisms and cellular processes. The functions of enhancer-promoter interactions in controlling gene expression programmes is crucial to how implicated genes mediate neurological function and disease. Yet, knowledge on enhancer-promoter interactions remains to be used in conjunction with GWAS data, particularly on such data from specific brain cell types, which may be useful to uncover the biological underpinnings of psychiatric conditions. This thesis examines the role of enhancer-mediated gene expression in the human brain and its potential contribution to psychiatric conditions. In Chapter 2, I report on the identification of significant chromosomal interactions from studies of brain Hi-C data generated from neuronal and glial cells, with the goal to investigate the impact of EPIs genome-wide, as well as to provide a template for an in-depth understanding of how EPIs impact transcriptional regulation. In the Chapter 3, I discuss a novel approach integrating Activity by Contact (ABC) and gene set enrichment analyses of GWAS data in two steps. In the first step, ABC is used to predict enhancer-gene regulatory interactions in a given cell type (e.g., glial cells, neurons). Secondly, Hi-C coupled multi-marker analysis of genomic annotation (H-MAGMA) is used to assign the SNPs located in the regulatory regions identified by ABC to each gene and calculate gene-level association p-values. I applied this novel framework (ABC-HMAGMA) to GWAS data from SCZ and ASD, to identify novel SCZ and ASD trait-associated genes and molecular pathways. In Chapter 4, I have evaluated a potential novel mechanism for the regulation of enhancer activity within cells. I hypothesized that, in addition to its known roles in DNA replication and transcription, Topoisomerase I may regulate enhancer activity in brain cells. To test this hypothesis, I employed RNA-seq and transient transcriptome sequencing (TT-seq) data, a method that enriches for short-lived enhancer derived RNAs. These data showed that Topoisomerase I inhibition leads to significant changes in eRNA expression and offers evidence that such changes are relevant to the homeostatic functions for Top 1 in cellular gene expression regulation
    • …
    corecore