38 research outputs found

    Computational methods for studying epigenomic regulation

    Get PDF
    In the nucleus, DNA is tightly wrapped around proteins in a structure called chromatin in order to protect it from degradation. Chromatin is composed of nucleosomes which are a structure of eight histones around which the DNA is wrapped. Nucleosomes can be modified by enzymes on amino acids located on their N-terminal tails. These modifications allow the chromatin to open and close in targeted regions, providing control over gene expression. At present, chromatin immuno-precipitation (ChIP) and assay of transposase-accessible chromatin (ATAC) combined with high-throughput sequencing (ChIP-seq and ATAC-seq) are the major high-throughput methods allowing the study of histone modifications and genome-wide chromatin openness, respectively. Typically, ChIP-seq targets one histone at a time by enriching the histone-bound regions of the genome using immuno-precipitation, while ATAC-seq uses a transposase enzyme to cut the open chromatin into fragments of DNA. The DNA fragments obtained from both techniques can be sequenced and aligned against a reference genome. Once the location of the fragments is determined, the genome is scanned for significant enrichment in a process called peak calling. Differential analysis is then used to compare local enrichment-level variations between different biological conditions. Combining ChIP-seq and ATAC-seq data with other information, such as RNA-seq–derived transcriptomics data, can further help to build a comprehensive picture of the complex underlying biology. This work therefore focuses on the development of computational tools to help with the analysis of epigenomics research data. In this thesis, a robust workflow for the differential analysis of ChIP-seq and ATAC-seq data is developed and evaluated against existing tools using one synthetic dataset, two biological ChIP-seq datasets and two biological ATAC-seq datasets. RNA-seq data is then further correlated with the detected peaks. An efficient replicate-driven visualisation tool is also proposed to visualise coverage of DNA fragments on the genome, which is compared to two existing tools, highlighting its efficiency. Lastly, two studies are presented showcasing the usefulness of the differential analysis approaches in extracting knowledge in a real-life biological setting

    Differential ATAC-seq and ChIP-seq peak detection using ROTS

    Get PDF
    Changes in cellular chromatin states fine-tune transcriptional output and ultimately lead to phenotypic changes. Here we propose a novel application of our reproducibility-optimized test statistics (ROTS) to detect differential chromatin states (ATAC-seq) or differential chromatin modification states (ChIP-seq) between conditions. We compare the performance of ROTS to existing and widely used methods for ATAC-seq and ChIP-seq data using both synthetic and real datasets. Our results show that ROTS outperformed other commonly used methods when analyzing ATAC-seq data. ROTS also displayed the most accurate detection of small differences when modeling with synthetic data. We observed that two-step methods that require the use of a separate peak caller often more accurately called enrichment borders, whereas one-step methods without a separate peak calling step were more versatile in calling sub-peaks. The top ranked differential regions detected by the methods had marked correlation with transcriptional differences of the closest genes. Overall, our study provides evidence that ROTS is a useful addition to the available differential peak detection methods to study chromatin and performs especially well when applied to study differential chromatin states in ATAC-seq data. </p

    Statistical analysis of genomic binding sites using high-throughput ChIP-seq data

    Get PDF
    This thesis focuses on the statistical analysis of Chromatin immunoprecipitation sequencing (ChIP-Seq) data produced by Next Generation Sequencing (NGS). ChIP-Seq is a method to investigate interactions between protein and DNA. Specifically, the method aims to identify the binding sites of a particular protein of interest, such as a transcription factor, in the genome. In the context of cancer research, this information is important to check whether, for example, a particular transcription factor can be considered as a therapeutic target. The sequence data produced by ChIP-Seq experiment are in the form of mapped short sequences, which are called reads. The reads are counted at each single genomic position, and the read counts are the data to be analysed. There are many problems related to the analysis of ChIP-Seq data, and in this research we focus on three of them. First, in the analysis of ChIP-Seq data, the genome is not analysed in its entirety; instead the intensity of read counts is estimated locally. Estimating the intensity of read counts usually involves dividing the genome into small regions (windows). If the window size is small, the noise level (low read counts) would dominate and many empty windows would be observed. If the window size is large, the windows would have many small read counts, which would smooth out some important features. The need exists for an approach that enables researchers to choose an appropriate window size. To address this problem, an approach was developed to optimise the window size. The approach optimises the window size based on histogram construction. Note, the developed methodology is published in [46]. Second, different studies of ChIP-Seq can target different transcription factors and then give different conclusions, which is expected. However, they are all ChIP-Seq datasets and many of them are performed on the same genome, for example the human genome. So is there a pattern for the distribution of the counts? If the answer is yes, is the pattern common in all ChIP-Seq data? Answering this question can help in better understanding the biology behind this experiment. We try to answer this question by investigating RUNX1/ETO ChIP-Seq data. We try to develop a statistical model that is able to describe the data. We employ some observed features in ChIP-Seq data to improve the performance of the model. Although we obtained a model that is able to describe the RUNX1/ETO data, the model does not provide a good statistical fit to the data. Third, it is biologically important to know what changes (if any) occur at the binding sites under some biological conditions, for example in knock-out experiments. Changes in the binding sites can be either in the location of the sites or in the characteristics of the sites (for example, the density of the read counts), or sometimes both. Current approaches for differential binding sites analysis suffer from major drawbacks. First, unclear underlying models as a result of dependencies between methods used, for example peak finding and testing methods. Second, lack of accurate control of type-I error. Hence there is a need for approach(es) to address these drawbacks. To address this problem, we developed three statistical tests that are able to detect significantly differential regions between two ChIPSeq datasets. The tests are evaluated and compared to some current methodologies by using simulated and real ChIP-Seq datasets. The proposed tests exhibit more power as well as accuracy compared to current methodologies

    Statistical Methods for the Analysis of Epigenomic Data

    Get PDF
    Epigenomics, the study of the human genome and its interactions with proteins and other cellular elements, has become of significant interest in the past decade. Several landmark studies have shown that these interactions regulate essential cellular processes (gene transcription, gene silencing, etc.) and are associated with multiple complex disorders such as cancer incidence, cardiovascular disease, etc. Chromatin immunoprecipitation followed by massively-parallel sequencing (ChIP-seq) is one of several techniques used to (1) detect protein-DNA interaction sites, (2) classify differential epigenomic activity across conditions, and (3) characterize subpopulations of single-cells in heterogeneous samples. In this dissertation, we present statistical methods to tackle problems (1-3) in contexts where protein-DNA interaction sites expand across broad genomic domains. First, we present a statistical model that integrates data from multiple epigenomic assays and detects protein-DNA interaction sites in consensus across multiple replicates. We introduce a class of zero-inflated mixed-effects hidden Markov models (HMMs) to account for the excess of observed zeros, the latent sample-specific differences, and the local dependency of sequencing read counts. By integrating multiple samples into a statistical model tailored for broad epigenomic marks, our model shows high sensitivity and specificity in both simulated and real datasets. Second, we present an efficient framework for the detection and classification of regions exhibiting differential epigenomic activity in multi-sample multi-condition designs. The presented model utilizes a finite mixture model embedded into a HMM to classify patterns of broad and short differential epigenomic activity across conditions. We utilize a fast rejection-controlled EM algorithm that makes our implementation among the fastest algorithms available, while showing improvement in performance in data from broad epigenomic marks. Lastly, we analyze data from single-cell ChIP-seq assays and present a statistical model that allows the simultaneous clustering and characterization of single-cell subpopulations. The presented framework is robust for the often observed sparsity in single-cell epigenomic data and accounts for the local dependency of counts. We introduce an initialization scheme for the initialization of the EM algorithm as well as the identification of the number of single-cell subpopulations in the data, a common task in current single-cell epigenomic algorithms.Doctor of Philosoph

    A functional and regulatory perspective on Arabidopsis thaliana

    Get PDF

    Integrative analysis of ChIP-chip datasets in Saccharomyces cerevisiae

    Get PDF
    ChIP-chip is a technology originally developed to determine the binding sites of proteins in chromatin on a genome wide scale. Its uses have since been expanded to analyse other genome features, such as epigenetic modifications and, in our laboratory, DNA damage. Datasets comprise many thousands of data points and therefore require bioinformatic tools for their analysis. Currently available tools are limited in their applications and lack the ability to normalise data so as to allow relative comparisons between different datasets. This has limited the analyses of multiple ChIP-chip datasets from different experimental conditions. The first part of the study presented here is bioinformatic, presenting a selection of tools written in R for ChIP-chip data analysis, including a novel normalisation procedure which allows datasets from different conditions to be analysed together, permitting comparisons of values between different experiments and opening up a new dimension of analysis of these datasets. A novel enrichment detection procedure is presented, suited to many formats of data, including protein binding (which forms peaks) and epigenetic modifications (which can form extended regions of enrichment). Graphical tools are also presented, to facilitate the analysis of these large datasets. A method of predicting the output of a ChIP-chip dataset is presented, which has been used to show that ChIP-chip is capable of detecting sequence dependent damage events. All functions work together, using a common data format, and are effcient and easy to use. The second part of this study applies these bioinformatic tools in a biological context. An analysis of Abf1 protein binding datasets has been undertaken, revealing many more binding sites than had previously been identified. Analysis of the sequences at these binding sites identifed the previously determined consensus binding motif in only a subset, with no novel motif identifiable in the remainder, suggesting binding may be in uenced by factors other than sequence

    Identification and Characteristics of Factors Regulating Hepatocellular Carcinoma Progression and Metastasis: A Dissertation

    Get PDF
    Hepatocellular carcinoma (HCC) is a common malignancy of the liver that is one of the most frequent causes of cancer-related death in the world. Surgical resection and liver transplantation are the only curative options for HCC, and tumor invasion and metastasis render many patients ineligible for these treatments. Identification of the mechanisms that contribute to invasive and metastatic disease may enlighten therapeutic strategies for those not eligible for surgical treatments. In this dissertation, I describe two sets of experiments to elucidate mechanisms underlying HCC dissemination, involving the activities of Krüppel-like factor 6 and a particular p53 point mutation, R172H. Gene expression profiling of migratory HCC subpopulations demonstrated reduced expression of Krüppel-like factor 6 (KLF6) in invasive HCC cells. Knockdown of KLF6 in HCC cells increased cell transformation and migration. Single-copy deletion of Klf6 in a HCC mouse model results in increased tumor formation, increased metastasis to the lungs, and decreased survival, indicating that KLF6 suppresses both tumor formation and metastasis in HCC. To elucidate the mechanism of KLF6-mediated tumor and metastasis suppression, we performed gene expression profiling and ChIP-sequencing to identify direct transcriptional targets of KLF6 in HCC cells. This analysis revealed novel transcriptional targets of KLF6 in HCC including CDC42EP3 and VAV3, both of which are positive regulators of Rho family GTPases. Concordantly, KLF6 knockdown cells demonstrate increased activity of the Rho family GTPases RAC1 and CDC42, and RAC1 is required for migration induced following KLF6 knockdown. Moreover, VAV3 and CDC42EP3 are also required for enhanced cell migration in HCC cells with KLF6 knockdown. Together, this work describes a novel signaling axis through which KLF6-mediated repression of VAV3 and CDC42EP3 inhibits RAC1Gmediated HCC cell migration in culture, and potentially HCC metastasis in vivo. TP53 gene mutations are commonly found in HCC and are associated with poor prognosis. Prior studies have suggested that p53 mutants can display gain-of- function properties in other tumor types. Therefore, I sought to determine if a particular hotspot p53 mutation, p53R172H, provided enhanced, gain-of-function properties compared to p53 loss in HCC. In vitro, soft agar colony formation and cell migration is reduced upon knockdown of p53R172H, indicating that this mutation is required for transformation-associated phenotypes in these cells. However, p53R172H-expressing mice did not have enhanced tumor formation or metastasis compared to p53-null mice. These data suggest that p53R172H and p53 deletion are functionally equivalent in vivo, and that p53R172H is not a gain-of-function mutant in HCC. Inhibition of the related transcription factors p63 and p73 has been suggested as a potential mechanism by which mutant p53 exerts its gain-of-function effects. Analysis of p63 and p73 target genes demonstrated that they are similarly suppressed in p53-null and p53R172H-expressing HCC cell lines, suggesting a potential explanation for the phenotypes I observed in vivo and in vitro. Together, the studies described in this dissertation increase our understanding of the mechanisms underlying HCC progression and metastasis. Specifically, we find and characterize KLF6 as a novel suppressor of HCC metastasis, and determine the contribution of a common p53 point mutation in HCC. This work contributes to ongoing efforts to improve treatment options for HCC patients

    Plant Genetics and Molecular Biology

    Get PDF
    This book reviews the latest advances in multiple fields of plant biotechnology and the opportunities that plant genetics, genomics and molecular biology have offered for agriculture improvement. Advanced technologies can dramatically enhance our capacity in understanding the molecular basis of traits and utilizing the available resources for accelerated development of high yielding, nutritious, input-use efficient and climate-smart crop varieties. In this book, readers will discover the significant advances in plant genetics, structural and functional genomics, trait and gene discovery, transcriptomics, proteomics, metabolomics, epigenomics, nanotechnology and analytical & decision support tools in breeding. This book appeals to researchers, academics and other stakeholders of global agriculture
    corecore