38 research outputs found
Computational methods for studying epigenomic regulation
In the nucleus, DNA is tightly wrapped around proteins in a structure called chromatin in order to protect it from degradation. Chromatin is composed of nucleosomes which are a structure of eight histones around which the DNA is wrapped. Nucleosomes can be modified by enzymes on amino acids located on their N-terminal tails. These modifications allow the chromatin to open and close in targeted regions, providing control over gene expression.
At present, chromatin immuno-precipitation (ChIP) and assay of transposase-accessible chromatin (ATAC) combined with high-throughput sequencing (ChIP-seq and ATAC-seq) are the major high-throughput methods allowing the study of histone modifications and genome-wide chromatin openness, respectively. Typically, ChIP-seq targets one histone at a time by enriching the histone-bound regions of the genome using immuno-precipitation, while ATAC-seq uses a transposase enzyme to cut the open chromatin into fragments of DNA. The DNA fragments obtained from both techniques can be sequenced and aligned against a reference genome. Once the location of the fragments is determined, the genome is scanned for significant enrichment in a process called peak calling. Differential analysis is then used to compare local enrichment-level variations between different biological conditions. Combining ChIP-seq and ATAC-seq data with other information, such as RNA-seq–derived transcriptomics data, can further help to build a comprehensive picture of the complex underlying biology. This work therefore focuses on the development of computational tools to help with the analysis of epigenomics research data.
In this thesis, a robust workflow for the differential analysis of ChIP-seq and ATAC-seq data is developed and evaluated against existing tools using one synthetic dataset, two biological ChIP-seq datasets and two biological ATAC-seq datasets. RNA-seq data is then further correlated with the detected peaks. An efficient replicate-driven visualisation tool is also proposed to visualise coverage of DNA fragments on the genome, which is compared to two existing tools, highlighting its efficiency. Lastly, two studies are presented showcasing the usefulness of the differential analysis approaches in extracting knowledge in a real-life biological setting
Differential ATAC-seq and ChIP-seq peak detection using ROTS
Changes in cellular chromatin states fine-tune transcriptional output and ultimately lead to phenotypic changes. Here we propose a novel application of our reproducibility-optimized test statistics (ROTS) to detect differential chromatin states (ATAC-seq) or differential chromatin modification states (ChIP-seq) between conditions. We compare the performance of ROTS to existing and widely used methods for ATAC-seq and ChIP-seq data using both synthetic and real datasets. Our results show that ROTS outperformed other commonly used methods when analyzing ATAC-seq data. ROTS also displayed the most accurate detection of small differences when modeling with synthetic data. We observed that two-step methods that require the use of a separate peak caller often more accurately called enrichment borders, whereas one-step methods without a separate peak calling step were more versatile in calling sub-peaks. The top ranked differential regions detected by the methods had marked correlation with transcriptional differences of the closest genes. Overall, our study provides evidence that ROTS is a useful addition to the available differential peak detection methods to study chromatin and performs especially well when applied to study differential chromatin states in ATAC-seq data. </p
Statistical analysis of genomic binding sites using high-throughput ChIP-seq data
This thesis focuses on the statistical analysis of Chromatin immunoprecipitation
sequencing (ChIP-Seq) data produced by Next Generation Sequencing (NGS). ChIP-Seq
is a method to investigate interactions between protein and DNA. Specifically, the method
aims to identify the binding sites of a particular protein of interest, such as a transcription
factor, in the genome. In the context of cancer research, this information is important
to check whether, for example, a particular transcription factor can be considered as a
therapeutic target.
The sequence data produced by ChIP-Seq experiment are in the form of mapped short
sequences, which are called reads. The reads are counted at each single genomic position,
and the read counts are the data to be analysed. There are many problems related to the
analysis of ChIP-Seq data, and in this research we focus on three of them.
First, in the analysis of ChIP-Seq data, the genome is not analysed in its entirety; instead
the intensity of read counts is estimated locally. Estimating the intensity of read counts
usually involves dividing the genome into small regions (windows). If the window size
is small, the noise level (low read counts) would dominate and many empty windows
would be observed. If the window size is large, the windows would have many small read
counts, which would smooth out some important features. The need exists for an approach
that enables researchers to choose an appropriate window size. To address this problem,
an approach was developed to optimise the window size. The approach optimises the
window size based on histogram construction. Note, the developed methodology is
published in [46].
Second, different studies of ChIP-Seq can target different transcription factors and then
give different conclusions, which is expected. However, they are all ChIP-Seq datasets
and many of them are performed on the same genome, for example the human genome.
So is there a pattern for the distribution of the counts? If the answer is yes, is the pattern common in all ChIP-Seq data? Answering this question can help in better understanding
the biology behind this experiment. We try to answer this question by investigating
RUNX1/ETO ChIP-Seq data. We try to develop a statistical model that is able to describe
the data. We employ some observed features in ChIP-Seq data to improve the performance
of the model. Although we obtained a model that is able to describe the RUNX1/ETO
data, the model does not provide a good statistical fit to the data.
Third, it is biologically important to know what changes (if any) occur at the binding sites
under some biological conditions, for example in knock-out experiments. Changes in the
binding sites can be either in the location of the sites or in the characteristics of the sites
(for example, the density of the read counts), or sometimes both. Current approaches for
differential binding sites analysis suffer from major drawbacks. First, unclear underlying
models as a result of dependencies between methods used, for example peak finding and
testing methods. Second, lack of accurate control of type-I error. Hence there is a need
for approach(es) to address these drawbacks. To address this problem, we developed three
statistical tests that are able to detect significantly differential regions between two ChIPSeq
datasets. The tests are evaluated and compared to some current methodologies by
using simulated and real ChIP-Seq datasets. The proposed tests exhibit more power as
well as accuracy compared to current methodologies
Statistical Methods for the Analysis of Epigenomic Data
Epigenomics, the study of the human genome and its interactions with proteins and other cellular elements, has become of significant interest in the past decade. Several landmark studies have shown that these interactions regulate essential cellular processes (gene transcription, gene silencing, etc.) and are associated with multiple complex disorders such as cancer incidence, cardiovascular disease, etc. Chromatin immunoprecipitation followed by massively-parallel sequencing (ChIP-seq) is one of several techniques used to (1) detect protein-DNA interaction sites, (2) classify differential epigenomic activity across conditions, and (3) characterize subpopulations of single-cells in heterogeneous samples. In this dissertation, we present statistical methods to tackle problems (1-3) in contexts where protein-DNA interaction sites expand across broad genomic domains. First, we present a statistical model that integrates data from multiple epigenomic assays and detects protein-DNA interaction sites in consensus across multiple replicates. We introduce a class of zero-inflated mixed-effects hidden Markov models (HMMs) to account for the excess of observed zeros, the latent sample-specific differences, and the local dependency of sequencing read counts. By integrating multiple samples into a statistical model tailored for broad epigenomic marks, our model shows high sensitivity and specificity in both simulated and real datasets. Second, we present an efficient framework for the detection and classification of regions exhibiting differential epigenomic activity in multi-sample multi-condition designs. The presented model utilizes a finite mixture model embedded into a HMM to classify patterns of broad and short differential epigenomic activity across conditions. We utilize a fast rejection-controlled EM algorithm that makes our implementation among the fastest algorithms available, while showing improvement in performance in data from broad epigenomic marks. Lastly, we analyze data from single-cell ChIP-seq assays and present a statistical model that allows the simultaneous clustering and characterization of single-cell subpopulations. The presented framework is robust for the often observed sparsity in single-cell epigenomic data and accounts for the local dependency of counts. We introduce an initialization scheme for the initialization of the EM algorithm as well as the identification of the number of single-cell subpopulations in the data, a common task in current single-cell epigenomic algorithms.Doctor of Philosoph
Integrative analysis of ChIP-chip datasets in Saccharomyces cerevisiae
ChIP-chip is a technology originally developed to determine the binding sites
of proteins in chromatin on a genome wide scale. Its uses have since been
expanded to analyse other genome features, such as epigenetic modifications
and, in our laboratory, DNA damage. Datasets comprise many thousands of
data points and therefore require bioinformatic tools for their analysis. Currently
available tools are limited in their applications and lack the ability to
normalise data so as to allow relative comparisons between different datasets.
This has limited the analyses of multiple ChIP-chip datasets from different
experimental conditions.
The first part of the study presented here is bioinformatic, presenting a
selection of tools written in R for ChIP-chip data analysis, including a novel
normalisation procedure which allows datasets from different conditions to be
analysed together, permitting comparisons of values between different experiments
and opening up a new dimension of analysis of these datasets. A novel
enrichment detection procedure is presented, suited to many formats of data,
including protein binding (which forms peaks) and epigenetic modifications
(which can form extended regions of enrichment). Graphical tools are also
presented, to facilitate the analysis of these large datasets. A method of predicting
the output of a ChIP-chip dataset is presented, which has been used
to show that ChIP-chip is capable of detecting sequence dependent damage
events. All functions work together, using a common data format, and are
effcient and easy to use.
The second part of this study applies these bioinformatic tools in a biological
context. An analysis of Abf1 protein binding datasets has been
undertaken, revealing many more binding sites than had previously been
identified. Analysis of the sequences at these binding sites identifed the previously
determined consensus binding motif in only a subset, with no novel
motif identifiable in the remainder, suggesting binding may be in
uenced by
factors other than sequence
Identification and Characteristics of Factors Regulating Hepatocellular Carcinoma Progression and Metastasis: A Dissertation
Hepatocellular carcinoma (HCC) is a common malignancy of the liver that is one of the most frequent causes of cancer-related death in the world. Surgical resection and liver transplantation are the only curative options for HCC, and tumor invasion and metastasis render many patients ineligible for these treatments. Identification of the mechanisms that contribute to invasive and metastatic disease may enlighten therapeutic strategies for those not eligible for surgical treatments. In this dissertation, I describe two sets of experiments to elucidate mechanisms underlying HCC dissemination, involving the activities of Krüppel-like factor 6 and a particular p53 point mutation, R172H.
Gene expression profiling of migratory HCC subpopulations demonstrated reduced expression of Krüppel-like factor 6 (KLF6) in invasive HCC cells. Knockdown of KLF6 in HCC cells increased cell transformation and migration. Single-copy deletion of Klf6 in a HCC mouse model results in increased tumor formation, increased metastasis to the lungs, and decreased survival, indicating that KLF6 suppresses both tumor formation and metastasis in HCC.
To elucidate the mechanism of KLF6-mediated tumor and metastasis suppression, we performed gene expression profiling and ChIP-sequencing to identify direct transcriptional targets of KLF6 in HCC cells. This analysis revealed novel transcriptional targets of KLF6 in HCC including CDC42EP3 and VAV3, both of which are positive regulators of Rho family GTPases. Concordantly, KLF6 knockdown cells demonstrate increased activity of the Rho family GTPases RAC1 and CDC42, and RAC1 is required for migration induced following KLF6 knockdown. Moreover, VAV3 and CDC42EP3 are also required for enhanced cell migration in HCC cells with KLF6 knockdown. Together, this work describes a novel signaling axis through which KLF6-mediated repression of VAV3 and CDC42EP3 inhibits RAC1Gmediated HCC cell migration in culture, and potentially HCC metastasis in vivo.
TP53 gene mutations are commonly found in HCC and are associated with poor prognosis. Prior studies have suggested that p53 mutants can display gain-of- function properties in other tumor types. Therefore, I sought to determine if a particular hotspot p53 mutation, p53R172H, provided enhanced, gain-of-function properties compared to p53 loss in HCC. In vitro, soft agar colony formation and cell migration is reduced upon knockdown of p53R172H, indicating that this mutation is required for transformation-associated phenotypes in these cells. However, p53R172H-expressing mice did not have enhanced tumor formation or metastasis compared to p53-null mice. These data suggest that p53R172H and p53 deletion are functionally equivalent in vivo, and that p53R172H is not a gain-of-function mutant in HCC. Inhibition of the related transcription factors p63 and p73 has been suggested as a potential mechanism by which mutant p53 exerts its gain-of-function effects. Analysis of p63 and p73 target genes demonstrated that they are similarly suppressed in p53-null and p53R172H-expressing HCC cell lines, suggesting a potential explanation for the phenotypes I observed in vivo and in vitro.
Together, the studies described in this dissertation increase our understanding of the mechanisms underlying HCC progression and metastasis. Specifically, we find and characterize KLF6 as a novel suppressor of HCC metastasis, and determine the contribution of a common p53 point mutation in HCC. This work contributes to ongoing efforts to improve treatment options for HCC patients
ChIP-seqデータベースの構築による遺伝子転写制御機構の解明
Tohoku University木下賢吾課
Plant Genetics and Molecular Biology
This book reviews the latest advances in multiple fields of plant biotechnology and the opportunities that plant genetics, genomics and molecular biology have offered for agriculture improvement. Advanced technologies can dramatically enhance our capacity in understanding the molecular basis of traits and utilizing the available resources for accelerated development of high yielding, nutritious, input-use efficient and climate-smart crop varieties. In this book, readers will discover the significant advances in plant genetics, structural and functional genomics, trait and gene discovery, transcriptomics, proteomics, metabolomics, epigenomics, nanotechnology and analytical & decision support tools in breeding. This book appeals to researchers, academics and other stakeholders of global agriculture