32 research outputs found
IDENTIFICATION OF CIS-REGULATORY MODULES AND NON-CODING VARIATION USING MACHINE LEARNING METHODS
Table of Contents
CHAPTER 1: INTRODUCTION 1
1. Transcriptional regulation 1
1.1 Classes of cis-regulatory modules 2
1.1.1 CRM architecture 3
1.2 Chromatin signatures of CRMs 4
1.3 Motif and evolutionary constraint in noncoding regions 5
1.4 Detecting regulatory regions using experimental methods 7
1.4.1 Genome-wide identification of TF binding with ChIP and DamID 7
1.4.2 Identification of enhancers using open chromatin profiling. 7
1.4.3 Functional validation of enhancers 8
1.4.3.1 Massively parallel reporter assay 8
1.4.2.2 STARR-seq 9
1.4.2.3 Assays using genomic integration 9
2. Computational identification of regulatory elements in the genome 9
2.1 Motif-based approaches 9
2.2 Comparative genomics approaches to identify functional binding sites 10
2.3 CRM detection using motif clustering 11
2.4 Machine learning approaches to find CRMs 12
2.4.1 Unsupervised learning methods 12
2.4.1.1 Hidden Markov Models 12
2.4.2 Supervised methods 13
2.4.2.1 Evaluation of model performance 13
2.4.2.2 Regularized linear models 14
2.4.2.3 SVM for CRM prediction 14
2.4.2.4 Ensemble of decision trees 15
2.4.2.4.1 Algorithms to train a decision tree classifier 16
2.4.2.4.2 Parameters of the Random Forest 17
2.4.2.5 Feature selection methods 17
2.4.2.5.1 Filter methods 18
2.4.2.5.2 Wrapper methods 18
2.4.2.5.3 Embedded methods 19
2.4.2.6 Deep learning methods 19
2.4.2.6.1 Convolutional Neural Networks 19
2.4.2.6.2 Overfitting in the CNN 21
2.4.2.6.3 CNNs for computational identification of CRMs 21
3. Transcriptional regulation and cancer 22
3.1 Role of TP53 in cancer 23
3.2 Role of non-coding mutations in cancer 23
CHAPTER II: Objectives 27
CHAPTER III: Results 29
PAPER 1: Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models 31
PAPER 2: Multiplex enhancer-reporter assays uncover unsophisticated TP53 enhancer logic 83
CHAPTER IV: DISCUSSION 141
5.1 Computational models to identify TF-specific enhancers 141
5.2 Prediction of high-impact cis-regulatory mutations with enhancer models 142
5.3 Deciphering p53 enhancer logic using high-throughput enhancer reporter assays coupled with machine learning 143
5.4 General conclusion 146
5.5 Future perspectives 147
BIBLIOGRAPHY 153nrpages: 172status: publishe
A novel High-throughput Enhancer reporter assay reveals unsophisticated p53 enhancer logic
Deciphering the cis-regulatory logic encoded in enhancer sequences requires large-scale reporter assays to experimentally validate candidate enhancers predicted by genomic approaches such as chromatin accessibility and ChIP-seq. Here, we propose a novel high-throughput enhancer-reporter assay called CHEQ-Seq (Captured High-throughput Enhancer testing by Quantitative Sequencing). A set of candidate enhancers are pre-selected as regions of 0.5-1 kb and enriched from genomic, sheared DNA using custom-designed capturing baits. They are subsequently cloned into a reporter library and randomly combined with unique barcodes, before being tested under various conditions in cell culture. The relationship between each enhancer and its reporter-barcode is determined by PacBio long-read sequencing of the entire library; while the barcode expression level is determined by Illumina short-read cDNA sequencing. We have applied Cheq-seq to test the enhancer activity of 1526 p53 ChIP-seq peaks under p53 knock-down and p53 over-activating conditions. We obtained reproducible reporter expression for 1060 captured enhancers, of which 397 showed a significant p53-dependent activation. Strikingly, the large majority (99%) of p53 target enhancers can be characterized and distinguished from negative sequences by the occurrence of a single p53 binding site. Thus, the p53 enhancer logic represents a new ancestral class of enhancers, distinct from developmental enhancers that adhere to the billboard and enhanceosome models. The p53 enhancers do not contain obvious combinatorial complexity of binding sites for multiple transcription factors. This suggests that p53 acts alone on its target enhancers, and that context-dependent regulation of target genes is not encoded in the p53 enhancer sequences, but at different upstream or downstream layers of the cell’s gene regulatory network.status: accepte
Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models
Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a "gain-of-target" for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes.status: publishe
Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models.
Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a "gain-of-target" for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes
Identification of cis-regulatory mutations generating de novo edges in personalized cancer gene regulatory networks
The identification of functional non-coding mutations is a key challenge in the field of genomics. Here we introduce μ-cisTarget to filter, annotate and prioritize cis-regulatory mutations based on their putative effect on the underlying "personal" gene regulatory network. We validated μ-cisTarget by re-analyzing the TAL1 and LMO1 enhancer mutations in T-ALL, and the TERT promoter mutation in melanoma. Next, we re-sequenced the full genomes of ten cancer cell lines and used matched transcriptome data and motif discovery to identify master regulators with de novo binding sites that result in the up-regulation of nearby oncogenic drivers. μ-cisTarget is available from http://mucistarget.aertslab.org .status: publishe
Overview of the methodology.
<p>A) To identify functional CRMs we searched for significant correlations between TF ChIP-seq tracks and TF target genes using i-cisTarget [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004590#pcbi.1004590.ref028" target="_blank">28</a>]; and selected peaks (marked in green) that are located in 20 kb regulatory space around up- or down-regulated TF target genes. B) Feature selection was performed on the set of functional CRMs to select TF and co-regulatory PWMs and data tracks. C) The performance of each of the 45 TF models was evaluated by 5-fold cross-validation, using area under the precision-recall and receiver-operating characteristic curves. D) The 45 learned classifiers where used to identify <i>cis</i>-regulatory somatic mutations that have an impact on the CRM score, defining a PRIME score (Predicted Regulatory Impact of a Mutation in an Enhancer).</p
Candidate <i>cis</i>-regulatory driver SNVs and insertions across 498 breast cancer genomes.
<p>A) All SNVs and insertions with high PRIME score (>0.3) (insertions are within the black box) found by M1 models in the regulatory regions around cancer related genes and 167 TFs expressed in breast cancer (all significant PRIME scores with model-specific thresholds are provided in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004590#pcbi.1004590.s028" target="_blank">S5</a>–<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004590#pcbi.1004590.s029" target="_blank">S6</a> Tables). Values inside boxes indicate the recurrence, that is the number of samples where this variant was found across the 498 TCGA samples. B) An example of a high scoring recurrent insertion that is predicted to generate a TP53 gain of target in the vicinity of SOX5. Z-scores of the SOX5 gene expression are significantly higher (Wilcoxon rank sum test) in the 33 samples with the insertion, compared to samples without the insertion.</p
Regulatory impact score on simulated substitions.
<p>A) Nucleotide substitutions with higher PRIME scores are under constraint. B) An example of the <i>E2F1</i> promoter for which each possible substitution is evaluated by M0 and M1 models. The M1 model (Random Forest) identifies a 15 bp region that is highly vulnerable to mutations, while three different M0 models (using only the PWM), identify excessive numbers of false-positive substitutions, demonstrating the higher specificity of the Random Forest classifiers, compared to single PWMs. C) Barplot showing an example from A), thus averaged phastCons scores depeneding on the PRIME score threshold, for the E2F4 model. Error bars represent standard error of the mean.</p
Comparison of PWMs and Random Forest classifiers on the known <i>TAL1</i> insertion.
<p>We scored the known <i>TAL1</i> enhancer insertion that occurs in the Jurkat cell line [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004590#pcbi.1004590.ref006" target="_blank">6</a>] with Random Forest (M1) and PWM (M0) MYB-specific models. As control, we scored all SNVs and insertions in promoters across 498 breast cancer genomes with the same MYB models, to calculate a background distribution of impact scores. A) The distribution of background PRIME scores (i.e., delta Random Forest scores) and the observed PRIME score for M1, indicated as the orange arrow. B) The distribution of background PWM-delta scores (M0 model) and the observed score. C) Feature importance within the MYB model indicates that both and MYB motifs and co-regulatory TF motifs contribute significantly to the classification decision and the most important co-regulatory motif is RUNX, a known co-regulatory factor of MYB. D) The known driver insertion in the <i>TAL1</i> enhancer generates a gain of H2K27Ac peak, whereas the known SNV in the <i>TERT</i> promoter does not. The red highlighted region indicates which samples harbor the respective <i>cis</i>-regulatory mutation.</p
Feature importance.
<p>A) Three examples of TFs, each with several (for NANOG and TP53) or one (for MYC) target CRMs, illustrating the feature importance in the Random Forest classifier, in the M3 model. For NANOG co-regulatory PWMs contribute more to the classification performance than the PWM of NANOG itself. For TP53, the contribution of the co-regulatory PWMs is not strong and the classification decision is largely based on the presence of strong binding sites of TP53 itself. For the MYC model the most important features are regulatory tracks. B) Examples of a decision tree in the ensemble. C) Averaged feature importance across trees, showing the contribution of various features to the classification decision. For example TCF12 and ATF2 tracks are dominant for NANOG model; for TP53 the most relevant features are motifs of the query TF (red) and particular important ones are represented with logos. The colored region around dashed line demonstrates standard deviation of the feature impartance across trees.</p