31 research outputs found

    IDENTIFICATION OF CIS-REGULATORY MODULES AND NON-CODING VARIATION USING MACHINE LEARNING METHODS

    No full text
    Table of Contents CHAPTER 1: INTRODUCTION 1 1. Transcriptional regulation 1 1.1 Classes of cis-regulatory modules 2 1.1.1 CRM architecture 3 1.2 Chromatin signatures of CRMs 4 1.3 Motif and evolutionary constraint in noncoding regions 5 1.4 Detecting regulatory regions using experimental methods 7 1.4.1 Genome-wide identification of TF binding with ChIP and DamID 7 1.4.2 Identification of enhancers using open chromatin profiling. 7 1.4.3 Functional validation of enhancers 8 1.4.3.1 Massively parallel reporter assay 8 1.4.2.2 STARR-seq 9 1.4.2.3 Assays using genomic integration 9 2. Computational identification of regulatory elements in the genome 9 2.1 Motif-based approaches 9 2.2 Comparative genomics approaches to identify functional binding sites 10 2.3 CRM detection using motif clustering 11 2.4 Machine learning approaches to find CRMs 12 2.4.1 Unsupervised learning methods 12 2.4.1.1 Hidden Markov Models 12 2.4.2 Supervised methods 13 2.4.2.1 Evaluation of model performance 13 2.4.2.2 Regularized linear models 14 2.4.2.3 SVM for CRM prediction 14 2.4.2.4 Ensemble of decision trees 15 2.4.2.4.1 Algorithms to train a decision tree classifier 16 2.4.2.4.2 Parameters of the Random Forest 17 2.4.2.5 Feature selection methods 17 2.4.2.5.1 Filter methods 18 2.4.2.5.2 Wrapper methods 18 2.4.2.5.3 Embedded methods 19 2.4.2.6 Deep learning methods 19 2.4.2.6.1 Convolutional Neural Networks 19 2.4.2.6.2 Overfitting in the CNN 21 2.4.2.6.3 CNNs for computational identification of CRMs 21 3. Transcriptional regulation and cancer 22 3.1 Role of TP53 in cancer 23 3.2 Role of non-coding mutations in cancer 23 CHAPTER II: Objectives 27 CHAPTER III: Results 29 PAPER 1: Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models 31 PAPER 2: Multiplex enhancer-reporter assays uncover unsophisticated TP53 enhancer logic 83 CHAPTER IV: DISCUSSION 141 5.1 Computational models to identify TF-specific enhancers 141 5.2 Prediction of high-impact cis-regulatory mutations with enhancer models 142 5.3 Deciphering p53 enhancer logic using high-throughput enhancer reporter assays coupled with machine learning 143 5.4 General conclusion 146 5.5 Future perspectives 147 BIBLIOGRAPHY 153nrpages: 172status: publishe

    A novel High-throughput Enhancer reporter assay reveals unsophisticated p53 enhancer logic

    No full text
    Deciphering the cis-regulatory logic encoded in enhancer sequences requires large-scale reporter assays to experimentally validate candidate enhancers predicted by genomic approaches such as chromatin accessibility and ChIP-seq. Here, we propose a novel high-throughput enhancer-reporter assay called CHEQ-Seq (Captured High-throughput Enhancer testing by Quantitative Sequencing). A set of candidate enhancers are pre-selected as regions of 0.5-1 kb and enriched from genomic, sheared DNA using custom-designed capturing baits. They are subsequently cloned into a reporter library and randomly combined with unique barcodes, before being tested under various conditions in cell culture. The relationship between each enhancer and its reporter-barcode is determined by PacBio long-read sequencing of the entire library; while the barcode expression level is determined by Illumina short-read cDNA sequencing. We have applied Cheq-seq to test the enhancer activity of 1526 p53 ChIP-seq peaks under p53 knock-down and p53 over-activating conditions. We obtained reproducible reporter expression for 1060 captured enhancers, of which 397 showed a significant p53-dependent activation. Strikingly, the large majority (99%) of p53 target enhancers can be characterized and distinguished from negative sequences by the occurrence of a single p53 binding site. Thus, the p53 enhancer logic represents a new ancestral class of enhancers, distinct from developmental enhancers that adhere to the billboard and enhanceosome models. The p53 enhancers do not contain obvious combinatorial complexity of binding sites for multiple transcription factors. This suggests that p53 acts alone on its target enhancers, and that context-dependent regulation of target genes is not encoded in the p53 enhancer sequences, but at different upstream or downstream layers of the cell’s gene regulatory network.status: accepte

    Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models

    No full text
    Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a "gain-of-target" for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes.status: publishe

    Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models.

    No full text
    Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a "gain-of-target" for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes

    Identification of cis-regulatory mutations generating de novo edges in personalized cancer gene regulatory networks

    No full text
    The identification of functional non-coding mutations is a key challenge in the field of genomics. Here we introduce μ-cisTarget to filter, annotate and prioritize cis-regulatory mutations based on their putative effect on the underlying "personal" gene regulatory network. We validated μ-cisTarget by re-analyzing the TAL1 and LMO1 enhancer mutations in T-ALL, and the TERT promoter mutation in melanoma. Next, we re-sequenced the full genomes of ten cancer cell lines and used matched transcriptome data and motif discovery to identify master regulators with de novo binding sites that result in the up-regulation of nearby oncogenic drivers. μ-cisTarget is available from http://mucistarget.aertslab.org .status: publishe

    Validation of classifiers by genome-wide CRM prediction.

    No full text
    <p>After genome-wide CRM scoring, removing the training CRMs, we evaluated the enrichment of ChIP-seq peaks of the corresponding TF, and the enrichment of motifs of the corresponding TF, within the top 1000 newly predicted CRMs. Enrichment is calculated by i-cisTarget [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004590#pcbi.1004590.ref028" target="_blank">28</a>], and represented as a Normalized Enrichment Score (NES). A) Significant enrichment of ChIP-seq peaks (orange color corresponds to NES>2.5) for 31/45 M1 models, compared to 17/45 of the Mk models. B) The motif of the respective TF is also enriched in the top 1000 newly predicted functional CRMs, for those in orange (NES>2.5).</p

    Overview of the methodology.

    No full text
    <p>A) To identify functional CRMs we searched for significant correlations between TF ChIP-seq tracks and TF target genes using i-cisTarget [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004590#pcbi.1004590.ref028" target="_blank">28</a>]; and selected peaks (marked in green) that are located in 20 kb regulatory space around up- or down-regulated TF target genes. B) Feature selection was performed on the set of functional CRMs to select TF and co-regulatory PWMs and data tracks. C) The performance of each of the 45 TF models was evaluated by 5-fold cross-validation, using area under the precision-recall and receiver-operating characteristic curves. D) The 45 learned classifiers where used to identify <i>cis</i>-regulatory somatic mutations that have an impact on the CRM score, defining a PRIME score (Predicted Regulatory Impact of a Mutation in an Enhancer).</p

    Candidate <i>cis</i>-regulatory driver SNVs and insertions across 498 breast cancer genomes.

    No full text
    <p>A) All SNVs and insertions with high PRIME score (>0.3) (insertions are within the black box) found by M1 models in the regulatory regions around cancer related genes and 167 TFs expressed in breast cancer (all significant PRIME scores with model-specific thresholds are provided in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004590#pcbi.1004590.s028" target="_blank">S5</a>–<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004590#pcbi.1004590.s029" target="_blank">S6</a> Tables). Values inside boxes indicate the recurrence, that is the number of samples where this variant was found across the 498 TCGA samples. B) An example of a high scoring recurrent insertion that is predicted to generate a TP53 gain of target in the vicinity of SOX5. Z-scores of the SOX5 gene expression are significantly higher (Wilcoxon rank sum test) in the 33 samples with the insertion, compared to samples without the insertion.</p

    Regulatory impact score on simulated substitions.

    No full text
    <p>A) Nucleotide substitutions with higher PRIME scores are under constraint. B) An example of the <i>E2F1</i> promoter for which each possible substitution is evaluated by M0 and M1 models. The M1 model (Random Forest) identifies a 15 bp region that is highly vulnerable to mutations, while three different M0 models (using only the PWM), identify excessive numbers of false-positive substitutions, demonstrating the higher specificity of the Random Forest classifiers, compared to single PWMs. C) Barplot showing an example from A), thus averaged phastCons scores depeneding on the PRIME score threshold, for the E2F4 model. Error bars represent standard error of the mean.</p

    Identification of cis-regulatory mutations generating de novo edges in personalized cancer gene regulatory networks

    No full text
    Abstract The identification of functional non-coding mutations is a key challenge in the field of genomics. Here we introduce μ-cisTarget to filter, annotate and prioritize cis-regulatory mutations based on their putative effect on the underlying “personal” gene regulatory network. We validated μ-cisTarget by re-analyzing the TAL1 and LMO1 enhancer mutations in T-ALL, and the TERT promoter mutation in melanoma. Next, we re-sequenced the full genomes of ten cancer cell lines and used matched transcriptome data and motif discovery to identify master regulators with de novo binding sites that result in the up-regulation of nearby oncogenic drivers. μ-cisTarget is available from http://mucistarget.aertslab.org
    corecore