1,118 research outputs found

    CpG-depleted promoters harbor tissue-specific transcription factor binding signals—implications for motif overrepresentation analyses

    Get PDF
    Motif overrepresentation analysis of proximal promoters is a common approach to characterize the regulatory properties of co-expressed sets of genes. Here we show that these approaches perform well on mammalian CpG-depleted promoter sets that regulate expression in terminally differentiated tissues such as liver and heart. In contrast, CpG-rich promoters show very little overrepresentation signal, even when associated with genes that display highly constrained spatiotemporal expression. For instance, while ∼50% of heart specific genes possess CpG-rich promoters we find that the frequently observed enrichment of MEF2-binding sites upstream of heart-specific genes is solely due to contributions from CpG-depleted promoters. Similar results are obtained for all sets of tissue-specific genes indicating that CpG-rich and CpG-depleted promoters differ fundamentally in their distribution of regulatory inputs around the transcription start site. In order not to dilute the respective transcription factor binding signals, the two promoter types should thus be treated as separate sets in any motif overrepresentation analysis

    A Systems Biology Approach to Transcription Factor Binding Site Prediction

    Get PDF
    The elucidation of mammalian transcriptional regulatory networks holds great promise for both basic and translational research and remains one the greatest challenges to systems biology. Recent reverse engineering methods deduce regulatory interactions from large-scale mRNA expression profiles and cross-species conserved regulatory regions in DNA. Technical challenges faced by these methods include distinguishing between direct and indirect interactions, associating transcription regulators with predicted transcription factor binding sites (TFBSs), identifying non-linearly conserved binding sites across species, and providing realistic accuracy estimates.We address these challenges by closely integrating proven methods for regulatory network reverse engineering from mRNA expression data, linearly and non-linearly conserved regulatory region discovery, and TFBS evaluation and discovery. Using an extensive test set of high-likelihood interactions, which we collected in order to provide realistic prediction-accuracy estimates, we show that a careful integration of these methods leads to significant improvements in prediction accuracy. To verify our methods, we biochemically validated TFBS predictions made for both transcription factors (TFs) and co-factors; we validated binding site predictions made using a known E2F1 DNA-binding motif on E2F1 predicted promoter targets, known E2F1 and JUND motifs on JUND predicted promoter targets, and a de novo discovered motif for BCL6 on BCL6 predicted promoter targets. Finally, to demonstrate accuracy of prediction using an external dataset, we showed that sites matching predicted motifs for ZNF263 are significantly enriched in recent ZNF263 ChIP-seq data.Using an integrative framework, we were able to address technical challenges faced by state of the art network reverse engineering methods, leading to significant improvement in direct-interaction detection and TFBS-discovery accuracy. We estimated the accuracy of our framework on a human B-cell specific test set, which may help guide future methodological development

    Mapping and Functional Analysis of cis-Regulatory Elements in Mouse Photoreceptors

    Get PDF
    Photoreceptors are light-sensitive neurons that mediate vision, and they are the most commonly affected cell type in genetic forms of blindness. In mice, there are two basic types of photoreceptors, rods and cones, which mediate vision in dim and bright environments, respectively. The transcription factors (TFs) that control rod and cone development have been studied in detail, but the cis-regulatory elements (CREs) through which these TFs act are less well understood. To comprehensively identify photoreceptor CREs in mice and to understand their relationship with gene expression, we performed open chromatin (ATAC-seq) and transcriptome (RNA-seq) profiling of FACS-purified rods and cones. We find that rods have significantly fewer regions of open chromatin than cones (as well as \u3e60 additional cell types and tissues), and we demonstrate that this uniquely closed chromatin architecture depends on the rod master regulator Nrl. Finally, we find that regions of rod- and cone-specific open chromatin are enriched for distinct sets of TF binding sites, providing insight into the cis-regulatory grammar of these cell types. We also sought to understand how the regulatory activity of rod and cone open chromatin regions is encoded in DNA sequence. Cone-rod homeobox (CRX) is a paired-like homeodomain TF and master regulator of both rod and cone development, and CRX binding sites are by far the most enriched TF binding sites in photoreceptor CREs. The in vitro DNA binding preferences of CRX have been extensively characterized, but how well in vitro models of TF binding site affinity predict in vivo regulatory activity is not known. In addition, paired-class homeodomain TFs bind DNA as both monomers and dimers, but whether monomeric and dimeric CRX binding sites have distinct regulatory activities is not known. To address these questions, we used a massively parallel reporter assay to quantify the activity of thousands native and mutant CRX binding sites in explanted mouse retinas. These data reveal that dimeric CRX binding sites encode stronger enhancers than monomeric CRX binding sites. Moreover, the activity of half-sites within dimeric CRX binding sites is cooperative and spacing-dependent. In addition, saturating mutagenesis of 195 CRX binding sites reveals that, while TF binding site affinity and activity are moderately correlated across mutations within individual CREs, they are poorly correlated across mutations from distinct CREs. Accordingly, we show that accounting for baseline CRE activity improves the prediction of the effects of mutations in regulatory DNA from sequence-based models. Taken together, these data demonstrate that the activity of CRX binding sites depends on multiple layers of sequence context, providing insight into photoreceptor gene regulation and illustrating functional principles of homeodomain TF binding sites

    Transcription factor binding site clusters identify target genes with similar tissue-wide expression and buffer against mutations.

    Get PDF
    Background: The distribution and composition of cis-regulatory modules composed of transcription factor (TF) binding site (TFBS) clusters in promoters substantially determine gene expression patterns and TF targets. TF knockdown experiments have revealed that TF binding profiles and gene expression levels are correlated. We use TFBS features within accessible promoter intervals to predict genes with similar tissue-wide expression patterns and TF targets using Machine Learning (ML). Methods: Bray-Curtis Similarity was used to identify genes with correlated expression patterns across 53 tissues. TF targets from knockdown experiments were also analyzed by this approach to set up the ML framework. TFBSs were selected within DNase I-accessible intervals of corresponding promoter sequences using information theory-based position weight matrices (iPWMs) for each TF. Features from information-dense clusters of TFBSs were input to ML classifiers which predict these gene targets along with their accuracy, specificity and sensitivity. Mutations in TFBSs were analyzed in silico to examine their impact on TFBS clustering and predict changes in gene regulation. Results: The glucocorticoid receptor gene (NR3C1), whose regulation has been extensively studied, was selected to test this approach. SLC25A32 and TANK exhibited the most similar expression patterns to NR3C1. A Decision Tree classifier exhibited the best performance in detecting such genes, based on Area Under the Receiver Operating Characteristic curve (ROC). TF target gene prediction was confirmed using siRNA knockdown, which was more accurate than CRISPR/CAS9 inactivation. TFBS mutation analyses revealed that accurate target gene prediction required at least 1 information-dense TFBS cluster. Conclusions: ML based on TFBS information density, organization, and chromatin accessibility accurately identifies gene targets with comparable tissue-wide expression patterns. Multiple information-dense TFBS clusters in promoters appear to protect promoters from effects of deleterious binding site mutations in a single TFBS that would otherwise alter regulation of these genes

    Predicting Combinatorial Binding of Transcription Factors to Regulatory Elements in the Human Genome by Association Rule Mining

    Get PDF
    Cis-acting transcriptional regulatory elements in mammalian genomes typically contain specific combinations of binding sites for various transcription factors. Although some cisregulatory elements have been well studied, the combinations of transcription factors that regulate normal expression levels for the vast majority of the 20,000 genes in the human genome are unknown. We hypothesized that it should be possible to discover transcription factor combinations that regulate gene expression in concert by identifying over-represented combinations of sequence motifs that occur together in the genome. In order to detect combinations of transcription factor binding motifs, we developed a data mining approach based on the use of association rules, which are typically used in market basket analysis. We scored each segment of the genome for the presence or absence of each of 83 transcription factor binding motifs, then used association rule mining algorithms to mine this dataset, thus identifying frequently occurring pairs of distinct motifs within a segment. Results: Support for most pairs of transcription factor binding motifs was highly correlated across different chromosomes although pair significance varied. Known true positive motif pairs showed higher association rule support, confidence, and significance than background. Our subsets of high-confidence, high-significance mined pairs of transcription factors showed enrichment for co-citation in PubMed abstracts relative to all pairs, and the predicted associations were often readily verifiable in the literature. Conclusion: Functional elements in the genome where transcription factors bind to regulate expression in a combinatorial manner are more likely to be predicted by identifying statistically and biologically significant combinations of transcription factor binding motifs than by simply scanning the genome for the occurrence of binding sites for a single transcription factor.NIAAA Alcohol Training GrantNational Science FoundationCellular and Molecular Biolog

    Developmental constraints, innovations and robustness

    Get PDF
    During my PhD, I have been working on Evo-Devo patterns (especially the debate around the hourglass model) in transcriptomes, with an emphasis on adaptation. I have characterized patterns in model organisms in terms of constraints and especially in terms of positive selection. I found that the phylotypic stage (a stage in mid-embryonic development) is an evolutionary lockdown, with stronger purifying selection and less positive selection than other stages in terms of the evolution of protein sequences and of regulatory elements. To study the adaptive evolution of gene regulation during development, I have developed a machine leaning based in silico mutagenesis approach to detect positive selection on regulatory elements. In addition to transcriptome evolution, I have been working on the tension between precision and stochasticity of gene expression during development. More precisely, I have shown that expression noise follows an hourglass pattern, with lower noise at the phylotypic stage. This pattern can be explained by stronger histone modification mediated noise control at this stage. In addition, I propose that histone modifications contribute to mutational robustness in regulatory elements, and thus to conserved expression levels. These results provide insight into the role of robustness in the phenotypic and genetic patterns of evolutionary conservation in animal developmen

    Deciphering Transcriptional Regulation using Deep Neural Networks

    Get PDF
    The DNA holds the recipe of all life functions. To decipher the instructions, one has to learn and understand its complex syntax. The non-coding DNA contains regulatory elements, that are essential to control and activate gene expression in the right place at the right time. Previous studies have applied deep learning for gene expression prediction, directly from non-coding sequences, successfully. Almeida et al. [1] showed that a Convolutional Neural Network could learn regulatory syntax from long same-length fragments from the fruit fly. In this thesis, we tested how well deep neural networks could predict gene expression from short DNA fragments of varying lengths from the Atlantic salmon. Furthermore, we extracted what the models had learned, and tested if the sequence features corresponded to known regulatory sequence patterns (motifs). Two deep neural network architectures were built, a Convolutional Neural Network (CNN) and a hybrid Convolutional and Long Short-Term Memory Neural Network (CNN-LSTM). We trained the models to predict the gene expression effect of DNA fragments from open chromatin of liver cells. The two model architectures performed equally well, and the performances depended on the amount of noise in the validation data, reaching a correlation of 0.68 on the sequences of top 10% base mean. We extracted motifs both from the first convolutional filters and from sequence importance scores, and we compared the motifs to the JASPAR database of known vertebrate transcription factor binding site motifs. Among the significant matches to JASPAR, we found some general transcription factors like the TFCP2, HSF and AP-1, as well as some liver-specific transcription factors like the KLF15 and HNF6. Most motifs did not match any JASPAR motif. We explained the tendency of CNNs to distribute partial motifs across several filters, and that other sequence features might be important for prediction as well. Our results suggest that the models learned regulatory DNA syntax equally well, despite their different architectures, and we compared the motif findings in light of these differences. This thesis demonstrates the potential of deep neural networks for analysis of ATAC-STARR-seq data, and suggests improvements worth exploring further to possibly increase performance. We also stress the need for more robust model interpretation techniques, which could unlock valuable knowledge in the future of genomics

    Inference of transcriptional regulation using gene expression data from the bovine and human genomes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene expression is in part regulated by sequences in promoters that bind transcription factors. Thus, co-expressed genes may have shared sequence motifs representing putative transcription factor binding sites (TFBSs). However, for agriculturally important animals the genomic sequence is often incomplete. The more complete human genome may be able to be used for this prediction by taking advantage of the expected evolutionary conservation in TFBSs between the species.</p> <p>Results</p> <p>A method of <it>de novo </it>TFBS prediction based on MEME was implemented, tested, and validated on a muscle-specific dataset.</p> <p>Muscle specific expression data from EST library analysis from cattle was used to predict sets of genes whose expression was enriched in muscle and cardiac tissues. The upstream 1500 bases from calculated orthologous genes were extracted from the human reference set. A set of common motifs were discovered in these promoters. Slightly over one third of these motifs were identified as known TFBSs including known muscle specific binding sites. This analysis also predicted several highly statistically significantly overrepresented sites that may be novel TFBS.</p> <p>An independent analysis of the equivalent bovine genomic sequences was also done, this gave less detailed results than the human analysis due to both the quality of orthologue prediction and assembly in promoter regions. However, the most common motifs could be detected in both sets.</p> <p>Conclusion</p> <p>Using promoter sequences from human genes is a useful approach when studying gene expression in species with limited or non-existing genomic sequence. As the bovine genome becomes better annotated it can in turn serve as the reference genome for other agriculturally important ruminants, such as sheep, goat and deer.</p

    Responsiveness of genes to long-range transcriptional regulation

    Get PDF
    Developmental genes are highly regulated at the level of transcription and exhibit complex spatial and temporal expression patterns. Key developmental loci are frequently spanned by clusters of conserved non-coding elements (CNEs), referred to as genomic regulatory blocks (GRBs), that have been subject to extreme levels of purifying selection during metazoan evolution. CNEs have been shown to function as long-range enhancers, activating transcription of their developmental target genes over vast genomic distances and bypassing more proximally located unresponsive genes (bystanders). Despite their role in the establishment of cell identity during development, many of these long-range regulatory landscapes remain poorly characterised. In this thesis, I develop a computational method for the genome-wide identification of regulatory enhancer-promoter associations in human and mouse, based on co-variation of enhancer and promoter transcriptional activity across a comprehensive set of tissues and cell types, in combination with chromatin contact data. Using this method, I demonstrate that previously predicted GRB target genes are amongst the genes with the highest level of enhancer responsiveness in the genome, and are frequently associated with extremely long-range enhancers. Remarkably, the activity of some previously predicted bystanders is also weakly but significantly associated with enhancer activity, challenging the notion that the promoters of bystanders are unresponsive to enhancers. Next, I systematically annotate human genes with elevated enhancer responsiveness and identify more than 600 putative target genes, associated with the regulation of a wide range of developmental processes, from pattern specification to axonogenesis, as well as with disease. The analysis performed in this thesis has facilitated the identification of hundreds of previously uncharacterised enhancer-responsive genes and their long-range regulatory landscapes, allowing the study of their unique properties.Open Acces
    corecore