503 research outputs found

    Novel computational methods for studying the role and interactions of transcription factors in gene regulation

    Get PDF
    Regulation of which genes are expressed and when enables the existence of different cell types sharing the same genetic code in their DNA. Erroneously functioning gene regulation can lead to diseases such as cancer. Gene regulatory programs can malfunction in several ways. Often if a disease is caused by a defective protein, the cause is a mutation in the gene coding for the protein rendering the protein unable to perform its functions properly. However, protein-coding genes make up only about 1.5% of the human genome, and majority of all disease-associated mutations discovered reside outside protein-coding genes. The mechanisms of action of these non-coding disease-associated mutations are far more incompletely understood. Binding of transcription factors (TFs) to DNA controls the rate of transcribing genetic information from the coding DNA sequence to RNA. Binding affinities of TFs to DNA have been extensively measured in vitro, ligands by exponential enrichment) and Protein Binding Microarrays (PBMs), and the genome-wide binding locations and patterns of TFs have been mapped in dozens of cell types. Despite this, our understanding of how TF binding to regulatory regions of the genome, promoters and enhancers, leads to gene expression is not at the level where gene expression could be reliably predicted based on DNA sequence only. In this work, we develop and apply computational tools to analyze and model the effects of TF-DNA binding. We also develop new methods for interpreting and understanding deep learning-based models trained on biological sequence data. In biological applications, the ability to understand how machine learning models make predictions is as, or even more important as raw predictive performance. This has created a demand for approaches helping researchers extract biologically meaningful information from deep learning model predictions. We develop a novel computational method for determining TF binding sites genome-wide from recently developed high-resolution ChIP-exo and ChIP-nexus experiments. We demonstrate that our method performs similarly or better than previously published methods while making less assumptions about the data. We also describe an improved algorithm for calling allele-specific TF-DNA binding. We utilize deep learning methods to learn features predicting transcriptional activity of human promoters and enhancers. The deep learning models are trained on massively parallel reporter gene assay (MPRA) data from human genomic regulatory elements, designed regulatory elements and promoters and enhancers selected from totally random pool of synthetic input DNA. This unprecedentedly large set of measurements of human gene regulatory element activities, in total more than 100 times the size of the human genome, allowed us to train models that were able to predict genomic transcription start site positions more accurately than models trained on genomic promoters, and to correctly predict effects of disease-associated promoter variants. We also found that interactions between promoters and local classical enhancers are non-specific in nature. The MPRA data integrated with extensive epigenetic measurements supports existence of three different classes of enhancers: classical enhancers, closed chromatin enhancers and chromatin-dependent enhancers. We also show that TFs can be divided into four different, non-exclusive classes based on their activities: chromatin opening, enhancing, promoting and TSS determining TFs. Interpreting the deep learning models of human gene regulatory elements required application of several existing model interpretation tools as well as developing new approaches. Here, we describe two new methods for visualizing features and interactions learned by deep learning models. Firstly, we describe an algorithm for testing if a deep learning model has learned an existing binding motif of a TF. Secondly, we visualize mutual information between pairwise k-mer distributions in sample inputs selected according to predictions by a machine learning model. This method highlights pairwise, and positional dependencies learned by a machine learning model. We demonstrate the use of this model-agnostic approach with classification and regression models trained on DNA, RNA and amino acid sequences.Monet eliöt koostuvat useista erilaisista solutyypeistÀ, vaikka kaikissa nÀiden eliöiden soluissa onkin sama DNA-koodi. Geenien ilmentymisen sÀÀtely mahdollistaa erilaiset solutyypit. Virheellisesti toimiva sÀÀtely voi johtaa sairauksiin, esimerkiksi syövÀn puhkeamiseen. Jos sairauden aiheuttaa viallinen proteiini, on syynÀ usein mutaatio tÀtÀ proteiinia koodaavassa geenissÀ, joka muuttaa proteiinia siten, ettei se enÀÀ pysty toimittamaan tehtÀvÀÀnsÀ riittÀvÀn hyvin. Kuitenkin vain 1,5 % ihmisen genomista on proteiineja koodaavia geenejÀ. Suurin osa kaikista löydetyistÀ sairauksiin liitetyistÀ mutaatioista sijaitsee nÀiden ns. koodaavien alueiden ulkopuolella. Ei-koodaavien sairauksiin liitetyiden mutaatioiden vaikutusmekanismit ovat yleisesti paljon huonommin tunnettuja, kuin koodaavien alueiden mutaatioiden. Transkriptiotekijöiden sitoutuminen DNA:han sÀÀtelee transkriptiota, eli geeneissÀ olevan geneettisen informaation lukemista ja muuntamista RNA:ksi. Transkriptiotekijöiden sitoutumista DNA:han on mitattu kattavasti in vitro-olosuhteissa, ja monien transkriptiotekijöiden sitoutumiskohdat on mitattu genominlaajuisesti useissa eri solutyypeissÀ. TÀstÀ huolimatta ymmÀrryksemme siitÀ miten transkriptioitekijöiden sitoutuminen genomin sÀÀtelyelementteihin, eli promoottoreihin ja vahvistajiin, johtaa geenien ilmentymiseen ei ole sellaisella tasolla, ettÀ voisimme luotettavasti ennustaa geenien ilmentymistÀ pelkÀstÀÀn DNA-sekvenssin perusteella. TÀssÀ työssÀ kehitÀmme ja sovellamme laskennallisia työkaluja transkriptiotekijöiden sitoutumisesta johtuvan geenien ilmentymisen analysointiin ja mallintamiseen. KehitÀmme myös uusia menetelmiÀ biologisella sekvenssidatalla opetettujen syvÀoppimismallien tulkitsemiseksi. Koneoppimismallin tekemien ennusteiden ymmÀrrettÀvyys on biologisissa sovelluksissa yleensÀ yhtÀ tÀrkeÀÀ, ellei jopa tÀrkeÀmpÀÀ kuin pelkkÀ raaka ennustetarkkuus. TÀmÀ on synnyttÀnyt tarpeen uusille menetelmille, jotka auttavat tutkijoita louhimaan biologisesti merkityksellistÀ tietoa syvÀoppimismallien ennusteista. Kehitimme tÀssÀ työssÀ uuden laskennallisen työkalun, jolla voidaan mÀÀrittÀÀ transkriptiotekijöiden sitoutumiskohdat genominlaajuisesti kÀyttÀen mittausdataa hiljattain kehitetyistÀ korkearesoluutioisista ChIP-exo ja ChIP-nexus kokeista. NÀytÀmme, ettÀ kehittÀmÀmme menetelmÀ suoriutuu paremmin, tai vÀhintÀÀn yhtÀ hyvin kuin aiemmin julkaistut menetelmÀt tehden nÀitÀ vÀhemmÀn oletuksia signaalin muodosta. Esittelemme myös parannellun algoritmin transkriptiotekijöiden alleelispesifin sitoutumisen mÀÀrittÀmiseksi. KÀytÀmme syvÀoppimismenetelmiÀ oppimaan mitkÀ ominaisuudet ennustavat ihmisen promoottori- ja voimistajaelementtien aktiivisuutta. NÀmÀ syvÀoppimismallit on opetettu valtavien rinnakkaisten reportterigeenikokeiden datalla ihmisen genomisista sÀÀtelyelementeistÀ, sekÀ aktiivisista promoottoreista ja voimistajista, jotka ovat valikoituneet satunnaisesta joukosta synteettisiÀ DNA-sekvenssejÀ. TÀmÀ ennennÀkemÀttömÀn laaja joukko mittauksia ihmisen sÀÀtelyelementtien aktiivisuudesta - yli satakertainen mÀÀrÀ DNA sekvenssiÀ ihmisen genomiin verrattuna - mahdollisti transkription aloituskohtien sijainnin ennustamisen ihmisen genomissa tarkemmin kuin ihmisen genomilla opetetut mallit. NÀmÀ mallit myös ennustivat oikein sairauksiin liitettyjen mutaatioiden vaikutukset ihmisen promoottoreilla. Tuloksemme nÀyttivÀt, ettÀ vuorovaikutukset ihmisen promoottorien ja klassisten paikallisten voimistajien vÀlillÀ ovat epÀspesifejÀ. MPRA-data, integroituna kattavien epigeneettisten mittausten kanssa mahdollisti voimistajaelementtien jaon kolmeen luokkaan: klassiset, suljetun kromatiinin, ja kromatiinista riippuvat voimistajat. Tutkimuksemme osoitti, ettÀ transkriptiotekijÀt voidaan jakaa neljÀÀn, osittain pÀÀllekkÀiseen luokkaan niiden aktiivisuuksien perusteella: kromatiinia avaaviin, voimistaviin, promotoiviin ja transkription aloituskohdan mÀÀrittÀviin transkriptiotekijöihin. Ihmisen genomin sÀÀtelyelementtejÀ kuvaavien syvÀoppimismallien tulkitseminen vaati sekÀ olemassa olevien menetelmien soveltamista, ettÀ uusien kehittÀmistÀ. Kehitimme tÀssÀ työssÀ kaksi uutta menetelmÀÀ syvÀoppimismallien oppimien muuttujien ja niiden vÀlisten vuorovaikutusten visualisoimiseksi. Ensin esittelemme algoritmin, jonka avulla voidaan testata onko syvÀoppimismalli oppinut jonkin jo tunnetun transkriptiotekijÀn sitoutumishahmon. Toiseksi, visualisoimme positiokohtaisten k-meerijakaumien keskeisinformaatiota sekvensseissÀ, jotka on valittu syvÀoppimismallin ennusteiden perusteella. TÀmÀ menetelmÀ paljastaa syvÀoppimismallin oppimat parivuorovaikutukset ja positiokohtaiset riippuvuudet. NÀytÀmme, ettÀ kehittÀmÀmme menetelmÀ on mallin arkkitehtuurista riippumaton soveltamalla sitÀ sekÀ luokittelijoihin, ettÀ regressiomalleihin, jotka on opetettu joko DNA-, RNA-, tai aminohapposekvenssidatalla

    Synthetic and genomic regulatory elements reveal aspects of cis-regulatory grammar in mouse embryonic stem cells

    Get PDF
    In embryonic stem cells (ESCs), a core transcription factor (TF) network establishes the gene expression program necessary for pluripotency. To address how interactions between four key TFs contribute t

    Computational prediction and experimental validation of novel Hedgehog-responsive enhancers linked to genes of the Hedgehog pathway

    Full text link
    Abstract Background The Hedgehog (Hh) signaling pathway, acting through three homologous transcription factors (GLI1, GLI2, GLI3) in vertebrates, plays multiple roles in embryonic organ development and adult tissue homeostasis. At the level of the genome, GLI factors bind to specific motifs in enhancers, some of which are hundreds of kilobases removed from the gene promoter. These enhancers integrate the Hh signal in a context-specific manner to control the spatiotemporal pattern of target gene expression. Importantly, a number of genes that encode Hh pathway molecules are themselves targets of Hh signaling, allowing pathway regulation by an intricate balance of feed-back activation and inhibition. However, surprisingly few of the critical enhancer elements that control these pathway target genes have been identified despite the fact that such elements are central determinants of Hh signaling activity. Recently, ChIP studies have been carried out in multiple tissue contexts using mouse models carrying FLAG-tagged GLI proteins (GLIFLAG). Using these datasets, we tested whether a meta-analysis of GLI binding sites, coupled with a machine learning approach, could reveal genomic features that could be used to empirically identify Hh-regulated enhancers linked to loci of the Hh signaling pathway. Results A meta-analysis of four existing GLIFLAG datasets revealed a library of GLI binding motifs that was substantially more restricted than the potential sites predicted by previous in vitro binding studies. A machine learning method (kmer-SVM) was then applied to these datasets and enriched k-mers were identified that, when applied to the mouse genome, predicted as many as 37,000 potential Hh enhancers. For functional analysis, we selected nine regions which were annotated to putative Hh pathway molecules and found that seven exhibited GLI-dependent activity, indicating that they are directly regulated by Hh signaling (78 % success rate). Conclusions The results suggest that Hh enhancer regions share common sequence features. The kmer-SVM machine learning approach identifies those features and can successfully predict functional Hh regulatory regions in genomic DNA surrounding Hh pathway molecules and likely, other Hh targets. Additionally, the library of enriched GLI binding motifs that we have identified may allow improved identification of functional GLI binding sites.http://deepblue.lib.umich.edu/bitstream/2027.42/134520/1/12861_2016_Article_106.pd

    Responsiveness of genes to long-range transcriptional regulation

    Get PDF
    Developmental genes are highly regulated at the level of transcription and exhibit complex spatial and temporal expression patterns. Key developmental loci are frequently spanned by clusters of conserved non-coding elements (CNEs), referred to as genomic regulatory blocks (GRBs), that have been subject to extreme levels of purifying selection during metazoan evolution. CNEs have been shown to function as long-range enhancers, activating transcription of their developmental target genes over vast genomic distances and bypassing more proximally located unresponsive genes (bystanders). Despite their role in the establishment of cell identity during development, many of these long-range regulatory landscapes remain poorly characterised. In this thesis, I develop a computational method for the genome-wide identification of regulatory enhancer-promoter associations in human and mouse, based on co-variation of enhancer and promoter transcriptional activity across a comprehensive set of tissues and cell types, in combination with chromatin contact data. Using this method, I demonstrate that previously predicted GRB target genes are amongst the genes with the highest level of enhancer responsiveness in the genome, and are frequently associated with extremely long-range enhancers. Remarkably, the activity of some previously predicted bystanders is also weakly but significantly associated with enhancer activity, challenging the notion that the promoters of bystanders are unresponsive to enhancers. Next, I systematically annotate human genes with elevated enhancer responsiveness and identify more than 600 putative target genes, associated with the regulation of a wide range of developmental processes, from pattern specification to axonogenesis, as well as with disease. The analysis performed in this thesis has facilitated the identification of hundreds of previously uncharacterised enhancer-responsive genes and their long-range regulatory landscapes, allowing the study of their unique properties.Open Acces

    A framework to identify epigenome and transcription factor crosstalk

    Get PDF
    While changes in chromatin are integral to transcriptional reprogramming during cellular differentiation, it is currently unclear how chromatin modifications are targeted to specific loci. To systematically identify transcription factors (TFs) that can direct chromatin changes during cell fate decisions, we model the genome-wide dynamics of chromatin marks in terms of computationally predicted TF binding sites. By applying this computational approach to a time course of Polycomb-mediated H3K27me3 marks during neuronal differentiation of murine stem cells, we identify several motifs that likely regulate dynamics of this chromatin mark. Among these, the motifs bound by REST and by the SNAIL family of TFs are predicted to transiently recruit H3K27me3 in neuronal progenitors. We validate these predictions experimentally and show that absence of REST indeed causes loss of H3K27me3 at target promoters in trans, specifically at the neuronal progenitor state. Moreover, using targeted transgenic insertion, we show that promoter fragments containing REST or SNAIL binding sites are sufficient to recruit H3K27me3 in cis, while deletion of these sites results in loss of H3K27me3. These findings illustrate that the occurrence of TF binding sites can determine chromatin dynamics. Local determination of Polycomb activity by Rest and Snail motifs exemplifies such TF based regulation of chromatin. Furthermore, our results show that key TFs can be identified ab initio through computational modeling of epigenome datasets using a modeling approach that we make readily accessible

    The evolutionary dynamics of genomic regulatory blocks in metazoan genomes

    Get PDF
    Developmental genes require intricate control of the timing, location and magnitude of their expression. This is provided by multiple evolutionarily conserved enhancers, known as conserved non-coding elements (CNEs). CNEs cluster around their target genes, forming long syntenic arrays known as genomic regulatory blocks (GRBs). Current methods for GRB identification rely on the selection of arbitrary minimum conservation thresholds, impeding their performance in many contexts. In this thesis, I propose a novel measure of pairwise genome conservation that eliminates the need for conservation thresholds, and use this measure to study the evolutionary dynamics of GRBs in metazoa. I define sets of GRBs based on their rate of regulatory turnover – high turnover GRBs (htGRBs) and low turnover GRBs (ltGRBs) – in three independent metazoan lineages. I show that ht- and ltGRBs target functionally distinct classes of genes, and that these genes tend to be expressed during late and early development respectively, potentially contributing to their differing tolerance of regulatory turnover. Moreover, the differences between ht- and ltGRBs are consistent across all three lineages, suggesting that similar evolutionary pressures have defined the rate of turnover in these GRBs since their emergence in the metazoan ancestor. Next I identify GRBs in the extremely compact Caenorhabditis elegans and Oikopleura dioica genomes for the first time, and use these GRBs to investigate the effects of genome compaction on GRB size and composition. I show that GRB size scales proportionally with genome size and that GRBs exhibit similar enrichment and depletion of specific genomic features. This suggests that regardless of background genome content, GRBs are under similar pressure to maintain a permissive environment for long-range gene regulation. The development of a threshold-free GRB identification method has facilitated the analysis of GRBs in both closely related species and compact genomes, providing further insights into their origin and evolution.Open Acces

    Iterative Machine Learning of a Cis-Regulatory Grammar

    Get PDF
    Gene regulation allows for the quantitative control of gene expression. Gene regulation is a complex process encoded through cis-regulatory sequences, short DNA sequences containing clusters of transcription factor binding sites. Each binding site can occur millions of times in multicellular genomes, and seemingly similar collections of binding sites can have very different activities. A leading model to explain these degeneracies is that cis-regulatory sequences follow a “grammar” defined by the number, identity, strength, arrangement, and/or context of the underlying binding sites. Understanding cis-regulatory grammar requires high-throughput technology, quantitative measurements, and computational modeling. This thesis describes an iterative machine learning approach to study cis-regulatory grammar using mouse photoreceptors as a model system. First, I characterized sequence features associated with enhancer and silencer activity in sequences bound by the transcription factor CRX. I showed that both enhancers and silencers are highly occupied by CRX compared to inactive sequences, and enhancers are uniquely enriched for a diverse but degenerate collection of eight motifs. I demonstrated that this information captures a majority of the available signal in genomic sequences and developed an information content metric that summarizes the effects of motif number and diversity. Second, I developed an active machine learning framework that iteratively samples informative perturbations to address the limitations of training quantitative models on genomic sequences alone. I showed that this approach, when complemented with human decision-making, effectively guides machine learning models towards a biologically relevant representation of cis-regulatory grammar. I also highlighted how perturbations selected with active learning are more informative than other perturbations generated by the same procedure. The final machine learning model can capture global and local context-dependencies of transcription factor binding motifs. Using this model, I found that the same motifs can produce the same activity in multiple arrangements. Thus, active machine learning is an effective way to sample perturbations that improve quantitative models of cis-regulatory grammar. Collectively, these results provide an iterative framework to design and sample perturbations that reveal the complexities of cis-regulatory grammar underlying gene regulation

    Grammar and Variation: Understanding How cis-Regulatory Information is Encoded in Mammalian Genomes

    Get PDF
    Understanding how genotype leads to phenotype is key to understand both the development and dysfunction of complex organisms. In the context of regulating the gene expression patterns that contribute to cell identity and function, the goal of my thesis research is to how changes in genome sequence may impact impact gene expression by determining how sequence features contribute to regulatory potential. To accomplish this goal, I first leveraged the key regulatory role of pluripotency transcription factors (TFs) in mouse embryonic stem cells (mESCs) and tested synthetically generated and genomic identified combinations of binding site for four TFs, OCT4, SOX2, KLF4, and ESRRB. I found that although the position of binding sites explained 87% of the variation in expression observed for synthetic elements, the position of binding sites did not explain the expression of tested genomic sequences despite roughly similar binding site composure. Instead, for genomic sequences I found that the quality and spacing of the binding sites contribute more to distinguishing active sequences, suggesting that the arrangements of binding sites are less important for controlling expression in mESCs. In a separate set of experiments, I tested regions of the human genome assigned a regulatory function based on chromatin features and predicted to have high to low probabilities of being under selection in a commonly used human immune progenitor cell culture model, GM12878. Although only a quarter of the library was assigned as ‘Repressive’ according to chromatin marks, 45% of tested sequences showed repressive activity. Sequences predicted to have high probabilities of being under selection have a small but significant higher average level of activation, but not a higher likelihood of either repression or activation. By making single substitutions found at those loci in human populations for a subset of sequences, I tested the predictive power of two independent programs that aim to integrate both functional annotations and evolutionary signals. I found that neither sets of predictions enriched for variants that impacted regulatory activity. This suggests that although we can survey human genotypes for impacts on regulation, it may be difficult to separate organismal level selection from other processes that contribute to the proper control of gene expression. These results demonstrate that in mESC, the fixed affinity and fixed spacing found in synthetic combinations of binding sites are unlikely to predict the activity of genomic sequences. Furthermore, testing sequences from the human genome in GM12878 shows that repression may be more prevalent than estimated by chromatin features alone and that predictions of selection do not enrich for human variants that impact regulatory activity. Together, these experiments demonstrate that the relationship between genotype and proper regulatory function is complex and that understanding this relationship is important to understand both subtle and severe impacts to phenotype

    Common variants in signaling transcription-factor-binding sites drive phenotypic variability in red blood cell traits

    Get PDF
    Genome-wide association studies identify genomic variants associated with human traits and diseases. Most trait-associated variants are located within cell-type-specific enhancers, but the molecular mechanisms governing phenotypic variation are less well understood. Here, we show that many enhancer variants associated with red blood cell (RBC) traits map to enhancers that are co-bound by lineage-specific master transcription factors (MTFs) and signaling transcription factors (STFs) responsive to extracellular signals. The majority of enhancer variants reside on STF and not MTF motifs, perturbing DNA binding by various STFs (BMP/TGF-ÎČ-directed SMADs or WNT-induced TCFs) and affecting target gene expression. Analyses of engineered human blood cells and expression quantitative trait loci verify that disrupted STF binding leads to altered gene expression. Our results propose that the majority of the RBC-trait-associated variants that reside on transcription-factor-binding sequences fall in STF target sequences, suggesting that the phenotypic variation of RBC traits could stem from altered responsiveness to extracellular stimuli
    • 

    corecore