2,665 research outputs found

    Novel computational methods for studying the role and interactions of transcription factors in gene regulation

    Get PDF
    Regulation of which genes are expressed and when enables the existence of different cell types sharing the same genetic code in their DNA. Erroneously functioning gene regulation can lead to diseases such as cancer. Gene regulatory programs can malfunction in several ways. Often if a disease is caused by a defective protein, the cause is a mutation in the gene coding for the protein rendering the protein unable to perform its functions properly. However, protein-coding genes make up only about 1.5% of the human genome, and majority of all disease-associated mutations discovered reside outside protein-coding genes. The mechanisms of action of these non-coding disease-associated mutations are far more incompletely understood. Binding of transcription factors (TFs) to DNA controls the rate of transcribing genetic information from the coding DNA sequence to RNA. Binding affinities of TFs to DNA have been extensively measured in vitro, ligands by exponential enrichment) and Protein Binding Microarrays (PBMs), and the genome-wide binding locations and patterns of TFs have been mapped in dozens of cell types. Despite this, our understanding of how TF binding to regulatory regions of the genome, promoters and enhancers, leads to gene expression is not at the level where gene expression could be reliably predicted based on DNA sequence only. In this work, we develop and apply computational tools to analyze and model the effects of TF-DNA binding. We also develop new methods for interpreting and understanding deep learning-based models trained on biological sequence data. In biological applications, the ability to understand how machine learning models make predictions is as, or even more important as raw predictive performance. This has created a demand for approaches helping researchers extract biologically meaningful information from deep learning model predictions. We develop a novel computational method for determining TF binding sites genome-wide from recently developed high-resolution ChIP-exo and ChIP-nexus experiments. We demonstrate that our method performs similarly or better than previously published methods while making less assumptions about the data. We also describe an improved algorithm for calling allele-specific TF-DNA binding. We utilize deep learning methods to learn features predicting transcriptional activity of human promoters and enhancers. The deep learning models are trained on massively parallel reporter gene assay (MPRA) data from human genomic regulatory elements, designed regulatory elements and promoters and enhancers selected from totally random pool of synthetic input DNA. This unprecedentedly large set of measurements of human gene regulatory element activities, in total more than 100 times the size of the human genome, allowed us to train models that were able to predict genomic transcription start site positions more accurately than models trained on genomic promoters, and to correctly predict effects of disease-associated promoter variants. We also found that interactions between promoters and local classical enhancers are non-specific in nature. The MPRA data integrated with extensive epigenetic measurements supports existence of three different classes of enhancers: classical enhancers, closed chromatin enhancers and chromatin-dependent enhancers. We also show that TFs can be divided into four different, non-exclusive classes based on their activities: chromatin opening, enhancing, promoting and TSS determining TFs. Interpreting the deep learning models of human gene regulatory elements required application of several existing model interpretation tools as well as developing new approaches. Here, we describe two new methods for visualizing features and interactions learned by deep learning models. Firstly, we describe an algorithm for testing if a deep learning model has learned an existing binding motif of a TF. Secondly, we visualize mutual information between pairwise k-mer distributions in sample inputs selected according to predictions by a machine learning model. This method highlights pairwise, and positional dependencies learned by a machine learning model. We demonstrate the use of this model-agnostic approach with classification and regression models trained on DNA, RNA and amino acid sequences.Monet eliöt koostuvat useista erilaisista solutyypeistä, vaikka kaikissa näiden eliöiden soluissa onkin sama DNA-koodi. Geenien ilmentymisen säätely mahdollistaa erilaiset solutyypit. Virheellisesti toimiva säätely voi johtaa sairauksiin, esimerkiksi syövän puhkeamiseen. Jos sairauden aiheuttaa viallinen proteiini, on syynä usein mutaatio tätä proteiinia koodaavassa geenissä, joka muuttaa proteiinia siten, ettei se enää pysty toimittamaan tehtäväänsä riittävän hyvin. Kuitenkin vain 1,5 % ihmisen genomista on proteiineja koodaavia geenejä. Suurin osa kaikista löydetyistä sairauksiin liitetyistä mutaatioista sijaitsee näiden ns. koodaavien alueiden ulkopuolella. Ei-koodaavien sairauksiin liitetyiden mutaatioiden vaikutusmekanismit ovat yleisesti paljon huonommin tunnettuja, kuin koodaavien alueiden mutaatioiden. Transkriptiotekijöiden sitoutuminen DNA:han säätelee transkriptiota, eli geeneissä olevan geneettisen informaation lukemista ja muuntamista RNA:ksi. Transkriptiotekijöiden sitoutumista DNA:han on mitattu kattavasti in vitro-olosuhteissa, ja monien transkriptiotekijöiden sitoutumiskohdat on mitattu genominlaajuisesti useissa eri solutyypeissä. Tästä huolimatta ymmärryksemme siitä miten transkriptioitekijöiden sitoutuminen genomin säätelyelementteihin, eli promoottoreihin ja vahvistajiin, johtaa geenien ilmentymiseen ei ole sellaisella tasolla, että voisimme luotettavasti ennustaa geenien ilmentymistä pelkästään DNA-sekvenssin perusteella. Tässä työssä kehitämme ja sovellamme laskennallisia työkaluja transkriptiotekijöiden sitoutumisesta johtuvan geenien ilmentymisen analysointiin ja mallintamiseen. Kehitämme myös uusia menetelmiä biologisella sekvenssidatalla opetettujen syväoppimismallien tulkitsemiseksi. Koneoppimismallin tekemien ennusteiden ymmärrettävyys on biologisissa sovelluksissa yleensä yhtä tärkeää, ellei jopa tärkeämpää kuin pelkkä raaka ennustetarkkuus. Tämä on synnyttänyt tarpeen uusille menetelmille, jotka auttavat tutkijoita louhimaan biologisesti merkityksellistä tietoa syväoppimismallien ennusteista. Kehitimme tässä työssä uuden laskennallisen työkalun, jolla voidaan määrittää transkriptiotekijöiden sitoutumiskohdat genominlaajuisesti käyttäen mittausdataa hiljattain kehitetyistä korkearesoluutioisista ChIP-exo ja ChIP-nexus kokeista. Näytämme, että kehittämämme menetelmä suoriutuu paremmin, tai vähintään yhtä hyvin kuin aiemmin julkaistut menetelmät tehden näitä vähemmän oletuksia signaalin muodosta. Esittelemme myös parannellun algoritmin transkriptiotekijöiden alleelispesifin sitoutumisen määrittämiseksi. Käytämme syväoppimismenetelmiä oppimaan mitkä ominaisuudet ennustavat ihmisen promoottori- ja voimistajaelementtien aktiivisuutta. Nämä syväoppimismallit on opetettu valtavien rinnakkaisten reportterigeenikokeiden datalla ihmisen genomisista säätelyelementeistä, sekä aktiivisista promoottoreista ja voimistajista, jotka ovat valikoituneet satunnaisesta joukosta synteettisiä DNA-sekvenssejä. Tämä ennennäkemättömän laaja joukko mittauksia ihmisen säätelyelementtien aktiivisuudesta - yli satakertainen määrä DNA sekvenssiä ihmisen genomiin verrattuna - mahdollisti transkription aloituskohtien sijainnin ennustamisen ihmisen genomissa tarkemmin kuin ihmisen genomilla opetetut mallit. Nämä mallit myös ennustivat oikein sairauksiin liitettyjen mutaatioiden vaikutukset ihmisen promoottoreilla. Tuloksemme näyttivät, että vuorovaikutukset ihmisen promoottorien ja klassisten paikallisten voimistajien välillä ovat epäspesifejä. MPRA-data, integroituna kattavien epigeneettisten mittausten kanssa mahdollisti voimistajaelementtien jaon kolmeen luokkaan: klassiset, suljetun kromatiinin, ja kromatiinista riippuvat voimistajat. Tutkimuksemme osoitti, että transkriptiotekijät voidaan jakaa neljään, osittain päällekkäiseen luokkaan niiden aktiivisuuksien perusteella: kromatiinia avaaviin, voimistaviin, promotoiviin ja transkription aloituskohdan määrittäviin transkriptiotekijöihin. Ihmisen genomin säätelyelementtejä kuvaavien syväoppimismallien tulkitseminen vaati sekä olemassa olevien menetelmien soveltamista, että uusien kehittämistä. Kehitimme tässä työssä kaksi uutta menetelmää syväoppimismallien oppimien muuttujien ja niiden välisten vuorovaikutusten visualisoimiseksi. Ensin esittelemme algoritmin, jonka avulla voidaan testata onko syväoppimismalli oppinut jonkin jo tunnetun transkriptiotekijän sitoutumishahmon. Toiseksi, visualisoimme positiokohtaisten k-meerijakaumien keskeisinformaatiota sekvensseissä, jotka on valittu syväoppimismallin ennusteiden perusteella. Tämä menetelmä paljastaa syväoppimismallin oppimat parivuorovaikutukset ja positiokohtaiset riippuvuudet. Näytämme, että kehittämämme menetelmä on mallin arkkitehtuurista riippumaton soveltamalla sitä sekä luokittelijoihin, että regressiomalleihin, jotka on opetettu joko DNA-, RNA-, tai aminohapposekvenssidatalla

    Deep Learning for Genomics: A Concise Overview

    Full text link
    Advancements in genomic research such as high-throughput sequencing techniques have driven modern genomic studies into "big data" disciplines. This data explosion is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intelligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning Application

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Chromatin interaction neural network (ChINN): a machine learning-based method for predicting chromatin interactions from DNA sequences.

    Get PDF
    Chromatin interactions play important roles in regulating gene expression. However, the availability of genome-wide chromatin interaction data is limited. We develop a computational method, chromatin interaction neural network (ChINN), to predict chromatin interactions between open chromatin regions using only DNA sequences. ChINN predicts CTCF- and RNA polymerase II-associated and Hi-C chromatin interactions. ChINN shows good across-sample performances and captures various sequence features for chromatin interaction prediction. We apply ChINN to 6 chronic lymphocytic leukemia (CLL) patient samples and a published cohort of 84 CLL open chromatin samples. Our results demonstrate extensive heterogeneity in chromatin interactions among CLL patient samples

    The Pharmacoepigenomics Informatics Pipeline and H-GREEN Hi-C Compiler: Discovering Pharmacogenomic Variants and Pathways with the Epigenome and Spatial Genome

    Full text link
    Over the last decade, biomedical science has been transformed by the epigenome and spatial genome, but the discipline of pharmacogenomics, the study of the genetic underpinnings of pharmacological phenotypes like drug response and adverse events, has not. Scientists have begun to use omics atlases of increasing depth, and inferences relating to the bidirectional causal relationship between the spatial epigenome and gene expression, as a foundational underpinning for genetics research. The epigenome and spatial genome are increasingly used to discover causative regulatory variants in the significance regions of genome-wide association studies, for the discovery of the biological mechanisms underlying these phenotypes and the design of genetic tests to predict them. Such variants often have more predictive power than coding variants, but in the area of pharmacogenomics, such advances have been radically underapplied. The majority of pharmacogenomics tests are designed manually on the basis of mechanistic work with coding variants in candidate genes, and where genome wide approaches are used, they are typically not interpreted with the epigenome. This work describes a series of analyses of pharmacogenomics association studies with the tools and datasets of the epigenome and spatial genome, undertaken with the intent of discovering causative regulatory variants to enable new genetic tests. It describes the potent regulatory variants discovered thereby to have a putative causative and predictive role in a number of medically important phenotypes, including analgesia and the treatment of depression, bipolar disorder, and traumatic brain injury with opiates, anxiolytics, antidepressants, lithium, and valproate, and in particular the tendency for such variants to cluster into spatially interacting, conceptually unified pathways which offer mechanistic insight into these phenotypes. It describes the Pharmacoepigenomics Informatics Pipeline (PIP), an integrative multiple omics variant discovery pipeline designed to make this kind of analysis easier and cheaper to perform, more reproducible, and amenable to the addition of advanced features. It described the successes of the PIP in rediscovering manually discovered gene networks for lithium response, as well as discovering a previously unknown genetic basis for warfarin response in anticoagulation therapy. It describes the H-GREEN Hi-C compiler, which was designed to analyze spatial genome data and discover the distant target genes of such regulatory variants, and its success in discovering spatial contacts not detectable by preceding methods and using them to build spatial contact networks that unite disparate TADs with phenotypic relationships. It describes a potential featureset of a future pipeline, using the latest epigenome research and the lessons of the previous pipeline. It describes my thinking about how to use the output of a multiple omics variant pipeline to design genetic tests that also incorporate clinical data. And it concludes by describing a long term vision for a comprehensive pharmacophenomic atlas, to be constructed by applying a variant pipeline and machine learning test design system, such as is described, to thousands of phenotypes in parallel. Scientists struggled to assay genotypes for the better part of a century, and in the last twenty years, succeeded. The struggle to predict phenotypes on the basis of the genotypes we assay remains ongoing. The use of multiple omics variant pipelines and machine learning models with omics atlases, genetic association, and medical records data will be an increasingly significant part of that struggle for the foreseeable future.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145835/1/ariallyn_1.pd

    Modelling the transcriptional regulation of androgen receptor in prostate cancer

    Get PDF
    Transcription of genes and production of proteins are essential functions of a normal cell. If disturbed, misregulation of crucial genes leads to aberrant cell behaviour and in some cases, leads to the development of diseased states such as cancer. One major transcriptional regulation tool involves the binding of transcription factor onto enhancer sequences that will encourage or repress transcription depending on the role of the transcription factor. In prostate cells, misregulation of the androgen receptor(AR), a key transcriptional regulator, leads to the development and maintenance of prostate cancer. Androgen receptor binds to numerous locations in the genome, but it is still unclear how and which other key transcription factors aid and repress AR-mediated transcription. Here I analyzed the data that contained the transcriptional activity of 4139 putative AR binding sites (ARBS) in the genome with and without the presence of hormone using the STARR-seq assay. Only a small fraction of ARBS showed significant differential expression when treated with hormone. To understand the underlying essential factors behind hormone-dependent behaviour, we developed both machine learning and biophysical models to identify active enhancers in prostate cancer cells. We also identify potentially crucial transcription factors for androgen-dependent behaviour and discuss the benefits and shortcomings of each modelling method

    The non-coding genome in Autism Spectrum Disorders

    Get PDF
    Autism Spectrum Disorders (ASD) are a group of neurodevelopmental disorders (NDDs) characterized by difficulties in social interaction and communication, repetitive behavior, and restricted interests. While ASD have been proven to have a strong genetic component, current research largely focuses on coding regions of the genome. However, non-coding DNA, which makes up for ∼99% of the human genome, has recently been recognized as an important contributor to the high heritability of ASD, and novel sequencing technologies have been a milestone in opening up new directions for the study of the gene regulatory networks embedded within the non-coding regions. Here, we summarize current progress on the contribution of non-coding alterations to the pathogenesis of ASD and provide an overview of existing methods allowing for the study of their functional relevance, discussing potential ways of unraveling ASD's “missing heritability”S
    corecore