152 research outputs found

    Novel computational methods for studying the role and interactions of transcription factors in gene regulation

    Get PDF
    Regulation of which genes are expressed and when enables the existence of different cell types sharing the same genetic code in their DNA. Erroneously functioning gene regulation can lead to diseases such as cancer. Gene regulatory programs can malfunction in several ways. Often if a disease is caused by a defective protein, the cause is a mutation in the gene coding for the protein rendering the protein unable to perform its functions properly. However, protein-coding genes make up only about 1.5% of the human genome, and majority of all disease-associated mutations discovered reside outside protein-coding genes. The mechanisms of action of these non-coding disease-associated mutations are far more incompletely understood. Binding of transcription factors (TFs) to DNA controls the rate of transcribing genetic information from the coding DNA sequence to RNA. Binding affinities of TFs to DNA have been extensively measured in vitro, ligands by exponential enrichment) and Protein Binding Microarrays (PBMs), and the genome-wide binding locations and patterns of TFs have been mapped in dozens of cell types. Despite this, our understanding of how TF binding to regulatory regions of the genome, promoters and enhancers, leads to gene expression is not at the level where gene expression could be reliably predicted based on DNA sequence only. In this work, we develop and apply computational tools to analyze and model the effects of TF-DNA binding. We also develop new methods for interpreting and understanding deep learning-based models trained on biological sequence data. In biological applications, the ability to understand how machine learning models make predictions is as, or even more important as raw predictive performance. This has created a demand for approaches helping researchers extract biologically meaningful information from deep learning model predictions. We develop a novel computational method for determining TF binding sites genome-wide from recently developed high-resolution ChIP-exo and ChIP-nexus experiments. We demonstrate that our method performs similarly or better than previously published methods while making less assumptions about the data. We also describe an improved algorithm for calling allele-specific TF-DNA binding. We utilize deep learning methods to learn features predicting transcriptional activity of human promoters and enhancers. The deep learning models are trained on massively parallel reporter gene assay (MPRA) data from human genomic regulatory elements, designed regulatory elements and promoters and enhancers selected from totally random pool of synthetic input DNA. This unprecedentedly large set of measurements of human gene regulatory element activities, in total more than 100 times the size of the human genome, allowed us to train models that were able to predict genomic transcription start site positions more accurately than models trained on genomic promoters, and to correctly predict effects of disease-associated promoter variants. We also found that interactions between promoters and local classical enhancers are non-specific in nature. The MPRA data integrated with extensive epigenetic measurements supports existence of three different classes of enhancers: classical enhancers, closed chromatin enhancers and chromatin-dependent enhancers. We also show that TFs can be divided into four different, non-exclusive classes based on their activities: chromatin opening, enhancing, promoting and TSS determining TFs. Interpreting the deep learning models of human gene regulatory elements required application of several existing model interpretation tools as well as developing new approaches. Here, we describe two new methods for visualizing features and interactions learned by deep learning models. Firstly, we describe an algorithm for testing if a deep learning model has learned an existing binding motif of a TF. Secondly, we visualize mutual information between pairwise k-mer distributions in sample inputs selected according to predictions by a machine learning model. This method highlights pairwise, and positional dependencies learned by a machine learning model. We demonstrate the use of this model-agnostic approach with classification and regression models trained on DNA, RNA and amino acid sequences.Monet eliöt koostuvat useista erilaisista solutyypeistä, vaikka kaikissa näiden eliöiden soluissa onkin sama DNA-koodi. Geenien ilmentymisen säätely mahdollistaa erilaiset solutyypit. Virheellisesti toimiva säätely voi johtaa sairauksiin, esimerkiksi syövän puhkeamiseen. Jos sairauden aiheuttaa viallinen proteiini, on syynä usein mutaatio tätä proteiinia koodaavassa geenissä, joka muuttaa proteiinia siten, ettei se enää pysty toimittamaan tehtäväänsä riittävän hyvin. Kuitenkin vain 1,5 % ihmisen genomista on proteiineja koodaavia geenejä. Suurin osa kaikista löydetyistä sairauksiin liitetyistä mutaatioista sijaitsee näiden ns. koodaavien alueiden ulkopuolella. Ei-koodaavien sairauksiin liitetyiden mutaatioiden vaikutusmekanismit ovat yleisesti paljon huonommin tunnettuja, kuin koodaavien alueiden mutaatioiden. Transkriptiotekijöiden sitoutuminen DNA:han säätelee transkriptiota, eli geeneissä olevan geneettisen informaation lukemista ja muuntamista RNA:ksi. Transkriptiotekijöiden sitoutumista DNA:han on mitattu kattavasti in vitro-olosuhteissa, ja monien transkriptiotekijöiden sitoutumiskohdat on mitattu genominlaajuisesti useissa eri solutyypeissä. Tästä huolimatta ymmärryksemme siitä miten transkriptioitekijöiden sitoutuminen genomin säätelyelementteihin, eli promoottoreihin ja vahvistajiin, johtaa geenien ilmentymiseen ei ole sellaisella tasolla, että voisimme luotettavasti ennustaa geenien ilmentymistä pelkästään DNA-sekvenssin perusteella. Tässä työssä kehitämme ja sovellamme laskennallisia työkaluja transkriptiotekijöiden sitoutumisesta johtuvan geenien ilmentymisen analysointiin ja mallintamiseen. Kehitämme myös uusia menetelmiä biologisella sekvenssidatalla opetettujen syväoppimismallien tulkitsemiseksi. Koneoppimismallin tekemien ennusteiden ymmärrettävyys on biologisissa sovelluksissa yleensä yhtä tärkeää, ellei jopa tärkeämpää kuin pelkkä raaka ennustetarkkuus. Tämä on synnyttänyt tarpeen uusille menetelmille, jotka auttavat tutkijoita louhimaan biologisesti merkityksellistä tietoa syväoppimismallien ennusteista. Kehitimme tässä työssä uuden laskennallisen työkalun, jolla voidaan määrittää transkriptiotekijöiden sitoutumiskohdat genominlaajuisesti käyttäen mittausdataa hiljattain kehitetyistä korkearesoluutioisista ChIP-exo ja ChIP-nexus kokeista. Näytämme, että kehittämämme menetelmä suoriutuu paremmin, tai vähintään yhtä hyvin kuin aiemmin julkaistut menetelmät tehden näitä vähemmän oletuksia signaalin muodosta. Esittelemme myös parannellun algoritmin transkriptiotekijöiden alleelispesifin sitoutumisen määrittämiseksi. Käytämme syväoppimismenetelmiä oppimaan mitkä ominaisuudet ennustavat ihmisen promoottori- ja voimistajaelementtien aktiivisuutta. Nämä syväoppimismallit on opetettu valtavien rinnakkaisten reportterigeenikokeiden datalla ihmisen genomisista säätelyelementeistä, sekä aktiivisista promoottoreista ja voimistajista, jotka ovat valikoituneet satunnaisesta joukosta synteettisiä DNA-sekvenssejä. Tämä ennennäkemättömän laaja joukko mittauksia ihmisen säätelyelementtien aktiivisuudesta - yli satakertainen määrä DNA sekvenssiä ihmisen genomiin verrattuna - mahdollisti transkription aloituskohtien sijainnin ennustamisen ihmisen genomissa tarkemmin kuin ihmisen genomilla opetetut mallit. Nämä mallit myös ennustivat oikein sairauksiin liitettyjen mutaatioiden vaikutukset ihmisen promoottoreilla. Tuloksemme näyttivät, että vuorovaikutukset ihmisen promoottorien ja klassisten paikallisten voimistajien välillä ovat epäspesifejä. MPRA-data, integroituna kattavien epigeneettisten mittausten kanssa mahdollisti voimistajaelementtien jaon kolmeen luokkaan: klassiset, suljetun kromatiinin, ja kromatiinista riippuvat voimistajat. Tutkimuksemme osoitti, että transkriptiotekijät voidaan jakaa neljään, osittain päällekkäiseen luokkaan niiden aktiivisuuksien perusteella: kromatiinia avaaviin, voimistaviin, promotoiviin ja transkription aloituskohdan määrittäviin transkriptiotekijöihin. Ihmisen genomin säätelyelementtejä kuvaavien syväoppimismallien tulkitseminen vaati sekä olemassa olevien menetelmien soveltamista, että uusien kehittämistä. Kehitimme tässä työssä kaksi uutta menetelmää syväoppimismallien oppimien muuttujien ja niiden välisten vuorovaikutusten visualisoimiseksi. Ensin esittelemme algoritmin, jonka avulla voidaan testata onko syväoppimismalli oppinut jonkin jo tunnetun transkriptiotekijän sitoutumishahmon. Toiseksi, visualisoimme positiokohtaisten k-meerijakaumien keskeisinformaatiota sekvensseissä, jotka on valittu syväoppimismallin ennusteiden perusteella. Tämä menetelmä paljastaa syväoppimismallin oppimat parivuorovaikutukset ja positiokohtaiset riippuvuudet. Näytämme, että kehittämämme menetelmä on mallin arkkitehtuurista riippumaton soveltamalla sitä sekä luokittelijoihin, että regressiomalleihin, jotka on opetettu joko DNA-, RNA-, tai aminohapposekvenssidatalla

    Constructivist neural network models of cognitive development

    Get PDF
    In this thesis I investigate the modelling of cognitive development with constructivist neural networks. I argue that the constructivist nature of development, that is, the building of a cognitive system through active interactions with its environment, is an essential property of human development and should be considered in models of cognitive development. I evaluate this claim on the basis of evidence from cortical development, cognitive development, and learning theory. In an empirical evaluation of this claim, I then present a constructivist neural network model of the acquisition of the English past tense and of impaired inflectional processing in German agrammatic aphasics. The model displays a realistic course of acquisition, closely modelling the U-shaped learning curve and more detailed effects such as frequency and family effects. Further, the model develops double dissociations between regular and irregular verbs. I argue that the ability of the model to account for the hu..

    Iterative Machine Learning of a Cis-Regulatory Grammar

    Get PDF
    Gene regulation allows for the quantitative control of gene expression. Gene regulation is a complex process encoded through cis-regulatory sequences, short DNA sequences containing clusters of transcription factor binding sites. Each binding site can occur millions of times in multicellular genomes, and seemingly similar collections of binding sites can have very different activities. A leading model to explain these degeneracies is that cis-regulatory sequences follow a “grammar” defined by the number, identity, strength, arrangement, and/or context of the underlying binding sites. Understanding cis-regulatory grammar requires high-throughput technology, quantitative measurements, and computational modeling. This thesis describes an iterative machine learning approach to study cis-regulatory grammar using mouse photoreceptors as a model system. First, I characterized sequence features associated with enhancer and silencer activity in sequences bound by the transcription factor CRX. I showed that both enhancers and silencers are highly occupied by CRX compared to inactive sequences, and enhancers are uniquely enriched for a diverse but degenerate collection of eight motifs. I demonstrated that this information captures a majority of the available signal in genomic sequences and developed an information content metric that summarizes the effects of motif number and diversity. Second, I developed an active machine learning framework that iteratively samples informative perturbations to address the limitations of training quantitative models on genomic sequences alone. I showed that this approach, when complemented with human decision-making, effectively guides machine learning models towards a biologically relevant representation of cis-regulatory grammar. I also highlighted how perturbations selected with active learning are more informative than other perturbations generated by the same procedure. The final machine learning model can capture global and local context-dependencies of transcription factor binding motifs. Using this model, I found that the same motifs can produce the same activity in multiple arrangements. Thus, active machine learning is an effective way to sample perturbations that improve quantitative models of cis-regulatory grammar. Collectively, these results provide an iterative framework to design and sample perturbations that reveal the complexities of cis-regulatory grammar underlying gene regulation

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

    27th Annual Computational Neuroscience Meeting (CNS*2018): Part One

    Get PDF

    Deep learning models of biological visual information processing

    Get PDF
    Improved computational models of biological vision can shed light on key processes contributing to the high accuracy of the human visual system. Deep learning models, which extract multiple layers of increasingly complex features from data, achieved recent breakthroughs on visual tasks. This thesis proposes such flexible data-driven models of biological vision and also shows how insights regarding biological visual processing can lead to advances within deep learning. To harness the potential of deep learning for modelling the retina and early vision, this work introduces a new dataset and a task simulating an early visual processing function and evaluates deep belief networks (DBNs) and deep neural networks (DNNs) on this input. The models are shown to learn feature detectors similar to retinal ganglion and V1 simple cells and execute early vision tasks. To model high-level visual information processing, this thesis proposes novel deep learning architectures and training methods. Biologically inspired Gaussian receptive field constraints are imposed on restricted Boltzmann machines (RBMs) to improve the fidelity of the data representation to encodings extracted by visual processing neurons. Moreover, concurrently with learning local features, the proposed local receptive field constrained RBMs (LRF-RBMs) automatically discover advantageous non-uniform feature detector placements from data. Following the hierarchical organisation of the visual cortex, novel LRF-DBN and LRF-DNN models are constructed using LRF-RBMs with gradually increasing receptive field sizes to extract consecutive layers of features. On a challenging face dataset, unlike DBNs, LRF-DBNs learn a feature hierarchy exhibiting hierarchical part-based composition. Also, the proposed deep models outperform DBNs and DNNs on face completion and dimensionality reduction, thereby demonstrating the strength of methods inspired by biological visual processing
    corecore