549 research outputs found

    A Neural Network Classifier for the COI Barcode Gene

    Get PDF
    Mitochondrial Cytochrome C Oxidase subunit I (CO I – to be read as “see – oh one”) is a 658 base pair region in the gene encoding that is proposed as standard barcode for animals. Meaning, the CO I is a special region found in animal DNA that is studied to identify the species of the animal. Currently, there is an implementation of an algorithm called ARBitrator which identifies and extracts these CO I sequences from enormous genes database called GenBank. The ARBitrator is good at extracting the CO I sequences that have better specificity and accuracy as compared to other existing algorithms for CO I sequence identification[1][2]. Now, this project aims at training a neural network to learn the features of the CO I sequences extracted by ARBitrator, so that this neural network can be used in future to further recognize CO I sequences. Effectively, we are aiming to successfully design, train, and use a deep learning neural network to learn to recognize CO I sequences in a supervised way. This is the first time that a neural network is explored and used for this purpose

    Functional Identification and Characterization of cis-Regulatory Elements

    Get PDF
    Transcription is regulated through interactions between regulatory proteins, such as transcription factors (TFs), and DNA sequence. It is known that TFs act combinatorially in some cases to regulate transcription, but in which situations and to what degree is unclear. I first studied the contribution of TF binding sites to expression in mouse embryonic stem (ES) cells by using synthetic cis-regulatory elements (CREs). The synthetic CREs were comprised of combinations of binding sites for the pluripotency TFs Oct4, Sox2, Klf4, and Esrrb. A statistical thermodynamic model explained 72% of the variation in expression driven by these CREs. The high predictive power of this model depended on five TF interaction parameters, including favorable heterotypic interactions between Oct4 and Sox2, Klf4 and Sox2, and Klf4 and Esrrb. The model also included two unfavorable homotypic interaction parameters. These homotypic parameters help to explain the fact that synthetic CREs with mixtures of binding sites for various TFs drive much higher expression than multiple binding sites for the same TF. I then found that the expression of these synthetic CREs largely changes as ES cells differentiate down the neural lineage. However, CREs with no repeat binding sites drove similar levels of expression, suggesting that heterotypic interactions may be similar in the two conditions. In a separate set of experiments I interrogated the determinants of expression driven by genomic sequences previously segmented into classes based on chromatin features. A set of these sequences was assayed in K562 cells. As expected, we found that Enhancers and Weak Enhancers drove expression over background, while Repressed elements and Enhancers from another cell type did not. Unexpectedly, we found that Weak Enhancers drove higher expression than Enhancers, possibly based on their lower H3K36me3 and H3K27ac, which we found to be weakly associated with lower expression. Using a logistic regression model, we showed that matches to TF binding motifs were best able to predict active sequences, but chromatin features contributed significantly as well. These results demonstrate that interactions between certain combinations of pluripotency TFs, but not all combinations, are important to transcriptional regulation. Furthermore, chromatin modifications can still contribute to predictions of expression even after accounting for binding site motifs. Better understanding of the process of cis-regulation will allow us to predict which sequences can drive expression and how perturbations affect this expression

    Development of a simple artificial intelligence method to accurately subtype breast cancers based on gene expression barcodes

    Get PDF
    >Magister Scientiae - MScINTRODUCTION: Breast cancer is a highly heterogeneous disease. The complexity of achieving an accurate diagnosis and an effective treatment regimen lies within this heterogeneity. Subtypes of the disease are not simply molecular, i.e. hormone receptor over-expression or absence, but the tumour itself is heterogeneous in terms of tissue of origin, metastases, and histopathological variability. Accurate tumour classification vastly improves treatment decisions, patient outcomes and 5-year survival rates. Gene expression studies aided by transcriptomic technologies such as microarrays and next-generation sequencing (e.g. RNA-Sequencing) have aided oncology researcher and clinician understanding of the complex molecular portraits of malignant breast tumours. Mechanisms governing cancers, which include tumorigenesis, gene fusions, gene over-expression and suppression, cellular process and pathway involvementinvolvement, have been elucidated through comprehensive analyses of the cancer transcriptome. Over the past 20 years, gene expression signatures, discovered with both microarray and RNA-Seq have reached clinical and commercial application through the development of tests such as Mammaprint®, OncotypeDX®, and FoundationOne® CDx, all which focus on chemotherapy sensitivity, prediction of cancer recurrence, and tumour mutational level. The Gene Expression Barcode (GExB) algorithm was developed to allow for easy interpretation and integration of microarray data through data normalization with frozen RMA (fRMA) preprocessing and conversion of relative gene expression to a sequence of 1's and 0's. Unfortunately, the algorithm has not yet been developed for RNA-Seq data. However, implementation of the GExB with feature-selection would contribute to a machine-learning based robust breast cancer and subtype classifier. METHODOLOGY: For microarray data, we applied the GExB algorithm to generate barcodes for normal breast and breast tumour samples. A two-class classifier for malignancy was developed through feature-selection on barcoded samples by selecting for genes with 85% stable absence or presence within a tissue type, and differentially stable between tissues. A multi-class feature-selection method was employed to identify genes with variable expression in one subtype, but 80% stable absence or presence in all other subtypes, i.e. 80% in n-1 subtypes. For RNA-Seq data, a barcoding method needed to be developed which could mimic the GExB algorithm for microarray data. A z-score-to-barcode method was implemented and differential gene expression analysis with selection of the top 100 genes as informative features for classification purposes. The accuracy and discriminatory capability of both microarray-based gene signatures and the RNA-Seq-based gene signatures was assessed through unsupervised and supervised machine-learning algorithms, i.e., K-means and Hierarchical clustering, as well as binary and multi-class Support Vector Machine (SVM) implementations. RESULTS: The GExB-FS method for microarray data yielded an 85-probe and 346-probe informative set for two-class and multi-class classifiers, respectively. The two-class classifier predicted samples as either normal or malignant with 100% accuracy and the multi-class classifier predicted molecular subtype with 96.5% accuracy with SVM. Combining RNA-Seq DE analysis for feature-selection with the z-score-to-barcode method, resulted in a two-class classifier for malignancy, and a multi-class classifier for normal-from-healthy, normal-adjacent-tumour (from cancer patients), and breast tumour samples with 100% accuracy. Most notably, a normal-adjacent-tumour gene expression signature emerged, which differentiated it from normal breast tissues in healthy individuals. CONCLUSION: A potentially novel method for microarray and RNA-Seq data transformation, feature selection and classifier development was established. The universal application of the microarray signatures and validity of the z-score-to-barcode method was proven with 95% accurate classification of RNA-Seq barcoded samples with a microarray discovered gene expression signature. The results from this comprehensive study into the discovery of robust gene expression signatures holds immense potential for further R&F towards implementation at the clinical endpoint, and translation to simpler and cost-effective laboratory methods such as qtPCR-based tests

    Dissecting regional heterogeneity and modeling transcriptional cascades in brain organoids

    Get PDF
    Over the past decade, there has been a rapid expansion in the development and utilization of brain organoid models, enabling three-dimensional in vivo-like views of fundamental neurodevelopmental features of corticogenesis in health and disease. Nonetheless, the methods used for generating cortical organoid fates exhibit widespread heterogeneity across different cell lines. Here, we show that a combination of dual SMAD and WNT inhibition (Triple-i protocol) establishes a robust cortical identity in brain organoids, while other widely used derivation protocols are inconsistent with respect to regional specification. In order to measure this heterogeneity, we employ single-cell RNA-sequencing (scRNA-Seq), enabling the sampling of the gene expression profiles of thousands of cells in an individual sample. However, in order to draw meaningful conclusions from scRNA-Seq data, technical artifacts must be identified and removed. In this thesis, we present a method to detect one such artifact, empty droplets that do not contain a cell and consist mainly of free-floating mRNA in the sample. Furthermore, from their expression profiles, cells can be ordered along a developmental trajectory which recapitulates the progression of cells as they differentiate. Based on this ordering, we model gene expression using a Bayesian inference approach in order to measure transcriptional dynamics within differentiating cells. This enables the ordering of genes along transcriptional cascades, statistical testing for differences in gene expression changes, and measuring potential regulatory gene interactions. We apply this approach to differentiating cortical neural stem cells into cortical neurons via an intermediate progenitor cell type in brain organoids to provide a detailed characterization of the endogenous molecular processes underlying neurogenesis.Im letzten Jahrzent hat die Entwicklung und Nutzung von Organoidmodellen des Gehirns stark zugenommen. Diese Modelle erlauben dreidimensionale, in-vivo ähnliche Einblicke in fundamentale Aspekte der neurologischen Entwicklung des Hirnkortex in Gesundheit und Krankheit. Jedoch weisen die Methoden, um die Entwicklung kortikaler Organoide zu verfolgen, starke Heterogenität zwischen verschiedenen Zelllinien auf. Hier weisen wir nach, dass eine Kombination dualer SMAD und WNT Hemmung (Triple-i Protokoll) eine konstante kortikale Zuordnung in Hirnorganoiden erzeugt, während andere, weit verbreitete und genutzte Protokolle in Bezug auf kortikale Spezifizierung keine konstanten Ergebnisse liefern. Um die Heterogenität zu messen, haben wir Einzelzell-RNA Sequenzierung (scRNA-Seq) benutzt, wodurch die Erfassung der Genexpression von Tausenden von Zellen in einer Probe möglich ist. Um jedoch sinnvolle Schlüsse aus diesen scRNA-Seq Daten zu ziehen, müssen technische Artifakte identifiziert und aus den Daten entfernt werden. In dieser Dissertation stellen wir eine Methode vor, um eines solcher Artifakte zu erkennen: leere Tröpfchen (ohne Zellen), die hauptsächlich aus freischwebender mRNA in der Probe bestehen. Weiterhin können Zellen anhand ihrer Genexpressionsprofile entlang einer Entwicklungsschiene angeordnet werden, die die Entwicklung der Zellen während ihrer Differenzierung rekapituliert. Auf der Grundlage dieser Entwicklungsreihenfolge modellieren wir die Genexpression mit einem Bayes’schen Inferenzansatz, um die Dynamik der Transkription in sich differenzierenden Zellen zu messen. Dies ermöglicht das Anordnen von Genen entlang einer Transkriptionskaskade, sowie statistische Untersuchungen in Hinblick auf Unterschiede in der Veränderung von Genexpression, und das Messen des Einflusses möglicher Regulationsgene. Wir wenden diese Methode an, um kortikale neuronale Stammzellen zu untersuchen, die sich über einen intermediären Vorläuferzelltyp in kortikale Neuronen in Hirnorganoiden differenzieren, und um eine detaillierte Charakterisierung der molekularen Prozesse zu liefern, die der Neurogenese zugrunde liegen

    cis-Regulation in the Mammalian Rod Photoreceptor

    Get PDF
    Transcription factors regulate the expression level of target genes by binding to cis-regulatory elements (CREs) present in gene promoters. The goal of my thesis research is to define the sequence components of CREs that determine transcriptional output. In order to accomplish this goal, I developed a method to measure the regulatory activity of thousands of CREs in a single experiment. In this method I insert unique barcodes in the 3\u27UTR of a reporter gene and multiplex expression measurements with RNA sequencing. Using this technique in explanted retinas, I determined the impact of single nucleotide variants in a mammalian promoter by measuring expression controlled by all single nucleotide variants of the Rhodopsin proximal promoter. I found that nearly all (86%) sequence variants drive significantly different activity than the wild-type promoter and that the mechanism of most variants can be interpreted as altered transcription factor binding. In addition, we found that the largest changes in expression resulted from variants located in characterized transcription factor binding site sequences. Next, I explored how combinations of binding sites drive particular levels of gene expression by utilizing a synthetic biology approach. I generated synthetic CREs composed of various combinations of binding sites found in the Rhodopsin promoter and measured the expression driven by these sequences. In this study I found that synthetic CREs containing binding sites for transcriptional activators yielded diverse expression outputs, including both activation and repression of a minimal promoter. Together, these experiments demonstrate that interactions between binding sites and dual regulation of a single binding site can produce diverse gene expression patterns. I conclude that simple cis-regulatory elements can produce complex expression outputs due to interactions between transcriptional activators and detailed quantitative models will be necessary to predict expression from these sequences

    Novel Approaches to Studying the Effects of Cis-Regulatory Variants in the Central Nervous System

    Get PDF
    For decades, studies of the genetic basis of disease have focused on rare coding mutations that disrupt protein function, leading to the identification of hundreds of genes underlying Mendelian diseases. However, many complex diseases are non-Mendelian, and less than 2% of the genome is coding. It is now clear that non-coding variants contribute to disease susceptibility, but the precise underlying mechanisms are generally unknown. Cis-regulatory elements (CREs) are transcription factor (TF)-bound genomic regions that regulate gene expression, and variants within CREs can therefore modify gene expression. The putative locations of CREs in a variety of cell types have been identified through genome-wide assays of TF binding and epigenomic signatures, providing a starting point for probing the effects of cis-regulatory variants. Unlike coding mutations, which can be interpreted based on the genetic code, the functional consequence of any given cis-regulatory variant is difficult to predict even at the molecular level. Therefore, a major bottleneck lies in interpreting the functional significance of these variants. In the present work, I study the effects of cis-regulatory variants in the central nervous system (CNS), specifically in retina and brain. The retina is composed of well-characterized neuronal cell types and an extensively studied transcriptional network, while the brain is the center of human cognition and a target of devastating neuropsychiatric diseases. First, I take advantage of the genetic diversity between two distantly related mouse strains to describe the relationship between cis-regulatory variants and differences in retinal gene expression. I identify cis- and trans-regulatory effects, as well as parent-of-origin effects. Second, I develop a new technology based on an existing massively parallel reporter assay, CRE-seq, to enable the functional study of long CREs in the CNS in vivo for the first time. I demonstrate the ability of this approach to measure tissue-specific cis-regulatory activity in the brain and to pinpoint DNA bases critical for activity. Finally, I conduct a detailed mechanistic study of a non-coding region containing variants associated with both human cognitive performance and bipolar disorder. This last study illustrates the complexities and challenges of establishing the causal role of non-coding variants in disease

    Whole-genome functional characterization of RE1 silencers using a modified massively parallel reporter assay.

    Get PDF
    Transcriptional silencers are under- studied compared with activating elements. By using MPRAduo, Mouri et al. perform a whole-genome functional characterization screen of RE1 silencers and identify REST-binding motif characteristics and cofactor localization required for a functional silencer. They also identify human genetic variants that impact RE1 activity

    Towards a multisensor station for automated biodiversity monitoring

    Get PDF
    Rapid changes of the biosphere observed in recent years are caused by both small and large scale drivers, like shifts in temperature, transformations in land-use, or changes in the energy budget of systems. While the latter processes are easily quantifiable, documentation of the loss of biodiversity and community structure is more difficult. Changes in organismal abundance and diversity are barely documented. Censuses of species are usually fragmentary and inferred by often spatially, temporally and ecologically unsatisfactory simple species lists for individual study sites. Thus, detrimental global processes and their drivers often remain unrevealed. A major impediment to monitoring species diversity is the lack of human taxonomic expertise that is implicitly required for large-scale and fine-grained assessments. Another is the large amount of personnel and associated costs needed to cover large scales, or the inaccessibility of remote but nonetheless affected areas. To overcome these limitations we propose a network of Automated Multisensor stations for Monitoring of species Diversity (AMMODs) to pave the way for a new generation of biodiversity assessment centers. This network combines cutting-edge technologies with biodiversity informatics and expert systems that conserve expert knowledge. Each AMMOD station combines autonomous samplers for insects, pollen and spores, audio recorders for vocalizing animals, sensors for volatile organic compounds emitted by plants (pVOCs) and camera traps for mammals and small invertebrates. AMMODs are largely self-containing and have the ability to pre-process data (e.g. for noise filtering) prior to transmission to receiver stations for storage, integration and analyses. Installation on sites that are difficult to access require a sophisticated and challenging system design with optimum balance between power requirements, bandwidth for data transmission, required service, and operation under all environmental conditions for years. An important prerequisite for automated species identification are databases of DNA barcodes, animal sounds, for pVOCs, and images used as training data for automated species identification. AMMOD stations thus become a key component to advance the field of biodiversity monitoring for research and policy by delivering biodiversity data at an unprecedented spatial and temporal resolution. (C) 2022 Published by Elsevier GmbH on behalf of Gesellschaft fur Okologie

    Exploiting natural and induced genetic variation to study hematopoiesis

    Get PDF
    PUZZLING WITH DNA Blood cell formation can be studied by making use of natural genetic variation across mouse strains. There are, for example, two mouse strains that do not only differ in fur color, but also in average life span and more specifically in the number of blood-forming stem cells in their bone marrow. The cause of these differences can be found in the DNA of these mice. This DNA differs slightly between the two mouse strains, making some genes in one strain just a bit more or less active compared to those same genes in the other strain. The aim of part I of this thesis was to study the influence of genetic variation on gene expression and how this might explain the specific characteristics of the mouse strains. One of the findings in this study was that the influence of genetic variation on gene expression is strongly cell-type-dependent. Additionally, blood cell formation can be studied by introducing genetic variation into the system. In part II of this thesis genetic variation was introduced into mouse blood-forming stem cells by letting random DNA sequences or “barcodes” integrate into the DNA of these cells. Thereby, these cells were provided with a unique and identifiable label that was heritable from mother- to daughter cell. In this manner the fate of blood-forming stem cells and their progeny could be tracked following transplantation in mice. This technique is very promising for monitoring blood cell formation in future clinical gene therapy studies in humans. PUZZELEN MET DNA Bloedvorming kan bestudeerd worden door gebruik te maken van natuurlijke genetische variatie tussen muizenstammen. Zo bestaan er bijvoorbeeld twee muizenstammen die niet alleen verschillen in vachtkleur, maar ook in gemiddelde levensduur en meer specifiek in het aantal bloedvormende stamcellen dat zich in hun beenmerg bevindt. De oorzaak van deze verschillen kan gevonden worden in het DNA van deze muizen. Dat DNA verschilt net iets tussen de twee muizenstammen, waardoor sommige genen in de ene stam actiever of juist minder actief zijn dan diezelfde genen in de andere stam. In deel I van dit proefschrift is onderzocht hoe genetische variatie de expressie van genen beïnvloedt en hoe dit de specifieke eigenschappen van de muizenstammen zou kunnen verklaren. Er is onder andere gevonden dat de invloed van genetische variatie op de expressie van genen sterk celtype-afhankelijk is. Daarnaast kan bloedvorming bestudeerd worden door genetische variatie te introduceren in het systeem. In deel II van dit proefschrift is genetische variatie in bloedvormende stamcellen van muizen geïntroduceerd door random DNA volgordes of “barcodes” te laten integreren in het DNA van deze cellen. Dit resulteert erin dat elke cel voorzien wordt van een uniek label dat overgegeven wordt van moeder- op dochtercel. De DNA volgorde van het label kan gelezen worden met behulp van een zogenaamde sequencing techniek. Op deze manier kan het lot van bloedvormende stamcellen en hun nakomelingen gevolgd worden na transplantatie in muizen. Deze techniek is zeer veelbelovend voor het monitoren van bloedvorming in toekomstige klinische gentherapie studies in de mens.

    Mapping and Functional Analysis of cis-Regulatory Elements in Mouse Photoreceptors

    Get PDF
    Photoreceptors are light-sensitive neurons that mediate vision, and they are the most commonly affected cell type in genetic forms of blindness. In mice, there are two basic types of photoreceptors, rods and cones, which mediate vision in dim and bright environments, respectively. The transcription factors (TFs) that control rod and cone development have been studied in detail, but the cis-regulatory elements (CREs) through which these TFs act are less well understood. To comprehensively identify photoreceptor CREs in mice and to understand their relationship with gene expression, we performed open chromatin (ATAC-seq) and transcriptome (RNA-seq) profiling of FACS-purified rods and cones. We find that rods have significantly fewer regions of open chromatin than cones (as well as \u3e60 additional cell types and tissues), and we demonstrate that this uniquely closed chromatin architecture depends on the rod master regulator Nrl. Finally, we find that regions of rod- and cone-specific open chromatin are enriched for distinct sets of TF binding sites, providing insight into the cis-regulatory grammar of these cell types. We also sought to understand how the regulatory activity of rod and cone open chromatin regions is encoded in DNA sequence. Cone-rod homeobox (CRX) is a paired-like homeodomain TF and master regulator of both rod and cone development, and CRX binding sites are by far the most enriched TF binding sites in photoreceptor CREs. The in vitro DNA binding preferences of CRX have been extensively characterized, but how well in vitro models of TF binding site affinity predict in vivo regulatory activity is not known. In addition, paired-class homeodomain TFs bind DNA as both monomers and dimers, but whether monomeric and dimeric CRX binding sites have distinct regulatory activities is not known. To address these questions, we used a massively parallel reporter assay to quantify the activity of thousands native and mutant CRX binding sites in explanted mouse retinas. These data reveal that dimeric CRX binding sites encode stronger enhancers than monomeric CRX binding sites. Moreover, the activity of half-sites within dimeric CRX binding sites is cooperative and spacing-dependent. In addition, saturating mutagenesis of 195 CRX binding sites reveals that, while TF binding site affinity and activity are moderately correlated across mutations within individual CREs, they are poorly correlated across mutations from distinct CREs. Accordingly, we show that accounting for baseline CRE activity improves the prediction of the effects of mutations in regulatory DNA from sequence-based models. Taken together, these data demonstrate that the activity of CRX binding sites depends on multiple layers of sequence context, providing insight into photoreceptor gene regulation and illustrating functional principles of homeodomain TF binding sites
    • …