1,118 research outputs found

    Deep sequencing of pre-translational mRNPs reveals hidden flux through evolutionarily conserved AS-NMD pathways

    Get PDF
    Deep sequencing of mRNAs (RNA-Seq) is now the preferred method for transcriptome-wide quantification of gene expression. Yet many mRNA isoforms, such as those eliminated by nonsense-mediated decay (NMD), are inherently unstable. Thus a significant drawback of steady-state RNA-Seq is that it provides marginal information on the flux through alternative splicing pathways. Measurement of such flux necessitates capture of newly made species prior to mRNA decay. One means to capture nascent mRNAs is affinity purifying either the exon junction complex (EJC) or activated spliceosomes. Late-stage spliceosomes deposit the EJC upstream of exon-exon junctions, where it remains associated until the first round of translation. As most mRNA decay pathways are translation-dependent, these EJC- or spliceosome-associated, pre-translational mRNAs should provide an accurate record of the initial population of alternate mRNA isoforms. Previous work has analyzed the protein composition and structure of pre- translational mRNPs in detail. While in the Moore lab, my project has focused on exploring the diversity of mRNA isoforms contained within these complexes. As expected, known NMD isoforms are more highly represented in pre-translational mRNPs than in RNA-Seq libraries. To investigate whether pre-translational mRNPs contain novel mRNA isoforms, we created a bioinformatics pipeline that identified thousands of previously unannotated splicing events. Though many can be attributed to “splicing noise”, others are evolutionarily-conserved events that produce new AS-NMD isoforms likely involved in maintenance of protein homeostasis. Several of these occur in genes whose overexpression has been linked to poor cancer prognosis

    Predicting gene expression in the human malaria parasite Plasmodium falciparum using histone modification, nucleosome positioning, and 3D localization features.

    Get PDF
    Empirical evidence suggests that the malaria parasite Plasmodium falciparum employs a broad range of mechanisms to regulate gene transcription throughout the organism's complex life cycle. To better understand this regulatory machinery, we assembled a rich collection of genomic and epigenomic data sets, including information about transcription factor (TF) binding motifs, patterns of covalent histone modifications, nucleosome occupancy, GC content, and global 3D genome architecture. We used these data to train machine learning models to discriminate between high-expression and low-expression genes, focusing on three distinct stages of the red blood cell phase of the Plasmodium life cycle. Our results highlight the importance of histone modifications and 3D chromatin architecture in Plasmodium transcriptional regulation and suggest that AP2 transcription factors may play a limited regulatory role, perhaps operating in conjunction with epigenetic factors

    Computational studies of genome evolution and regulation

    Get PDF
    This thesis takes on the challenge of extracting information from large volumes of biological data produced with newly established experimental techniques. The different types of information present in a particular dataset have been carefully identified to maximise the information gained from the data. This also precludes the attempts to infer the types of information that are not present in the data. In the first part of the thesis I examined the evolutionary origins of de novo taxonomically restricted genes (TRGs) in Drosophila subgenus. De novo TRGs are genes that have originated after the speciation of a particular clade from previously non-coding regions - functional ncRNA, within introns or alternative frames of older protein-coding genes, or from intergenic sequences. TRGs are clade-specific tool-kits that are likely to contain proteins with yet undocumented functions and new protein folds that are yet to be discovered. One of the main challenges in studying de novo TRGs is the trade-off between false positives (non-functional open reading frames) and false negatives (true TRGs that have properties distinct from well established genes). Here I identified two de novo TRG families in Drosophila subgenus that have not been previously reported as de novo originated genes, and to our knowledge they are the best candidates identified so far for experimental studies aimed at elucidating the properties of de novo genes. In the second part of the thesis I examined the information contained in single cell RNA sequencing (scRNA-seq) data and propose a method for extracting biological knowledge from this data using generative neural networks. The main challenge is the noisiness of scRNA-seq data - the number of transcripts sequenced is not proportional to the number of mRNAs present in the cell. I used an autoencoder to reduce the dimensionality of the data without making untestable assumptions about the data. This embedding into lower dimensional space alongside the features learned by an autoencoder contains information about the cell populations, differentiation trajectories and the regulatory relationships between the genes. Unlike most methods currently used, an autoencoder does not assume that these regulatory relationships are the same in all cells in the data set. The main advantages of our approach is that it makes minimal assumptions about the data, it is robust to noise and it is possible to assess its performance. In the final part of the thesis I summarise lessons learnt from analysing various types of biological data and make suggestions for the future direction of similar computational studies

    Computational methods for RNA integrative biology

    Get PDF
    Ribonucleic acid (RNA) is an essential molecule, which carries out a wide variety of functions within the cell, from its crucial involvement in protein synthesis to catalysing biochemical reactions and regulating gene expression. Such diverse functional repertoire is indebted to complex structures that RNA can adopt and its flexibility as an interacting molecule. It has become possible to experimentally measure these two crucial aspects of RNA regulatory role with such technological advancements as next-generation sequencing (NGS). NGS methods can rapidly obtain the nucleotide sequence of many molecules in parallel. Designing experiments, where only the desired parts of the molecule (or specific parts of the transcriptome) are sequenced, allows to study various aspects of RNA biology. Analysis of NGS data is insurmountable without computational methods. One such experimental method is RNA structure probing, which aims to infer RNA structure from sequencing chemically altered transcripts. RNA structure probing data is inherently noisy, affected both by technological biases and the stochasticity of the underlying process. Most existing methods do not adequately address the issue of noise, resorting to heuristics and limiting the informativeness of their output. In this thesis, a statistical pipeline was developed for modelling RNA structure probing data, which explicitly captures biological variability, provides automated bias-correcting strategies, and generates a probabilistic output based on experimental measurements. The output of our method agrees with known RNA structures, can be used to constrain structure prediction algorithms, and remains robust to reduced sequence coverage, thereby increasing sensitivity of the technology. Another recent experimental innovation maps RNA-protein interactions at very high temporal resolution, making it possible to study rapid binding events happening on a minute time scale. In this thesis, a non-parametric algorithm was developed for identifying significant changes in RNA-protein binding time-series between different conditions. The method was applied to novel yeast RNA-protein binding time-course data to study the role of RNA degradation in stress response. It revealed pervasive changes in the binding to the transcriptome of the yeast transcription termination factor Nab3 and the cytoplasmic exoribonuclease Xrn1 under nutrient stress. This challenged the common assumption of viewing transcriptional changes as the major driver of changes in RNA expression during stress and highlighted the importance of degradation. These findings inspired a dynamical model for RNA expression, where transcription and degradation rates are modelled using RNA-protein binding time-series data

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Random Parametric Perturbations of Gene Regulatory Circuit Uncover State Transitions in Cell Cycle.

    Get PDF
    Many biological processes involve precise cellular state transitions controlled by complex gene regulation. Here, we use budding yeast cell cycle as a model system and explore how a gene regulatory circuit encodes essential information of state transitions. We present a generalized random circuit perturbation method for circuits containing heterogeneous regulation types and its usage to analyze both steady and oscillatory states from an ensemble of circuit models with random kinetic parameters. The stable steady states form robust clusters with a circular structure that are associated with cell cycle phases. This circular structure in the clusters is consistent with single-cell RNA sequencing data. The oscillatory states specify the irreversible state transitions along cell cycle progression. Furthermore, we identify possible mechanisms to understand the irreversible state transitions from the steady states. We expect this approach to be robust and generally applicable to unbiasedly predict dynamical transitions of a gene regulatory circuit

    RNA Exosome & Chromatin: The Yin & Yang of Transcription: A Dissertation

    Get PDF
    Eukaryotic genomes can produce two types of transcripts: protein-coding and non-coding RNAs (ncRNAs). Cryptic ncRNA transcripts are bona fide RNA Pol II products that originate from bidirectional promoters, yet they are degraded by the RNA exosome. Such pervasive transcription is prevalent across eukaryotes, yet its regulation and function is poorly understood. We hypothesized that chromatin architecture at cryptic promoters may regulate ncRNA transcription. Nucleosomes that flank promoters are highly enriched in two histone marks: H3-K56Ac and the variant H2A.Z, which make nucleosomes highly dynamic. These histone modifications are present at a majority of promoters and their stereotypic pattern is conserved from yeast to mammals, suggesting their evolutionary importance. Although required for inducing a handful of genes, their contribution to steady-state transcription has remained elusive. In this work, we set out to understand if dynamic nucleosomes regulate cryptic transcription and how this is coordinated with the RNA exosome. Remarkably, we find that H3-K56Ac promotes RNA polymerase II occupancy at a large number of protein coding and noncoding loci, yet neither histone mark has a significant impact on steady state mRNA levels in budding yeast. Instead, broad effects of H3-K56Ac or H2A.Z on levels of both coding and ncRNAs are only revealed in the absence of the nuclear RNA exosome. We show that H2A.Z functions with H3-K56Ac in chromosome folding, facilitating formation of Chromosomal Interaction Domains (CIDs). Our study suggests that H2A.Z and H3-K56Ac work in concert with the RNA exosome to control mRNA and ncRNA levels, perhaps in part by regulating higher order chromatin structures. Together, these chromatin factors achieve a balance of RNA exosome activity (yin; negative) and Pol II (yang; positive) to maintain transcriptional homeostasis

    Detection of regulator genes and eQTLs in gene networks

    Full text link
    Genetic differences between individuals associated to quantitative phenotypic traits, including disease states, are usually found in non-coding genomic regions. These genetic variants are often also associated to differences in expression levels of nearby genes (they are "expression quantitative trait loci" or eQTLs for short) and presumably play a gene regulatory role, affecting the status of molecular networks of interacting genes, proteins and metabolites. Computational systems biology approaches to reconstruct causal gene networks from large-scale omics data have therefore become essential to understand the structure of networks controlled by eQTLs together with other regulatory genes, and to generate detailed hypotheses about the molecular mechanisms that lead from genotype to phenotype. Here we review the main analytical methods and softwares to identify eQTLs and their associated genes, to reconstruct co-expression networks and modules, to reconstruct causal Bayesian gene and module networks, and to validate predicted networks in silico.Comment: minor revision with typos corrected; review article; 24 pages, 2 figure
    corecore