627 research outputs found

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Riboswitches as hormone receptors: hypothetical cytokinin-binding riboswitches in Arabidopsis thaliana

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Riboswitches are mRNA elements that change conformation when bound to small molecules. They are known to be key regulators of biosynthetic pathways in both prokaryotes and eukaryotes.</p> <p>Presentation of the Hypothesis</p> <p>The hypothesis presented here is that riboswitches function as receptors in hormone perception. We propose that riboswitches initiate or integrate signaling cascades upon binding to classic signaling molecules. The molecular interactions for ligand binding and gene expression control would be the same as for biosynthetic pathways, but the context and the cadre of ligands to consider is dramatically different. The hypothesis arose from the observation that a compound used to identify adenine binding RNA sequences is chemically similar to the classic plant hormone, or growth regulator, cytokinin. A general tenet of the hypothesis is that riboswitch-binding metabolites can be used to make predictions about chemically related signaling molecules. In fact, all cell permeable signaling compounds can be considered as potential riboswitch ligands. The hypothesis is plausible, as demonstrated by a cursory review of the transcriptome and genome of the model plant <it>Arabidopsis thaliana </it>for transcripts that <it>i) </it>contain an adenine aptamer motif, and <it>ii) </it>are also predicted to be cytokinin-regulated. Here, one gene, <it>CRK10 </it>(for <it>Cysteine-rich Receptor-like Kinase 10</it>, At4g23180), contains an adenine aptamer-related sequence and is down-regulated by cytokinin approximately three-fold in public gene expression data. To illustrate the hypothesis, implications of cytokinin-binding to the <it>CRK10 </it>mRNA are discussed.</p> <p>Testing the hypothesis</p> <p>At the broadest level, screening various cell permeable signaling molecules against random RNA libraries and comparing hits to sequence and gene expression data bases could determine how broadly the hypothesis applies. Specific cases, such as <it>CRK10 </it>presented here, will require experimental validation of direct ligand binding, altered RNA conformation, and effect on gene expression. Each case will be different depending on the signaling pathway and the physiology involved.</p> <p>Implications of the hypothesis</p> <p>This would be a very direct signal perception mechanism for regulating gene expression; rivaling animal steroid hormone receptors, which are frequently ligand dependent transcription initiation factors. Riboswitch-regulated responses could occur by modulating target RNA stability, translatability, and alternative splicing - all known expression platforms used in riboswitches. The specific illustration presented, <it>CRK10</it>, implies a new mechanism for the perception of cytokinin, a classic plant hormone. Experimental support for the hypothesis would add breadth to the growing list of important functions attributed to riboswitches.</p> <p>Reviewers</p> <p>This article was reviewed by Anthony Poole, Rob Knight, Mikhail Gelfand.</p

    Predictive modeling of plant messenger RNA polyadenylation sites

    Get PDF
    BACKGROUND: One of the essential processing events during pre-mRNA maturation is the post-transcriptional addition of a polyadenine [poly(A)] tail. The 3'-end poly(A) track protects mRNA from unregulated degradation, and indicates the integrity of mRNA through recognition by mRNA export and translation machinery. The position of a poly(A) site is predetermined by signals in the pre-mRNA sequence that are recognized by a complex of polyadenylation factors. These signals are generally tri-part sequence patterns around the cleavage site that serves as the future poly(A) site. In plants, there is little sequence conservation among these signal elements, which makes it difficult to develop an accurate algorithm to predict the poly(A) site of a given gene. We attempted to solve this problem. RESULTS: Based on our current working model and the profile of nucleotide sequence distribution of the poly(A) signals and around poly(A) sites in Arabidopsis, we have devised a Generalized Hidden Markov Model based algorithm to predict potential poly(A) sites. The high specificity and sensitivity of the algorithm were demonstrated by testing several datasets, and at the best combinations, both reach 97%. The accuracy of the program, called poly(A) site sleuth or PASS, has been demonstrated by the prediction of many validated poly(A) sites. PASS also predicted the changes of poly(A) site efficiency in poly(A) signal mutants that were constructed and characterized by traditional genetic experiments. The efficacy of PASS was demonstrated by predicting poly(A) sites within long genomic sequences. CONCLUSION: Based on the features of plant poly(A) signals, a computational model was built to effectively predict the poly(A) sites in Arabidopsis genes. The algorithm will be useful in gene annotation because a poly(A) site signifies the end of the transcript. This algorithm can also be used to predict alternative poly(A) sites in known genes, and will be useful in the design of transgenes for crop genetic engineering by predicting and eliminating undesirable poly(A) sites

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Identification and Functional Annotation of Alternatively Spliced Isoforms

    Full text link
    Alternative splicing is a key mechanism for increasing the complexity of transcriptome and proteome in eukaryotic cells. A large portion of multi-exon genes in humans undergo alternative splicing, and this can have significant functional consequences as the proteins translated from alternatively spliced mRNA might have different amino acid sequences and structures. The study of alternative splicing events has been accelerated by the next-generation sequencing technology. However, reconstruction of transcripts from short-read RNA sequencing is not sufficiently accurate. Recent progress in single-molecule long-read sequencing has provided researchers alternative ways to help solve this problem. With the help of both short and long RNA sequencing technologies, tens of thousands of splice isoforms have been catalogued in humans and other species, but relatively few of the protein products of splice isoforms have been characterized functionally, structurally and biochemically. The scope of this dissertation includes using short and long RNA sequencing reads together for the purpose of transcript reconstruction, and using high-throughput RNA-sequencing data and gene ontology functional annotations on gene level to predict functions for alternatively spliced isoforms in mouse and human. In the first chapter, I give an introduction of alternative splicing and discuss the existing studies where next generation sequencing is used for transcript identification. Then, I define the isoform function prediction problem, and explain how it differs from better known gene function prediction problem. In the second chapter of this dissertation, I describe our study where the overall transcriptome of kidney is studied using both long reads from PacBio platform and RNA-seq short reads from Illumina platform. We used short reads to validate full-length transcripts found by long PacBio reads, and generated two high quality sets of transcript isoforms that are expressed in glomerular and tubulointerstitial compartments. In the third chapter, I describe our generic framework, where we implemented and evaluated several related algorithms for isoform function prediction for mouse isoforms. We tested these algorithms through both computational evaluation and experimental validation of the predicted ‘responsible’ isoform(s) and the predicted disparate functions of the isoforms of Cdkn2a and of Anxa6. Our algorithm is the first effort to predict and differentiate isoform functions through large-scale genomic data integration. In the fourth chapter, I present the extension of isoform function prediction study to the protein coding isoforms in human. We used a similar multiple instance learning (MIL)-based approach for predicting the function of protein coding splice variants in human. We evaluated our predictions using literature evidence of ADAM15, LMNA/C, and DMXL2 genes. And in the fifth and final chapter, I give a summary of previous chapters and outline the future directions for alternatively spliced isoform reconstruction and function prediction studies.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144017/1/ridvan_1.pd

    A global genetic interaction network maps a wiring diagram of cellular function

    Get PDF
    We generated a global genetic interaction network for Saccharomyces cerevisiae, constructing more than 23 million double mutants, identifying about 550,000 negative and about 350,000 positive genetic interactions. This comprehensive network maps genetic interactions for essential gene pairs, highlighting essential genes as densely connected hubs. Genetic interaction profiles enabled assembly of a hierarchical model of cell function, including modules corresponding to protein complexes and pathways, biological processes, and cellular compartments. Negative interactions connected functionally related genes, mapped core bioprocesses, and identified pleiotropic genes, whereas positive interactions often mapped general regulatory connections among gene pairs, rather than shared functionality. The global network illustrates how coherent sets of genetic interactions connect protein complex and pathway modules to map a functional wiring diagram of the cell. INTRODUCTION: Genetic interactions occur when mutations in two or more genes combine to generate an unexpected phenotype. An extreme negative or synthetic lethal genetic interaction occurs when two mutations, neither lethal individually, combine to cause cell death. Conversely, positive genetic interactions occur when two mutations produce a phenotype that is less severe than expected. Genetic interactions identify functional relationships between genes and can be harnessed for biological discovery and therapeutic target identification. They may also explain a considerable component of the undiscovered genetics associated with human diseases. Here, we describe construction and analysis of a comprehensive genetic interaction network for a eukaryotic cell. RATIONALE: Genome sequencing projects are providing an unprecedented view of genetic variation. However, our ability to interpret genetic information to predict inherited phenotypes remains limited, in large part due to the extensive buffering of genomes, making most individual eukaryotic genes dispensable for life. To explore the extent to which genetic interactions reveal cellular function and contribute to complex phenotypes, and to discover the general principles of genetic networks, we used automated yeast genetics to construct a global genetic interaction network. RESULTS: We tested most of the ~6000 genes in the yeast Saccharomyces cerevisiae for all possible pairwise genetic interactions, identifying nearly 1 million interactions, including ~550,000 negative and ~350,000 positive interactions, spanning ~90% of all yeast genes. Essential genes were network hubs, displaying five times as many interactions as nonessential genes. The set of genetic interactions or the genetic interaction profile for a gene provides a quantitative measure of function, and a global network based on genetic interaction profile similarity revealed a hierarchy of modules reflecting the functional architecture of a cell. Negative interactions connected functionally related genes, mapped core bioprocesses, and identified pleiotropic genes, whereas positive interactions often mapped general regulatory connections associated with defects in cell cycle progression or cellular proteostasis. Importantly, the global network illustrates how coherent sets of negative or positive genetic interactions connect protein complex and pathways to map a functional wiring diagram of the cell. CONCLUSION: A global genetic interaction network highlights the functional organization of a cell and provides a resource for predicting gene and pathway function. This network emphasizes the prevalence of genetic interactions and their potential to compound phenotypes associated with single mutations. Negative genetic interactions tend to connect functionally related genes and thus may be predicted using alternative functional information. Although less functionally informative, positive interactions may provide insights into general mechanisms of genetic suppression or resiliency. We anticipate that the ordered topology of the global genetic network, in which genetic interactions connect coherently within and between protein complexes and pathways, may be exploited to decipher genotype-to-phenotype relationships

    Determinants of RNA metabolism in the Schizosaccharomyces pombe genome

    Get PDF
    To decrypt the regulatory code of the genome, sequence elements must be defined that determine the kinetics of RNA metabolism and thus gene expression. Here, we attempt such decryption in an eukaryotic model organism, the fission yeast S. pombe. We first derive an improved genome annotation that redefines borders of 36% of expressed mRNAs and adds 487 non-coding RNAs (ncRNAs). We then combine RNA labeling invivo with mathematical modeling to obtain rates of RNA synthesis and degradation for 5,484 expressed RNAs and splicing rates for4,958 introns. We identify functional sequence elements inDNA and RNA that control RNA metabolic rates and quantifythecontributions of individual nucleotides to RNA synthesis,splicing, and degradation. Our approach reveals distinct kineticsof mRNA and ncRNA metabolism, separates antisense regulation by transcription interference from RNA interference, and provides a general tool for studying the regulatory code of genomes

    Transient transcriptome sequencing captures enhancer landscapes immediately after T-cell stimulation

    Get PDF
    Transcription regulation is poorly understood. Transcriptional enhancers produce enhancer RNAs (eRNAs), a class of transient RNAs, whose function remains mainly unclear. To monitor transcriptional regulation in human cells, rapid changes in enhancer and promoter activity must be captured with high sensitivity and temporal reso- lution. Here I show that the recently established protocol TT-seq (‘transient tran- scriptome sequencing’) can monitor rapid changes in transcription from enhancers and promoters during the immediate response of T-cells to ionomycin and phorbol 12-myristate 13-acetate (PMA). Transient transcriptome sequencing (TT-seq) maps eRNAs and mRNAs every 5 minutes after T-cell stimulation with high sensitivity, and identifies many new primary response genes. TT-seq reveals that the synthesis of 1,601 eRNAs and 650 mRNAs changes significantly within only 15 minutes after stimulation, when standard RNA-seq does not detect differentially expressed genes. Transcription of enhancers that are primed for activation by nucleosome depletion can occur immediately and simultaneously with transcription of target gene promot- ers. My results indicate that enhancer transcription is a good proxy for enhancer regulatory activity in target gene activation, and establish TT-seq as a tool for monitoring the dynamics of enhancer landscapes and transcription programs during cellular responses and differentiation. Additionally, I developed a normalization method for TT-seq that scales labeled and total RNA-seq samples relative to each other, allowing to determine absolute half-lives. The method provides a powerful tool to normalize various samples relative to each other on a global scale, and therefore allows to observe global changes in RNA synthesis and degradation. Taken together, metabolical labeling of RNA followed by kinetic modeling enables to quantify RNA metabolism rates and to detect dynamic changes in enhancer landscapes and RNA expression levels

    A bioinformatics framework for RNA structure mining, motif discovery and polyadenylation analysis

    Get PDF
    The RNA molecules play various important roles in the cell and their functionality depends not only on the sequence information but to a large extent on their structure. The development of computational and predictive approaches to study RNA molecules is extremely valuable. In this research, a tool named RADAR was developed that provides a multitude of functionality for RNA data analysis and research. It aligns structure annotated RNA sequences so that both the sequence as well as structure information is taken into consideration. This tool is capable of performing pair-wise structure alignment, multiple structure alignment, database search and clustering. In addition, it provides two salient features: (i) constrained alignment of RNA secondary structures, and (ii) prediction of consensus structure for a set of RNA sequences. This tool is also hosted on the web and can be freely accessed and the software can be downloaded from http://datalab.njitedu/biodata/rna/RSmatch/server.htm . The RADAR software has been applied to various datasets (genomes of various mammals, viruses and parasites) and our experimental results show that this approach is capable of detecting functionally important regions. As an application of RADAR, a systematic data mining approach was developed, termed GLEAN-UTR, to identify small stem loop RNA structure elements in the Untranslated regions (UTRs) that are conserved between human and mouse orthologs and exist in multiple genes with common Gene Ontology terms. This study resulted in 90 distinct RNA structure groups containing 748 structures, with 3\u27 Histone stem loop (HSL3) and Iron Response element (IRE) among the top hits. Further, the role played by structure in mRNA polyadenylation was investigated. Polyadenylation is an important step towards the maturation of almost all cellular mRNAs in eukaryotes. Studies have identified several cis-elements besides the widely known polyadenylation signal (PAS) element (AATAAA or ATTAAA or a close variant) which may have a role to play in poly(A) site identification. In this study the differences in structural stability of sequences surrounding poly(A) sites was investigated and it was found that for the genes containing single poly(A) site, the surrounding sequence is most stable as compared with the surrounding sequences for alternative poly(A) sites. This indicates that structure may be providing a evolutionary advantage for single poly(A) sites that prevents multiple poly(A) sites from arising. In addition the study found that the structural stability of the region surrounding a polyadenylation site correlates with its distance from the next gene. The shortest distance corresponding to a greater structural stability
    corecore