10 research outputs found

    Detecting and comparing non-coding RNAs in the high-throughput era.

    Get PDF
    In recent years there has been a growing interest in the field of non-coding RNA. This surge is a direct consequence of the discovery of a huge number of new non-coding genes and of the finding that many of these transcripts are involved in key cellular functions. In this context, accurately detecting and comparing RNA sequences has become important. Aligning nucleotide sequences is a key requisite when searching for homologous genes. Accurate alignments reveal evolutionary relationships, conserved regions and more generally any biologically relevant pattern. Comparing RNA molecules is, however, a challenging task. The nucleotide alphabet is simpler and therefore less informative than that of amino-acids. Moreover for many non-coding RNAs, evolution is likely to be mostly constrained at the structural level and not at the sequence level. This results in very poor sequence conservation impeding comparison of these molecules. These difficulties define a context where new methods are urgently needed in order to exploit experimental results to their full potential. This review focuses on the comparative genomics of non-coding RNAs in the context of new sequencing technologies and especially dealing with two extremely important and timely research aspects: the development of new methods to align RNAs and the analysis of high-throughput data

    ncRNA orthologies in the vertebrate lineage.

    Get PDF
    Annotation of orthologous and paralogous genes is necessary for many aspects of evolutionary analysis. Methods to infer these homology relationships have traditionally focused on protein-coding genes and evolutionary models used by these methods normally assume the positions in the protein evolve independently. However, as our appreciation for the roles of non-coding RNA genes has increased, consistently annotated sets of orthologous and paralogous ncRNA genes are increasingly needed. At the same time, methods such as PHASE or RAxML have implemented substitution models that consider pairs of sites to enable proper modelling of the loops and other features of RNA secondary structure. Here, we present a comprehensive analysis pipeline for the automatic detection of orthologues and paralogues for ncRNA genes. We focus on gene families represented in Rfam and for which a specific covariance model is provided. For each family ncRNA genes found in all Ensembl species are aligned using Infernal, and several trees are built using different substitution models. In parallel, a genomic alignment that includes the ncRNA genes and their flanking sequence regions is built with PRANK. This alignment is used to create two additional phylogenetic trees using the neighbour-joining (NJ) and maximum-likelihood (ML) methods. The trees arising from both the ncRNA and genomic alignments are merged using TreeBeST, which reconciles them with the species tree in order to identify speciation and duplication events. The final tree is used to infer the orthologues and paralogues following Fitch's definition. We also determine gene gain and loss events for each family using CAFE. All data are accessible through the Ensembl Comparative Genomics ('Compara') API, on our FTP site and are fully integrated in the Ensembl genome browser, where they can be accessed in a user-friendly manner.Database URL: http://www.ensembl.org

    Expanding the repertoire of bacterial (non-)coding RNAs

    Get PDF
    The detection of non-protein-coding RNA (ncRNA) genes in bacteria and their diverse regulatory mode of action moved the experimental and bio-computational analysis of ncRNAs into the focus of attention. Regulatory ncRNA transcripts are not translated to proteins but function directly on the RNA level. These typically small RNAs have been found to be involved in diverse processes such as (post-)transcriptional regulation and modification, translation, protein translocation, protein degradation and sequestration. Bacterial ncRNAs either arise from independent primary transcripts or their mature sequence is generated via processing from a precursor. Besides these autonomous transcripts, RNA regulators (e.g. riboswitches and RNA thermometers) also form chimera with protein-coding sequences. These structured regulatory elements are encoded within the messenger RNA and directly regulate the expression of their “host” gene. The quality and completeness of genome annotation is essential for all subsequent analyses. In contrast to protein-coding genes ncRNAs lack clear statistical signals on the sequence level. Thus, sophisticated tools have been developed to automatically identify ncRNA genes. Unfortunately, these tools are not part of generic genome annotation pipelines and therefore computational searches for known ncRNA genes are the starting point of each study. Moreover, prokaryotic genome annotation lacks essential features of protein-coding genes. Many known ncRNAs regulate translation via base-pairing to the 5’ UTR (untranslated region) of mRNA transcripts. Eukaryotic 5’ UTRs have been routinely annotated by sequencing of ESTs (expressed sequence tags) for more than a decade. Only recently, experimental setups have been developed to systematically identify these elements on a genome-wide scale in prokaryotes. The first part of this thesis, describes three experimental surveys of exploratory field studies to analyze transcript organization in pathogenic bacteria. To identify ncRNAs in Pseudomonas aeruginosa we used a combination of an experimental RNomics approach and ncRNA prediction. Besides already known ncRNAs we identified and validated the expression of six novel RNA genes. Global detection of transcripts by next generation RNA sequencing techniques unraveled an unexpectedly complex transcript organization in many bacteria. These ultra high-throughput methods give us the appealing opportunity to analyze the complete RNA output of any species at once. The development of the differential RNA sequencing (dRNA-seq) approach enabled us to analyze the primary transcriptome of Helicobacter pylori and Xanthomonas campestris. For the first time we generated a comprehensive and precise transcription start site (TSS) map for both species and provide a general framework for the analysis of dRNA-seq data. Focusing on computer-aided analysis we developed new tools to annotate TSS, detect small protein-coding genes and to infer homology of newly detected transcripts. We discovered hundreds of TSS in intergenic regions, upstream of protein-coding genes, within operons and antisense to annotated genes. Analysis of 5’ UTRs (spanning from the TSS to the start codon of the adjacent protein-coding gene) revealed an unexpected size diversity ranging from zero to several hundred nucleotides. We identified and validated the expression of about 60 and about 20 ncRNA candidates in Helicobacter and Xanthomonas, respectively. Among these ncRNA candidates we found several small protein-coding genes that have previously evaded annotation in both species. We showed that the combination of dRNA-seq and computational analysis is a powerful method to examine prokaryotic transcriptomes. Experimental setups are time consuming and often combined with huge costs. Another limitation of experimental approaches is that genes which are expressed in specific developmental stages or stress conditions are likely to be missed. Bioinformatic tools build an alternative to overcome such restraints. General approaches usually depend on comparative genomic data and evolutionary signatures are used to analyze the (non-)coding potential of multiple sequence alignments. In the second part of my thesis we present our major update of the widely used ncRNA gene finder RNAz and introduce RNAcode, an efficient tool to asses local protein-coding potential of genomic regions. RNAz has been successfully used to identify structured RNA elements in all domains of life. However, our own experience and the user feedback not only demonstrated the applicability of the RNAz approach, but also helped us to identify limitations of the current implementation. Using a much larger training set and a new classification model we significantly improved the prediction accuracy of RNAz. During transcriptome analysis we repeatedly identified small protein-coding genes that have not been annotated so far. Only a few of those genes are known to date and standard proteincoding gene finding tools suffer from the lack of training data. To avoid an excess of false positive predictions, gene finding software is usually run with an arbitrary cutoff of 40-50 amino acids and therefore misses the small sized protein-coding genes. We have implemented RNAcode which is optimized for emerging applications not covered by standard protein-coding gene annotation software. In addition to complementing classical protein gene annotation, a major field of application of RNAcode is the functional classification of transcribed regions. RNA sequencing analyses are likely to falsely report transcript fragments (e.g. mRNA degradation products) as non-coding. Hence, an evaluation of the protein-coding potential of these fragments is an essential task. RNAcode reports local regions of high coding potential instead of complete protein-coding genes. A training on known protein-coding sequences is not necessary and RNAcode can therefore be applied to any species. We showed this with our analysis of the Escherichia coli genome where the current annotation could be accurately reproduced. We furthermore identified novel small protein-coding genes with RNAcode in this extensively studied genome. Using transcriptome and proteome data we found compelling evidence that several of the identified candidates are bona fide proteins. In summary, this thesis clearly demonstrates that bioinformatic methods are mandatory to analyze the huge amount of transcriptome data and to identify novel (non-)coding RNA genes. With the major update of RNAz and the implementation of RNAcode we contributed to complete the repertoire of gene finding software which will help to unearth hidden treasures of the RNA World

    Annotation and evolution of bacterial ncRNA genes

    Get PDF
    Successful pathogenic bacteria must alter gene expression in response to changing and hostile environments. Non-coding RNAs (ncRNAs) contribute to adaptability and pathogenicity by forming complex regulatory networks, and include riboswitches, cis-regulatory elements and sRNAs. Despite their important biological function, the annotation and discovery of ncRNAs is hindered by a lack of sequence conservation or other distinguishing sequence features. Studies of the evolutionary dynamics and origins of sRNA genes have been hindered by poor sequence conservation, which makes annotation via sequence homology challenging. The short length and relative simplicity of sRNA genes also make them interesting candidates for observing de novo gene formation from transcriptional noise, or exaptation from existing elements. We have used a pipeline based on profile hidden Markov models to study the conservation patterns of sRNA genes from Salmonella Typhimurium. Our results show that sRNAs are both rapidly acquired and exhibit rapid sequence turnover. We found that horizontal gene transfer is the main driver of sRNA acquisition in Salmonella, and identified Salmonella-specific sRNAs that appear to be derived from phage control systems, and other mobile genetic elements, as well as Type I toxin-antitoxin systems. This method was then applied to study ncRNAs in Pseudomonas syringae pv. actinidiae (Psa), the causal agent of kiwifruit canker disease. We have generated transcriptomes of a pandemic strain of Psa in multiple growth conditions in vitro, and analysed gene expression changes and identified novel non-coding transcripts. We then studied the expression and conservation of these candidate ncRNAs, and identified several with predicted secondary structure motifs characteristic to known functional ncRNAs. This thesis also includes a summary of two genome assembly projects of Gemmata and Legionella isolates, as part of larger collaborations. All diagrams in this thesis are my own work, unless otherwise state

    Investigation Of Rna-Protein Interactions In Prc2 Function

    Get PDF
    Chromatin regulation contributes to control of gene expression and what identity a cell will adopt. In the last decade the role that RNA plays in chromatin regulation has become increasingly clear. RNA mediates protein recruitment and eviction from chromatin, forms nuclear condensates with proteins and DNA, and contributes to proper chromatin organization. Yet our knowledge of the mechanisms that govern RNA activity on chromatin lags significantly and limits our ability to understand nuclear function. To effectively answer some of the questions of RNA function in the nucleus we need a comprehensive atlas of RNA-protein interactions, which would enable generation of protein mutants defective in RNA-binding. The goal of my thesis was to develop an unbiased method to profile RNA-binding proteins in the nucleus and apply it to Polycomb repressive complex 2 (PRC2). PRC2 is an epigenetic regulatory complex that deposits mono, di- and tri- methyl lysine onto histone H3 (H3K27me3) and maintains gene silencing during development. PRC2 shows extensive contacts with RNA but their function remains unclear. In the first chapter, we present a novel method, dubbed RBR-ID, for the identification of RNA-protein interactions, which usees UV-crosslinking of photosensitive nucleotide analogs to proteins followed by high resolution mass spectrometry (LC- MS/MS). We identified over 800 RNA-binding proteins, of which 427 were novel and enriched for chromatin-related functions. In the second chapter we adapted RBR-ID to study PRC2, identifying RNA-binding-regions (RBRs) on every subunit of the complex. An RBR identified on EED fell near the regulatory center of PRC2, and we showed that RNA-mediated inhibition of PRC2 can be reversed by stimulatory peptides that bind in the regulatory center, reflecting the antagonistic relationship between RNA and PRC2. In the final chapter we present a testing method we developed for the SARS-CoV-2 virus. Our method, COV-ID, uses reverse transcription and loop-mediated isothermal amplification (RT-LAMP) from patient saliva paired with high-throughput sequencing. Using this method we can detect as little as 5-10 SARS-CoV-2 virions/μL, and we successfully replicate classification of saliva samples (10/10) from clinical COVID-19 patients. We show that COV-ID can be multiplexed to detect influenza as well as SARS-CoV-2. Finally we demonstrate thatCOV-ID can process saliva samples collected on filter paper with sensitivity as low as 50 virions/μL

    Computational methods for RNA integrative biology

    Get PDF
    Ribonucleic acid (RNA) is an essential molecule, which carries out a wide variety of functions within the cell, from its crucial involvement in protein synthesis to catalysing biochemical reactions and regulating gene expression. Such diverse functional repertoire is indebted to complex structures that RNA can adopt and its flexibility as an interacting molecule. It has become possible to experimentally measure these two crucial aspects of RNA regulatory role with such technological advancements as next-generation sequencing (NGS). NGS methods can rapidly obtain the nucleotide sequence of many molecules in parallel. Designing experiments, where only the desired parts of the molecule (or specific parts of the transcriptome) are sequenced, allows to study various aspects of RNA biology. Analysis of NGS data is insurmountable without computational methods. One such experimental method is RNA structure probing, which aims to infer RNA structure from sequencing chemically altered transcripts. RNA structure probing data is inherently noisy, affected both by technological biases and the stochasticity of the underlying process. Most existing methods do not adequately address the issue of noise, resorting to heuristics and limiting the informativeness of their output. In this thesis, a statistical pipeline was developed for modelling RNA structure probing data, which explicitly captures biological variability, provides automated bias-correcting strategies, and generates a probabilistic output based on experimental measurements. The output of our method agrees with known RNA structures, can be used to constrain structure prediction algorithms, and remains robust to reduced sequence coverage, thereby increasing sensitivity of the technology. Another recent experimental innovation maps RNA-protein interactions at very high temporal resolution, making it possible to study rapid binding events happening on a minute time scale. In this thesis, a non-parametric algorithm was developed for identifying significant changes in RNA-protein binding time-series between different conditions. The method was applied to novel yeast RNA-protein binding time-course data to study the role of RNA degradation in stress response. It revealed pervasive changes in the binding to the transcriptome of the yeast transcription termination factor Nab3 and the cytoplasmic exoribonuclease Xrn1 under nutrient stress. This challenged the common assumption of viewing transcriptional changes as the major driver of changes in RNA expression during stress and highlighted the importance of degradation. These findings inspired a dynamical model for RNA expression, where transcription and degradation rates are modelled using RNA-protein binding time-series data

    Investigation into the Potential Application of Microbial Enhanced Oil Recovery on Unconventional Oil: A Field Specific Approach

    Get PDF
    A substantial amount of the world’s recoverable oil reserves comprise unconventional resources. However great difficulty has been encountered in recovering oil lower than 22° API. Therefore, advanced methods of Enhanced oil recovery (EOR) such as microbial enhanced oil recovery (MEOR) have been employed to increase the amount of recovered residual oil. MEOR involves the use of bacteria and their metabolic products to alter the oil properties or rock permeability within a reservoir in order to promote the flow of oil. Although MEOR has been trialled in the past with mixed outcomes, its feasibility on heavier oils has not been fully demonstrated. The aim of this study was to show that MEOR can be successfully applied to unconventional oil fields to increase oil production. Using both genomic and microbiologically applied petroleum engineering techniques, it was possible to target and isolate key indigenous microorganisms with MEOR potential from the reservoir of interest. In this study we have identified an indigenous microorganism (Bacillus licheniformis Bi10) that was capable of enhancing heavy oil recovery. This strain was applied to field specific microcosms and the effect of this microorganism was compared to variant inoculate, showing improved recovery beyond levels shown by previous MEOR related bacteria (Additional Oil Recovery- 11.8%). Furthermore, we also confirmed that the use of biosurfactant lichenysin alone was not as effective in MEOR compared to viable cell treatment, and hypothesized that a dual mechanism of action approach may be taking place within the microcosm, of both bio-plugging and wettability alteration. The interfacial tension of biosurfactant produced by the Bi10 isolate also showed a substantial decrease in wettability calculations, to < 5 mNm-1, lower than any other bacterial surfactants have been shown in heavy oil environments Comparative genomics also revealed key genetic variations between this and similar MEOR strains that could hold the key to its increased potential for future MEOR strategies. The results presented in this thesis were part of an ERDF project, involving academic and industrial partner, BiSN Laboratory Services, on fundamental and applied aspects of microbial enhanced oil recovery in heavy oilfield environments, which was funded to improve the understanding of MEOR and its processes in these unconventional oil environments
    corecore