61 research outputs found

    A linear memory algorithm for Baum-Welch training

    Get PDF
    Background: Baum-Welch training is an expectation-maximisation algorithm for training the emission and transition probabilities of hidden Markov models in a fully automated way. Methods and results: We introduce a linear space algorithm for Baum-Welch training. For a hidden Markov model with M states, T free transition and E free emission parameters, and an input sequence of length L, our new algorithm requires O(M) memory and O(L M T_max (T + E)) time for one Baum-Welch iteration, where T_max is the maximum number of states that any state is connected to. The most memory efficient algorithm until now was the checkpointing algorithm with O(log(L) M) memory and O(log(L) L M T_max) time requirement. Our novel algorithm thus renders the memory requirement completely independent of the length of the training sequences. More generally, for an n-hidden Markov model and n input sequences of length L, the memory requirement of O(log(L) L^(n-1) M) is reduced to O(L^(n-1) M) memory while the running time is changed from O(log(L) L^n M T_max + L^n (T + E)) to O(L^n M T_max (T + E)). Conclusions: For the large class of hidden Markov models used for example in gene prediction, whose number of states does not scale with the length of the input sequence, our novel algorithm can thus be both faster and more memory-efficient than any of the existing algorithms.Comment: 14 pages, 1 figure version 2: fixed some errors, final version of pape

    SimulFold: Simultaneously Inferring RNA Structures Including Pseudoknots, Alignments, and Trees Using a Bayesian MCMC Framework

    Get PDF
    Computational methods for predicting evolutionarily conserved rather than thermodynamic RNA structures have recently attracted increased interest. These methods are indispensable not only for elucidating the regulatory roles of known RNA transcripts, but also for predicting RNA genes. It has been notoriously difficult to devise them to make the best use of the available data and to predict high-quality RNA structures that may also contain pseudoknots. We introduce a novel theoretical framework for co-estimating an RNA secondary structure including pseudoknots, a multiple sequence alignment, and an evolutionary tree, given several RNA input sequences. We also present an implementation of the framework in a new computer program, called SimulFold, which employs a Bayesian Markov chain Monte Carlo method to sample from the joint posterior distribution of RNA structures, alignments, and trees. We use the new framework to predict RNA structures, and comprehensively evaluate the quality of our predictions by comparing our results to those of several other programs. We also present preliminary data that show SimulFold's potential as an alignment and phylogeny prediction method. SimulFold overcomes many conceptual limitations that current RNA structure prediction methods face, introduces several new theoretical techniques, and generates high-quality predictions of conserved RNA structures that may include pseudoknots. It is thus likely to have a strong impact, both on the field of RNA structure prediction and on a wide range of data analyses

    ShapeSorter: a fully probabilistic method for detecting conserved RNA structure features supported by SHAPE evidence

    Get PDF
    There is an increased interest in the determination of RNA structures in vivo as it is now possible to probe them in a high-throughput manner, e.g. using SHAPE protocols. By now, there exist a range of computational methods that integrate experimental SHAPE-probing evidence into computational RNA secondary structure prediction. The state-of-the-art in this field is currently provided by computational methods that employ the minimum-free energy strategy for prediction RNA secondary structures with SHAPE-probing evidence. These methods, however, rely on the assumption that transcripts in vivo fold into the thermodynamically most stable configuration and ignore evolutionary evidence for conserved RNA structure features. We here present a new computational method, ShapeSorter, that predicts RNA structure features without employing the thermodynamic strategy. Instead, ShapeSorter employs a fully probabilistic framework to identify RNA structure features that are supported by evolutionary and SHAPE-probing evidence. Our method can capture RNA structure heterogeneity, pseudo-knotted RNA structures as well as transient and mutually exclusive RNA structure features. Moreover, it estimates P-values for the predicted RNA structure features which allows for easy filtering and ranking. We investigate the merits of our method in a comprehensive performance benchmarking and conclude that ShapeSorter has a significantly superior performance for predicting base-pairs than the existing state-of-the-art methods

    Statistical evidence for conserved, local secondary structure in the coding regions of eukaryotic mRNAs and pre-mRNAs

    Get PDF
    Owing to the degeneracy of the genetic code, protein-coding regions of mRNA sequences can harbour more than only amino acid information. We search the mRNA sequences of 11 human protein-coding genes for evolutionarily conserved secondary structure elements using RNA-Decoder, a comparative secondary structure prediction program that is capable of explicitly taking the known protein-coding context of the mRNA sequences into account. We detect well-defined, conserved RNA secondary structure elements in the coding regions of the mRNA sequences and show that base-paired codons strongly correlate with sparse codons. We also investigate the role of repetitive elements in the formation of secondary structure and explain the use of alternate start codons in the caveolin-1 gene by a conserved secondary structure element overlapping the nominal start codon. We discuss the functional roles of our novel findings in regulating the gene expression on mRNA level. We also investigate the role of secondary structure on the correct splicing of the human CFTR gene. We study the wild-type version of the pre-mRNA as well as 29 variants with synonymous mutations in exon 12. By comparing our predicted secondary structures to the experimentally determined splicing efficiencies, we find with weak statistical significance that pre-mRNAs with high-splicing efficiencies have different predicted secondary structures than pre-mRNAs with low-splicing efficiencies

    CYCLeR—a novel tool for the full isoform assembly and quantification of circRNAs

    Get PDF
    Splicing is one key mechanism determining the state of any eukaryotic cell. Apart from linear splice variants, circular splice variants (circRNAs) can arise via non-canonical splicing involving a back-splice junction (BSJ). Most existing methods only identify circRNAs via the corresponding BSJ, but do not aim to estimate their full sequence identity or to identify different, alternatively spliced circular isoforms arising from the same BSJ. We here present CYCLeR, the first computational method for identifying the full sequence identity of new and alternatively spliced circRNAs and their abundances while simultaneously co-estimating the abundances of known linear splicing isoforms. We show that CYCLeR significantly outperforms existing methods in terms of F score and quantification of transcripts in simulated data. In a in a comparative study with long-read data, we also show the advantages of CYCLeR compared to existing methods. When analysing Drosophila melanogaster data, CYCLeR uncovers biological patterns of circRNA expression that other methods fail to observe

    e-RNA: a collection of web-servers for the prediction and visualisation of RNA secondary structure and their functional features

    Get PDF
    e-RNA is a collection of web-servers for the prediction and visualisation of RNA secondary structures and their functional features, including in particular RNA–RNA interactions. In this updated version, we have added novel tools for RNA secondary structure prediction and have significantly updated the visualisation functionality. The new method CoBold can identify transient RNA structure features and their potential functional effects on a known RNA structure during co-transcriptional structure formation. New tool ShapeSorter can predict evolutionarily conserved RNA secondary structure features while simultaneously taking experimental SHAPE probing evidence into account. The web-server R-Chie which visualises RNA secondary structure information in terms of arc diagrams, can now be used to also visualise and intuitively compare RNA–RNA, RNA–DNA and DNA–DNA interactions alongside multiple sequence alignments and quantitative information. The prediction generated by any method in e-RNA can be readily visualised on the web-server. For completed tasks, users can download their results and readily visualise them later on with R-Chie without having to re-run the predictions. e-RNA can be found at http://www.e-rna.org

    HMMConverter 1.0: a toolbox for hidden Markov models

    Get PDF
    Hidden Markov models (HMMs) and their variants are widely used in Bioinformatics applications that analyze and compare biological sequences. Designing a novel application requires the insight of a human expert to define the model's architecture. The implementation of prediction algorithms and algorithms to train the model's parameters, however, can be a time-consuming and error-prone task. We here present HMMConverter, a software package for setting up probabilistic HMMs, pair-HMMs as well as generalized HMMs and pair-HMMs. The user defines the model itself and the algorithms to be used via an XML file which is then directly translated into efficient C++ code. The software package provides linear-memory prediction algorithms, such as the Hirschberg algorithm, banding and the integration of prior probabilities and is the first to present computationally efficient linear-memory algorithms for automatic parameter training. Users of HMMConverter can thus set up complex applications with a minimum of effort and also perform parameter training and data analyses for large data sets

    R-chie: a web server and R package for visualizing RNA secondary structures

    Get PDF
    Visually examining RNA structures can greatly aid in understanding their potential functional roles and in evaluating the performance of structure prediction algorithms. As many functional roles of RNA structures can already be studied given the secondary structure of the RNA, various methods have been devised for visualizing RNA secondary structures. Most of these methods depict a given RNA secondary structure as a planar graph consisting of base-paired stems interconnected by roundish loops. In this article, we present an alternative method of depicting RNA secondary structure as arc diagrams. This is well suited for structures that are difficult or impossible to represent as planar stem-loop diagrams. Arc diagrams can intuitively display pseudo-knotted structures, as well as transient and alternative structural features. In addition, they facilitate the comparison of known and predicted RNA secondary structures. An added benefit is that structure information can be displayed in conjunction with a corresponding multiple sequence alignments, thereby highlighting structure and primary sequence conservation and variation. We have implemented the visualization algorithm as a web server R-chie as well as a corresponding R package called R4RNA, which allows users to run the software locally and across a range of common operating systems

    The International Virus Bioinformatics Meeting 2020.

    Get PDF
    The International Virus Bioinformatics Meeting 2020 was originally planned to take place in Bern, Switzerland, in March 2020. However, the COVID-19 pandemic put a spoke in the wheel of almost all conferences to be held in 2020. After moving the conference to 8-9 October 2020, we got hit by the second wave and finally decided at short notice to go fully online. On the other hand, the pandemic has made us even more aware of the importance of accelerating research in viral bioinformatics. Advances in bioinformatics have led to improved approaches to investigate viral infections and outbreaks. The International Virus Bioinformatics Meeting 2020 has attracted approximately 120 experts in virology and bioinformatics from all over the world to join the two-day virtual meeting. Despite concerns being raised that virtual meetings lack possibilities for face-to-face discussion, the participants from this small community created a highly interactive scientific environment, engaging in lively and inspiring discussions and suggesting new research directions and questions. The meeting featured five invited and twelve contributed talks, on the four main topics: (1) proteome and RNAome of RNA viruses, (2) viral metagenomics and ecology, (3) virus evolution and classification and (4) viral infections and immunology. Further, the meeting featured 20 oral poster presentations, all of which focused on specific areas of virus bioinformatics. This report summarizes the main research findings and highlights presented at the meeting

    Reciprocal regulation of glycine-rich RNA-binding proteins via an interlocked feedback loop coupling alternative splicing to nonsense-mediated decay in Arabidopsis

    Get PDF
    The Arabidopsis RNA-binding protein AtGRP8 undergoes negative autoregulation at the post-transcriptional level. An elevated AtGRP8 protein level promotes the use of a cryptic 5′ splice site to generate an alternatively spliced transcript, as_AtGRP8, retaining the 5′ half of the intron with a premature termination codon. In mutants defective in nonsense-mediated decay (NMD) abundance of as_AtGRP8 but not its pre-mRNA is elevated, indicating that as_AtGRP8 is a direct NMD target, thus limiting the production of functional AtGRP8 protein. In addition to its own pre-mRNA, AtGRP8 negatively regulates the AtGRP7 transcript through promoting the formation of the equivalent alternatively spliced as_AtGRP7 transcript, leading to a decrease in AtGRP7 abundance. Recombinant AtGRP8 binds to its own and the AtGRP7 pre-mRNA, suggesting that this interaction is relevant for the splicing decision in vivo. AtGRP7 itself is part of a negative autoregulatory circuit that influences circadian oscillations of its own and the AtGRP8 transcript through alternative splicing linked to NMD. Thus, we identify an interlocked feedback loop through which two RNA-binding proteins autoregulate and reciprocally crossregulate by coupling unproductive splicing to NMD. A high degree of evolutionary sequence conservation in the introns retained in as_AtGRP8 or as_AtGRP7 points to an important function of these sequences
    corecore