144 research outputs found

    From RNA folding to inverse folding: a computational study: Folding and design of RNA molecules

    Get PDF
    Since the discovery of the structure of DNA in the early 1953s and its double-chained complement of information hinting at its means of replication, biologists have recognized the strong connection between molecular structure and function. In the past two decades, there has been a surge of research on an ever-growing class of RNA molecules that are non-coding but whose various folded structures allow a diverse array of vital functions. From the well-known splicing and modification of ribosomal RNA, non-coding RNAs (ncRNAs) are now known to be intimately involved in possibly every stage of DNA translation and protein transcription, as well as RNA signalling and gene regulation processes. Despite the rapid development and declining cost of modern molecular methods, they typically can only describe ncRNA's structural conformations in vitro, which differ from their in vivo counterparts. Moreover, it is estimated that only a tiny fraction of known ncRNAs has been documented experimentally, often at a high cost. There is thus a growing realization that computational methods must play a central role in the analysis of ncRNAs. Not only do computational approaches hold the promise of rapidly characterizing many ncRNAs yet to be described, but there is also the hope that by understanding the rules that determine their structure, we will gain better insight into their function and design. Many studies revealed that the ncRNA functions are performed by high-level structures that often depend on their low-level structures, such as the secondary structure. This thesis studies the computational folding mechanism and inverse folding of ncRNAs at the secondary level. In this thesis, we describe the development of two bioinformatic tools that have the potential to improve our understanding of RNA secondary structure. These tools are as follows: (1) RAFFT for efficient prediction of pseudoknot-free RNA folding pathways using the fast Fourier transform (FFT)}; (2) aRNAque, an evolutionary algorithm inspired by LΓ©vy flights for RNA inverse folding with or without pseudoknot (A secondary structure that often poses difficulties for bio-computational detection). The first tool, RAFFT, implements a novel heuristic to predict RNA secondary structure formation pathways that has two components: (i) a folding algorithm and (ii) a kinetic ansatz. When considering the best prediction in the ensemble of 50 secondary structures predicted by RAFFT, its performance matches the recent deep-learning-based structure prediction methods. RAFFT also acts as a folding kinetic ansatz, which we tested on two RNAs: the CFSE and a classic bi-stable sequence. In both test cases, fewer structures were required to reproduce the full kinetics, whereas known methods (such as Treekin) required a sample of 20,000 structures and more. The second tool, aRNAque, implements an evolutionary algorithm (EA) inspired by the LΓ©vy flight, allowing both local global search and which supports pseudoknotted target structures. The number of point mutations at every step of aRNAque's EA is drawn from a Zipf distribution. Therefore, our proposed method increases the diversity of designed RNA sequences and reduces the average number of evaluations of the evolutionary algorithm. The overall performance showed improved empirical results compared to existing tools through intensive benchmarks on both pseudoknotted and pseudoknot-free datasets. In conclusion, we highlight some promising extensions of the versatile RAFFT method to RNA-RNA interaction studies. We also provide an outlook on both tools' implications in studying evolutionary dynamics

    The 3β€² Splice Site of Influenza A Segment 7 mRNA Can Exist in Two Conformations: A Pseudoknot and a Hairpin

    Get PDF
    The 3β€² splice site of influenza A segment 7 is used to produce mRNA for the M2 ion-channel protein, which is critical to the formation of viable influenza virions. Native gel analysis, enzymatic/chemical structure probing, and oligonucleotide binding studies of a 63 nt fragment, containing the 3β€² splice site, key residues of an SF2/ASF splicing factor binding site, and a polypyrimidine tract, provide evidence for an equilibrium between pseudoknot and hairpin structures. This equilibrium is sensitive to multivalent cations, and can be forced towards the pseudoknot by addition of 5 mM cobalt hexammine. In the two conformations, the splice site and other functional elements exist in very different structural environments. In particular, the splice site is sequestered in the middle of a double helix in the pseudoknot conformation, while in the hairpin it resides in a two-by-two nucleotide internal loop. The results suggest that segment 7 mRNA splicing can be controlled by a conformational switch that exposes or hides the splice site

    Geometric combinatorics and computational molecular biology: branching polytopes for RNA sequences

    Full text link
    Questions in computational molecular biology generate various discrete optimization problems, such as DNA sequence alignment and RNA secondary structure prediction. However, the optimal solutions are fundamentally dependent on the parameters used in the objective functions. The goal of a parametric analysis is to elucidate such dependencies, especially as they pertain to the accuracy and robustness of the optimal solutions. Techniques from geometric combinatorics, including polytopes and their normal fans, have been used previously to give parametric analyses of simple models for DNA sequence alignment and RNA branching configurations. Here, we present a new computational framework, and proof-of-principle results, which give the first complete parametric analysis of the branching portion of the nearest neighbor thermodynamic model for secondary structure prediction for real RNA sequences.Comment: 17 pages, 8 figure

    Graphical methods in RNA structure matching

    Get PDF
    Eukaryotic genomes are pervasively transcribed; almost every base can be found in an RNA transcript. This is a surprising observation since most of the genome does not encode proteins. This RNA must serve an important regulatory function – important because producing non-coding RNA is an energy intensive process, and in the absence of strong selection one would expect it to disappear. RNA families with common functions have specifically conserved structural motifs, which are directly related to the functional roles of RNA in catalysis and regulation. Because the conserved structures depend on base-pairing, similar RNA structures may have little or no detectable sequence similarity, making the identification of conserved RNAs difficult. This is a particularly serious problem when studying regulatory structures in RNA. In many cases, such as that of cellular internal ribosome entry sites, although we can identify RNAs that have similar regulatory responses, it is difficult to tell whether the RNAs have common structural features using current methods. Available tools for identifying common structures based on RNA sequence suffer from one or more of the following problems: they do not consider pseudoknots, which are important in many catalytic and regulatory structures; they do not consider near minimum free energy structures, which is important as many RNAs exist as an ensemble of structures of nearly equal energy; they require many examples of known structures in order to train a computational model; they require impractical amounts of computational time, precluding their use on long sequences or genomic scale; or they use a similarity function that cannot identify RNAs as having similar structure, even when they are from one of the well characterized known classes. The approach presented here has the potential to address all of these issues, allowing novel RNA structures that are shared between RNAs with little or no sequence similarity to be discovered. This provides a powerful tool to investigate and explain the pervasive transcription observed in eukaryotic genomes

    Computing the Partition Function for Kinetically Trapped RNA Secondary Structures

    Get PDF
    An RNA secondary structure is locally optimal if there is no lower energy structure that can be obtained by the addition or removal of a single base pair, where energy is defined according to the widely accepted Turner nearest neighbor model. Locally optimal structures form kinetic traps, since any evolution away from a locally optimal structure must involve energetically unfavorable folding steps. Here, we present a novel, efficient algorithm to compute the partition function over all locally optimal secondary structures of a given RNA sequence. Our software, RNAlocopt runs in time and space. Additionally, RNAlocopt samples a user-specified number of structures from the Boltzmann subensemble of all locally optimal structures. We apply RNAlocopt to show that (1) the number of locally optimal structures is far fewer than the total number of structures – indeed, the number of locally optimal structures approximately equal to the square root of the number of all structures, (2) the structural diversity of this subensemble may be either similar to or quite different from the structural diversity of the entire Boltzmann ensemble, a situation that depends on the type of input RNA, (3) the (modified) maximum expected accuracy structure, computed by taking into account base pairing frequencies of locally optimal structures, is a more accurate prediction of the native structure than other current thermodynamics-based methods. The software RNAlocopt constitutes a technical breakthrough in our study of the folding landscape for RNA secondary structures. For the first time, locally optimal structures (kinetic traps in the Turner energy model) can be rapidly generated for long RNA sequences, previously impossible with methods that involved exhaustive enumeration. Use of locally optimal structure leads to state-of-the-art secondary structure prediction, as benchmarked against methods involving the computation of minimum free energy and of maximum expected accuracy. Web server and source code available at http://bioinformatics.bc.edu/clotelab/RNAlocopt/

    Analysis of Genomic and Proteomic Sequences using DSP Techniques

    Get PDF
    Analysis of biological sequences by detecting the hidden periodicities and symbolic patterns has been an active area of research since couple of decades. The hidden periodic components and the patterns help locating the biologically relevant motifs such as protein coding regions (exons), CpG islands (CGI) and hot-spots that characterize various biological functions. The discrete nature of biological sequences has prompted many researchers to use digital signal processing (DSP) techniques for their analysis. After mapping the biological sequences to numerical sequences, various DSP techniques using digital filters, wavelets, neural networks, filter banks etc. have been developed to detect the hidden periodicities and recurring patterns in these sequences. This thesis attempts to develop effective DSP based techniques to solve some of the important problems in biological sequence analysis. Specifically, DSP techniques such as statistically optimal null filters (SONF), matched filters and neural networks based algorithms are developed for the analysis of deoxyribonucleic acid (DNA), ribonucleic acid (RNA) and protein sequences. In the first part of this study, DNA sequences are investigated in order to identify the locations of CGIs and protein coding regions, i.e., exons. SONFs, which are known for their ability to efficiently estimate short-duration signals embedded in noise by combining the maximum signal-to-noise ratio and the least squares optimization criteria, are utilized to solve these problems. Basis sequences characterizing CGIs and exons are formulated to be used in SONF technique for solving the problems. In the second part of this study, RNA sequences are analyzed to predict their secondary structures. For this purpose, matched filters based on 2-dimensional convolution are developed to identify the locations of stem and loop patterns in the RNA secondary structure. The knowledge of the stem and loop patterns thus obtained are then used to predict the presence of pseudoknot, leading to the determination of the entire RNA secondary structure. Finally, in the third part of this thesis, protein sequences are analyzed to solve the problems of predicting protein secondary structure and identifying the locations of hot-spots. For predicting the protein secondary structure a two-stage neural network scheme is developed, whereas for predicting the locations of hot-spots an SONF based approach is proposed. Hot-spots in proteins exhibit a characteristic frequency corresponding to their biological function. A basis function is formulated based on this characteristic frequency to be used in SONFs to detect the locations of hot-spots belonging to the corresponding functional group. Extensive experiments are performed throughout the thesis to demonstrate the effectiveness and validity of the various schemes and techniques developed in this investigation. The performance of the proposed techniques is compared with that of the previously reported techniques for the analysis of biological sequences. For this purpose, the results obtained are validated using databases containing with known annotations. It is shown that the proposed schemes result in performance superior to those of some of the existing techniques
    • …
    corecore