7 research outputs found

    Asymmetry in RNA pseudoknots: observation and theory

    Get PDF
    RNA can fold into a topological structure called a pseudoknot, composed of non-nested double-stranded stems connected by single-stranded loops. Our examination of the PseudoBase database of pseudoknotted RNA structures reveals asymmetries in the stem and loop lengths and provocative composition differences between the loops. By taking into account differences between major and minor grooves of the RNA double helix, we explain much of the asymmetry with a simple polymer physics model and statistical mechanical theory, with only one adjustable parameter

    From RNA folding to inverse folding: a computational study: Folding and design of RNA molecules

    Get PDF
    Since the discovery of the structure of DNA in the early 1953s and its double-chained complement of information hinting at its means of replication, biologists have recognized the strong connection between molecular structure and function. In the past two decades, there has been a surge of research on an ever-growing class of RNA molecules that are non-coding but whose various folded structures allow a diverse array of vital functions. From the well-known splicing and modification of ribosomal RNA, non-coding RNAs (ncRNAs) are now known to be intimately involved in possibly every stage of DNA translation and protein transcription, as well as RNA signalling and gene regulation processes. Despite the rapid development and declining cost of modern molecular methods, they typically can only describe ncRNA's structural conformations in vitro, which differ from their in vivo counterparts. Moreover, it is estimated that only a tiny fraction of known ncRNAs has been documented experimentally, often at a high cost. There is thus a growing realization that computational methods must play a central role in the analysis of ncRNAs. Not only do computational approaches hold the promise of rapidly characterizing many ncRNAs yet to be described, but there is also the hope that by understanding the rules that determine their structure, we will gain better insight into their function and design. Many studies revealed that the ncRNA functions are performed by high-level structures that often depend on their low-level structures, such as the secondary structure. This thesis studies the computational folding mechanism and inverse folding of ncRNAs at the secondary level. In this thesis, we describe the development of two bioinformatic tools that have the potential to improve our understanding of RNA secondary structure. These tools are as follows: (1) RAFFT for efficient prediction of pseudoknot-free RNA folding pathways using the fast Fourier transform (FFT)}; (2) aRNAque, an evolutionary algorithm inspired by Lévy flights for RNA inverse folding with or without pseudoknot (A secondary structure that often poses difficulties for bio-computational detection). The first tool, RAFFT, implements a novel heuristic to predict RNA secondary structure formation pathways that has two components: (i) a folding algorithm and (ii) a kinetic ansatz. When considering the best prediction in the ensemble of 50 secondary structures predicted by RAFFT, its performance matches the recent deep-learning-based structure prediction methods. RAFFT also acts as a folding kinetic ansatz, which we tested on two RNAs: the CFSE and a classic bi-stable sequence. In both test cases, fewer structures were required to reproduce the full kinetics, whereas known methods (such as Treekin) required a sample of 20,000 structures and more. The second tool, aRNAque, implements an evolutionary algorithm (EA) inspired by the Lévy flight, allowing both local global search and which supports pseudoknotted target structures. The number of point mutations at every step of aRNAque's EA is drawn from a Zipf distribution. Therefore, our proposed method increases the diversity of designed RNA sequences and reduces the average number of evaluations of the evolutionary algorithm. The overall performance showed improved empirical results compared to existing tools through intensive benchmarks on both pseudoknotted and pseudoknot-free datasets. In conclusion, we highlight some promising extensions of the versatile RAFFT method to RNA-RNA interaction studies. We also provide an outlook on both tools' implications in studying evolutionary dynamics

    Thermodynamic Analysis of Interacting Nucleic Acid Strands

    Get PDF
    Motivated by the analysis of natural and engineered DNA and RNA systems, we present the first algorithm for calculating the partition function of an unpseudoknotted complex of multiple interacting nucleic acid strands. This dynamic program is based on a rigorous extension of secondary structure models to the multistranded case, addressing representation and distinguishability issues that do not arise for single-stranded structures. We then derive the form of the partition function for a fixed volume containing a dilute solution of nucleic acid complexes. This expression can be evaluated explicitly for small numbers of strands, allowing the calculation of the equilibrium population distribution for each species of complex. Alternatively, for large systems (e.g., a test tube), we show that the unique complex concentrations corresponding to thermodynamic equilibrium can be obtained by solving a convex programming problem. Partition function and concentration information can then be used to calculate equilibrium base-pairing observables. The underlying physics and mathematical formulation of these problems lead to an interesting blend of approaches, including ideas from graph theory, group theory, dynamic programming, combinatorics, convex optimization, and Lagrange duality

    Analysis of Genomic and Proteomic Sequences using DSP Techniques

    Get PDF
    Analysis of biological sequences by detecting the hidden periodicities and symbolic patterns has been an active area of research since couple of decades. The hidden periodic components and the patterns help locating the biologically relevant motifs such as protein coding regions (exons), CpG islands (CGI) and hot-spots that characterize various biological functions. The discrete nature of biological sequences has prompted many researchers to use digital signal processing (DSP) techniques for their analysis. After mapping the biological sequences to numerical sequences, various DSP techniques using digital filters, wavelets, neural networks, filter banks etc. have been developed to detect the hidden periodicities and recurring patterns in these sequences. This thesis attempts to develop effective DSP based techniques to solve some of the important problems in biological sequence analysis. Specifically, DSP techniques such as statistically optimal null filters (SONF), matched filters and neural networks based algorithms are developed for the analysis of deoxyribonucleic acid (DNA), ribonucleic acid (RNA) and protein sequences. In the first part of this study, DNA sequences are investigated in order to identify the locations of CGIs and protein coding regions, i.e., exons. SONFs, which are known for their ability to efficiently estimate short-duration signals embedded in noise by combining the maximum signal-to-noise ratio and the least squares optimization criteria, are utilized to solve these problems. Basis sequences characterizing CGIs and exons are formulated to be used in SONF technique for solving the problems. In the second part of this study, RNA sequences are analyzed to predict their secondary structures. For this purpose, matched filters based on 2-dimensional convolution are developed to identify the locations of stem and loop patterns in the RNA secondary structure. The knowledge of the stem and loop patterns thus obtained are then used to predict the presence of pseudoknot, leading to the determination of the entire RNA secondary structure. Finally, in the third part of this thesis, protein sequences are analyzed to solve the problems of predicting protein secondary structure and identifying the locations of hot-spots. For predicting the protein secondary structure a two-stage neural network scheme is developed, whereas for predicting the locations of hot-spots an SONF based approach is proposed. Hot-spots in proteins exhibit a characteristic frequency corresponding to their biological function. A basis function is formulated based on this characteristic frequency to be used in SONFs to detect the locations of hot-spots belonging to the corresponding functional group. Extensive experiments are performed throughout the thesis to demonstrate the effectiveness and validity of the various schemes and techniques developed in this investigation. The performance of the proposed techniques is compared with that of the previously reported techniques for the analysis of biological sequences. For this purpose, the results obtained are validated using databases containing with known annotations. It is shown that the proposed schemes result in performance superior to those of some of the existing techniques

    Computational Design and Experimental Validation of Functional Ribonucleic Acid Nanostructures

    Get PDF
    In living cells, two major classes of ribonucleic acid (RNA) molecules can be found. The first class called the messenger RNA (mRNA) contains the genetic information that allows the ribosome to read and translate it into proteins. The second class called non-coding RNA (ncRNA), do not code for proteins and are involved with key cellular processes, such as gene expression regulation, splicing, differentiation, and development. NcRNAs fold into an ensemble of thermodynamically stable secondary structures, which will eventually lead the molecule to fold into a specific 3D structure. It is widely known that ncRNAs carry their functions via their 3D structures as well as their molecular composition. The secondary structure of ncRNAs is composed of different types of structural elements (motifs) such as stacking base pairs, internal loops, hairpin loops and pseudoknots. Pseudoknots are specifically difficult to model, are abundant in nature and known to stabilize the functional form of the molecule. Due to the diverse range of functions of ncRNAs, their computational design and analysis have numerous applications in nano-technology, therapeutics, synthetic biology, and materials engineering. The RNA design problem is to find novel RNA sequences that are predicted to fold into target structure(s) while satisfying specific qualitative characteristics and constraints. RNA design can be modeled as a combinatorial optimization problem (COP) and is known to be computationally challenging or more precisely NP-hard. Numerous algorithms to solve the RNA design problem have been developed over the past two decades, however mostly ignore pseudoknots and therefore limit application to only a slice of real-world modeling and design problems. Moreover, the few existing pseudoknot designer methods which were developed only recently, do not provide any evidence about the applicability of their proposed design methodology in biological contexts. The two objectives of this thesis are set to address these two shortcomings. First, we are interested in developing an efficient computational method for the design of RNA secondary structures including pseudoknots that show significantly improved in-silico quality characteristics than the state of the art. Second, we are interested in showing the real-world worthiness of the proposed method by validating it experimentally. More precisely, our aim is to design instances of certain types of RNA enzymes (i.e. ribozymes) and demonstrate that they are functionally active. This would likely only happen if their predicted folding matched their actual folding in the in-vitro experiments. In this thesis, we present four contributions. First, we propose a novel adaptive defect weighted sampling algorithm to efficiently solve the RNA secondary structure design problem where pseudoknots are included. We compare the performance of our design algorithm with the state of the art and show that our method generates molecules that are thermodynamically more stable and less defective than those generated by state of the art methods. Moreover, we show when the effect of fitness evaluation is decoupled from the search and optimization process, our optimization method converges faster than the non-dominated sorting genetic algorithm (NSGA II) and the ant colony optimization (ACO) algorithm do. Second, we use our algorithmic development to implement an RNA design pipeline called Enzymer and make it available as an open source package useful for wet lab practitioners and RNA bioinformaticians. Enzymer uses multiple sequence alignment (MSA) data to generate initial design templates for further optimization. Our design pipeline can then be used to re-engineer naturally occurring RNA enzymes such as ribozymes and riboswitches. Our first and second contributions are published in the RNA section of the Journal of Frontiers in Genetics. Third, we use Enzymer to reengineer three different species of pseudoknotted ribozymes: a hammerhead ribozyme from the mouse gut metagenome, a hammerhead ribozyme from Yarrowia lipolytica and a glmS ribozyme from Thermoanaerobacter tengcogensis. We designed a total of 18 ribozyme sequences and showed the 16 of them were active in-vitro. Our experimental results have been submitted to the RNA journal and strongly suggest that Enzymer is a reliable tool to design pseudoknotted ncRNAs with desired secondary structure. Finally, we propose a novel architecture for a new ribozyme-based gene regulatory network where a hammerhead ribozyme modulates expression of a reporter gene when an external stimulus IPTG is present. Our in-vivo results show expected results in 7 out of 12 cases

    Thermodynamic Analysis of Interacting Nucleic Acid Strands

    Full text link

    Insights into RNA structure by melding experiment and computation

    Get PDF
    The ability of RNA to perform diverse cellular functions depends on its capability to form complex structures. Therefore, determining RNA structure is critical to understanding RNA function. Computational methods allow for quick determination of RNA structures, but are often prone to inaccuracies in their predictions. A newly developed technology, known as SHAPE, can be used to probe RNA structure and identify nucleotides that are likely to be single stranded and base paired. This SHAPE data can be inputted into an RNA structure program to refine predictions. Previous studies have shown that the incorporation of SHAPE data can increase the accuracy of prediction by over 30% compared to traditional mFold class algorithms. In this work, I utilize SHAPE technology to refine RNA predictions and solve new challenges. First, I create an algorithm, ShapeKnots, which incorporates SHAPE data and the prediction of pseudoknots. Pseudoknots are relatively rare RNA structural motifs that have a tendency of occurring in functional regions, but, due to their complexity, are often eliminated from structural prediction. Second, I utilize the ShapeKnots algorithm to identify pseudoknots in HIV-1 and test their role in viral replication. Third, I develop a modified partition function calculation to identify the de novo accuracy of secondary structure predictions. This allows end users to not only obtain a predicted structure, but also, to know the confidence of that prediction. Fourth, I utilize SHAPE-directed folding to identify potential alternative structures in the ribosome. Finally, I create a method to identify the accuracy of tertiary structure predictions. This allows for a quantitative measurement of accuracy when comparing predicted tertiary structures with previously determined conventional structures.Doctor of Philosoph
    corecore