46 research outputs found

    Dynamic programming based RNA pseudoknot alignment

    Get PDF
    Pseudoknots are certain structural motifs of RNA molecules. In this thesis we consider the problem of RNA pseudoknot alignment. Most current approaches either discard pseudoknots in order to be efficient or rely on heuristics generating only approximate solutions. This work focuses on dynamic programming based alignment methods and proposes two new approaches for an exact solution of the alignment problem in the presence of pseudoknot structures. The first approach is able to handle arbitrary pseudoknots, however, does not guarantee a polynomial runtime for all instances, due to the NP-hardness of the problem. Nevertheless, an analysis in terms of parameterized complexity shows that the algorithm is fixed parameter tractable for a parameter that is small in practice. The second approach is a general scheme for the alignment of restricted classes of pseudoknots in polynomial time. It is motivated by existing RNA pseudoknot prediction algorithms. We show how to embed seven of those algorithms in a common scheme and present an analogous scheme for the alignment problem, which yields for each of the structure prediction algorithms a corresponding alignment algorithm. The alignment algorithms handle the same class of pseudoknots as the corresponding prediction algorithms and the time and space complexity is only increased by a linear factor, compared to the respective prediction algorithm. Both approaches have been implemented to evaluate their applicability in practice.In dieser Dissertation beschäftige ich mich mit dem Alignment von bestimmten RNA Strukturen, die als Pseudoknoten bezeichnet werden. Da dieses Problem NP-hart ist, berücksichtigen die meisten bisher verfügbaren Alignmentverfahren um effizient zu sein entweder keine Pseudoknoten oder berechnen nur approximierte Lösungen mit Hilfe von Heuristiken. In der vorliegenden Arbeit beschreibe ich zwei neue Verfahren, die mit Hilfe von dynamischer Programmierung eine exakte Lösung für das Alignmentproblem von Pseudoknotenstrukturen berechnen. Das erste Verfahren kann beliebige Pseudoknoten alignieren und hat, da es sich hierbei um ein NPhartes Problem handelt, im allgemeinen keine polynomiell beschränkte Laufzeit. Eine parametrische Komplexitätsanalyse zeigt allerdings, dass der Algorithmus parametrisierbar (fixed parameter tractable) in Bezug auf einen in der Praxis kleinen Parameter ist. Das zweite Verfahren ermöglicht es, unterschiedliche eingeschränkte Klassen von Pseudoknoten in polynomieller Zeit zu alignieren. In einem ersten Schritt zeige ich hierzu, wie man existierende Vorhersagealgorithmen für sieben solcher Klassen in ein gemeinsames Schema einbetten kann. Dann entwickele ich ein analoges Schema für das Alignment von Pseudoknoten, das zu jedem der Vorhersagealgorithmen einen entsprechenden Alignmentalgorithmus mit nur linear erhöhter Speicher- und Zeitkomplexität liefert. Beide Verfahren wurden auch implementiert um die Praxistauglichkeit zu evaluieren

    Automated Design of Dynamic Programming Schemes for RNA Folding with Pseudoknots

    Get PDF
    Despite being a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, RNA secondary structure prediction remains challenging whenever pseudoknots come into play. To circumvent the NP-hardness of energy minimization in realistic energy models, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations. While these methods rely on hand-crafted DP schemes, we generalize and fully automatize the design of DP pseudoknot prediction algorithms. We formalize the problem of designing DP algorithms for an (infinite) class of conformations, modeled by (a finite number of) fatgraphs, and automatically build DP schemes minimizing their algorithmic complexity. We propose an algorithm for the problem, based on the tree-decomposition of a well-chosen representative structure, which we simplify and reinterpret as a DP scheme. The algorithm is fixed-parameter tractable for the tree-width tw of the fatgraph, and its output represents a ?(n^{tw+1}) algorithm for predicting the MFE folding of an RNA of length n. Our general framework supports general energy models, partition function computations, recursive substructures and partial folding, and could pave the way for algebraic dynamic programming beyond the context-free case

    Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics

    Get PDF
    BACKGROUND: The general problem of RNA secondary structure prediction under the widely used thermodynamic model is known to be NP-complete when the structures considered include arbitrary pseudoknots. For restricted classes of pseudoknots, several polynomial time algorithms have been designed, where the O(n(6))time and O(n(4)) space algorithm by Rivas and Eddy is currently the best available program. RESULTS: We introduce the class of canonical simple recursive pseudoknots and present an algorithm that requires O(n(4)) time and O(n(2)) space to predict the energetically optimal structure of an RNA sequence, possible containing such pseudoknots. Evaluation against a large collection of known pseudoknotted structures shows the adequacy of the canonization approach and our algorithm. CONCLUSIONS: RNA pseudoknots of medium size can now be predicted reliably as well as efficiently by the new algorithm

    Tree Diet: Reducing the Treewidth to Unlock FPT Algorithms in RNA Bioinformatics

    Get PDF
    Hard graph problems are ubiquitous in Bioinformatics, inspiring the design of specialized Fixed-Parameter Tractable algorithms, many of which rely on a combination of tree-decomposition and dynamic programming. The time/space complexities of such approaches hinge critically on low values for the treewidth tw of the input graph. In order to extend their scope of applicability, we introduce the Tree-Diet problem, i.e. the removal of a minimal set of edges such that a given tree-decomposition can be slimmed down to a prescribed treewidth tw\u27. Our rationale is that the time gained thanks to a smaller treewidth in a parameterized algorithm compensates the extra post-processing needed to take deleted edges into account. Our core result is an FPT dynamic programming algorithm for Tree-Diet, using 2^{O(tw)}n time and space. We complement this result with parameterized complexity lower-bounds for stronger variants (e.g., NP-hardness when tw\u27 or tw-tw\u27 is constant). We propose a prototype implementation for our approach which we apply on difficult instances of selected RNA-based problems: RNA design, sequence-structure alignment, and search of pseudoknotted RNAs in genomes, revealing very encouraging results. This work paves the way for a wider adoption of tree-decomposition-based algorithms in Bioinformatics

    Geometric combinatorics and computational molecular biology: branching polytopes for RNA sequences

    Full text link
    Questions in computational molecular biology generate various discrete optimization problems, such as DNA sequence alignment and RNA secondary structure prediction. However, the optimal solutions are fundamentally dependent on the parameters used in the objective functions. The goal of a parametric analysis is to elucidate such dependencies, especially as they pertain to the accuracy and robustness of the optimal solutions. Techniques from geometric combinatorics, including polytopes and their normal fans, have been used previously to give parametric analyses of simple models for DNA sequence alignment and RNA branching configurations. Here, we present a new computational framework, and proof-of-principle results, which give the first complete parametric analysis of the branching portion of the nearest neighbor thermodynamic model for secondary structure prediction for real RNA sequences.Comment: 17 pages, 8 figure

    Algorithms for RNA secondary structure analysis : prediction of pseudoknots and the consensus shapes approach

    Get PDF
    Reeder J. Algorithms for RNA secondary structure analysis : prediction of pseudoknots and the consensus shapes approach. Bielefeld (Germany): Bielefeld University; 2007.Our understanding of the role of RNA has undergone a major change in the last decade. Once believed to be only a mere carrier of information and structural component of the ribosomal machinery in the advent of the genomic age, it is now clear that RNAs play a much more active role. RNAs can act as regulators and can have catalytic activity - roles previously only attributed to proteins. There is still much speculation in the scientific community as to what extent RNAs are responsible for the complexity in higher organisms which can hardly be explained with only proteins as regulators. In order to investigate the roles of RNA, it is therefore necessary to search for new classes of RNA. For those and already known classes, analyses of their presence in different species of the tree of life will provide further insight about the evolution of biomolecules and especially RNAs. Since RNA function often follows its structure, the need for computer programs for RNA structure prediction is an immanent part of this procedure. The secondary structure of RNA - the level of base pairing - strongly determines the tertiary structure. As the latter is computationally intractable and experimentally expensive to obtain, secondary structure analysis has become an accepted substitute. In this thesis, I present two new algorithms (and a few variations thereof) for the prediction of RNA secondary structures. The first algorithm addresses the problem of predicting a secondary structure from a single sequence including RNA pseudoknots. Pseudoknots have been shown to be functionally relevant in many RNA mediated processes. However, pseudoknots are excluded from considerations by state-of-the-art RNA folding programs for reasons of computational complexity. While folding a sequence of length n into unknotted structures requires O(n^3) time and O(n^2) space, finding the best structure including arbitrary pseudoknots has been proven to be NP-complete. Nevertheless, I demonstrate in this work that certain types of pseudoknots can be included in the folding process with only a moderate increase of computational cost. In analogy to protein coding RNA, where a conserved encoded protein hints at a similar metabolic function, structural conservation in RNA may give clues to RNA function and to finding of RNA genes. However, structure conservation is more complex to deal with computationally than sequence conservation. The method considered to be at least conceptually the ideal approach in this situation is the Sankoff algorithm. It simultaneously aligns two sequences and predicts a common secondary structure. Unfortunately, it is computationally rather expensive - O(n^6) time and O(n^4) space for two sequences, and for more than two sequences it becomes exponential in the number of sequences! Therefore, several heuristic implementations emerged in the last decade trying to make the Sankoff approach practical by introducing pragmatic restrictions on the search space. In this thesis, I propose to redefine the consensus structure prediction problem in a way that does not imply a multiple sequence alignment step. For a family of RNA sequences, my method explicitly and independently enumerates the near-optimal abstract shape space and predicts an abstract shape as the consensus for all sequences. For each sequence, it delivers the thermodynamically best structure which has this shape. The technique of abstract shapes analysis is employed here for a synoptic view of the suboptimal folding space. As the shape space is much smaller than the structure space, and identification of common shapes can be done in linear time (in the number of shapes considered), the method is essentially linear in the number of sequences. Evaluations show that the new method compares favorably with available alternatives

    A Progressive Folding Algorithm for RNA Secondary Structure Prediction

    Get PDF
    RNA secondary structure prediction is an area where computational techniques have shown great promise. Most RNA secondary structure prediction algorithms use dynamic programming to compute a secondary structure with minimum free energy. Energy minimization algorithms are less accurate on larger RNA molecules. One potential reason is that larger RNA molecules do not fold instantaneously. Instead, several studies show that RNA molecules fold progressively during transcription. This process could encourage the molecule to fold into a structure that is not at the global lowest energy level. Additionally, dynamic programming algorithms do not allow for a important type of structure called a pseudoknot. Secondary structure prediction allowing pseudoknots was recently shown to be NP-complete. We have created a simulation that captures these biological insights. Our simulation uses a probabilistic approach to fold the molecule progressively as it is synthesized. This thesis evaluates the performance of the simulation and presents several enhancements to improve efficiency and accuracy. Our results show that our progressive folding algorithm did not improve on current techniques. Additionally, we found that a simulated annealing algorithm using our probability models was more accurate than our progressive folding algorithm

    A modular data analysis pipeline for the discovery of novel RNA motifs

    Get PDF
    This dissertation presents a modular software pipeline that searches collections of RNA sequences for novel RNA motifs. In this case the motifs incorporate elements of primary and secondary structure. The motif search pipeline breaks up sets of RNA sequences into shortened segments of RNA primary sequence. The shortened segments are then folded to obtain low energy secondary structures. The distance estimation module of the pipeline then calculates distances between the folded bricks, and then analyzes the resulting distance matrices for patterns;An initial implementation of the pipeline is applied to synthetic and biological data sets. This implementation introduces a new distance measure for comparing RNA sequences based on structural annotation of the folded sequence as well as a new data analysis technique called non-linear projection. The modular nature of the pipeline is then used to explore the relationships between several different distance measures on random data, synthetic data, and a biological data set consisting of iron response elements. It is shown that the different distance measures capture different relationships between the RNA sequences. The non-linear projection algorithm is used to produce 2-dimensional projections of the distance matrices which are examined via inspection and k-means multiclustering. The pipeline is able to successfully cluster synthetic RNA sequences based only on primary sequence data as well as the iron response elements data set. The dissertation also presents a preliminary analysis of a large biological data set of HIV sequences

    Transat—A Method for Detecting the Conserved Helices of Functional RNA Structures, Including Transient, Pseudo-Knotted and Alternative Structures

    Get PDF
    The prediction of functional RNA structures has attracted increased interest, as it allows us to study the potential functional roles of many genes. RNA structure prediction methods, however, assume that there is a unique functional RNA structure and also do not predict functional features required for in vivo folding. In order to understand how functional RNA structures form in vivo, we require sophisticated experiments or reliable prediction methods. So far, there exist only a few, experimentally validated transient RNA structures. On the computational side, there exist several computer programs which aim to predict the co-transcriptional folding pathway in vivo, but these make a range of simplifying assumptions and do not capture all features known to influence RNA folding in vivo. We want to investigate if evolutionarily related RNA genes fold in a similar way in vivo. To this end, we have developed a new computational method, Transat, which detects conserved helices of high statistical significance. We introduce the method, present a comprehensive performance evaluation and show that Transat is able to predict the structural features of known reference structures including pseudo-knotted ones as well as those of known alternative structural configurations. Transat can also identify unstructured sub-sequences bound by other molecules and provides evidence for new helices which may define folding pathways, supporting the notion that homologous RNA sequence not only assume a similar reference RNA structure, but also fold similarly. Finally, we show that the structural features predicted by Transat differ from those assuming thermodynamic equilibrium. Unlike the existing methods for predicting folding pathways, our method works in a comparative way. This has the disadvantage of not being able to predict features as function of time, but has the considerable advantage of highlighting conserved features and of not requiring a detailed knowledge of the cellular environment

    MaxSAT Evaluation 2017 : Solver and Benchmark Descriptions

    Get PDF
    corecore