702 research outputs found

    Accelerated probabilistic inference of RNA structure evolution

    Get PDF
    BACKGROUND: Pairwise stochastic context-free grammars (Pair SCFGs) are powerful tools for evolutionary analysis of RNA, including simultaneous RNA sequence alignment and secondary structure prediction, but the associated algorithms are intensive in both CPU and memory usage. The same problem is faced by other RNA alignment-and-folding algorithms based on Sankoff's 1985 algorithm. It is therefore desirable to constrain such algorithms, by pre-processing the sequences and using this first pass to limit the range of structures and/or alignments that can be considered. RESULTS: We demonstrate how flexible classes of constraint can be imposed, greatly reducing the computational costs while maintaining a high quality of structural homology prediction. Any score-attributed context-free grammar (e.g. energy-based scoring schemes, or conditionally normalized Pair SCFGs) is amenable to this treatment. It is now possible to combine independent structural and alignment constraints of unprecedented general flexibility in Pair SCFG alignment algorithms. We outline several applications to the bioinformatics of RNA sequence and structure, including Waterman-Eggert N-best alignments and progressive multiple alignment. We evaluate the performance of the algorithm on test examples from the RFAM database. CONCLUSION: A program, Stemloc, that implements these algorithms for efficient RNA sequence alignment and structure prediction is available under the GNU General Public License

    A Combinatorial Framework for Designing (Pseudoknotted) RNA Algorithms

    Get PDF
    We extend an hypergraph representation, introduced by Finkelstein and Roytberg, to unify dynamic programming algorithms in the context of RNA folding with pseudoknots. Classic applications of RNA dynamic programming energy minimization, partition function, base-pair probabilities...) are reformulated within this framework, giving rise to very simple algorithms. This reformulation allows one to conceptually detach the conformation space/energy model -- captured by the hypergraph model -- from the specific application, assuming unambiguity of the decomposition. To ensure the latter property, we propose a new combinatorial methodology based on generating functions. We extend the set of generic applications by proposing an exact algorithm for extracting generalized moments in weighted distribution, generalizing a prior contribution by Miklos and al. Finally, we illustrate our full-fledged programme on three exemplary conformation spaces (secondary structures, Akutsu's simple type pseudoknots and kissing hairpins). This readily gives sets of algorithms that are either novel or have complexity comparable to classic implementations for minimization and Boltzmann ensemble applications of dynamic programming

    Algorithms for RNA secondary structure analysis : prediction of pseudoknots and the consensus shapes approach

    Get PDF
    Reeder J. Algorithms for RNA secondary structure analysis : prediction of pseudoknots and the consensus shapes approach. Bielefeld (Germany): Bielefeld University; 2007.Our understanding of the role of RNA has undergone a major change in the last decade. Once believed to be only a mere carrier of information and structural component of the ribosomal machinery in the advent of the genomic age, it is now clear that RNAs play a much more active role. RNAs can act as regulators and can have catalytic activity - roles previously only attributed to proteins. There is still much speculation in the scientific community as to what extent RNAs are responsible for the complexity in higher organisms which can hardly be explained with only proteins as regulators. In order to investigate the roles of RNA, it is therefore necessary to search for new classes of RNA. For those and already known classes, analyses of their presence in different species of the tree of life will provide further insight about the evolution of biomolecules and especially RNAs. Since RNA function often follows its structure, the need for computer programs for RNA structure prediction is an immanent part of this procedure. The secondary structure of RNA - the level of base pairing - strongly determines the tertiary structure. As the latter is computationally intractable and experimentally expensive to obtain, secondary structure analysis has become an accepted substitute. In this thesis, I present two new algorithms (and a few variations thereof) for the prediction of RNA secondary structures. The first algorithm addresses the problem of predicting a secondary structure from a single sequence including RNA pseudoknots. Pseudoknots have been shown to be functionally relevant in many RNA mediated processes. However, pseudoknots are excluded from considerations by state-of-the-art RNA folding programs for reasons of computational complexity. While folding a sequence of length n into unknotted structures requires O(n^3) time and O(n^2) space, finding the best structure including arbitrary pseudoknots has been proven to be NP-complete. Nevertheless, I demonstrate in this work that certain types of pseudoknots can be included in the folding process with only a moderate increase of computational cost. In analogy to protein coding RNA, where a conserved encoded protein hints at a similar metabolic function, structural conservation in RNA may give clues to RNA function and to finding of RNA genes. However, structure conservation is more complex to deal with computationally than sequence conservation. The method considered to be at least conceptually the ideal approach in this situation is the Sankoff algorithm. It simultaneously aligns two sequences and predicts a common secondary structure. Unfortunately, it is computationally rather expensive - O(n^6) time and O(n^4) space for two sequences, and for more than two sequences it becomes exponential in the number of sequences! Therefore, several heuristic implementations emerged in the last decade trying to make the Sankoff approach practical by introducing pragmatic restrictions on the search space. In this thesis, I propose to redefine the consensus structure prediction problem in a way that does not imply a multiple sequence alignment step. For a family of RNA sequences, my method explicitly and independently enumerates the near-optimal abstract shape space and predicts an abstract shape as the consensus for all sequences. For each sequence, it delivers the thermodynamically best structure which has this shape. The technique of abstract shapes analysis is employed here for a synoptic view of the suboptimal folding space. As the shape space is much smaller than the structure space, and identification of common shapes can be done in linear time (in the number of shapes considered), the method is essentially linear in the number of sequences. Evaluations show that the new method compares favorably with available alternatives

    Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics

    Get PDF
    BACKGROUND: The general problem of RNA secondary structure prediction under the widely used thermodynamic model is known to be NP-complete when the structures considered include arbitrary pseudoknots. For restricted classes of pseudoknots, several polynomial time algorithms have been designed, where the O(n(6))time and O(n(4)) space algorithm by Rivas and Eddy is currently the best available program. RESULTS: We introduce the class of canonical simple recursive pseudoknots and present an algorithm that requires O(n(4)) time and O(n(2)) space to predict the energetically optimal structure of an RNA sequence, possible containing such pseudoknots. Evaluation against a large collection of known pseudoknotted structures shows the adequacy of the canonization approach and our algorithm. CONCLUSIONS: RNA pseudoknots of medium size can now be predicted reliably as well as efficiently by the new algorithm

    Higher-Order Equational Pattern Anti-Unification

    Get PDF
    We consider anti-unification for simply typed lambda terms in associative, commutative, and associative-commutative theories and develop a sound and complete algorithm which takes two lambda terms and computes their generalizations in the form of higher-order patterns. The problem is finitary: the minimal complete set of generalizations contains finitely many elements. We define the notion of optimal solution and investigate special fragments of the problem for which the optimal solution can be computed in linear or polynomial time

    Unity and disunity in evolutionary sciences: Process-based analogies open common research avenues for biology and linguistics

    Get PDF
    International audienceBackgroundFor a long time biologists and linguists have been noticing surprising similarities between the evolution of life forms and languages. Most of the proposed analogies have been rejected. Some, however, have persisted, and some even turned out to be fruitful, inspiring the transfer of methods and models between biology and linguistics up to today. Most proposed analogies were based on a comparison of the research objects rather than the processes that shaped their evolution. Focusing on process-based analogies, however, has the advantage of minimizing the risk of overstating similarities, while at the same time reflecting the common strategy to use processes to explain the evolution of complexity in both fields.ResultsWe compared important evolutionary processes in biology and linguistics and identified processes specific to only one of the two disciplines as well as processes which seem to be analogous, potentially reflecting core evolutionary processes. These new process-based analogies support novel methodological transfer, expanding the application range of biological methods to the field of historical linguistics. We illustrate this by showing (i) how methods dealing with incomplete lineage sorting offer an introgression-free framework to analyze highly mosaic word distributions across languages; (ii) how sequence similarity networks can be used to identify composite and borrowed words across different languages; (iii) how research on partial homology can inspire new methods and models in both fields; and (iv) how constructive neutral evolution provides an original framework for analyzing convergent evolution in languages resulting from common descent (Sapir’s drift).ConclusionsApart from new analogies between evolutionary processes, we also identified processes which are specific to either biology or linguistics. This shows that general evolution cannot be studied from within one discipline alone. In order to get a full picture of evolution, biologists and linguists need to complement their studies, trying to identify cross-disciplinary and discipline-specific evolutionary processes. The fact that we found many process-based analogies favoring transfer from biology to linguistics further shows that certain biological methods and models have a broader scope than previously recognized. This opens fruitful paths for collaboration between the two disciplines

    Extensible Languages for Flexible and Principled Domain Abstraction

    Get PDF
    Die meisten Programmiersprachen werden als Universalsprachen entworfen. UnabhĂ€ngig von der zu entwickelnden Anwendung, stellen sie die gleichen Sprachfeatures und Sprachkonstrukte zur VerfĂŒgung. Solch universelle Sprachfeatures ignorieren jedoch die spezifischen Anforderungen, die viele Softwareprojekte mit sich bringen. Als Gegenkraft zu Universalsprachen fördern domĂ€nenspezifische Programmiersprachen, modellgetriebene Softwareentwicklung und sprachorientierte Programmierung die Verwendung von DomĂ€nenabstraktion, welche den Einsatz von domĂ€nenspezifischen Sprachfeatures und Sprachkonstrukten ermöglicht. Insbesondere erlaubt DomĂ€nenabstraktion Programmieren auf dem selben Abstraktionsniveau zu programmieren wie zu denken und vermeidet dadurch die Notwendigkeit DomĂ€nenkonzepte mit universalsprachlichen Features zu kodieren. Leider ermöglichen aktuelle AnsĂ€tze zur DomĂ€nenabstraktion nicht die Entfaltung ihres ganzen Potentials. Einerseits mangelt es den AnsĂ€tzen fĂŒr interne domĂ€nenspezifische Sprachen an FlexibilitĂ€t bezĂŒglich der Syntax, statischer Analysen, und WerkzeugunterstĂŒtzung, was das tatsĂ€chlich erreichte Abstraktionsniveau beschrĂ€nkt. Andererseits mangelt es den AnsĂ€tzen fĂŒr externe domĂ€nenspezifische Sprachen an wichtigen Prinzipien, wie beispielsweise modularem Schließen oder Komposition von DomĂ€nenabstraktionen, was die Anwendbarkeit dieser AnsĂ€tze in der Entwicklung grĂ¶ĂŸerer Softwaresysteme einschrĂ€nkt. Wir verfolgen in der vorliegenden Doktorarbeit einen neuartigen Ansatz, welcher die Vorteile von internen und externen domĂ€nenspezifischen Sprachen vereint um flexible und prinzipientreue DomĂ€nenabstraktion zu unterstĂŒtzen. Wir schlagen bibliotheksbasierte erweiterbare Programmiersprachen als Grundlage fĂŒr DomĂ€nenabstraktion vor. In einer erweiterbaren Sprache kann DomĂ€nenabstraktion durch die Erweiterung der Sprache mit domĂ€nenspezifischer Syntax, statischer Analyse, und WerkzeugunterstĂŒtzung erreicht werden . Dies ermöglicht DomĂ€nenabstraktionen die selbe FlexibilitĂ€t wie externe domĂ€nenspezifische Sprachen. Um die Einhaltung ĂŒblicher Prinzipien zu gewĂ€hrleisten, organisieren wir Spracherweiterungen als Bibliotheken und verwenden einfache Import-Anweisungen zur Aktivierung von Erweiterungen. Dies erlaubt modulares Schließen (durch die Inspektion der Import-Anweisungen), unterstĂŒtzt die Komposition von DomĂ€nenabstraktionen (durch das Importieren mehrerer Erweiterungen), und ermöglicht die uniforme Selbstanwendbarkeit von Spracherweiterungen in der Entwicklung zukĂŒnftiger Erweiterungen (durch das Importieren von Erweiterungen in einer Erweiterungsdefinition). Die Organisation von Erweiterungen in Form von Bibliotheken ermöglicht DomĂ€nenabstraktionen die selbe Prinzipientreue wie interne domĂ€nenspezifische Sprachen. Wir haben die bibliotheksbasierte erweiterbare Programmiersprache SugarJ entworfen und implementiert. SugarJ Bibliotheken können Erweiterungen der Syntax, der statischen Analyse, und der WerkzeugunterstĂŒtzung von SugarJ deklarieren. Eine syntaktische Erweiterung besteht dabei aus einer erweiterten Syntax und einer Transformation der erweiterten Syntax in die Basissyntax von SugarJ. Eine Erweiterung der Analyse testet Teile des abstrakten Syntaxbaums der aktuellen Datei und produziert eine Liste von Fehlern. Eine Erweiterung der WerkzeugunterstĂŒtzung deklariert Dienste wie SyntaxfĂ€rbung oder CodevervollstĂ€ndigung fĂŒr bestimmte Sprachkonstrukte. SugarJ Erweiterungen sind vollkommen selbstanwendbar: Eine erweiterte Syntax kann in eine Erweiterungsdefinition transformiert werden, eine erweiterte Analyse kann Erweiterungsdefinitionen testen, und eine erweiterte WerkzeugunterstĂŒtzung kann Entwicklern beim Definieren von Erweiterungen assistieren. Um eine Quelldatei mit Erweiterungen zu verarbeiten, inspizieren der SugarJ Compiler und die SugarJ IDE die importierten Bibliotheken um die aktiven Erweiterungen zu bestimmen. Der Compiler und die IDE adaptieren den Parser, den Codegenerator, die Analyseroutine und die WerkzeugunterstĂŒtzung der Quelldatei entsprechend der aktiven Erweiterungen. Wir beschreiben in der vorliegenden Doktorarbeit nicht nur das Design und die Implementierung von SugarJ, sondern berichten darĂŒber hinaus ĂŒber Erweiterungen unseres ursprĂŒnglich Designs. Insbesondere haben wir eine Generalisierung des SugarJ Compilers entworfen und implementiert, die neben Java alternative Basissprachen unterstĂŒtzt. Wir haben diese Generalisierung verwendet um die bibliotheksbasierten erweiterbaren Programmiersprachen SugarHaskell, SugarProlog, und SugarFomega zu entwickeln. Weiterhin haben wir SugarJ ergĂ€nzt um polymorphe DomĂ€nenabstraktion und KommunikationsintegritĂ€t zu unterstĂŒtzen. Polymorphe DomĂ€nenabstraktion ermöglicht Programmierern mehrere Transformationen fĂŒr die selbe domĂ€nenspezifische Syntax bereitzustellen. Dies erhöht die FlexibilitĂ€t von SugarJ und unterstĂŒtzt bekannte Szenarien aus der modellgetriebenen Entwicklung. KommunikationsintegritĂ€t spezifiziert, dass die Komponenten eines Softwaresystems nur ĂŒber explizite KanĂ€le kommunizieren dĂŒrfen. Im Kontext von Codegenerierung stellt dies eine interessante Eigenschaft dar, welche die Generierung von impliziten ModulabhĂ€ngigkeiten untersagt. Wir haben KommunikationsintegritĂ€t als weiteres Prinzip zu SugarJ hinzugefĂŒgt. Basierend auf SugarJ und zahlreicher Fallstudien argumentieren wir, dass flexible und prinzipientreue DomĂ€nenabstraktion ein skalierbares Programmiermodell fĂŒr die Entwicklung komplexer Softwaresysteme darstellt

    LCGbase: A Comprehensive Database for Lineage-Based Co-regulated Genes

    Get PDF
    Animal genes of different lineages, such as vertebrates and arthropods, are well-organized and blended into dynamic chromosomal structures that represent a primary regulatory mechanism for body development and cellular differentiation. The majority of genes in a genome are actually clustered, which are evolutionarily stable to different extents and biologically meaningful when evaluated among genomes within and across lineages. Until now, many questions concerning gene organization, such as what is the minimal number of genes in a cluster and what is the driving force leading to gene co-regulation, remain to be addressed. Here, we provide a user-friendly database—LCGbase (a comprehensive database for lineage-based co-regulated genes)—hosting information on evolutionary dynamics of gene clustering and ordering within animal kingdoms in two different lineages: vertebrates and arthropods. The database is constructed on a web-based Linux-Apache-MySQL-PHP framework and effective interactive user-inquiry service. Compared to other gene annotation databases with similar purposes, our database has three comprehensible advantages. First, our database is inclusive, including all high-quality genome assemblies of vertebrates and representative arthropod species. Second, it is human-centric since we map all gene clusters from other genomes in an order of lineage-ranks (such as primates, mammals, warm-blooded, and reptiles) onto human genome and start the database from well-defined gene pairs (a minimal cluster where the two adjacent genes are oriented as co-directional, convergent, and divergent pairs) to large gene clusters. Furthermore, users can search for any adjacent genes and their detailed annotations. Third, the database provides flexible parameter definitions, such as the distance of transcription start sites between two adjacent genes, which is extendable to genes that flanking the cluster across species. We also provide useful tools for sequence alignment, gene ontology (GO) annotation, promoter identification, gene expression (co-expression), and evolutionary analysis. This database not only provides a way to define lineage-specific and species-specific gene clusters but also facilitates future studies on gene co-regulation, epigenetic control of gene expression (DNA methylation and histone marks), and chromosomal structures in a context of gene clusters and species evolution. LCGbase is freely available at http://lcgbase.big.ac.cn/LCGbase
    • 

    corecore