49 research outputs found

    Counting, generating and sampling tree alignments

    Get PDF
    Pairwise ordered tree alignment are combinatorial objects that appear in RNA secondary structure comparison. However, the usual representation of tree alignments as supertrees is ambiguous, i.e. two distinct supertrees may induce identical sets of matches between identical pairs of trees. This ambiguity is uninformative, and detrimental to any probabilistic analysis.In this work, we consider tree alignments up to equivalence. Our first result is a precise asymptotic enumeration of tree alignments, obtained from a context-free grammar by mean of basic analytic combinatorics. Our second result focuses on alignments between two given ordered trees SS and TT. By refining our grammar to align specific trees, we obtain a decomposition scheme for the space of alignments, and use it to design an efficient dynamic programming algorithm for sampling alignments under the Gibbs-Boltzmann probability distribution. This generalizes existing tree alignment algorithms, and opens the door for a probabilistic analysis of the space of suboptimal RNA secondary structures alignments.Comment: ALCOB - 3rd International Conference on Algorithms for Computational Biology - 2016, Jun 2016, Trujillo, Spain. 201

    Algorithms for pre-microrna classification and a GPU program for whole genome comparison

    Get PDF
    MicroRNAs (miRNAs) are non-coding RNAs with approximately 22 nucleotides that are derived from precursor molecules. These precursor molecules or pre-miRNAs often fold into stem-loop hairpin structures. However, a large number of sequences with pre-miRNA-like hairpin can be found in genomes. It is a challenge to distinguish the real pre-miRNAs from other hairpin sequences with similar stem-loops (referred to as pseudo pre-miRNAs). The first part of this dissertation presents a new method, called MirID, for identifying and classifying microRNA precursors. MirID is comprised of three steps. Initially, a combinatorial feature mining algorithm is developed to identify suitable feature sets. Then, the feature sets are used to train support vector machines to obtain classification models, based on which classifier ensemble is constructed. Finally, an AdaBoost algorithm is adopted to further enhance the accuracy of the classifier ensemble. Experimental results on a variety of species demonstrate the good performance of the proposed approach, and its superiority over existing methods. In the second part of this dissertation, A GPU (Graphics Processing Unit) program is developed for whole genome comparison. The goal for the research is to identify the commonalities and differences of two genomes from closely related organisms, via multiple sequencing alignments by using a seed and extend technique to choose reliable subsets of exact or near exact matches, which are called anchors. A rigorous method named Smith-Waterman search is applied for the anchor seeking, but takes days and months to map millions of bases for mammalian genome sequences. With GPU programming, which is designed to run in parallel hundreds of short functions called threads, up to 100X speed up is achieved over similar CPU executions

    The super-n-motifs model : a novel alignment-free approach for representing and comparing RNA secondary structures

    Get PDF
    Abstract : Motivation: Comparing ribonucleic acid (RNA) secondary structures of arbitrary size uncovers structural patterns that can provide a better understanding of RNA functions. However, performing fast and accurate secondary structure comparisons is challenging when we take into account the RNA configuration (i.e. linear or circular), the presence of pseudoknot and G-quadruplex (G4) motifs and the increasing number of secondary structures generated by high-throughput probing techniques. To address this challenge, we propose the super-n-motifs model based on a latent analysis of enhanced motifs comprising not only basic motifs but also adjacency relations. The super-n-motifs model computes a vector representation of secondary structures as linear combinations of these motifs. Results: We demonstrate the accuracy of our model for comparison of secondary structures from linear and circular RNA while also considering pseudoknot and G4 motifs. We show that the supern- motifs representation effectively captures the most important structural features of secondary structures, as compared to other representations such as ordered tree, arc-annotated and string representations. Finally, we demonstrate the time efficiency of our model, which is alignment free and capable of performing large-scale comparisons of 10 000 secondary structures with an efficiency up to 4 orders of magnitude faster than existing approaches

    Alignment and analysis of noncoding DNA sequences in Drosophila

    Get PDF

    Computational Methods for Comparative Non-coding RNA Analysis: from Secondary Structures to Tertiary Structures

    Get PDF
    Unlike message RNAs (mRNAs) whose information is encoded in the primary sequences, the cellular roles of non-coding RNAs (ncRNAs) originate from the structures. Therefore studying the structural conservation in ncRNAs is important to yield an in-depth understanding of their functionalities. In the past years, many computational methods have been proposed to analyze the common structural patterns in ncRNAs using comparative methods. However, the RNA structural comparison is not a trivial task, and the existing approaches still have numerous issues in efficiency and accuracy. In this dissertation, we will introduce a suite of novel computational tools that extend the classic models for ncRNA secondary and tertiary structure comparisons. For RNA secondary structure analysis, we first developed a computational tool, named PhyloRNAalifold, to integrate the phylogenetic information into the consensus structural folding. The underlying idea of this algorithm is that the importance of a co-varying mutation should be determined by its position on the phylogenetic tree. By assigning high scores to the critical covariances, the prediction of RNA secondary structure can be more accurate. Besides structure prediction, we also developed a computational tool, named ProbeAlign, to improve the efficiency of genome-wide ncRNA screening by using high-throughput RNA structural probing data. It treats the chemical reactivities embedded in the probing information as pairing attributes of the searching targets. This approach can avoid the time-consuming base pair matching in the secondary structure alignment. The application of ProbeAlign to the FragSeq datasets shows its capability of genome-wide ncRNAs analysis. For RNA tertiary structure analysis, we first developed a computational tool, named STAR3D, to find the global conservation in RNA 3D structures. STAR3D aims at finding the consensus of stacks by using 2D topology and 3D geometry together. Then, the loop regions can be ordered and aligned according to their relative positions in the consensus. This stack-guided alignment method adopts the divide-and-conquer strategy into RNA 3D structural alignment, which has improved its efficiency dramatically. Furthermore, we also have clustered all loop regions in non-redundant RNA 3D structures to de novo detect plausible RNA structural motifs. The computational pipeline, named RNAMSC, was extended to handle large-scale PDB datasets, and solid downstream analysis was performed to ensure the clustering results are valid and easily to be applied to further research. The final results contain many interesting variations of known motifs, such as GNAA tetraloop, kink-turn, sarcin-ricin and t-loops. We also discovered novel functional motifs that conserved in a wide range of ncRNAs, including ribosomal RNA, sgRNA, SRP RNA, GlmS riboswitch and twister ribozyme

    Counting, Generating, Analyzing and Sampling Tree Alignments

    Get PDF
    Pairwise ordered tree alignment are combinatorial objects that appear unimportant applications, such as RNA secondary structure comparison. However, the usual representation of tree alignments as supertrees is ambiguous, i.e. two distinct supertrees may induce identical sets of matches between identical pairs of trees. This ambiguity is uninformative, and detrimental to any probabilistic analysis. In this work, we consider tree alignments up to equivalence. Our first result is a precise asymptotic enumeration of tree alignments, obtained from a context-free grammar by mean of basic analytic combinatorics. Our second result focuses on alignments between two given ordered trees SS and TT. By refining our grammar to align specific trees, we obtain a decomposition scheme for the space of alignments, and use it to design an efficient dynamic programming algorithm for sampling alignments under the Gibbs-Boltzmann probability distribution. This generalizes existing tree alignment algorithms, and opens the door for a probabilistic analysis of the space of suboptimal alignments

    Exploration des structures secondaires de l’ARN

    Get PDF
    À l’ère du numérique, valoriser les données en leur donnant un sens est un enjeu capital pour supporter la prise de décision stratégique et cela dans divers domaines, notamment dans le domaine du marketing numérique ou de la santé, ou encore, dans notre contexte, pour une meilleure compréhension de la biologie des structures des acides nucléiques. L’un des défis majeurs de la biologie structurale concerne l’étude des structures des acides ribonucléiques (ARN), les effets de ces structures et de leurs altérations sur leurs fonctions. Contribuer à cet enjeu important est l’objectif de cette thèse. Celle-ci s’inscrit principalement dans le développement de méthodes et d’outils pour l’exploration efficace des structures secondaires d’ARN. En effet, explorer les structures secondaires d’ARN contribue à lever le voile sur leur fonction et permet de mieux cerner leur implication spécifique au sein des processus cellulaires. Dans ce contexte nous avons développé le modèle des super-n-motifs qui contribue à une meilleure représentation de la complexité structurale des ARN et offre un moyen efficace d’évaluer la similarité des structures d’ARN en tenant compte de cette complexité. Le modèle des super-n-motifs facilite l’étude des ARN dont le rôle est inconnu. Il permet de poser des hypothèses sur la ou les fonctions des ARN lorsque ceux-ci partagent une similarité structurale sans équivoque. Nous avons aussi développé la plateforme structurexplor pour faciliter l’exploration des structures secondaires, c’est-à-dire de permettre, en quelques clics, de caractériser les populations de structures d’ARN en, par exemple, faisant ressortir les groupes d’ARN partageant des structures similaires. La mise en œuvre du modèle des super-n-motifs et de la plateforme structurexplor a contribué à une meilleure compréhension de la phylogénie structurale des viroïdes qui sont des agents pathogènes à ARN attaquant les plantes, phylogénie jusqu’alors basée que sur leurs séquences

    Inférence des acteurs de la régulation des expressions géniques

    Get PDF
    The increasing amount of available data is a source of many issues in bioinformatics such that the development of new methods of treatments and efficient analysis of data. Especially, regulatory networks are at the heart of many projects. Also, in order to understand regulatory systems, it appears to be necessary to characterize and to understand actors of these systems such as RNA and pseudogenes. We develop a new method to compare a query RNA with a static set of target RNAs. Our method is based on (i) a preliminary indexing of the sequence/structure seeds of the target RNAs, (ii) searching the potentially homolog RNAs by detecting seeds of the query present in targets, chaining these seeds, then (iii) completing the alignment using an anchor-based exact alignment algorithm. We apply our method on the benchmark Bralibase2.1. We compare our method accuracy and efficiency with the exact method LocARNA and its recent seeds-based speed-up ExpLocP. Our pipeline RNA-unchained greatly improves computation time of LocARNA and is comparable to the one of ExpLocP, while improving the overall accuracy of the final alignments.Moreover, we develop a new method, PseudOE, to detect and to characterize the pseudome of one genome, and to analyse by comparison two genomes at least. This method allows to analyse the pan-pseudome of two distantly related Oenococcus oeni strains with opposite oenological properties. Quite interestingly, with 8.5% of pseudogenes for a compact 1.8Mb genome, O. oeni appeared to be prone to pseudogenization compared to other bacteria. A great proportion of pseudogenes were found to come from mutational degradation suggesting a relatively recent origin that could illustrate the natural propensity of O. oeni for hypermutability. In addition, we identify a spatial organization of pseudogenes into dedicated chromosomal territories. These analysis illustrate peculiar properties of O. oeni pseudogenes, providing additional insights of gene/genome evolution from which future genome annotation will benefit.La quantité croissante de données générées est à l’origine de nombreuses problématiques en bioinformatique telles que le développement de nouvelles méthodes de traitement et d’analyse efficaces de ces données. Plus particulièrement, les réseaux de régulation des fonctions cellulaires sont au coeur de nombreux projets aujourd’hui. Il est donc nécessaire, afin d’appréhender correctement ces systèmes de régulation, de comprendre l’origine et de caractériser les acteurs de ces systèmes tels que les ARN et les pseudogènes.Nous avons établi une nouvelle méthode de comparaison d’une séquence ARN requête avec un jeu de séquences ARN cibles. Notre méthode se base sur (i) l’indexation préalable des graines en séquence/structure des ARN du jeu cible, (ii) la recherche des ARN cibles par détection des graines de la séquence requête présentes également dans le jeu de données cible et le chainage de ces graines, puis (iii) la complétion de l’alignement obtenu à l’aide d’un algorithme d’alignement exact incorporant des contraintes d’alignement. Cette méthode a été appliquée sur le jeu de données de BraliBase2.1. L’exactitude des résultats obtenus et l’efficacité de la méthode ont alors été comparés à la méthode d’alignement exact LocARNA et à son filtre basé sur un algorithme de chainage de graines récemment développé, ExpLocP. Notre méthode RNA-unchained permet d’améliorer significativement les temps de calcul de LocARNA et présente des temps de calcul similaires à ExpLocP, tout en améliorant l’exactitude des alignements finaux.De plus, nous avons développé une méthode, PseudOE, de détection et de caractérisation du pseudome au sein d’un génome et d’analyse comparative de ce pseudome entre plusieurs génomes. Cette méthode a ainsi permis de réaliser l’analyse du panpseudome de deux souches relativement distantes de l’espèce Oenococcus oeni et qui présentent des propriétés oenologiques opposées. On observe dans ces génomes compacts, de 1,8Mb, 8,5% de pseudogènes. Par comparaison aux autres génomes bactériens, les génomes d’O. oeni semblent sensibles à la pseudogénisation. La majorité des pseudogènes détectés ont pour origine des mutations de leur séquence et sont présents uniquement dans l’un des génomes, ce qui soutient l’hypothèse d’une origine récente de ces séquences et qui illustre la tendance des O. oeni à l’hypermutabilité. De plus, l’analyse des données fournies par PseudOE a permis la mise en évidence d’une organisation spatiale des pseudogènes au sein de territoires spécifiques du chromosome. L’ensemble de ces analyses illustre les particularités des pseudogènes chez O. oeni et apporte des informations supplémentaires concernant l’évolution des gènes/génomes dont les annotations de génomes pourraient retirer des bénéfices

    Global characterization of the immune response to inoculation of aluminium hydroxide-based vaccines by RNA sequencing

    Get PDF
    xix, 195 p.En este trabajo se han analizado muestras correspondientes a un experimento de vacunación de larga duración. Múltiples ovejas fueron expuestas a varias vacunas compuestas de aluminio hidróxido como adyuvante en un periodo de 475 días, con el objetivo de estudiar el mecanismo de acción de dicho adyuvante en el sistema inmune y comprobar si es capaz de llegar a órganos distantes como el cerebro después de su inoculación. Para ello se extrajeron muestras de células mononucleares de sangre periférica y de la corteza del lóbulo parietal y se usaron para la preparación de librerías de secuenciación de ARN y microRNAs (Total RNA-seq y miRNA-seq). Las librerías se analizaron mediante herramientas bioinformáticas y se realizaron multiples análisis: 1. Expresión diferencial tanto para los datos de RNA-seq como para los de miRNA-seq; 2. Anotación de nuevos miRNAs en oveja; 3. Predicción de targets para los miRNAs y análisis de co-expresión con los datos de RNA-seq. Además, como las librerías de Total RNA-seq retienen el ARN no codificante, que esta pobremente anotado en oveja, dichos datos se usaron para la anotación de ARN circulares en oveja y se estudió si dichos ARN no-codificantes pudieran tener algún rol en la actividad del aluminio como adyuvante
    corecore