27 research outputs found

    Comparative analysis of RNA genes: the caRNAc software

    Get PDF
    RNA genes are ubiquitous in the cell and are involved in a number of biochemical processes. Since there is a close relationship between function and structure, software tools that predict the secondary structure of non-coding RNAs from the base sequence are very helpful. In this article, we focus our attention on the inference of conserved secondary structure for a group of homologous RNA sequences. We present the caRNAc software which enables the analysis of families of homologous sequences without prior alignment. The method relies both on comparative analysis and thermodynamic information

    Comparative analysis of RNA genes: the caRNAc software

    Get PDF
    RNA genes are ubiquitous in the cell and are involved in a number of biochemical processes. Since there is a close relationship between function and structure, software tools that predict the secondary structure of non-coding RNAs from the base sequence are very helpful. In this article, we focus our attention on the inference of conserved secondary structure for a group of homologous RNA sequences. We present the caRNAc software which enables the analysis of families of homologous sequences without prior alignment. The method relies both on comparative analysis and thermodynamic information

    CG-seq: a toolbox for automatic annotation of genomes by comparative analysis

    Get PDF
    CG-seq is a software pipeline to identify functional regions such as noncoding RNAs or protein coding genes in a genomic sequence by comparative analysis and multispecies comparison. It takes as input a genomic sequence to annotate and a set of other sequences coming from a variety of species to be compared against the user sequence. The pipeline includes several external software components to perform sequence analysis tasks as well as some new features that were especially developed for the purpose. CG-seq is distributed under the GPL licence. It is available both for command line interface usage or with a Graphical User Interface. It can be downloaded from http://bioinfo.lifl.fr/CGseq. A web version can also be runned from this same website for input data of limited length.CG-seq est une suite logicielle qui permet l'identification de régions fonctionnelles, telles que les ARN non-codants ou les gènes codants, dans une séquence génomique en utilisant le principe de la génomique comparative et de la comparaison entre espèces. Il prend en entrée une séquence à annoter, ainsi que d'autres séquences de référence issues de différentes espèces, et retourne en sortie une liste de régions candidates, avec leur annotation. Pour ce faire, CG-seq intègre plusieurs logiciels d'analyse de séquences existants, ainsi que de nouveaux modules qui ont été développés spécifiquement pour ce travail. CG-seq est distribué sous licence GPL, et téléchargeable à http://bioinfo.lifl.fr/CGseq. Il est disponible pour une utilisation en ligne de commande ou avec une interface graphique. Une version web est également proposée sur ce même site, qui permet de tester CG-seq sur des séquences de longueur raisonnable

    Biomanycores, open-source parallel code for many-core bioinformatics

    Get PDF
    International audienceBiomanycores is a collection of bioinformatics tools, designed to bridge the gap between researches in OpenCL/CUDA high-performance computing on GPU and other "manycore processors" and usual bioinformaticians and biologists

    DiNAMO: Exact method for degenerate IUPAC motifs discovery, characterization of sequence-specific errors

    Get PDF
    National audienceNext generation sequencing technologies are still associated with relatively high error rates, about 1%, which correspond to thousands of errors in the scale of a complete genome. Each region needs therefore to be sequenced several times and variants are usually filtered based on depth criteria. The significant number of artifacts, in spite of those filters, shows the limit of conventional approaches and indicates that some sequencing artifacts are recurrent. This recurrence underlines that sequencing errors can depend on the upstream nucleotide sequence context. Our goal is to search for overrepresented motifs that tend to induce sequencing errors. Previous studies showed that some motifs, such as GGT [1,2], induce sequencing errors in the Illumina technologies. However, these studies were dedicated to exact motifs, and did not take into account approximate motifs, limiting the statistical power of such approaches. On the other hand, some tools, such as FIRE [3], DREME [4] and Discrover [5], were developed to search for degenerate motifs over the 15-letter IUPAC alphabet in the context of chip-seq studies. However, these tools use greedy algorithms, implying a lack of sensitivity. So we developed an exact algorithm to search for degenerate motifs by enumerating all possible IUPAC motifs. This algorithm is based on mutual information and uses hashtables with graphs data structure to store the motifs. It is independent from the sequencing technology. Experimental results on real data show that there are many overrepresented motifs upstream of sequencing artifacts. These latter are identified through the strand bias between forward and reverse reads. The homopoly-mer of length 3 CCC seems to be sufficient to induce errors on IonTorrent. On Illumina, motifs are mainly composed of GGC followed by GGT (like: TGGCNGGT) or homopolymers. We have also noticed a base quality fall after the detected motifs. Our exact algorithm requires less than one minute (Intel R Core TM i5-4570 CPU, 3.20GHz), and less than 2GB of RAM to search for full degenerate motifs of length 6 on a dataset of approximately 24000 sequences, extracted from 11 exomes sequenced on IonTorrent Proton

    Biomanycores, open-source parallel code for many-core bioinformatics

    Get PDF
    International audienceBiomanycores is a collection of bioinformatics tools, designed to bridge the gap between researches in OpenCL/CUDA high-performance computing on GPU and other "manycore processors" and usual bioinformaticians and biologists

    Analysis of tree edit distance algorithms Serge Dulucq

    No full text
    In this article, we study the behaviour of dynamic programming methods for the tree edit distance problem, such as [4] and [2]. We show that those two algorithms may be described in a more general framework of cover strategies. This analysis allows us to define a new tree edit distance algorithm, that is optimal for cover strategies.

    Self-Overlapping Occurrences and Knuth-Morris-Pratt Algorithm for Weighted Matching

    No full text
    International audiencePosition Weight Matrices are broadly used probabilistic motif models. In this paper, we address the problem of identifying and characterizing potential overlaps between occurrences of such a motif. It has useful applications to the statistics of the number of occurrences, and to weighted pattern matching with an extension of the well-known Knuth-Morris-Pratt algorithm

    Algorithms with Polynomial Interpretation Termination Proof

    No full text
    Article dans revue scientifique avec comité de lecture. internationale.International audienceWe study the effect of polynomial interpretation termination proofs of deterministic (resp. non-deterministic) algorithms defined by confluent (resp. non-confluent) rewrite systems over data structures which include strings, lists and trees, and we classify them according to the interpretations of the constructors. This leads to the definition of six function classes which turn out to be exactly the deterministic (resp. non-deterministic) polynomial time, linear exponential time and linear doubly exponential time computable functions when the class is based on confluent (resp. non-confluent) rewrite systems. We also obtain a characterisation of the linear space computable functions. Finally, we demonstrate that functions with exponential interpretation termination proofs are super-elementary
    corecore