17 research outputs found

    A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

    Get PDF
    Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

    Indexing k-mers in linear space for quality value compression.

    Get PDF
    Many bioinformatics tools heavily rely on [Formula: see text]-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive [Formula: see text]-mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each [Formula: see text]-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input [Formula: see text]-mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant [Formula: see text]-mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff

    Indexing k

    No full text

    Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing

    No full text
    International audienceAlignment-free classification of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Much work has been done to speed-up the indexing of k-mers through hash-table and other data structures. These efforts have led to very fast indexes, but because they are k-mer based, they often lack sensitivity due to sequencing errors or polymorphisms. Spaced seeds are a special type of pattern that accounts for errors or mutations. They allow to improve the sensitivity and they are now routinely used instead of k-mers in many applications. The major drawback of spaced seeds is that they cannot be efficiently hashed and thus their usage increases substantially the computational time. In this paper we address the problem of efficient spaced seed hashing. We propose an iterative algorithm that combines multiple spaced seed hashes by exploiting the similarity of adjacent hash values in order to efficiently compute the next hash. We report a series of experiments on HTS reads hashing, with several spaced seeds. Our algorithm can compute the hashing values of spaced seeds with a speedup of 6.2x, outperforming previous methods. Software and Datasets are available at ISS

    Improving Metagenomic Classification Using Discriminative k-mers from Sequencing Data

    No full text
    The major problem when analyzing a metagenomic sample is to taxonomically annotate its reads to identify the species they contain. Most of the methods currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of recall (the actual number of classified reads) the performances fall at around 50%. One of the reasons is the fact that the sequences in a sample can be very different from the corresponding reference genome, e.g. viral genomes are highly mutated. To address this issue, in this paper we study the problem of metagenomic reads classification by improving the reference k-mers library with novel discriminative k-mers from the input sequencing reads. We evaluated the performance in different conditions against several other tools and the results showed an improved F-measure, especially when close reference genomes are not available. Availability: https://github.com/davide92/K2Mem.gi

    Fast Approximation of Frequent k-mers and Applications to Metagenomics

    No full text
    Estimating the abundances of all kk-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. While several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire dataset, that can be extremely expensive for high-throughput sequencing datasets. While in some applications it is crucial to estimate all kk-mers and their abundances, in other situations reporting only frequent kk-mers, that appear with relatively high frequency in a dataset, may suffice. This is the case, for example, in the computation of kk-mers' abundance-based distances among datasets of reads, commonly used in metagenomic analyses. In this work, we develop, analyze, and test, a sampling-based approach, called SAKEIMA, to approximate the frequent kk-mers and their frequencies in a high-throughput sequencing dataset while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the VC dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent kk-mers by processing only a fraction of a dataset and that the frequencies estimated by SAKEIMA lead to accurate estimates of kk-mer based distances between high-throughput sequencing datasets. Overall, SAKEIMA is an efficient and rigorous tool to estimate kk-mers abundances providing significant speed-ups in the analysis of large sequencing datasets.Comment: Accepted for RECOMB 201

    SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines

    No full text
    International audienceBACKGROUND:The evolution of next-generation sequencing (NGS) technologies has led to increased focus on RNA-Seq. Many bioinformatic tools have been developed for RNA-Seq analysis, each with unique performance characteristics and configuration parameters. Users face an increasingly complex task in understanding which bioinformatic tools are best for their specific needs and how they should be configured. In order to provide some answers to these questions, we investigate the performance of leading bioinformatic tools designed for RNA-Seq analysis and propose a methodology for systematic evaluation and comparison of performance to help users make well informed choices.RESULTS:To evaluate RNA-Seq pipelines, we developed a suite of two benchmarking tools. SimCT generates simulated datasets that get as close as possible to specific real biological conditions accompanied by the list of genomic incidents and mutations that have been inserted. BenchCT then compares the output of any bioinformatics pipeline that has been run against a SimCT dataset with the simulated genomic and transcriptional variations it contains to give an accurate performance evaluation in addressing specific biological question. We used these tools to simulate a real-world genomic medicine question s involving the comparison of healthy and cancerous cells. Results revealed that performance in addressing a particular biological context varied significantly depending on the choice of tools and settings used. We also found that by combining the output of certain pipelines, substantial performance improvements could be achieved.CONCLUSION:Our research emphasizes the importance of selecting and configuring bioinformatic tools for the specific biological question being investigated to obtain optimal results. Pipeline designers, developers and users should include benchmarking in the context of their biological question as part of their design and quality control process. Our SimBA suite of benchmarking tools provides a reliable basis for comparing the performance of RNA-Seq bioinformatics pipelines in addressing a specific biological question. We would like to see the creation of a reference corpus of data-sets that would allow accurate comparison between benchmarks performed by different groups and the publication of more benchmarks based on this public corpus. SimBA software and data-set are available at http://cractools.gforge.inria.fr/softwares/simba/
    corecore