19 research outputs found

    Multiple Comparative Metagenomics using Multiset k-mer Counting

    Get PDF
    Background. Large scale metagenomic projects aim to extract biodiversity knowledge between different environmental conditions. Current methods for comparing microbial communities face important limitations. Those based on taxonomical or functional assignation rely on a small subset of the sequences that can be associated to known organisms. On the other hand, de novo methods, that compare the whole sets of sequences, either do not scale up on ambitious metagenomic projects or do not provide precise and exhaustive results. Methods. These limitations motivated the development of a new de novo metagenomic comparative method, called Simka. This method computes a large collection of standard ecological distances by replacing species counts by k-mer counts. Simka scales-up today's metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets. Results. Experiments on public Human Microbiome Project datasets demonstrate that Simka captures the essential underlying biological structure. Simka was able to compute in a few hours both qualitative and quantitative ecological distances on hundreds of metagenomic samples (690 samples, 32 billions of reads). We also demonstrate that analyzing metagenomes at the k-mer level is highly correlated with extremely precise de novo comparison techniques which rely on all-versus-all sequences alignment strategy or which are based on taxonomic profiling

    Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

    Get PDF
    International audienceData volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method.We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/

    KLAST: fast and sensitive software to compare large genomic databanks on cloud

    Get PDF
    International audienceAs the genomic data generated by high throughput sequencing machines continue to exponentially grow, the need for very efficient bioinformatics tools to extract relevant knowledge from this mass of data doesn't weaken. Comparing sequences is still a major task in this discovering process, but tends to be more and more time-consuming. KLAST is a sequence comparison software optimized to compare two nucleotides or proteins data sets, typically a set of query sequences and a reference bank. Performances of KLAST are obtained by a new indexing scheme, an optimized seed-extend methodology, and a multi-level parallelism implementation. To scale up to NGS data processing, a Hadoop version has been designed. Experiments demonstrate a good scalability and a large speed-up over BLAST, the reference software of the domain. In addition, computation can be optionally performed on compressed data without any loss in performances

    Speeding up NGS software development

    Get PDF
    International audienceThe analysis of NGS data remains a time and space-consuming task. Many efforts have been made toprovide efficient data structures for indexing the terabytes of data generated by the fast sequencingmachines (Suffix Array, Burrows-Wheeler transform, Bloom Filter, etc.). Mapper tools, genomeassemblers, SNP callers, etc., make an intensive use of these data structures to keep their memoryfootprint as lower as possible.The overall efficiency of NGS software is brought by a smart combination of how data are representedinside the computer memory and how they are processed through the available processing units insidea processor. Developing such software is thus a real challenge, as it requires a large spectrum ofcompetences from high-level data structure and algorithm concepts to tiny details of implementation.We have developed a C++ library, called GATB (Genomic Assembly and Analysis Tool Box) tospeed up the design of NGS algorithms. This library offers a panel of high-level optimized buildingblocks. The underlying data structure is the de Bruijn graph, and the general parallelism model ismultithreading. The GATB library targets standard computing resources such as current multicoreprocessor (laptop computer, small server) with a few GB of memory. Hence, from high-level C++API, NGS programing designers can rapidly elaborate their own software based on state-of-the-artalgorithms and data structures of the domain.To demonstrate the efficiency of the GATB library, several NGS software have been designed such ascontiger (Minia), read corrector (Bloocoo) or SNP discovery (DiscoSNP). The GATB library iswritten in C++ and is available at the following web site http://gatb.inria.fr under the GNU AfferoGPL license

    Quality metrics for benchmarking sequences comparison tools

    Get PDF
    International audienceComparing sequences is a daily task in bioinformatics and many software try to fulfill this need by proposing fast execution times and accurate results. Introducing a new software in this field requires to compare it to recognized tools with the help of well defined metrics. A set of quality metrics is proposed that enables a systematic approach for comparing alignment tools. These metrics have been implemented in a dedicated software, allowing to produce textual and graphical benchmark artifacts

    KLAST: a new high-­performance sequence similarity search tool

    Get PDF
    International audienceKLAST is a fast, accurate and NGS scalable bank-to-bank sequence similarity search tool providing significant accelerations of seeds-based heuristic comparison methods, such as the Blast suite. Relying on unique software architecture, KLAST takes full advantage of recent multi-core personal computers without requiring any additional hardware devices.KLAST is a new optimized implementation of the PLAST algorithm (1), to which several improvements have been made. KLAST is fully designed to compare query and subject comprised of large sets of DNA, RNA and protein sequences using KLASTn, KLASTp, KLASTx, tKLASTx and tKLASTn methods. It is significantly faster than original PLAST, while providing comparable sensitivity to BLAST and SSearch algorithms. KLAST contains a fully integrated data-filtering engine capable of selecting relevant hits with user-defined criteria (E-Value, identity, coverage, alignment length, etc.).KLAST has been benchmarked on metagenomic data sets from the Tara Oceans International Research Project (2). The main goal of the test was to evaluate speedup and quality of results obtained by KLAST in comparison with BLAST, which is usually used at Genoscope to run sequence comparisons. Quality was evaluated in two ways. First, crude results from both tools were compared, i.e. how much results from BLAST are also found by KLAST. Second, by using results from both tools to assign each query to a taxonomy entry. KLAST achieved sequence comparisons up to 18x times faster than BLAST, while covering up to 96% of the results produced by BLAST. This benchmark illustrates the benefits of using KLAST both in terms of quality results and speed on the deciphering of Tara Oceans metagenomic data.To provide users with an advanced sequence similarity search platform, the KLAST engine has been integrated into several software tools, from the command-line up to full-featured graphical data analysis platforms such as ngKLAST, KNIME and CLC bio’s Genomics Workbench. In all cases, the KLAST system provides an integrated algorithm suite that automatically processes analysis workflows that includes similarity searches, hits annotations, and data filtering

    From medico-administrative databases analysis to care trajectories analytics: an example with the French SNDS

    Get PDF
    International audienceMedico-administrative data like SNDS (Système National de Données de Santé) are not collected initially for epidemiological purposes. Moreover, the data model and the tools proposed to SNDS users make their in-depth exploitation difficult. We propose a data model, called the ePEPS model, based on healthcare trajectories to provide a medical view of raw data. A data abstraction process enables the clini-cian to have an intuitive medical view of raw data and to design a study-specific view. This view is based on a generic model of care trajectory, that is a sequence of time stamped medical events for a given patient. This model is combined with tools to manipulate care trajectories efficiently. I N T R O D U C T I O N Medico-administrative databases hold rich information about healthcare trajectories (or healthcare pathways) at an individual level. Such data are very valuable for carrying out pharmaco-epidemiological studies on large representative cohorts of patients in real-life conditions. Moreover, historical data are readily available for longitudinal analysis of care trajectories. These opportunities are given by the use of the database of the French healthcare system, so called SNDS (Syst eme National de Donn ees de Sant e) database, which covers 98.8% of the French population, with a sliding period of 3 years. A classical pharmaco-epidemiological study from medico-administrative databases consists of three main steps: (i) defining inclusion and exclusion criteria of a cohort, (ii) specifying proxies for events of interest, and (iii) analyzing the transformed data. Practically, these three steps are closely intertwined and make use of digital data management tools (e.g., SQL databases, R, or SAS). The study outcomes depend on the available data at hand as much as on the tools to manage and process them. But the data model, 1 designed for administrative purposes , is not suitable for pharmaco-epidemiological studies without careful data preparation. It leads to difficulties for epidemiologists to access the useful information and even to know what is reachable with such databases. For instance, the SNDS database is a rela-tional database with hundreds of tables with very complex join relations. The set of prescribed drugs of a patient is accessible with a query containing 10 join relations involving attributes with unintuitive names. Mastering the data management with such complex models requires a lot of time, good knowledge of its content, and some technical skills. It is a practical bottleneck to exploit the potential of the database. 1 A data model is an abstract model that describes the organization of the data. In relational database, it is the description of tables, their attributes, and their relations. ª 2017 Soci et

    Simka: large scale de novo comparative metagenomics

    Get PDF
    Acte onlineReference paper: Benoit et al. (2016) Multiple comparative metagenomics using multiset k-mer counting. PeerJComputer Science. https://doi.org/10.7717/peerj-cs.94Simka: large scale [i]de novo[/i] comparative metagenomics. JOBIM 2017 - Journées Ouvertes Biologie Informatique Mathématique

    GATB: Genome Assembly & Analysis Tool Box

    Get PDF
    International audienceMotivation: Efficient and fast NGS algorithms are essential to ana-lyze the terabytes of data generated by the next generation se-quencing machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and ad-vanced hardware implementation. Results: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with very low memory footprints. Availability and Implementation: The GATB library is written in C++ and is available at the following web site http://gatb.inria.fr under the A-GPL license

    GATB: a software toolbox for genome assembly and analysis

    Get PDF
    International audienceThe analysis of NGS data remains a time and space-consuming task. Many efforts have been made to provide efficient data structures for indexing the terabytes of data generated by the fast sequencing machines (Suffix Array, Burrows-Wheeler transform, Bloom Filter, etc.). Mapper tools, genome assemblers, SNP callers, etc., make an intensive use of these data structures to keep their memory footprint as lower as possible.The overall efficiency of NGS software is brought by a smart combination of how data are represented inside the computer memory and how they are processed through the available processing units inside a processor. Developing such software is thus a real challenge, as it requires a large spectrum of competences from high-level data structure and algorithm concepts to tiny details of implementation.GATB toolboxThe GATB software toolbox aims to lighten the design of NGS algorithms. It offers a panel of high-level optimized building blocks to speed-up the development of NGS tools related to genome assembly and/or genome analysis. The underlying data structure is the de Bruijn graph, and the general parallelism model is multithreading. The GATB library targets standard computing resources such as current multicore processor (laptop computer, small server) with a few GB of memory. From high-level C++ API, NGS programing designers can rapidly elaborate their own software based on state-of-the-art algorithms and data structures of the domain.The GATB library is written in C++ and is available at the following web site http://gatb.inria.fr under the GNU Affero GPL license.Genomic SoftwareFrom the GATB toolbox, various software targeting specific genomic treatments have been designed. Below is a short list of tools currently available. Many other tools are under development.Minia is a short-read assembler capable of assembling large and complex genomes into contigs on a desktop computer. The assembler produces contigs of similar length and accuracy compared to other assemblers. As an example, a Boa constrictor constrictor (1.6 Gbp) dataset (Illumina 2x120 bp reads, 125x coverage) from Assemblathon 2 can be processed in approximately 45 hours and 3GB of memory on a standard computer (3.4 GHz 8-core processor) using a single core, yielding a contig N50 of 3.6 Kbp (prior to scaffolding and gap-filling).Bloocoo is a k-mer spectrum-based read error corrector, designed to correct large datasets with a very low memory footprint. The correction procedure is similar to the Musket multistage approach. Bloocoo yields similar results while requiring far less memory: as an example, it can correct whole human genome re-sequencing reads at 70 x coverage with less than 4GB of memory.DiscoSNP aims to discover Single Nucleotide Polymorphism (SNP) from non-assembled reads. Applied on a mouse dataset (2.88 Gbp, 100 bp Illumina reads), DiscoSnp takes 34 hours and at most 4.5 GB RAM memory. In the same spirit, the TakeABreak software discovers inversions from non-assembled reads. It directly finds particular patterns in the de Bruijn Graph, and provides execution performances similar to DiscoSNP
    corecore