36 research outputs found

    A note on the shortest common superstring of NGS reads

    Full text link
    The Shortest Superstring Problem (SSP) consists, for a set of strings S = {s_1,...,s_n}, to find a minimum length string that contains all s_i, 1 <= i <= k, as substrings. This problem is proved to be NP-Complete and APX-hard. Guaranteed approximation algorithms have been proposed, the current best ratio being 2+11/23, which has been achieved following a long and difficult quest. However, SSP is highly used in practice on next generation sequencing (NGS) data, which plays an increasingly important role in sequencing. In this note, we show that the SSP approximation ratio can be improved on NGS reads by assuming specific characteristics of NGS data that are experimentally verified on a very large sampling set

    On improving the approximation ratio of the r-shortest common superstring problem

    Full text link
    The Shortest Common Superstring problem (SCS) consists, for a set of strings S = {s_1,...,s_n}, in finding a minimum length string that contains all s_i, 1<= i <= n, as substrings. While a 2+11/30 approximation ratio algorithm has recently been published, the general objective is now to break the conceptual lower bound barrier of 2. This paper is a step ahead in this direction. Here we focus on a particular instance of the SCS problem, meaning the r-SCS problem, which requires all input strings to be of the same length, r. Golonev et al. proved an approximation ratio which is better than the general one for r<= 6. Here we extend their approach and improve their approximation ratio, which is now better than the general one for r<= 7, and less than or equal to 2 up to r = 6

    Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

    Get PDF
    International audienceData volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method.We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/

    Consensus clustering applied to multi-omics disease subtyping

    Get PDF
    Background: Facing the diversity of omics data and the difficulty of selecting one result over all those produced by several methods, consensus strategies have the potential to reconcile multiple inputs and to produce robust results. Results: Here, we introduce ClustOmics, a generic consensus clustering tool that we use in the context of cancer subtyping. ClustOmics relies on a non-relational graph database, which allows for the simultaneous integration of both multiple omics data and results from various clustering methods. This new tool conciliates input clusterings, regardless of their origin, their number, their size or their shape. ClustOmics implements an intuitive and flexible strategy, based upon the idea of evidence accumulation clustering. ClustOmics computes co-occurrences of pairs of samples in input clusters and uses this score as a similarity measure to reorganize data into consensus clusters. Conclusion: We applied ClustOmics to multi-omics disease subtyping on real TCGA cancer data from ten different cancer types. We showed that ClustOmics is robust to heterogeneous qualities of input partitions, smoothing and reconciling preliminary predictions into high-quality consensus clusters, both from a computational and a biological point of view. The comparison to a state-of-the-art consensus-based integration tool, COCA, further corroborated this statement. However, the main interest of ClustOmics is not to compete with other tools, but rather to make profit from their various predictions when no gold-standard metric is available to assess their significance. Availability: The ClustOmics source code, released under MIT license, and the results obtained on TCGA cancer data are available on GitHub: https://github.com/galadrielbriere/ClustOmics

    Reference-free detection of isolated SNPs

    Get PDF
    International audienceDetecting Single Nucleotide Polymorphisms (SNPs) between genomes is becoming a routine task with Next Generation Sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, DISCOSNP, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, DISCOSNP ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, DISCOSNP requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism

    Colib'read on galaxy : a tools suite dedicated to biological information extraction from raw NGS reads

    Get PDF
    Background: With next-generation sequencing (NGS) technologies, the life sciences face a deluge of raw data. Classical analysis processes for such data often begin with an assembly step, needing large amounts of computing resources, and potentially removing or modifying parts of the biological information contained in the data. Our approach proposes to focus directly on biological questions, by considering raw unassembled NGS data, through a suite of six command-line tools. Findings: Dedicated to 'whole-genome assembly-free' treatments, the Colib'read tools suite uses optimized algorithms for various analyses of NGS datasets, such as variant calling or read set comparisons. Based on the use of a de Bruijn graph and bloom filter, such analyses can be performed in a few hours, using small amounts of memory. Applications using real data demonstrate the good accuracy of these tools compared to classical approaches. To facilitate data analysis and tools dissemination, we developed Galaxy tools and tool shed repositories. Conclusions: With the Colib'read Galaxy tools suite, we enable a broad range of life scientists to analyze raw NGS data. More importantly, our approach allows the maximum biological information to be retained in the data, and uses a very low memory footprint.Peer reviewe

    Algorithmes de comparaison de génomes appliqués aux génomes bactériens

    No full text
    With more than 1000 complete genomes available (among which, the vast majority come from bacteria), comparative genomic analysis become essential for the functional annotation of genomes, the understanding of their structure and evolution and have applications in phylogenomics or vaccine design. One of the main approaches for comparing genomes is by aligning their DNA sequences, i.e. whole genome alignment (WGA), which means identifying the similarity regions without any prior annotation knowledge. Despite the significant improvements during the last years, reliable tools for WGA and methodology for estimating its quality, in particular for bacterial genomes, still need to be designed. Besides their extremely large lengths that make classical dynamic programming alignment methods unsuitable, aligning whole genomes involves several additional difficulties, due to the mechanisms through which genomes evolve: the divergence, which let sequence similarity vanish over time, the re- ordering of genomic segments (rearrangements), or the acquisition of external genetic material generating regions that are unalignable between sequences, e.g. horizontal gene transfer, phages. Therefore, whole genome alignment tools implement heuristics, among which the most common is the anchor based strategy. It starts by detecting an initial set of similarity regions (phase 1), and, through a chaining phase (phase 2), selects a non-overlapping maximum-weighted, usually collinear, subset of those similarities, called anchors. Phases 1 and 2 are recursively applied on yet unaligned regions (phase 3). The last phase (phase 4) consists in systematically applying classical alignment tools to all short regions still left unaligned.This thesis addresses several problems related to whole genome alignment: the evaluation of the quality of results given by WGA tools and the improvement of the classical anchor based strategy. We first designed a protocol for evaluating the quality of alignment results, based on both computational and biological measures. An evaluation of the results given by two state of the art WGA tools on pairs of intra-species bacterial genomes revealed their shortcomings: the failure of detecting some of the similarities between sequences and the misalignment of some regions. Based on these results, which imply a lack in both sensitivity and specificity, we propose a novel, pair- wise whole genome alignment tool, YOC, implementing a simplified two-phase version of the anchor strategy. In phase 1, YOC improves sensitivity by using as anchors, for the first time, local similarities based on spaced seeds that are capable of detecting larger similarity regions in divergent sequences. This phase is followed by a chaining method adapted to local similarities, a novel type of collinear chaining, allowing for proportional overlaps. We give a formulation for this novel problem and provide the first algorithm for it. The algorithm, implementing a dynamic programming approach based on the sweep-line paradigm, is exact and runs in quadratic time. We show that, compared to classical collinear chaining, chaining with overlaps improves on real bacterial data, while remaining almost as efficient in practice. Our novel tool, YOC, is evaluated together with other four WGA tools on a dataset composed of 694 pairs of intra-species bacterial genomes. The results show that YOC improves on divergent cases by detecting more distant similarities and by avoiding misaligned regions. In conclusion, YOC should be easier to apply automatically and systematically to in- coming genomes, for it does not require a post-filtering step to detect misalignment and is less complex to calibrate.Avec plus de 1000 génomes complets disponibles (la grande majorité venant de bactéries), les analyses comparatives de génomes deviennent indispensables pour leur annotation fonctionnelle, ainsi que pour la compréhension de leur structure et leur évolution, et s’appliquent par exemple en phylogénomique ou au design des vaccins. L’une des approches les plus utilisées pour comparer des génomes est l’alignement de leurs séquences d’ADN, i.e. alignement de génomes complets, c’est-à-dire identifier les régions de similarité en s’affranchissant de toute annotation. Malgré des améliorations significatives durant les dernières années, des outils performants pour cette approche ainsi que des méthodes pour l’estimation de la qualité des résultats qu’elle produit, en particulier sur les génomes bactériens, restent encore à développer. Outre leurs grandes tailles qui rendent les solutions classiques basées sur la programmation dynamique inutilisables, l’alignement de génomes complets pose des difficultés supplémentaires dues à des mécanismes d’évolution particuliers: la divergence, qui estompe les similarités entre les séquences, le réordonnancent des portions génomiques (réarrangements), ou l’acquisition de matériel génétique extérieur, qui produit des régions non alignables entres les séquences, e.g. transfert horizontal des gènes, phages. En conséquence, les solutions pour l’alignement de génomes sont des heuristiques, dont la plus commune est la stratégie basée sur des ancres. Cette stratégie commence par identifier un ensemble initial de régions de similarité (phase 1). Ensuite une phase de chaînage sélectionne un sous-ensemble (non-chevauchantes et généralement colinéaires) de ces similarités de poids maximal, nommées ancres (phase 2). Les phases 1 et 2 sont appliquées de manière récursive sur les régions encore non-alignées (phase 3). La dernière phase consiste en l’application systématique des outils d’alignement classiques sur toutes les régions courtes qui n’ont pas encore été alignées.Cette thèse traite plusieurs problèmes liés à l’alignement de génomes complets dont: l’évaluation de la qualité des résultats produits par les outils d’alignement et l’amélioration de la stratégie basée sur des ancres. Premièrement, nous avons créé un protocole pour évaluer la qualité des résultats d’alignement, comprenant des mesures de calcul quantitatives et qualitatives, dont certaines basées sur des connaissances biologiques. Une analyse de la qualité des alignements produits par deux des principaux outils existants sur des paires de génomes bactériens intra-espèces révèle leurs limitations: des similarités non détectées et des portions d’alignement incorrectes. À partir de ces résultats, qui suggèrent un manque de sensibilité et spécificité, nous pro- posons un nouvel outil pour l’alignement deux à deux de génomes complets, YOC, qui implémente une version simplifiée de la stratégie basée sur des ancres, contenant seulement deux phases. Dans la phase 1, YOC améliore la sensibilité en utilisant comme ancres, pour la première fois dans cette stratégie, des similarités locales basées sur des graines espacées, capables de détecter des similarités plus longues dans des régions plus divergentes. Cette phase est suivie par une méthode de chainage adaptée aux similarités locales, un nouveau type de chaînage colinéaire, permettant des chevauchements proportionnels. Nous avons donné une formulation de ce nouveau problème et réalisé un premier algorithme. L’algorithme, qui adopte une approche de programmation dynamique basée sur le paradigme de la “sweep-line”, donne une solution optimale, i.e. est exacte, et s’exécute en temps quadratique. Nous avons montré que cet algorithme, comparé au chainage colinéaire classique, améliore les résultats sur des génomes bactériens, tout en restant aussi efficace en pratique. Notre nouvel outil, YOC, a été évalué ensemble avec quatre autres outils d’alignement sur un en- semble de données composé de 694 couples de génomes bactériens intra-espèces. Les résultats montrent que YOC améliore les cas divergents en détectant des similarités plus distantes et en évitant les régions mal alignées. En conclusion, YOC semble être plus facile à appliquer de manière automatique et systématique, parce qu’il nécessite pas un post-traitement des régions mal alignées, ni un paramétrage complexe

    Algorithms for the comparisons of genomic sequences applied to bacterial genomes

    No full text
    Avec plus de 1000 gĂ©nomes complets disponibles (la grande majoritĂ© venant de bactĂ©ries), les analyses comparatives de gĂ©nomes deviennent indispensables pour leurs annotations fonctionnelles, ainsi que pour la comprĂ©hension de leur structure et leur Ă©volution, et s'appliquent par exemple en phylogĂ©nomique ou au design des vaccins. L'une des approches de plus utilisĂ©es pour comparer des gĂ©nomes est l'alignement de leurs sĂ©quences d'ADN, i.e. alignement de gĂ©nomes complets, c'est Ă  dire identifier les rĂ©gions de similaritĂ© en s'affranchissant de toute annotation. MalgrĂ© des amĂ©liorations significatives durant les derniĂšres annĂ©es, des outils performants pour cette approche ainsi que des mĂ©thodes pour l'estimation de la qualitĂ© des rĂ©sultats qu'elle produit, en particulier sur les gĂ©nomes bactĂ©riens, restent encore Ă  dĂ©velopper. Outre leurs grandes tailles qui rendent les solutions classiques basĂ©es sur la programmation dynamique inutilisables, l'alignement de gĂ©nomes complets posent des difficultĂ©s supplĂ©mentaires dues Ă  leur Ă©volution particuliĂšre, comprenant: la divergence, qui estompe les similaritĂ©s entre les sĂ©quences, le rĂ©ordonnancent des portions gĂ©nomiques (rĂ©arrangements), ou l'acquisition de matĂ©riel gĂ©nĂ©tique extĂ©rieur, qui produit des rĂ©gions non alignables entres les sĂ©quences, e.g. transfert horizontal des gĂšnes, phages. En consĂ©quence, les solutions pour l'alignement de gĂ©nomes sont des heuristiques, dont la plus commune est appelĂ©e stratĂ©gie basĂ©e sur des ancres. Cette stratĂ©gie commence par identifier un ensemble initial de rĂ©gions de similaritĂ© (phase 1). Ensuite une phase de chaĂźnage sĂ©lectionne un sous-ensemble (non-chevauchantes et gĂ©nĂ©ralement colinĂ©aires) de ces similaritĂ©s de poids maximal, nommĂ©es ancres (phase 2). Les phases 1 et 2 sont appliquĂ©es de maniĂšre rĂ©cursive sur les rĂ©gions encore non-alignĂ©es (phase 3). La derniĂšre phase consiste en l'application systĂ©matique des outils d'alignement classiques sur toutes les rĂ©gions courtes qui n'ont pas encore Ă©tĂ© alignĂ©es. Cette thĂšse adresse plusieurs problĂšmes liĂ©s Ă  l'alignement de gĂ©nomes complets dont: l'Ă©valuation de la qualitĂ© des rĂ©sultats produits par les outils d'alignement et l'amĂ©lioration de la stratĂ©gie basĂ©e sur des ancres. PremiĂšrement, nous avons crĂ©Ă© un protocole pour Ă©valuer la qualitĂ© des rĂ©sultats d'alignement, contenant des mesures de calcul quantitatives et qualitatives, dont certaines basĂ©es sur des connaissances biologiques. Une analyse de la qualitĂ© des alignements produits par deux des principaux outils existants sur des paires de gĂ©nomes bactĂ©riens intra-espĂšces rĂ©vĂšle leurs limitations: des similaritĂ©s non dĂ©tectĂ©es et des portions d'alignement incorrectes. À partir de ces rĂ©sultats, qui suggĂšrent un manque de sensibilitĂ© et spĂ©cificitĂ©, nous proposons un nouvel outil pour l'alignement deux Ă  deux de gĂ©nomes complets, YOC, qui implĂ©mente une version simplifiĂ©e de la stratĂ©gie basĂ©e sur des ancres, contenant seulement deux phases. Dans la phase 1, YOC amĂ©liore la sensibilitĂ© en utilisant comme ancres, pour la premiĂšre fois dans cette stratĂ©gie, des similaritĂ©s locales basĂ©es sur des graines espacĂ©es, capables de dĂ©tecter des similaritĂ©s plus longues dans des rĂ©gions plus divergentes. Cette phase est suivie par une mĂ©thode de chainage adaptĂ©e aux similaritĂ©s locales, un nouveau type de chaĂźnage colinĂ©aire, permettant des chevauchements proportionnels. Nous avons donnĂ© une formulation de ce nouveau problĂšme et rĂ©alisĂ© un premier algorithme. L'algorithme, qui adopte une approche de programmation dynamique basĂ©e sur le paradigme de la ``sweep-line'', donne une solution optimale, i.e. est exacte, et s'exĂ©cute en temps quadratique. Nous avons montrĂ© que cet algorithme, comparĂ© au chainage colinĂ©aire classique, amĂ©liore les rĂ©sultats sur des gĂ©nomes bactĂ©riens, tout en restant aussi efficace en pratique.With more than 1000 complete genomes available (among which, the vast majority come from bacteria), comparative genomic analysis become essential for the functional annotation of genomes, the understanding of their structure and evolution and have applications in phylogenomics or vaccine design. One of the main approaches for comparing genomes is by aligning their DNA sequences, i.e. whole genome alignment (WGA), which means identifying the similarity regions without any prior annotation knowledge. Despite the significant improvements during the last years, reliable tools for WGA and methodology for estimating its quality, in particular for bacterial genomes, still need to be designed. Besides their extremely large lengths that make classical dynamic programming alignment methods unsuitable, aligning whole genomes involves several additional difficulties, due to the mechanisms through which genomes evolve: the divergence, which let sequence sim ilarity vanish over time, the reordering of genomic segments (rearrangements), or the acquisition of external genetic material generating regions that are unalignable between sequences, e.g. horizontal gene transfer, phages. Therefore, whole genome alignment tools implement heuristics, among which the most common is the anchor based strategy. It starts by detecting an initial set of similarity regions (phase 1), and, through a chaining phase (phase 2), selects a non-overlapping maximum-weighted, usually collinear, subset of those similarities, called anchors. Phases 1 and 2 are recursively applied on yet unaligned regions (phase 3). The last phase (phase 4) consists in systematically applying classical alignment tools to all short regions still left unaligned.This thesis addresses several problems related to whole genome alignment: the evaluation of the quality of results given by WGA tools and the improvement of the classical anchor based strategy. We first designed a protocol for evaluating the quality of alignment results, based on both computational and biological measures. An evaluation of the results given by two state of the art WGA tools on pairs of intra-species bacterial genomes revealed their shortcomings: the failure of detecting some of the similarities between sequences and the misalignment of some regions. Based on these results, which imply a lack in both sensitivity and specificity, we propose a novel, pairwise whole genome alignment tool, YOC, implementing a simplified two-phase version of the anchor strategy. In phase 1, YOC improves sensitivity by using as anchors, for the first time, local similarities based on spaced seeds that are capable of detecting larger similarity regions in divergent sequences. This ph ase is followed by a chaining method adapted to local similarities, a novel type of collinear chaining, allowing for proportional overlaps. We give a formulation for this novel problem and provide the first algorithm for it. The algorithm, implementing a dynamic programming approach based on the sweep-line paradigm, is exact and runs in quadratic time. We show that, compared to classical collinear chaining, chaining with overlaps improves on real bacterial data, while remaining almost as efficient in practice. Our novel tool, YOC, is evaluated together with other four WGA tools on a dataset composed of 694 pairs of intra-species bacterial genomes. The results show that YOC improves on divergent cases by detecting more distant similarities and by avoiding misaligned regions. In conclusion, YOC should be easier to apply automatically and systematically to incoming genomes, for it does not require a post-filtering step to detect misalignment and is less complex to calibrate
    corecore