Search CORE

32 research outputs found

A resource-frugal probabilistic dictionary and applications in (meta)genomics

Author: Bittner Lucie
Limasset Antoine
Marchet Camille
Peterlongo Pierre
Publication venue
Publication date: 26/05/2016
Field of study

Genomic and metagenomic fields, generating huge sets of short genomic sequences, brought their own share of high performance problems. To extract relevant pieces of information from the huge data sets generated by current sequencing techniques, one must rely on extremely scalable methods and solutions. Indexing billions of objects is a task considered too expensive while being a fundamental need in this field. In this paper we propose a straightforward indexing structure that scales to billions of element and we propose two direct applications in genomics and metagenomics. We show that our proposal solves problem instances for which no other known solution scales-up. We believe that many tools and applications could benefit from either the fundamental data structure we provide or from the applications developed from this structure.Comment: Submitted to PSC 201

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Toward Optimal Fingerprint Indexing for Large Scale Genomics

Author: Cazaux Bastien
Limasset Antoine
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)
Publication date: 01/01/2022
Field of study

Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases

Dagstuhl Research Online Publication Server

Locality-preserving minimal perfect hashing of k-mers

Author: Limasset Antoine
Pibiri Giulio Ermanno
Shibuya Yoshihiro
Publication venue
Publication date: 01/01/2023
Field of study

Motivation: Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space {1,...,n} bijectively. It is well-known that n log(2) (e) bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of k - 1 symbols, it seems possible to beat the classic log (2)(e) bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers.Results: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching

Author: Limasset Antoine
Marchet Camille
Martayan Igor
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)
Publication date: 01/01/2023
Field of study

The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets. Scalable sketching methods, such as Sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which uniformly cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes. By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, SuperSampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings. In comparison to Sourmash, SuperSampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data

Dagstuhl Research Online Publication Server

Fast and Scalable Minimal Perfect Hashing for Massive Key Sets

Author: Chikhi Rayan
Limasset Antoine
Peterlongo Pierre
Rizk Guillaume
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 16th International Symposium on Experimental Algorithms (SEA 2017)
Publication date: 01/01/2017
Field of study

Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of 10^{10} elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality 10^{12}. Source code: https://github.com/rizkg/BBHas

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Dagstuhl Research Online Publication Server

Hal-Diderot

HAL-Rennes 1

BGREAT: A De Bruijn graph read mapping tool

Author: Limasset Antoine
Peterlongo Pierre
Publication venue: HAL CCSD
Publication date: 06/07/2015
Field of study

International audienceMapping reads on references is a central task in numerous genomic studies. Since references are mainly extracted from assembly graphs, it is of high interest to map efficiently on such structures. The problem of mapping sequences on a De Bruijn graph has been shown NP-complete[1] and no scalable generic tool exists yet. We motivate here the problem of mapping reads on a de Bruijn graph and we present a practical solution and its implementation called BGREAT. BGREAT handles real world instances of billions reads with moderate resources. Mapping on de Bruijn graph enable to keep whole genomic information and get rid off possible assembly mistakes. However the problem is theoretically hard to handle on real-world dataset. Using a set of heuristics, our proposed tool is able to map million read by CPU hours even on complex human genomes. BGREAT is available at github.com/Malfoy/BGREAT[1]Limasset, A., & Peterlongo, P. (2015). Read Mapping on de Bruijn graph. arXiv preprint arXiv:1505.04911. [2]Langmead, Ben, et al. "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome." Genome Biol 10.3 (2009): R25

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Minimal perfect hash functions in large scale bioinformatics Problem

Author: Bittner Lucie
Limasset Antoine
Marchet Camille
Peterlongo Pierre
Publication venue: HAL CCSD
Publication date: 28/06/2016
Field of study

International audience. Genomic and metagenomic fields, generating huge sets ofshort genomic sequences, brought their own share of high performanceproblems. To extract relevant pieces of information from the huge datasets generated by current sequencing techniques, one must rely on extremelyscalable methods and solutions. Indexing billions of objects isa task considered too expensive while being a fundamental need in thisfield. In this paper we propose a straightforward indexing structure thatscales to billions of element and we propose two direct applications ingenomics and metagenomics. We show that our proposal solves probleminstances for which no other known solution scales-up. We believe thatmany tools and applications could benefit from either the fundamentaldata structure we provide or from the applications developed from thisstructure

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Nouvelles approches pour l'exploitation des données de séquences génomique haut débit

Author: Limasset Antoine
Publication venue: HAL CCSD
Publication date: 12/07/2017
Field of study

Novel approaches for the exploitation of high throughput sequencing data In this thesis we discuss computational methods to deal with DNA sequences provided by high throughput sequencers. We will mostly focus on the reconstruction of genomes from DNA fragments (genome assembly) and closely related problems. These tasks combine huge amounts of data with combinatorial problems. Various graph structures are used to handle this problem, presenting trade-off between scalability and assembly quality. This thesis introduces several contributions in order to cope with these tasks. First, novel representations of assembly graphs are proposed to allow a better scaling. We also present novel uses of those graphs apart from assembly and we propose tools to use such graphs as references when a fully assembled genome is not available. Finally we show how to use those methods to produce less fragmented assembly while remaining tractable.Cette thèse a pour sujet les méthodes informatiques traitant les séquences ADN provenant des séquenceurs haut débit. Nous nous concentrons essentiellement sur la reconstruction de génomes à partir de fragments ADN (assemblage génomique) et sur des problèmes connexes. Ces tâches combinent de très grandes quantités de données et des problèmes combinatoires. Différentes structures de graphe sont utilisées pour répondre à ces problèmes, présentant des compromis entre passage à l'échelle et qualité d'assemblage. Ce document introduit plusieurs contributions pour répondre à ces problèmes. De nouvelles représentations de graphes d'assemblage sont proposées pour autoriser un meilleur passage à l'échelle. Nous présentons également de nouveaux usages de ces graphes, différent de l'assemblage, ainsi que des outils pour utiliser ceux-ci comme références dans les cas où un génome de référence n'est pas disponible. Pour finir nous montrons comment utiliser ces méthodes pour produire un meilleur assemblage en utilisant des ressources raisonnables

Thèses en Ligne

INRIA a CCSD electronic archive server

HAL-Rennes 1

New approaches for exploitation of high throughput sequencing data

Author: Limasset Antoine
Publication venue
Publication date: 12/07/2017
Field of study

Cette thèse a pour sujet les méthodes informatiques traitant les séquences ADN provenant des séquenceurs haut débit. Nous nous concentrons essentiellement sur la reconstruction de génomes à partir de fragments ADN (assemblage génomique) et sur des problèmes connexes. Ces tâches combinent de très grandes quantités de données et des problèmes combinatoires. Différentes structures de graphe sont utilisées pour répondre à ces problèmes, présentant des compromis entre passage à l'échelle et qualité d'assemblage. Ce document introduit plusieurs contributions pour répondre à ces problèmes. De nouvelles représentations de graphes d'assemblage sont proposées pour autoriser un meilleur passage à l'échelle. Nous présentons également de nouveaux usages de ces graphes, différent de l'assemblage, ainsi que des outils pour utiliser ceux-ci comme références dans les cas où un génome de référence n'est pas disponible. Pour finir nous montrons comment utiliser ces méthodes pour produire un meilleur assemblage en utilisant des ressources raisonnables.Novel approaches for the exploitation of high throughput sequencing data In this thesis we discuss computational methods to deal with DNA sequences provided by high throughput sequencers. We will mostly focus on the reconstruction of genomes from DNA fragments (genome assembly) and closely related problems. These tasks combine huge amounts of data with combinatorial problems. Various graph structures are used to handle this problem, presenting trade-off between scalability and assembly quality. This thesis introduces several contributions in order to cope with these tasks. First, novel representations of assembly graphs are proposed to allow a better scaling. We also present novel uses of those graphs apart from assembly and we propose tools to use such graphs as references when a fully assembled genome is not available. Finally we show how to use those methods to produce less fragmented assembly while remaining tractable

Theses.fr