32 research outputs found

    A resource-frugal probabilistic dictionary and applications in (meta)genomics

    Get PDF
    Genomic and metagenomic fields, generating huge sets of short genomic sequences, brought their own share of high performance problems. To extract relevant pieces of information from the huge data sets generated by current sequencing techniques, one must rely on extremely scalable methods and solutions. Indexing billions of objects is a task considered too expensive while being a fundamental need in this field. In this paper we propose a straightforward indexing structure that scales to billions of element and we propose two direct applications in genomics and metagenomics. We show that our proposal solves problem instances for which no other known solution scales-up. We believe that many tools and applications could benefit from either the fundamental data structure we provide or from the applications developed from this structure.Comment: Submitted to PSC 201

    Toward Optimal Fingerprint Indexing for Large Scale Genomics

    Get PDF
    Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases

    Locality-preserving minimal perfect hashing of k-mers

    Get PDF
    Motivation: Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space {1,...,n} bijectively. It is well-known that n log(2) (e) bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of k - 1 symbols, it seems possible to beat the classic log (2)(e) bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers.Results: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature

    Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching

    Get PDF
    The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets. Scalable sketching methods, such as Sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which uniformly cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes. By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, SuperSampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings. In comparison to Sourmash, SuperSampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data

    Fast and Scalable Minimal Perfect Hashing for Massive Key Sets

    Get PDF
    Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of 10^{10} elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality 10^{12}. Source code: https://github.com/rizkg/BBHas

    BGREAT: A De Bruijn graph read mapping tool

    Get PDF
    International audienceMapping reads on references is a central task in numerous genomic studies. Since references are mainly extracted from assembly graphs, it is of high interest to map efficiently on such structures. The problem of mapping sequences on a De Bruijn graph has been shown NP-complete[1] and no scalable generic tool exists yet. We motivate here the problem of mapping reads on a de Bruijn graph and we present a practical solution and its implementation called BGREAT. BGREAT handles real world instances of billions reads with moderate resources. Mapping on de Bruijn graph enable to keep whole genomic information and get rid off possible assembly mistakes. However the problem is theoretically hard to handle on real-world dataset. Using a set of heuristics, our proposed tool is able to map million read by CPU hours even on complex human genomes. BGREAT is available at github.com/Malfoy/BGREAT[1]Limasset, A., & Peterlongo, P. (2015). Read Mapping on de Bruijn graph. arXiv preprint arXiv:1505.04911. [2]Langmead, Ben, et al. "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome." Genome Biol 10.3 (2009): R25

    Minimal perfect hash functions in large scale bioinformatics Problem

    Get PDF
    International audience. Genomic and metagenomic fields, generating huge sets ofshort genomic sequences, brought their own share of high performanceproblems. To extract relevant pieces of information from the huge datasets generated by current sequencing techniques, one must rely on extremelyscalable methods and solutions. Indexing billions of objects isa task considered too expensive while being a fundamental need in thisfield. In this paper we propose a straightforward indexing structure thatscales to billions of element and we propose two direct applications ingenomics and metagenomics. We show that our proposal solves probleminstances for which no other known solution scales-up. We believe thatmany tools and applications could benefit from either the fundamentaldata structure we provide or from the applications developed from thisstructure

    Nouvelles approches pour l'exploitation des données de séquences génomique haut débit

    No full text
    Novel approaches for the exploitation of high throughput sequencing data In this thesis we discuss computational methods to deal with DNA sequences provided by high throughput sequencers. We will mostly focus on the reconstruction of genomes from DNA fragments (genome assembly) and closely related problems. These tasks combine huge amounts of data with combinatorial problems. Various graph structures are used to handle this problem, presenting trade-off between scalability and assembly quality. This thesis introduces several contributions in order to cope with these tasks. First, novel representations of assembly graphs are proposed to allow a better scaling. We also present novel uses of those graphs apart from assembly and we propose tools to use such graphs as references when a fully assembled genome is not available. Finally we show how to use those methods to produce less fragmented assembly while remaining tractable.Cette thĂšse a pour sujet les mĂ©thodes informatiques traitant les sĂ©quences ADN provenant des sĂ©quenceurs haut dĂ©bit. Nous nous concentrons essentiellement sur la reconstruction de gĂ©nomes Ă  partir de fragments ADN (assemblage gĂ©nomique) et sur des problĂšmes connexes. Ces tĂąches combinent de trĂšs grandes quantitĂ©s de donnĂ©es et des problĂšmes combinatoires. DiffĂ©rentes structures de graphe sont utilisĂ©es pour rĂ©pondre Ă  ces problĂšmes, prĂ©sentant des compromis entre passage Ă  l'Ă©chelle et qualitĂ© d'assemblage. Ce document introduit plusieurs contributions pour rĂ©pondre Ă  ces problĂšmes. De nouvelles reprĂ©sentations de graphes d'assemblage sont proposĂ©es pour autoriser un meilleur passage Ă  l'Ă©chelle. Nous prĂ©sentons Ă©galement de nouveaux usages de ces graphes, diffĂ©rent de l'assemblage, ainsi que des outils pour utiliser ceux-ci comme rĂ©fĂ©rences dans les cas oĂč un gĂ©nome de rĂ©fĂ©rence n'est pas disponible. Pour finir nous montrons comment utiliser ces mĂ©thodes pour produire un meilleur assemblage en utilisant des ressources raisonnables

    New approaches for exploitation of high throughput sequencing data

    No full text
    Cette thĂšse a pour sujet les mĂ©thodes informatiques traitant les sĂ©quences ADN provenant des sĂ©quenceurs haut dĂ©bit. Nous nous concentrons essentiellement sur la reconstruction de gĂ©nomes Ă  partir de fragments ADN (assemblage gĂ©nomique) et sur des problĂšmes connexes. Ces tĂąches combinent de trĂšs grandes quantitĂ©s de donnĂ©es et des problĂšmes combinatoires. DiffĂ©rentes structures de graphe sont utilisĂ©es pour rĂ©pondre Ă  ces problĂšmes, prĂ©sentant des compromis entre passage Ă  l'Ă©chelle et qualitĂ© d'assemblage. Ce document introduit plusieurs contributions pour rĂ©pondre Ă  ces problĂšmes. De nouvelles reprĂ©sentations de graphes d'assemblage sont proposĂ©es pour autoriser un meilleur passage Ă  l'Ă©chelle. Nous prĂ©sentons Ă©galement de nouveaux usages de ces graphes, diffĂ©rent de l'assemblage, ainsi que des outils pour utiliser ceux-ci comme rĂ©fĂ©rences dans les cas oĂč un gĂ©nome de rĂ©fĂ©rence n'est pas disponible. Pour finir nous montrons comment utiliser ces mĂ©thodes pour produire un meilleur assemblage en utilisant des ressources raisonnables.Novel approaches for the exploitation of high throughput sequencing data In this thesis we discuss computational methods to deal with DNA sequences provided by high throughput sequencers. We will mostly focus on the reconstruction of genomes from DNA fragments (genome assembly) and closely related problems. These tasks combine huge amounts of data with combinatorial problems. Various graph structures are used to handle this problem, presenting trade-off between scalability and assembly quality. This thesis introduces several contributions in order to cope with these tasks. First, novel representations of assembly graphs are proposed to allow a better scaling. We also present novel uses of those graphs apart from assembly and we propose tools to use such graphs as references when a fully assembled genome is not available. Finally we show how to use those methods to produce less fragmented assembly while remaining tractable
    corecore