Search CORE

205 research outputs found

Fractional Hitting Sets for Efficient and Lightweight Genomic Data Sketching

Author: Limasset Antoine
Marchet Camille
Martayan Igor
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)
Publication date: 01/01/2023
Field of study

The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets. Scalable sketching methods, such as Sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which uniformly cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes. By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, SuperSampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings. In comparison to Sourmash, SuperSampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data

Dagstuhl Research Online Publication Server

Sparse and skew hashing of K-mers

Author: Pibiri G. E.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2022
Field of study

Motivation: A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-Throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings-in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. Results: To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: A data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions

PubMed Central

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Locality-preserving minimal perfect hashing of k-mers

Author: Limasset Antoine
Pibiri Giulio Ermanno
Shibuya Yoshihiro
Publication venue
Publication date: 01/01/2023
Field of study

Motivation: Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space {1,...,n} bijectively. It is well-known that n log(2) (e) bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of k - 1 symbols, it seems possible to beat the classic log (2)(e) bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers.Results: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Positive Definite Kernels in Machine Learning

Author: Cuturi Marco
Publication venue
Publication date: 01/01/2009
Field of study

This survey is an introduction to positive definite kernels and the set of methods they have inspired in the machine learning literature, namely kernel methods. We first discuss some properties of positive definite kernels as well as reproducing kernel Hibert spaces, the natural extension of the set of functions

\{k(x,\cdot),x\in\mathcal{X}\}

associated with a kernel

k

defined on a space

\mathcal{X}

. We discuss at length the construction of kernel functions that take advantage of well-known statistical models. We provide an overview of numerous data-analysis methods which take advantage of reproducing kernel Hilbert spaces and discuss the idea of combining several kernels to improve the performance on certain tasks. We also provide a short cookbook of different kernels which are particularly useful for certain data-types such as images, graphs or speech segments.Comment: draft. corrected a typo in figure

arXiv.org e-Print Archive

CiteSeerX

Locality-Sensitive Bucketing Functions for the Edit Distance

Author: Chen Ke
Shao Mingfu
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)
Publication date: 01/01/2022
Field of study

Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be (d?, d?)-sensitive if any two sequences within an edit distance of d? are mapped into at least one shared bucket, and any two sequences with distance at least d? are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of (d?,d?) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions

Dagstuhl Research Online Publication Server

Bidirectional string anchors: A new string sampling mechanism

Author: Loukides G. (Grigorios)
Pissis S. (Solon)
Publication venue
Publication date: 01/01/2021
Field of study

The minimizers sampling mechanism is a popular mechanism for string sampling introduced independently by Schleimer et al. [SIGMOD 2003] and by Roberts et al. [Bioinf. 2004]. Given two positive integers w and k, it selects the lexicographically smallest length-k substring in every fragment of w consecutive length-k substrings (in every sliding window of length w+k-1). Minimizers samples are approximately uniform, locally consistent, and computable in linear time. Although they do not have good worst-case guarantees on their size, they are often small in practice. They thus have been successfully employed in several string processing applications. Two main disadvantages of minimizers sampling mechanisms are: first, they also do not have good guarantees on the expected size of their samples for every combination of w and k; and, second, indexes that are constructed over their samples do not have good worst-case guarantees for on-line pattern searches. To alleviate these disadvantages, we introduce bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given a positive integer , our mechanism selects the lexicographically smallest rotation in every length- fragment (in every sliding window of length ). We show that bd-anchors samples are also approximately uniform, locally consistent, and computable in linear time. In addition, our experimen

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Dagstuhl Research Online Publication Server

Ab-Initio Molecular Dynamics

Author: Alavi
Alder
Almlöf
Alonso
Andrade
Arias
Baroni
Bendt
Benoit
Berendsen
Berne
Binder
Binder
Blöchl
Blöchl
Blöchl
Born
Bornemann
Briggs
Camellone
Car
Caravati
Caravati
Caravati
Caravati
Caravati
Caravati
Ceperley
Ceperley
Ceriotti
Ceriotti
Chandler
Cucinotta
Dai
Dreizler
Ehrenfest
Engel
Ferguson
Fermi
Feynman
Feynman
Foulkes
Frenkel
Galli
Gan
Gilbert
Gillan
Goedecker
Goedecker
Guidon
Guidon
Guidon
Habershon
Harriman
Harris
Hartree
Hassanali
Hellmann
Herbert
Hohenberg
Hutter
Iannuzzi
Jones
Kalos
Khaliullin
Khaliullin
Koch
Kohanoff
Kohn
Kohn
Kolafa
Kolafa
Krack
Krack
Krack
Krajewski
Kresse
Kresse
Kresse
Kühne
Kühne
Kühne
Kühne
Laasonen
Landau
Levy
Lieb
Lippert
Lippert
Liu
Los
Los
Luduena
Luduena
Martin
Martyna
Marx
Marx
Marzari
McWeeny
Mermin
Metropolis
Modine
Morrone
Mostofi
Niklasson
Niklasson
Palser
Parr
Parrinello
Pascal
Pastore
Payne
Perdew
Pulay
Putrino
Rahman
Rapacioli
Rapaport
Ricci
Richters
Röhrig
Scheffler
Schlegel
Schmid
Selloni
Sharma
Smargiassi
Tangney
Tangney
Thomas
Thomas
Todorova
Tuckerman
Tuckerman
Tymczak
VandeVondele
VandeVondele
VandeVondele
Yang
Zhang
Publication venue: 'Wiley'
Publication date: 26/03/2013
Field of study

Computer simulation methods, such as Monte Carlo or Molecular Dynamics, are very powerful computational techniques that provide detailed and essentially exact information on classical many-body problems. With the advent of ab-initio molecular dynamics, where the forces are computed on-the-fly by accurate electronic structure calculations, the scope of either method has been greatly extended. This new approach, which unifies Newton's and Schr\"odinger's equations, allows for complex simulations without relying on any adjustable parameter. This review is intended to outline the basic principles as well as a survey of the field. Beginning with the derivation of Born-Oppenheimer molecular dynamics, the Car-Parrinello method and the recently devised efficient and accurate Car-Parrinello-like approach to Born-Oppenheimer molecular dynamics, which unifies best of both schemes are discussed. The predictive power of this novel second-generation Car-Parrinello approach is demonstrated by a series of applications ranging from liquid metals, to semiconductors and water. This development allows for ab-initio molecular dynamics simulations on much larger length and time scales than previously thought feasible.Comment: 13 pages, 3 figure

arXiv.org e-Print Archive

Crossref