Search CORE

3,087 research outputs found

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

Author: Brown C. Titus
Canino-Koning Rosangela
Howe Adina Chuang
Pell Jason
Zhang Qingpeng
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Scalable Techniques for Similarity Search

Author: Nagireddy Siddartha Reddy
Publication venue: SJSU ScholarWorks
Publication date: 01/10/2015
Field of study

Document similarity is similar to the nearest neighbour problem and has applications in various domains. In order to determine the similarity / dissimilarity of the documents first they need to be converted into sets containing shingles. Each document is converted into k-shingles, k being the length of each shingle. The similarity is calculated using Jaccard distance between sets and output into a characteristic matrix, the complexity to parse this matrix is significantly high especially when the sets are large. In this project we explore various approaches such as Min hashing, LSH & Bloom Filter to decrease the matrix size and to improve the time complexity. Min hashing creates a signature matrix which significantly smaller compared to a characteristic matrix. In this project we will look into Min-Hashing implementation, pros and cons. Also we will explore Locality Sensitive Hashing, Bloom Filters and their advantages

SJSU ScholarWorks

Adaptive one memory access bloom filters

Author: Larrabeiti López David
Reviriego Vasallo Pedro
Rottenstreich Ori
Sánchez-Macián Alfonso
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2022
Field of study

Bloom filters are widely used to perform fast approximate membership checking in networking applications. The main limitation of Bloom filters is that they suffer from false positives that can only be reduced by using more memory. We suggest to take advantage of a common repetition in the identity of queried elements to adapt Bloom filters for avoiding false positives for elements that repeat upon queries. In this paper, one memory access Bloom filters are used to design an adaptation scheme that can effectively remove false positives while completing all queries in a single memory access. The proposed filters are well suited for scenarios on which the number of memory bits per element is low and thus complement existing adaptive cuckoo filters that are not efficient in that case. The evaluation results using packet traces show that the proposed adaptive Bloom filters can significantly reduce the false positive rate in networking applications with the single memory access. In particular, when using as few as four bits per element, false positive rates below 5% are achieved.This work was supported by the ACHILLES project PID2019-104207RB-I00 and the Go2Edge network RED2018-102585-T funded by the Spanish Agencia Estatal de Investigación (AEI) 10.13039/501100011033 and by the Madrid Community research project TAPIR-CM grant no. P2018/TCS-4496

Universidad Carlos III de Madrid e-Archivo