8,082 research outputs found
These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure
K-mer abundance analysis is widely used for many purposes in nucleotide
sequence analysis, including data preprocessing for de novo assembly, repeat
detection, and sequencing coverage estimation. We present the khmer software
package for fast and memory efficient online counting of k-mers in sequencing
data sets. Unlike previous methods based on data structures such as hash
tables, suffix arrays, and trie structures, khmer relies entirely on a simple
probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits
online updating and retrieval of k-mer counts in memory which is necessary to
support online k-mer analysis algorithms. On sparse data sets this data
structure is considerably more memory efficient than any exact data structure.
In exchange, the use of a Count-Min Sketch introduces a systematic overcount
for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we
analyze the speed, the memory usage, and the miscount rate of khmer for
generating k-mer frequency distributions and retrieving k-mer counts for
individual k-mers. We also compare the performance of khmer to several other
k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC,
Turtle and KAnalyze. Finally, we examine the effectiveness of profiling
sequencing error, k-mer abundance trimming, and digital normalization of reads
in the context of high khmer false positive rates. khmer is implemented in C++
wrapped in a Python interface, offers a tested and robust API, and is freely
available under the BSD license at github.com/ged-lab/khmer
Fast and Powerful Hashing using Tabulation
Randomized algorithms are often enjoyed for their simplicity, but the hash
functions employed to yield the desired probabilistic guarantees are often too
complicated to be practical. Here we survey recent results on how simple
hashing schemes based on tabulation provide unexpectedly strong guarantees.
Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as
consisting of characters and we have precomputed character tables
mapping characters to random hash values. A key
is hashed to . This schemes is
very fast with character tables in cache. While simple tabulation is not even
4-independent, it does provide many of the guarantees that are normally
obtained via higher independence, e.g., linear probing and Cuckoo hashing.
Next we consider twisted tabulation where one input character is "twisted" in
a simple way. The resulting hash function has powerful distributional
properties: Chernoff-Hoeffding type tail bounds and a very small bias for
min-wise hashing. This also yields an extremely fast pseudo-random number
generator that is provably good for many classic randomized algorithms and
data-structures.
Finally, we consider double tabulation where we compose two simple tabulation
functions, applying one to the output of the other, and show that this yields
very high independence in the classic framework of Carter and Wegman [1977]. In
fact, w.h.p., for a given set of size proportional to that of the space
consumed, double tabulation gives fully-random hashing. We also mention some
more elaborate tabulation schemes getting near-optimal independence for given
time and space.
While these tabulation schemes are all easy to implement and use, their
analysis is not
Closing the Gap Between Short and Long XORs for Model Counting
Many recent algorithms for approximate model counting are based on a
reduction to combinatorial searches over random subsets of the space defined by
parity or XOR constraints. Long parity constraints (involving many variables)
provide strong theoretical guarantees but are computationally difficult. Short
parity constraints are easier to solve but have weaker statistical properties.
It is currently not known how long these parity constraints need to be. We
close the gap by providing matching necessary and sufficient conditions on the
required asymptotic length of the parity constraints. Further, we provide a new
family of lower bounds and the first non-trivial upper bounds on the model
count that are valid for arbitrarily short XORs. We empirically demonstrate the
effectiveness of these bounds on model counting benchmarks and in a
Satisfiability Modulo Theory (SMT) application motivated by the analysis of
contingency tables in statistics.Comment: The 30th Association for the Advancement of Artificial Intelligence
(AAAI-16) Conferenc
More Analysis of Double Hashing for Balanced Allocations
With double hashing, for a key , one generates two hash values and
, and then uses combinations for
to generate multiple hash values in the range from the initial two.
For balanced allocations, keys are hashed into a hash table where each bucket
can hold multiple keys, and each key is placed in the least loaded of
choices. It has been shown previously that asymptotically the performance of
double hashing and fully random hashing is the same in the balanced allocation
paradigm using fluid limit methods. Here we extend a coupling argument used by
Lueker and Molodowitch to show that double hashing and ideal uniform hashing
are asymptotically equivalent in the setting of open address hash tables to the
balanced allocation setting, providing further insight into this phenomenon. We
also discuss the potential for and bottlenecks limiting the use this approach
for other multiple choice hashing schemes.Comment: 13 pages ; current draft ; will be submitted to conference shortl
A unified approach to linear probing hashing with buckets
We give a unified analysis of linear probing hashing with a general bucket
size. We use both a combinatorial approach, giving exact formulas for
generating functions, and a probabilistic approach, giving simple derivations
of asymptotic results. Both approaches complement nicely, and give a good
insight in the relation between linear probing and random walks. A key
methodological contribution, at the core of Analytic Combinatorics, is the use
of the symbolic method (based on q-calculus) to directly derive the generating
functions to analyze.Comment: 49 page
- …